How to find x and y coordinates of a text in PDF #418

ccpplinux · 2021-05-08T06:48:56Z

Hi,
Is it possible to get the value of X and Y coordinates of a text in a PDF file using this library? If yes then please send me a sample code.

Best Regards ...
Pankaj Kumar

k00ni · 2021-05-10T08:05:31Z

Yes. Here is an example.

ccpplinux · 2021-05-10T11:10:06Z

Thanks for reply. I have installed it on my server and then tested it. I am using the following code to extract X and Y coordinates of all words in a PDF file:

<?php
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('result_sheet_format_llb_1.pdf');
 
$pages = $pdf->getPages();
$page = $pages[0];

$dataTm = $page->getDataTm();

echo("<pre>");
var_dump($dataTm);
echo("</pre>");
?>

It is working and you can see its output at the URL https://glug4muz.org/php/pdfparser/parse.php

But there are two small problems that I would like to get rid of. At this moment I am using the package https://github.com/measuresforjustice/textricator to extract X and Y coordinates of each word of a PDF file. I am using this tool to extract all words and corresponding X/Y coordinates in the form of a CSV file. Then I am using importing that CSV file into a MySQL database. Then I am reading the values of X/Y coordinates using PHP script.

In case of this PHP script that is pdfparser, the X coordinate of each word is same as that received using the tool textricator but there is difference in the value of Y coordinate. For example the value of Y coordinate of the word $roll_no as per textricator is 115.216255 but as per pdfparser is 721.786 but X coordinate of the same word using both tools is 48.503. Can you please tell me why it is so?

Further pdfparser is not returning the X/Y coordinate of every distinct word. Sometimes it is combining two or more words. For example as you can see that it is returning X/Y coordinates of two words $held_month $held_year as combined at https://glug4muz.org/php/pdfparser/parse.php. Why it is not returning separate X/Y coordinates of $held_month and $held_year?

Can you please explain how these two issues can be resolved? Then I will integrate this tool in my project.

Best Regards ...
Pankaj Kumar

ccpplinux · 2021-05-10T11:18:07Z

parse.php.txt

ccpplinux · 2021-05-10T11:19:04Z

result_sheet_format_llb_1.pdf

ericksho · 2022-05-02T21:45:24Z

I have the same problem, I believe that using tables (I convert html to PDF) messes with y coordinate.
An example: Im searching for the tag {f{Docente}f} in this document in white font at the bottom but x, y is x: 251.367 - y: 109.779

hohohoc · 2024-05-10T09:52:14Z

This is what I found out, you need to do the conversion from Point to mm before you use SetXY.
Plus, to get the real y-coordinate, you need to use the page height minus with the y-coordinate.

Point to mm conversion refer to this link below:
https://stackoverflow.com/questions/34545339/the-size-of-pdf-documents-how-do-i-convert-from-millimeters-to-pixels-using-spi

$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))

// get the page you want; put looping if need to read multiple pages. 
$pages = $pdf->getPages(); 

// get coordinate info of each string in the page
$dataTm = $pages->getDataTm();

// Find keyword
$keyword = "testing"; 

// To get page height 
$details = $pages->getDetails();
$page_height = $details['MediaBox'][3];

$x=0;
$y=0;

// Matching the string with keyword
foreach ($dataTm as $element){
        $pos = strpos($element[1], $keyword);
        if($pos !== false){
               
		// Convert point to mm;
                $x = ($element[0][4])*0.352777778;
                $y = ($page_height - $element[0][5])*0.352777778;
        }
}

print_r($x);
print_r($y);

crThiago · 2024-05-20T20:52:02Z

This is what I found out, you need to do the conversion from Point to mm before you use SetXY. Plus, to get the real y-coordinate, you need to use the page height minus with the y-coordinate.

Point to mm conversion refer to this link below: https://stackoverflow.com/questions/34545339/the-size-of-pdf-documents-how-do-i-convert-from-millimeters-to-pixels-using-spi
$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))

// get the page you want; put looping if need to read multiple pages. 
$pages = $pdf->getPages(); 

// get coordinate info of each string in the page
$dataTm = $pages->getDataTm();

// Find keyword
$keyword = "testing"; 

// To get page height 
$details = $pages->getDetails();
$page_height = $details['MediaBox'][3];

$x=0;
$y=0;

// Matching the string with keyword
foreach ($dataTm as $element){
        $pos = strpos($element[1], $keyword);
        if($pos !== false){
               
		// Convert point to mm;
                $x = ($element[0][4])*0.352777778;
                $y = ($page_height - $element[0][5])*0.352777778;
        }
}

print_r($x);
print_r($y);

This solution work to me

k00ni added the question label May 10, 2021

k00ni closed this as completed May 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to find x and y coordinates of a text in PDF #418

How to find x and y coordinates of a text in PDF #418

ccpplinux commented May 8, 2021

k00ni commented May 10, 2021 •

edited

ccpplinux commented May 10, 2021 •

edited by k00ni

ccpplinux commented May 10, 2021

ccpplinux commented May 10, 2021

ericksho commented May 2, 2022

hohohoc commented May 10, 2024 •

edited

crThiago commented May 20, 2024

How to find x and y coordinates of a text in PDF #418

How to find x and y coordinates of a text in PDF #418

Comments

ccpplinux commented May 8, 2021

k00ni commented May 10, 2021 • edited

ccpplinux commented May 10, 2021 • edited by k00ni

ccpplinux commented May 10, 2021

ccpplinux commented May 10, 2021

ericksho commented May 2, 2022

hohohoc commented May 10, 2024 • edited

crThiago commented May 20, 2024

k00ni commented May 10, 2021 •

edited

ccpplinux commented May 10, 2021 •

edited by k00ni

hohohoc commented May 10, 2024 •

edited