Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to find x and y coordinates of a text in PDF #418

Closed
ccpplinux opened this issue May 8, 2021 · 7 comments
Closed

How to find x and y coordinates of a text in PDF #418

ccpplinux opened this issue May 8, 2021 · 7 comments
Labels

Comments

@ccpplinux
Copy link

Hi,
Is it possible to get the value of X and Y coordinates of a text in a PDF file using this library? If yes then please send me a sample code.

Best Regards ...
Pankaj Kumar

@k00ni k00ni added the question label May 10, 2021
@k00ni
Copy link
Collaborator

k00ni commented May 10, 2021

Yes. Here is an example.

@k00ni k00ni closed this as completed May 10, 2021
@ccpplinux
Copy link
Author

ccpplinux commented May 10, 2021

Thanks for reply. I have installed it on my server and then tested it. I am using the following code to extract X and Y coordinates of all words in a PDF file:

<?php
// Include Composer autoloader if not already done.
include 'vendor/autoload.php';
 
// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('result_sheet_format_llb_1.pdf');
 
$pages = $pdf->getPages();
$page = $pages[0];

$dataTm = $page->getDataTm();

echo("<pre>");
var_dump($dataTm);
echo("</pre>");
?>

It is working and you can see its output at the URL https://glug4muz.org/php/pdfparser/parse.php

But there are two small problems that I would like to get rid of. At this moment I am using the package https://github.com/measuresforjustice/textricator to extract X and Y coordinates of each word of a PDF file. I am using this tool to extract all words and corresponding X/Y coordinates in the form of a CSV file. Then I am using importing that CSV file into a MySQL database. Then I am reading the values of X/Y coordinates using PHP script.

In case of this PHP script that is pdfparser, the X coordinate of each word is same as that received using the tool textricator but there is difference in the value of Y coordinate. For example the value of Y coordinate of the word $roll_no as per textricator is 115.216255 but as per pdfparser is 721.786 but X coordinate of the same word using both tools is 48.503. Can you please tell me why it is so?

Further pdfparser is not returning the X/Y coordinate of every distinct word. Sometimes it is combining two or more words. For example as you can see that it is returning X/Y coordinates of two words $held_month $held_year as combined at https://glug4muz.org/php/pdfparser/parse.php. Why it is not returning separate X/Y coordinates of $held_month and $held_year?

Can you please explain how these two issues can be resolved? Then I will integrate this tool in my project.

Best Regards ...
Pankaj Kumar

@ccpplinux
Copy link
Author

parse.php.txt

@ccpplinux
Copy link
Author

@ericksho
Copy link

ericksho commented May 2, 2022

I have the same problem, I believe that using tables (I convert html to PDF) messes with y coordinate.
An example: Im searching for the tag {f{Docente}f} in this document in white font at the bottom but x, y is x: 251.367 - y: 109.779

@hohohoc
Copy link

hohohoc commented May 10, 2024

This is what I found out, you need to do the conversion from Point to mm before you use SetXY.
Plus, to get the real y-coordinate, you need to use the page height minus with the y-coordinate.

Point to mm conversion refer to this link below:
https://stackoverflow.com/questions/34545339/the-size-of-pdf-documents-how-do-i-convert-from-millimeters-to-pixels-using-spi

$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))

// get the page you want; put looping if need to read multiple pages. 
$pages = $pdf->getPages(); 

// get coordinate info of each string in the page
$dataTm = $pages->getDataTm();

// Find keyword
$keyword = "testing"; 

// To get page height 
$details = $pages->getDetails();
$page_height = $details['MediaBox'][3];

$x=0;
$y=0;

// Matching the string with keyword
foreach ($dataTm as $element){
        $pos = strpos($element[1], $keyword);
        if($pos !== false){
               
		// Convert point to mm;
                $x = ($element[0][4])*0.352777778;
                $y = ($page_height - $element[0][5])*0.352777778;
        }
}

print_r($x);
print_r($y);

@crThiago
Copy link

This is what I found out, you need to do the conversion from Point to mm before you use SetXY. Plus, to get the real y-coordinate, you need to use the page height minus with the y-coordinate.

Point to mm conversion refer to this link below: https://stackoverflow.com/questions/34545339/the-size-of-pdf-documents-how-do-i-convert-from-millimeters-to-pixels-using-spi

$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))

// get the page you want; put looping if need to read multiple pages. 
$pages = $pdf->getPages(); 

// get coordinate info of each string in the page
$dataTm = $pages->getDataTm();

// Find keyword
$keyword = "testing"; 

// To get page height 
$details = $pages->getDetails();
$page_height = $details['MediaBox'][3];

$x=0;
$y=0;

// Matching the string with keyword
foreach ($dataTm as $element){
        $pos = strpos($element[1], $keyword);
        if($pos !== false){
               
		// Convert point to mm;
                $x = ($element[0][4])*0.352777778;
                $y = ($page_height - $element[0][5])*0.352777778;
        }
}

print_r($x);
print_r($y);

This solution work to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants