Extracting User provided information from PDF - how to do so? #2862
Unanswered
mangowithice
asked this question in
Q&A
Replies: 1 comment 1 reply
-
I am afraid there is no way other than developing your own logic that fits user input in the right places. So you must sort the extracted text in some clever way.
That should ultimately give you the desired output. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
hiya - i have a PDF written in Thai, and there are empty space for users to provide information via Fill & Comment in Acrobat PDF. When I extract text from the PDF, the provided text will be consolidated towards the end of the output dump (i.e. PDF original text displayed first, followed by the user provided text). if i do a SORT=TRUE, the order of the provided text will be random. My aim is to extract the provided information and store them .
I guessed I can find out the coordinates of the text boxes where user potentially fill and write the function to detect the text based on the bbox. But because this is not a proper form per-se, if the input overflow out of the box, i hit an issue.
Is there a way for pymupdf to extract the text in some logic sequence? The reason i am exploring pymupdf is because it can print Thai much better than tabula and pdfplumber, However I am unable to find the most appropriate approach to detect user text.
The highlighted below in yellow are what user keyed in.
![image](https://private-user-images.githubusercontent.com/152759078/287495410-653a82cb-6b8e-44c3-b2b4-a385fdf3aeb2.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAyMDg2ODcsIm5iZiI6MTcyMDIwODM4NywicGF0aCI6Ii8xNTI3NTkwNzgvMjg3NDk1NDEwLTY1M2E4MmNiLTZiOGUtNDRjMy1iMmI0LWEzODVmZGYzYWViMi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzA1JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcwNVQxOTM5NDdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04ZWUwYmE0YzU4YTU5OTc0OGVhYjA1ZWEyOWNmODhhYmYwMmExNzEwY2Y3MGNlYjdlOWFhOGRmNzYwMWEzYjMyJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.GcLJsVGjU9r4a8Fv4Ahjem8axkvsUJL8JLijx5Ia6Zs)
Beta Was this translation helpful? Give feedback.
All reactions