-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copying from a table in a pdf #2158
Comments
this can be given in advanced settings in sumatra - right?? so, the option might be entered as string like "linebreak" "tab" "comma" "semicolon" "space". i dont know about much other delimiters, so, cant say about it. The default delimiter can be one which's not likely to be find in table data - which to me seems tab. |
PDF has no delimiters there are no contents "columns" "words" "paragraphs" or line endings like \r or \n
but that's stored in a clipboard just as it would be without code markers or without whitespace as Hello |
|
oh just for clarification - i did not said that pdf itself contained those delimiters. I said that on selecting the pdf content, and copying from there, the viewer (sumatra, or foxit) inserts those delimiters at certain places. |
ohkay. i cant seem to understand fully, but the crux seems that it's not an easy fix, right? |
Its not easy but MuPDF depends on other library data to guide on a lines base construction. As soon as there is a gap between glyphs or a step change like sub/superscript the baseline may be reset thus a common baseline is never guaranteed for any line of characters. Every character could in theory be placed on its own line in a raw PDF and be extracted as one horizontal line aslong as they all have exactly the same Y value up from the bottom of page but mind your p's and q's |
This is unlikely to change in Sumatra. Extracting table data from PDF is a niche and very complicated problem. What is a possibility is that I write an online tool just for this purpose and integrate it with Sumatra to make the process easy. |
What is a possibility is that I write an online tool just for this purpose and integrate it with Sumatra to make the process easy.
yeah, anything that eases the processes is welcome :)
|
@yashpalgoyal1304 By far the simplest way to get space delimited characters with white space is use Poppler (or xpdf https://www.xpdfreader.com/download.html) "PDFTOTEXT using the -Layout option will produce a page layout as seen on screen then import the page as text columns where import allows setting values for date currency etc. (there is a tabular option but can be fickle to repeat tries) |
Summary
Data copied from sumatra not suitable
Details
Video showing the difference in behaviour (crucial part: only upto 0:45)
copying.from.pdf.to.excel.via.sumatra.via.foxit.a99xPJLV10.mp4
Observation
Conclusions
D4 in the observations table says: "both bad" Why so?
What can be done
Environment info
The text was updated successfully, but these errors were encountered: