Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying from a table in a pdf #2158

Closed
goyalyashpal opened this issue Sep 9, 2021 · 10 comments
Closed

Copying from a table in a pdf #2158

goyalyashpal opened this issue Sep 9, 2021 · 10 comments

Comments

@goyalyashpal
Copy link

goyalyashpal commented Sep 9, 2021

Summary

Data copied from sumatra not suitable

Details

  • Source pdf contains a 2 columned data presented as split in 3 super columns (so to say)
  • i copied the contents of that via Sumatra and via Foxit
  • and pasted the data in a spreadsheet software (LO Calc, or MS O Excel)

Video showing the difference in behaviour (crucial part: only upto 0:45)

copying.from.pdf.to.excel.via.sumatra.via.foxit.a99xPJLV10.mp4

Observation

-- Sumatra Foxit Comment
Order Column wise in a row, then to next row Column wise in a row, then to next row good
Delimiter on row change linebreak linebreak good
Delimiter on column change linebreak space both bad
Screenshot windows clipboard vivaldi_F4aKaJTwis

Conclusions

  • in those spreadsheet softwares, the linebreak is interpretted as row break
  • So, effectively, on copying via sumatra, each cell is placed in a new row - which is not easy to deal with.
  • While on copying via foxit, each row's all column's content comes space delimited in a single cell - which in this case, can be easily converted to columns using "text to column" wizard.
    • LO Calc: Data > Text to Columns
      image

D4 in the observations table says: "both bad" Why so?

  • Sumatra one's bad because it uses same delimiter as it uses for row change. So, it's a bit difficult to get it back to separate columns in spreadsheet
  • Foxit one's bad because it uses a suuuper common delimiter - space
    • Here, the data in each cell didnt have space, so, space delimiter didnt pose any issue.
    • But say that the cell's content did contain spaces (as common in names/addresses etc), thennnn foxit's space delimiter will become an even worse situation as far as my excel's novice knowledge goes.

What can be done

  • I think the better and inclusive way would be to somehow give an option to choose delimiter/separator to be inserted while copying. Above screenshot showing "text to columns" ui can be referred too.
  • anyways, i would like to know how do other people handle it - and any constructive views

Environment info

  • OS: Win 10 20H2
  • Sumatra: 3.2 x64 (fc8f35a)
  • Foxit: 9.3.0.10826
  • Libre Office: Version: 6.3.0.4 (x64) Build ID: 057fc023c990d676a43019934386b85b21a9ee99 (see log)
  • MS Office: 2007 MSO (12.0.4518.1014)
@goyalyashpal
Copy link
Author

goyalyashpal commented Sep 9, 2021

to somehow give an option to choose delimiter/separator

this can be given in advanced settings in sumatra - right?? so, the option might be entered as string like "linebreak" "tab" "comma" "semicolon" "space". i dont know about much other delimiters, so, cant say about it.

The default delimiter can be one which's not likely to be find in table data - which to me seems tab.

example of one software which provided settings like that:
image

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Sep 9, 2021

PDF has no delimiters there are no contents "columns" "words" "paragraphs" or line endings like \r or \n
There is in PDF use of line endings for blocks of characters occasionally on a line so gives the perception of a new line.
One issue is that a larger space between blocks of glyphs may look like a line ending. so print PDF to a Line Printer and you may easily get from   Hello   World

Hello
             World

but that's stored in a clipboard just as it would be without code markers or without whitespace as

Hello
World

@goyalyashpal
Copy link
Author

goyalyashpal commented Sep 10, 2021

PDF has no delimiters there are no contents "columns" "words"

@GitHubRulesOK

  • i had the idea, but i used those words to refer to what might have been the source of that generated pdf - that is, word document containing a table, or a spreadsheet, or might be some advance version of these.
  • and as u can see in the video i attached, i have opened the same file in foxit and sumatra, and they are in fact copying content with different delimiters. So...
  • umh, did u read complete description and watched the attached video?

@goyalyashpal
Copy link
Author

oh just for clarification - i did not said that pdf itself contained those delimiters. I said that on selecting the pdf content, and copying from there, the viewer (sumatra, or foxit) inserts those delimiters at certain places.

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Sep 10, 2021

It is MuPDF that does the extraction based on the true content received from Font Library.
There are many open issues as to why those inputs are corrupt in source PDFs and certainly Adobe (or Foxit) use different libraries to add their own corrections, interpretations and adjustments, hence they are larger and slower.
Adobe will combine sequential blocks as best it can to provide a new output more akin to Reflowable Paragraphs, but even that extraction will be flawed on many an occasion.
Not directly related I know but serves to show visually how PDF OCR text extractors see the "words" as blocks in random order then need to group those into lines, in this case there were two identical blocks but different qualities
image

SumatraPDF can make small modifications but nothing can beat human restructure of poor text sequences from OCR

@goyalyashpal
Copy link
Author

ohkay. i cant seem to understand fully, but the crux seems that it's not an easy fix, right?

@GitHubRulesOK
Copy link
Collaborator

Its not easy but MuPDF depends on other library data to guide on a lines base construction.
In effect each character of text is intended to be an ink blob at different offsets from lower left corner, so in reality the output from a pdf is objects of differing heights and distances from the origin some grouping is preserved by aggregating sequential characters by font style and ascenders / descenders in those cases are not a problem.

As soon as there is a gap between glyphs or a step change like sub/superscript the baseline may be reset thus a common baseline is never guaranteed for any line of characters. Every character could in theory be placed on its own line in a raw PDF and be extracted as one horizontal line aslong as they all have exactly the same Y value up from the bottom of page but mind your p's and q's

@kjk
Copy link
Member

kjk commented Sep 10, 2021

This is unlikely to change in Sumatra.

Extracting table data from PDF is a niche and very complicated problem.

What is a possibility is that I write an online tool just for this purpose and integrate it with Sumatra to make the process easy.

@goyalyashpal
Copy link
Author

goyalyashpal commented Sep 10, 2021 via email

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Dec 19, 2022

@yashpalgoyal1304

By far the simplest way to get space delimited characters with white space is use Poppler (or xpdf https://www.xpdfreader.com/download.html) "PDFTOTEXT using the -Layout option will produce a page layout as seen on screen then import the page as text columns where import allows setting values for date currency etc. (there is a tabular option but can be fickle to repeat tries)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants