Copying from a table in a pdf #2158

goyalyashpal · 2021-09-09T17:37:09Z

Summary

Data copied from sumatra not suitable

Details

Source pdf contains a 2 columned data presented as split in 3 super columns (so to say)
i copied the contents of that via Sumatra and via Foxit
and pasted the data in a spreadsheet software (LO Calc, or MS O Excel)

Video showing the difference in behaviour (crucial part: only upto 0:45)

copying.from.pdf.to.excel.via.sumatra.via.foxit.a99xPJLV10.mp4

Observation

--	Sumatra	Foxit	Comment
Order	Column wise in a row, then to next row	Column wise in a row, then to next row	good
Delimiter on row change	linebreak	linebreak	good
Delimiter on column change	linebreak	space	both bad
Screenshot

Conclusions

in those spreadsheet softwares, the linebreak is interpretted as row break
So, effectively, on copying via sumatra, each cell is placed in a new row - which is not easy to deal with.
While on copying via foxit, each row's all column's content comes space delimited in a single cell - which in this case, can be easily converted to columns using "text to column" wizard.
- LO Calc: Data > Text to Columns

D4 in the observations table says: "both bad" Why so?

Sumatra one's bad because it uses same delimiter as it uses for row change. So, it's a bit difficult to get it back to separate columns in spreadsheet
Foxit one's bad because it uses a suuuper common delimiter - space
- Here, the data in each cell didnt have space, so, space delimiter didnt pose any issue.
- But say that the cell's content did contain spaces (as common in names/addresses etc), thennnn foxit's space delimiter will become an even worse situation as far as my excel's novice knowledge goes.

What can be done

I think the better and inclusive way would be to somehow give an option to choose delimiter/separator to be inserted while copying. Above screenshot showing "text to columns" ui can be referred too.
anyways, i would like to know how do other people handle it - and any constructive views

Environment info

OS: Win 10 20H2
Sumatra: 3.2 x64 (fc8f35a)
Foxit: 9.3.0.10826
Libre Office: Version: 6.3.0.4 (x64) Build ID: 057fc023c990d676a43019934386b85b21a9ee99 (see log)
MS Office: 2007 MSO (12.0.4518.1014)

The text was updated successfully, but these errors were encountered:

goyalyashpal · 2021-09-09T18:41:35Z

to somehow give an option to choose delimiter/separator

this can be given in advanced settings in sumatra - right?? so, the option might be entered as string like "linebreak" "tab" "comma" "semicolon" "space". i dont know about much other delimiters, so, cant say about it.

The default delimiter can be one which's not likely to be find in table data - which to me seems tab.

example of one software which provided settings like that:

GitHubRulesOK · 2021-09-09T21:37:29Z

PDF has no delimiters there are no contents "columns" "words" "paragraphs" or line endings like \r or \n
There is in PDF use of line endings for blocks of characters occasionally on a line so gives the perception of a new line.
One issue is that a larger space between blocks of glyphs may look like a line ending. so print PDF to a Line Printer and you may easily get from Hello World

Hello
             World

but that's stored in a clipboard just as it would be without code markers or without whitespace as

Hello
World

goyalyashpal · 2021-09-10T04:31:35Z

PDF has no delimiters there are no contents "columns" "words"

@GitHubRulesOK

i had the idea, but i used those words to refer to what might have been the source of that generated pdf - that is, word document containing a table, or a spreadsheet, or might be some advance version of these.
and as u can see in the video i attached, i have opened the same file in foxit and sumatra, and they are in fact copying content with different delimiters. So...
umh, did u read complete description and watched the attached video?

goyalyashpal · 2021-09-10T06:03:41Z

oh just for clarification - i did not said that pdf itself contained those delimiters. I said that on selecting the pdf content, and copying from there, the viewer (sumatra, or foxit) inserts those delimiters at certain places.

GitHubRulesOK · 2021-09-10T11:11:53Z

It is MuPDF that does the extraction based on the true content received from Font Library.
There are many open issues as to why those inputs are corrupt in source PDFs and certainly Adobe (or Foxit) use different libraries to add their own corrections, interpretations and adjustments, hence they are larger and slower.
Adobe will combine sequential blocks as best it can to provide a new output more akin to Reflowable Paragraphs, but even that extraction will be flawed on many an occasion.
Not directly related I know but serves to show visually how PDF OCR text extractors see the "words" as blocks in random order then need to group those into lines, in this case there were two identical blocks but different qualities

SumatraPDF can make small modifications but nothing can beat human restructure of poor text sequences from OCR

goyalyashpal · 2021-09-10T13:43:56Z

ohkay. i cant seem to understand fully, but the crux seems that it's not an easy fix, right?

GitHubRulesOK · 2021-09-10T16:10:10Z

Its not easy but MuPDF depends on other library data to guide on a lines base construction.
In effect each character of text is intended to be an ink blob at different offsets from lower left corner, so in reality the output from a pdf is objects of differing heights and distances from the origin some grouping is preserved by aggregating sequential characters by font style and ascenders / descenders in those cases are not a problem.

As soon as there is a gap between glyphs or a step change like sub/superscript the baseline may be reset thus a common baseline is never guaranteed for any line of characters. Every character could in theory be placed on its own line in a raw PDF and be extracted as one horizontal line aslong as they all have exactly the same Y value up from the bottom of page but mind your p's and q's

kjk · 2021-09-10T17:51:09Z

This is unlikely to change in Sumatra.

Extracting table data from PDF is a niche and very complicated problem.

What is a possibility is that I write an online tool just for this purpose and integrate it with Sumatra to make the process easy.

goyalyashpal · 2021-09-10T17:54:04Z

What is a possibility is that I write an online tool just for this purpose and integrate it with Sumatra to make the process easy.

yeah, anything that eases the processes is welcome :)

GitHubRulesOK · 2022-12-19T00:14:16Z

@yashpalgoyal1304

By far the simplest way to get space delimited characters with white space is use Poppler (or xpdf https://www.xpdfreader.com/download.html) "PDFTOTEXT using the -Layout option will produce a page layout as seen on screen then import the page as text columns where import allows setting values for date currency etc. (there is a tabular option but can be fickle to repeat tries)

goyalyashpal mentioned this issue Apr 17, 2022

Misaligned, fragmented pdfs (aggravated in Sumatra pdf viewer) mkulesh/microMathematics#118

Closed

goyalyashpal closed this as not planned Won't fix, can't repro, duplicate, stale Dec 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copying from a table in a pdf #2158

Copying from a table in a pdf #2158

goyalyashpal commented Sep 9, 2021 •

edited

Loading

goyalyashpal commented Sep 9, 2021 •

edited

Loading

GitHubRulesOK commented Sep 9, 2021 •

edited

Loading

goyalyashpal commented Sep 10, 2021 •

edited

Loading

goyalyashpal commented Sep 10, 2021

GitHubRulesOK commented Sep 10, 2021 •

edited

Loading

goyalyashpal commented Sep 10, 2021

GitHubRulesOK commented Sep 10, 2021

kjk commented Sep 10, 2021

goyalyashpal commented Sep 10, 2021 via email

GitHubRulesOK commented Dec 19, 2022 •

edited

Loading

Copying from a table in a pdf #2158

Copying from a table in a pdf #2158

Comments

goyalyashpal commented Sep 9, 2021 • edited Loading

Summary

Details

Observation

Conclusions

What can be done

Environment info

goyalyashpal commented Sep 9, 2021 • edited Loading

GitHubRulesOK commented Sep 9, 2021 • edited Loading

goyalyashpal commented Sep 10, 2021 • edited Loading

goyalyashpal commented Sep 10, 2021

GitHubRulesOK commented Sep 10, 2021 • edited Loading

goyalyashpal commented Sep 10, 2021

GitHubRulesOK commented Sep 10, 2021

kjk commented Sep 10, 2021

goyalyashpal commented Sep 10, 2021 via email

GitHubRulesOK commented Dec 19, 2022 • edited Loading

goyalyashpal commented Sep 9, 2021 •

edited

Loading

goyalyashpal commented Sep 9, 2021 •

edited

Loading

GitHubRulesOK commented Sep 9, 2021 •

edited

Loading

goyalyashpal commented Sep 10, 2021 •

edited

Loading

GitHubRulesOK commented Sep 10, 2021 •

edited

Loading

GitHubRulesOK commented Dec 19, 2022 •

edited

Loading