Parse error on Google sheets generated PDF #521

coppit · 2019-10-15T19:05:07Z

I created a PDF by "printing" from Google sheets. When I try to merge the page into a PDF, I get the stack trace below. It looks like the parser is incorrectly backing up into the %PDF-1.4 comment. If I export the document as PDF using Apple Preview, it gets converted to a new %PDF-1.3 document that parses correctly.

Traceback (most recent call last):
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 257, in
make_pdf(LOOKUP[partner].get('doc', None), LOOKUP[partner]['sheet'])
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 248, in make_pdf
merge_pdfs(doc_temp_path, sheet_spares_temp_path)
File "/Users/dcoppit/documents/p4sw/sw/pvt/dcoppit/make_bom.py", line 231, in merge_pdfs
output_pdf_file.merge(insert_point, sheet_spares_temp_path.name, pages=(0,1))
File "/usr/local/lib/python3.7/site-packages/PyPDF2/merger.py", line 151, in merge
outline = pdfr.getOutlines()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1346, in getOutlines
lines = catalog["/Outlines"]
File "/usr/local/lib/python3.7/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1599, in getObject
idnum, generation = self.readObjectHeader(self.stream)
File "/usr/local/lib/python3.7/site-packages/PyPDF2/pdf.py", line 1668, in readObjectHeader
return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: b'F-1.4'

Here's how the document starts:

%PDF-1.4
% âãÏÓ
4
0
obj
<<
/Type
/Catalog
/Names

There's a space after the % in the second line, and each word is on a separate line.

coppit · 2019-10-15T19:27:12Z

Here's a small PDF that demonstrates the problem.

coppit · 2019-10-15T20:56:52Z

I think the problem has to do with improper parsing of free objects in the xref table. Here's the table from a problematic PDF:

xref
0 12
0000000002 65535 f
0000000962 00000 n
0000000003 00000 f
0000000000 00000 f
0000000016 00000 n
0000000160 00000 n
0000000287 00000 n
0000000453 00000 n
0000000819 00000 n
0000000728 00000 n
0000000747 00000 n
0000000767 00000 n

You can see that the 5th line refers to a free object with an offset of 3. read() in pdf.py doesn't look at the object type, so it parses this non-existent object just like any other.

I made the following change to read() in pdf.py, and it seems to work. I don't know if it's a proper fix though.

# offset, generation = line[:16].split(b_(" "))
offset, generation, kind = line[:18].split(b_(" "))
# Ignore free objects
if kind == b'f' and num > 0:
cnt += 1
num += 1
continue

cadu-leite · 2020-07-27T11:24:28Z

I got the exact same error with my won "googl shee" PDF Downloaded.
Here attached.
pdf_sample_googlesheet_pages_02.pdf

the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests.
https://github.com/cadu-leite/merge2pdf/blob/76a0ace2a10ad81ec03ad9cdbbcc11af2c18eaf4/tests/test_merge2pdf.py#L52

MartinThoma · 2022-04-07T16:29:00Z

Could somebody write a minimal snippet of Python code that shows the issue?

cadu-leite · 2022-04-10T12:49:49Z

Could somebody write a minimal snippet of Python code that shows the issue?

Not sure, but I think thats exactly what it was wrote right above

the code is in https://github.com/cadu-leite/merge2pdf and the error is part of the tests.
https://github.com/cadu-leite/merge2pdf/blob/76a0ace2a10ad81ec03ad9cdbbcc11af2c18eaf4/tests/test_merge2pdf.py#L52

MartinThoma · 2022-04-10T13:06:56Z

@cadu-leite You're linking to a repository. Could you please paste the relevant part in here, creating a MVCE?

MartinThoma · 2022-06-30T13:08:34Z

Using https://github.com/mstamy2/PyPDF2/files/4981701/pdf_sample_googlesheet_pages_02.pdf from @cadu-leite :

>>> from PyPDF2 import PdfReader, PdfWriter, PdfMerger;reader = PdfReader("pdf_sample_googlesheet_pages_02.pdf")
>>> for page in reader.pages: page.extract_text()
... 
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.00000000000-45474735'
Invalid FloatObject b'0.00000000000-45474735'
' 1\nabr. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\nmai. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n20\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\njun. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n01\njul. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n23\n\n\n'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000-14210855'
Invalid FloatObject b'0.000000000000000-5551115'
Invalid FloatObject b'0.000000000000000-7851537'
Invalid FloatObject b'0.0000000000000000000000000000000-6162976'
Invalid FloatObject b'0.0000000000000000000000000000000-8716957'
Invalid FloatObject b'0.000000000000-14210855'
' 2\njul. 20\n\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\nago. 20\n\nSun\nMon\nTue\nWed\nThu\nFri\nSat\n21\n01\n02\n03\n04\n05\n06\n07\n08\n09\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n\n'
>>> import PyPDF2; PyPDF2.__version__
'2.4.1'

Looking at the document, that looks ok. So text extraction works.

MartinThoma · 2022-07-05T08:23:05Z

This issue was fixed by @Hatell via #1054. It will be part of PyPDF2>=2.4.2. I will make that release on PyPI probably this evening.

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022

MartinThoma mentioned this issue Jun 30, 2022

PDF from Google Sheet doesn't merge with PdfMerger when import_bookmarks is True #1034

Closed

Hatell mentioned this issue Jul 4, 2022

Resolve IndirectObject when it refers to a free entry. #1054

Merged

MartinThoma closed this as completed in 02c601c Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse error on Google sheets generated PDF #521

Parse error on Google sheets generated PDF #521

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019 •

edited

cadu-leite commented Jul 27, 2020

MartinThoma commented Apr 7, 2022

cadu-leite commented Apr 10, 2022

MartinThoma commented Apr 10, 2022

MartinThoma commented Jun 30, 2022

MartinThoma commented Jul 5, 2022

Parse error on Google sheets generated PDF #521

Parse error on Google sheets generated PDF #521

Comments

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019

coppit commented Oct 15, 2019 • edited

cadu-leite commented Jul 27, 2020

MartinThoma commented Apr 7, 2022

cadu-leite commented Apr 10, 2022

MartinThoma commented Apr 10, 2022

MartinThoma commented Jun 30, 2022

MartinThoma commented Jul 5, 2022

coppit commented Oct 15, 2019 •

edited