KeyError: '/Contents' #353

puneetsinha · 2017-06-11T04:35:04Z

when and why do i get this error any work around for this case

KeyError                                  Traceback (most recent call last)
<ipython-input-5-1d51d9a98e6c> in <module>()
      9     page_content= ''
     10     for page in range(0,number_of_pages):
---> 11         page_content += read_pdf.getPage(page).extractText()
     12         #print(page_content.encode('utf-8'))
     13     textFilename = output_dir + base_file + ".txt"

~\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\PyPDF2\pdf.py in extractText(self)
   2655         """
   2656         text = u_("")
-> 2657         content = self["/Contents"].getObject()
   2658         if not isinstance(content, ContentStream):
   2659             content = ContentStream(content, self.pdf)

~\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\PyPDF2\generic.py in __getitem__(self, key)
    516 
    517     def __getitem__(self, key):
--> 518         return dict.__getitem__(self, key).getObject()
    519 
    520     ##

KeyError: '/Contents'

The text was updated successfully, but these errors were encountered:

mpeuss · 2017-06-29T09:26:15Z

We also have this error but would expect at least a PdfReadError instead of a KeyError.

Amalgamator · 2017-10-08T17:27:18Z

I'm working on some code that is supposed to convert ~20000 pdf's to text files for natural language processing. I also have the abovementioned error. I'm using this:

i=0
for subdir, dirs, files in os.walk(rootdir):
	for file in files:
		i += 1
		filedir = subdir+"/"+file
		print(i,filedir)
		pdfFileObj = open(filedir,'rb')
		pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

		text_file = open(file.strip(".pdf")+'.txt', "w")
		for page in pdfReader.pages:
			text = str(page.extractText())
			text = cleanup(text)  # some function that looks for odd substrings and such
			text_file.write(text)
		
		text_file.close()
		pdfFileObj.close()

Note that I had to explicitly state my text has to be a string (I had some errors otherwise).

Traceback (most recent call last):
  File "converter.py", line 17, in <module>
    text = str(page.extractText())
  File "/usr/local/lib/python3.5/dist-packages/PyPDF2/pdf.py", line 2591, in extractText
    content = self["/Contents"].getObject()
  File "/usr/local/lib/python3.5/dist-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'

I'm guessing that this is due to a corrupted pdf (/contents instead of /Contents? or lacking that field, something like it), but since I have 20000 pdf's and I rly need all of them properly converted, I need to make sure exceptions like these are handled. This error came up on the ~40th pdf, which was a non-secure non-optimized PDF-1.2 file.

Any fix/workarounds/suggestions? (I'm trying to see whats in the pdf)

puneetsinha · 2017-12-20T07:07:06Z

I just have solved this temporarily by using try catch for a specific PDF as this is issues with version make or content of PDF.
anyone needs code msg me i will pass

puneetsinha · 2018-03-19T05:57:26Z

while opening the file in reading mode , open it in UTF-8 encoding. will solve the problem. some pdfs have corrections and highlighting that is why this error occours.

…

On Sun, Oct 8, 2017 at 10:57 PM, Amalgamator ***@***.***> wrote: I'm working on some code that is supposed to convert ~20000 pdf's to text files for natural language processing. I also have the abovementioned error. I'm using this: i=0 for subdir, dirs, files in os.walk(rootdir): for file in files: i += 1 filedir = subdir+"/"+file print(i,filedir) pdfFileObj = open(filedir,'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) text_file = open(file.strip(".pdf")+'.txt', "w") for page in pdfReader.pages: text = str(page.extractText()) text = cleanup(text) # some function that looks for odd substrings and such text_file.write(text) text_file.close() pdfFileObj.close() Note that I had to explicitly state my text has to be a string (I had some errors otherwise). Traceback (most recent call last): File "converter.py", line 17, in <module> text = str(page.extractText()) File "/usr/local/lib/python3.5/dist-packages/PyPDF2/pdf.py", line 2591, in extractText content = self["/Contents"].getObject() File "/usr/local/lib/python3.5/dist-packages/PyPDF2/generic.py", line 516, in __getitem__ return dict.__getitem__(self, key).getObject() KeyError: '/Contents' I'm guessing that this is due to a corrupted pdf (/contents instead of /Contents? or lacking that field, something like it), but since I have 20000 pdf's and I rly need all of them properly converted, I need to make sure exceptions like these are handled. This error came up on the ~40th pdf. Any suggestions? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#353 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APfEOPpdHHN4nIEa_nPLqMxIs73ak26kks5sqQX7gaJpZM4N2Sxk> .

-- Best Regards, Puneet Sinha Lead Data Scientist - Advanced Analytics and Machine Learning *Yatra.**Com Labs* | 5th Floor, Tower - A, Unitech Cyber Park, Sec - 39, Gurgaon 8888835462

Namrata-1995 · 2018-11-29T07:20:29Z

i have facing same problem Can you plz help me in that

file="Combined spec_CP1CP2_Shubham.pdf"
file1=file.encode('UTF-8')
pdfFileObj = open(file1, 'rb')             # create pdf file object(pdf file open in binary mode)
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)      # create pdf reader object
print(pdfReader.numPages)                         # number of pages in pdf file

for i in xrange(pdfReader.getNumPages()):         # get number of pages
    page = pdfReader.getPage(i)
    print 'Page No - ' + str(1+pdfReader.getPageNumber(page))
    page_content = page.extractText()                 # extract data
    print page_content
pdfFileObj.close()

anthng · 2019-03-18T04:05:06Z

I also have this error and cannot fix. Someone helps me! below is my code

def text_extractor(filePath=""):
    fileObj = open(filePath, 'rb')
    pdf = PdfFileReader(fileObj)
    totalPage = pdf.numPages

    print("This pdf file contains totally " + str(totalPage) +  " pages.")

    currentPage = 0
    text = ""

    while(currentPage < totalPage):
        pdfPage = pdf.getPage(currentPage)
        text = text + pdfPage.extractText()
        currentPage += 1

    if(text == ""):
        text = textract.process(filePath, method='tesseract', encoding='utf8')
       
    return text

anthng · 2019-03-18T05:12:33Z

I found out that there are some blank page when i tried to "try-except" to print number of pages which have errors. How about you?

marceloid · 2019-11-19T15:20:50Z

I just have solved this temporarily by using try catch for a specific PDF as this is issues with version make or content of PDF.
anyone needs code msg me i will pass

Hi, I need the mentioned code. Could you send it for me?

anthng · 2020-03-12T09:38:27Z

I just have solved this temporarily by using try catch for a specific PDF as this is issues with version make or content of PDF.
anyone needs code msg me i will pass

Hi, I need the mentioned code. Could you send it for me?

Here

while(currentPage < totalPage):
    pdfPage = pdf.getPage(currentPage)
    try:
        text = text + pdfPage.extractText()
    except:
        print(curentPage)
    currentPage += 1

if(text == ""):
    text = textract.process(filePath, method='tesseract', encoding='utf8')

MartinThoma · 2022-04-07T14:46:05Z

Has somebody a PDF + code example that shows the issue?

pubpub-zz · 2022-06-19T12:18:05Z

@mpeuss @marceloid @puneetsinha @Amalgamator @anthng @Namrata-1995

Many improvements have been introduced on the latest versions. Can you re-test and give feed back

MartinThoma · 2022-06-26T07:52:12Z

I'm closing this issue now as I believe it was fixed. Please leave a comment if you still run into this problem with a recent PyPDF2 version.

tylerdq mentioned this issue Jun 21, 2019

Force abort when exceptions occur tylerdq/pdfca#2

Closed

sharang108 mentioned this issue Aug 11, 2019

BUG: Handling 'Keyerror issue" #512

Closed

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022

MartinThoma added the needs-change The PR/issue cannot be handled as issue and needs to be improved label Apr 16, 2022

MartinThoma added needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem and removed needs-change The PR/issue cannot be handled as issue and needs to be improved labels Jun 26, 2022

MartinThoma closed this as completed Jun 26, 2022

MartinThoma added the key-error Could be a bug, but also a robustness issue label Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: '/Contents' #353

KeyError: '/Contents' #353

puneetsinha commented Jun 11, 2017 •

edited by MartinThoma

Loading

mpeuss commented Jun 29, 2017

Amalgamator commented Oct 8, 2017 •

edited by MartinThoma

Loading

puneetsinha commented Dec 20, 2017

puneetsinha commented Mar 19, 2018 via email

Namrata-1995 commented Nov 29, 2018 •

edited by MartinThoma

Loading

anthng commented Mar 18, 2019 •

edited by MartinThoma

Loading

anthng commented Mar 18, 2019

marceloid commented Nov 19, 2019

anthng commented Mar 12, 2020

MartinThoma commented Apr 7, 2022

pubpub-zz commented Jun 19, 2022

MartinThoma commented Jun 26, 2022

KeyError: '/Contents' #353

KeyError: '/Contents' #353

Comments

puneetsinha commented Jun 11, 2017 • edited by MartinThoma Loading

mpeuss commented Jun 29, 2017

Amalgamator commented Oct 8, 2017 • edited by MartinThoma Loading

puneetsinha commented Dec 20, 2017

puneetsinha commented Mar 19, 2018 via email

Namrata-1995 commented Nov 29, 2018 • edited by MartinThoma Loading

anthng commented Mar 18, 2019 • edited by MartinThoma Loading

anthng commented Mar 18, 2019

marceloid commented Nov 19, 2019

anthng commented Mar 12, 2020

MartinThoma commented Apr 7, 2022

pubpub-zz commented Jun 19, 2022

MartinThoma commented Jun 26, 2022

puneetsinha commented Jun 11, 2017 •

edited by MartinThoma

Loading

Amalgamator commented Oct 8, 2017 •

edited by MartinThoma

Loading

Namrata-1995 commented Nov 29, 2018 •

edited by MartinThoma

Loading

anthng commented Mar 18, 2019 •

edited by MartinThoma

Loading