Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: '/Contents' #353

Closed
puneetsinha opened this issue Jun 11, 2017 · 12 comments
Closed

KeyError: '/Contents' #353

puneetsinha opened this issue Jun 11, 2017 · 12 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF key-error Could be a bug, but also a robustness issue needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem needs-pdf The issue needs a PDF file to show the problem

Comments

@puneetsinha
Copy link

puneetsinha commented Jun 11, 2017

when and why do i get this error any work around for this case

KeyError                                  Traceback (most recent call last)
<ipython-input-5-1d51d9a98e6c> in <module>()
      9     page_content= ''
     10     for page in range(0,number_of_pages):
---> 11         page_content += read_pdf.getPage(page).extractText()
     12         #print(page_content.encode('utf-8'))
     13     textFilename = output_dir + base_file + ".txt"

~\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\PyPDF2\pdf.py in extractText(self)
   2655         """
   2656         text = u_("")
-> 2657         content = self["/Contents"].getObject()
   2658         if not isinstance(content, ContentStream):
   2659             content = ContentStream(content, self.pdf)

~\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\PyPDF2\generic.py in __getitem__(self, key)
    516 
    517     def __getitem__(self, key):
--> 518         return dict.__getitem__(self, key).getObject()
    519 
    520     ##

KeyError: '/Contents'
@mpeuss
Copy link

mpeuss commented Jun 29, 2017

We also have this error but would expect at least a PdfReadError instead of a KeyError.

@Amalgamator
Copy link

Amalgamator commented Oct 8, 2017

I'm working on some code that is supposed to convert ~20000 pdf's to text files for natural language processing. I also have the abovementioned error. I'm using this:

i=0
for subdir, dirs, files in os.walk(rootdir):
	for file in files:
		i += 1
		filedir = subdir+"/"+file
		print(i,filedir)
		pdfFileObj = open(filedir,'rb')
		pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

		text_file = open(file.strip(".pdf")+'.txt', "w")
		for page in pdfReader.pages:
			text = str(page.extractText())
			text = cleanup(text)  # some function that looks for odd substrings and such
			text_file.write(text)
		
		text_file.close()
		pdfFileObj.close()

Note that I had to explicitly state my text has to be a string (I had some errors otherwise).

Traceback (most recent call last):
  File "converter.py", line 17, in <module>
    text = str(page.extractText())
  File "/usr/local/lib/python3.5/dist-packages/PyPDF2/pdf.py", line 2591, in extractText
    content = self["/Contents"].getObject()
  File "/usr/local/lib/python3.5/dist-packages/PyPDF2/generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'

I'm guessing that this is due to a corrupted pdf (/contents instead of /Contents? or lacking that field, something like it), but since I have 20000 pdf's and I rly need all of them properly converted, I need to make sure exceptions like these are handled. This error came up on the ~40th pdf, which was a non-secure non-optimized PDF-1.2 file.

Any fix/workarounds/suggestions? (I'm trying to see whats in the pdf)

@puneetsinha
Copy link
Author

I just have solved this temporarily by using try catch for a specific PDF as this is issues with version make or content of PDF.
anyone needs code msg me i will pass

@puneetsinha
Copy link
Author

puneetsinha commented Mar 19, 2018 via email

@Namrata-1995
Copy link

Namrata-1995 commented Nov 29, 2018

i have facing same problem Can you plz help me in that

file="Combined spec_CP1CP2_Shubham.pdf"
file1=file.encode('UTF-8')
pdfFileObj = open(file1, 'rb')             # create pdf file object(pdf file open in binary mode)
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)      # create pdf reader object
print(pdfReader.numPages)                         # number of pages in pdf file

for i in xrange(pdfReader.getNumPages()):         # get number of pages
    page = pdfReader.getPage(i)
    print 'Page No - ' + str(1+pdfReader.getPageNumber(page))
    page_content = page.extractText()                 # extract data
    print page_content
pdfFileObj.close()

@anthng
Copy link

anthng commented Mar 18, 2019

I also have this error and cannot fix. Someone helps me! below is my code

def text_extractor(filePath=""):
    fileObj = open(filePath, 'rb')
    pdf = PdfFileReader(fileObj)
    totalPage = pdf.numPages

    print("This pdf file contains totally " + str(totalPage) +  " pages.")

    currentPage = 0
    text = ""

    while(currentPage < totalPage):
        pdfPage = pdf.getPage(currentPage)
        text = text + pdfPage.extractText()
        currentPage += 1

    if(text == ""):
        text = textract.process(filePath, method='tesseract', encoding='utf8')
       
    return text

@anthng
Copy link

anthng commented Mar 18, 2019

I found out that there are some blank page when i tried to "try-except" to print number of pages which have errors. How about you?

@marceloid
Copy link

I just have solved this temporarily by using try catch for a specific PDF as this is issues with version make or content of PDF.
anyone needs code msg me i will pass

Hi, I need the mentioned code. Could you send it for me?

@anthng
Copy link

anthng commented Mar 12, 2020

I just have solved this temporarily by using try catch for a specific PDF as this is issues with version make or content of PDF.
anyone needs code msg me i will pass

Hi, I need the mentioned code. Could you send it for me?

Here

while(currentPage < totalPage):
    pdfPage = pdf.getPage(currentPage)
    try:
        text = text + pdfPage.extractText()
    except:
        print(curentPage)
    currentPage += 1

if(text == ""):
    text = textract.process(filePath, method='tesseract', encoding='utf8')

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@MartinThoma
Copy link
Member

Has somebody a PDF + code example that shows the issue?

@MartinThoma MartinThoma added the needs-change The PR/issue cannot be handled as issue and needs to be improved label Apr 16, 2022
@pubpub-zz
Copy link
Collaborator

@mpeuss @marceloid @puneetsinha @Amalgamator @anthng @Namrata-1995

Many improvements have been introduced on the latest versions. Can you re-test and give feed back

@MartinThoma MartinThoma added needs-pdf The issue needs a PDF file to show the problem needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem and removed needs-change The PR/issue cannot be handled as issue and needs to be improved labels Jun 26, 2022
@MartinThoma
Copy link
Member

I'm closing this issue now as I believe it was fixed. Please leave a comment if you still run into this problem with a recent PyPDF2 version.

@MartinThoma MartinThoma added the key-error Could be a bug, but also a robustness issue label Aug 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF key-error Could be a bug, but also a robustness issue needs-example-code The issue needs a minimal and complete (e.g. all imports) example showing the problem needs-pdf The issue needs a PDF file to show the problem
Projects
None yet
Development

No branches or pull requests

8 participants