-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError: '/Contents' #353
Comments
We also have this error but would expect at least a PdfReadError instead of a KeyError. |
I'm working on some code that is supposed to convert ~20000 pdf's to text files for natural language processing. I also have the abovementioned error. I'm using this: i=0
for subdir, dirs, files in os.walk(rootdir):
for file in files:
i += 1
filedir = subdir+"/"+file
print(i,filedir)
pdfFileObj = open(filedir,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text_file = open(file.strip(".pdf")+'.txt', "w")
for page in pdfReader.pages:
text = str(page.extractText())
text = cleanup(text) # some function that looks for odd substrings and such
text_file.write(text)
text_file.close()
pdfFileObj.close() Note that I had to explicitly state my text has to be a string (I had some errors otherwise).
I'm guessing that this is due to a corrupted pdf (/contents instead of /Contents? or lacking that field, something like it), but since I have 20000 pdf's and I rly need all of them properly converted, I need to make sure exceptions like these are handled. This error came up on the ~40th pdf, which was a non-secure non-optimized PDF-1.2 file. Any fix/workarounds/suggestions? (I'm trying to see whats in the pdf) |
I just have solved this temporarily by using try catch for a specific PDF as this is issues with version make or content of PDF. |
while opening the file in reading mode , open it in UTF-8 encoding. will
solve the problem.
some pdfs have corrections and highlighting that is why this error occours.
…On Sun, Oct 8, 2017 at 10:57 PM, Amalgamator ***@***.***> wrote:
I'm working on some code that is supposed to convert ~20000 pdf's to text
files for natural language processing. I also have the abovementioned
error. I'm using this:
i=0
for subdir, dirs, files in os.walk(rootdir):
for file in files:
i += 1
filedir = subdir+"/"+file
print(i,filedir)
pdfFileObj = open(filedir,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
text_file = open(file.strip(".pdf")+'.txt', "w")
for page in pdfReader.pages:
text = str(page.extractText())
text = cleanup(text) # some function that looks for odd substrings and such
text_file.write(text)
text_file.close()
pdfFileObj.close()
Note that I had to explicitly state my text has to be a string (I had some
errors otherwise).
Traceback (most recent call last):
File "converter.py", line 17, in <module>
text = str(page.extractText())
File "/usr/local/lib/python3.5/dist-packages/PyPDF2/pdf.py", line 2591, in extractText
content = self["/Contents"].getObject()
File "/usr/local/lib/python3.5/dist-packages/PyPDF2/generic.py", line 516, in __getitem__
return dict.__getitem__(self, key).getObject()
KeyError: '/Contents'
I'm guessing that this is due to a corrupted pdf (/contents instead of
/Contents? or lacking that field, something like it), but since I have
20000 pdf's and I rly need all of them properly converted, I need to make
sure exceptions like these are handled. This error came up on the ~40th
pdf. Any suggestions?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#353 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/APfEOPpdHHN4nIEa_nPLqMxIs73ak26kks5sqQX7gaJpZM4N2Sxk>
.
--
Best Regards,
Puneet Sinha
Lead Data Scientist - Advanced Analytics and Machine Learning
*Yatra.**Com Labs* | 5th Floor, Tower - A, Unitech Cyber Park, Sec - 39,
Gurgaon
8888835462
|
i have facing same problem Can you plz help me in that file="Combined spec_CP1CP2_Shubham.pdf"
file1=file.encode('UTF-8')
pdfFileObj = open(file1, 'rb') # create pdf file object(pdf file open in binary mode)
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # create pdf reader object
print(pdfReader.numPages) # number of pages in pdf file
for i in xrange(pdfReader.getNumPages()): # get number of pages
page = pdfReader.getPage(i)
print 'Page No - ' + str(1+pdfReader.getPageNumber(page))
page_content = page.extractText() # extract data
print page_content
pdfFileObj.close() |
I also have this error and cannot fix. Someone helps me! below is my code def text_extractor(filePath=""):
fileObj = open(filePath, 'rb')
pdf = PdfFileReader(fileObj)
totalPage = pdf.numPages
print("This pdf file contains totally " + str(totalPage) + " pages.")
currentPage = 0
text = ""
while(currentPage < totalPage):
pdfPage = pdf.getPage(currentPage)
text = text + pdfPage.extractText()
currentPage += 1
if(text == ""):
text = textract.process(filePath, method='tesseract', encoding='utf8')
return text |
I found out that there are some blank page when i tried to "try-except" to print number of pages which have errors. How about you? |
Hi, I need the mentioned code. Could you send it for me? |
Here while(currentPage < totalPage):
pdfPage = pdf.getPage(currentPage)
try:
text = text + pdfPage.extractText()
except:
print(curentPage)
currentPage += 1
if(text == ""):
text = textract.process(filePath, method='tesseract', encoding='utf8') |
Has somebody a PDF + code example that shows the issue? |
@mpeuss @marceloid @puneetsinha @Amalgamator @anthng @Namrata-1995 Many improvements have been introduced on the latest versions. Can you re-test and give feed back |
I'm closing this issue now as I believe it was fixed. Please leave a comment if you still run into this problem with a recent PyPDF2 version. |
when and why do i get this error any work around for this case
The text was updated successfully, but these errors were encountered: