Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Pipeline fails with 'can't concat int to bytes' #115

Closed
ilmcconnell opened this issue Aug 14, 2020 · 2 comments
Closed

Ingest Pipeline fails with 'can't concat int to bytes' #115

ilmcconnell opened this issue Aug 14, 2020 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@ilmcconnell
Copy link
Contributor

Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document: 5e7a19df998e17af826615ac.pdf
)
Error:

ERROR :: 2020-08-11 17:56:01,387 :: can't concat int to bytes
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 936, in execute
func()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 506, in do_s
self.do_S()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 499, in do_S
self.curpath)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/converter.py", line 115, in paint_path
pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/utils.py", line 138, in apply_matrix_pt
return a * x + c * y + e, b * x + d * y + f
TypeError: can't concat int to bytes
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a19df998e17af826615ac.pdf')
kwargs: {}
Exception: Exception('Parsing error', "can't concat int to bytes")

@ilmcconnell ilmcconnell added the bug Something isn't working label Aug 14, 2020
@ilmcconnell ilmcconnell self-assigned this Aug 14, 2020
@ilmcconnell ilmcconnell changed the title Ingest Pipeline fails with 'Parsing error' Ingest Pipeline fails with 'can't concat int to bytes' Aug 14, 2020
@ankur-gos ankur-gos self-assigned this Aug 14, 2020
@ankur-gos
Copy link
Contributor

I was able to reproduce the error. It looks like this is a PDF parsing error, and should be handled upstream with pdfminer. I'm going to look to catch this exception and skip metadata extraction for all such PDFs.

@ankur-gos
Copy link
Contributor

Fixed with #120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants