You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Code: branch apiv1 Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh) Result: Ingestion fails with error on this document: 5e7a19df998e17af826615ac.pdf
) Error:
ERROR :: 2020-08-11 17:56:01,387 :: can't concat int to bytes
Traceback (most recent call last):
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/ingest.py", line 198, in pdf_to_images
meta, limit = parse_pdf(filename)
File "/ssd/iain/Cosmos/cosmos/ingestion/ingest/utils/pdf_extractor.py", line 45, in parse_pdf
interpreter.process_page(page)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
self.execute(list_value(streams))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 936, in execute
func()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 506, in do_s
self.do_S()
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 499, in do_S
self.curpath)
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/converter.py", line 115, in paint_path
pts.append(apply_matrix_pt(self.ctm, (p[i], p[i+1])))
File "/home/imcconnell2/miniconda3/envs/cosmos/lib/python3.7/site-packages/pdfminer/utils.py", line 138, in apply_matrix_pt
return a * x + c * y + e, b * x + d * y + f
TypeError: can't concat int to bytes
distributed.worker - WARNING - Compute Failed
Function: pdf_to_images
args: ('/hdd/iain/covid_docs_25Mar_all/5e7a19df998e17af826615ac.pdf')
kwargs: {}
Exception: Exception('Parsing error', "can't concat int to bytes")
The text was updated successfully, but these errors were encountered:
I was able to reproduce the error. It looks like this is a PDF parsing error, and should be handled upstream with pdfminer. I'm going to look to catch this exception and skip metadata extraction for all such PDFs.
Code: branch apiv1
Re-create: run on cosmos0003, with two gpus in dask cluster (spawn_dask_cluster_2_gpu.sh), run ingestion script (ingest_documents_timing.sh)
Result: Ingestion fails with error on this document: 5e7a19df998e17af826615ac.pdf
)
Error:
The text was updated successfully, but these errors were encountered: