Skip to content

Commit

Permalink
Add handler to cope when indexing fails.
Browse files Browse the repository at this point in the history
  • Loading branch information
anjackson committed Dec 14, 2022
1 parent 7f305db commit f705e1c
Showing 1 changed file with 9 additions and 1 deletion.
10 changes: 9 additions & 1 deletion lib/windex/mr_cdx_pywb_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,15 @@ def mapper_raw(self, warc_path, warc_uri):
# CDX N b a m s k r M S V g
# com,example)/ 20170306040206 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1242 784 example.warc.gz
cdx11 = CDX11Indexer(inputs=[warc_path], output=cdx_file, cdx11=True, post_append=True)
cdx11.process_all()
# Some WARCs throw an exception during indexing. To avoid everything getting stuck we need to catch these errors and
# record them for later investigation rather than killing the whole job:
self.set_status('Running cdxj_indexer on %s ...' % warc_path)
try:
cdx11.process_all()
except Exception as e:
yield f"__by_file {warc_path} warc_cdx_indexing_exception_s", str(e)
# Do not process output of failed process:
return

# The warc_path we get passed in is just the local temp filename.
# We need to use the HDFS file URI instead and extract the path:
Expand Down

0 comments on commit f705e1c

Please sign in to comment.