Skip to content
Andy Jackson edited this page Aug 21, 2015 · 10 revisions

The warc-hadoop-indexer project also contains metadata extraction (MDX) tools that can be applied as follows.

Step 1 - create metadata extractions for every resource

Breaking the whole 1996-2013 collection into 6 chunks (at least 86,000 ARCs or WARCs per chunk), we then run the WARCMDXGenerator to create a set of MDX metadata objects stored in a manner that allows further processing.

It works by running the same indexer over the resources as we use in the Solr-indexing workflow, but rather than sending the data to Solr it is converted to a simple map of properties and stored in map-reduce-friendly sequence files. This provides a means for re-duplication to be carrier out, as the sequence files are sorted by the hash of the entity body.

Invocation looked like this:

hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-2.2.0-BETA-6-SNAPSHOT-job.jar uk.bl.wa.hadoop.mapreduce.mdx.WARCMDXGenerator -i mdx_arcs_b -o mdx-arcs-b

where mdx_arcs_b is a list of the ARC and WARC files to process. The sequence files are then deposited in mdx-arcs-b.

The time taken to process the six chunks is indicated here (although note that other tasks were running on the cluster at various times, hence the variability).

chunk time records NULLs Errors HDFS bytes read MDXSeq size
a 33hrs, 57mins, 1sec 538,761,419 116,127,700 0 5,661,783,544,289 (5.66 TB) 186,053,418,102
b 19hrs, 23mins, 6sec 475,516,515 77,813,176 0 6,279,242,837,014 158,101,113,403
c 18hrs, 56mins, 11sec 539,713,722 93,696,252 0 5,802,813,832,422 180,454,498,713
d 28hrs, 56mins, 11sec 524,143,344 89,077,559 8 5,918,383,825,363 177,392,768,820
e 35hrs, 26mins, 40sec 501,582,602 101,396,811 1 6,505,417,693,908 180,565,740,110
f 72hrs, 19mins, 35sec 1,353,142,719 332,045,791 14 29,129,360,605,968 462,439,132,652

So, the output is about 2-3% of the input.

Step 2 - merge the entire set of MDX sequence files

Merging all into one set, sorted by hash, and re-duplicating revisit records.

 hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-2.2.0-BETA-6-SNAPSHOT-job.jar uk.bl.wa.hadoop.mapreduce.mdx.MDXSeqMerger -i mdx-arcs-ef-parts -o mdx-arcs-ef -r 50

Note that when merging just chunks e and f (which are the only ones containing WARCs), there were 162,103,566 revisits out of 1,854,725,336 records, but 2,482,073 were unresolved!

for everything

hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-2.2.0-BETA-6-SNAPSHOT-job.jar uk.bl.wa.hadoop.mapreduce.mdx.MDXSeqMerger -i mdx-a-f -o mdx-merged -r 50

11hrs, 50mins, 39sec

NUM_RESOLVED_REVISITS 0 159,621,493 159,621,493 NUM_REVISITS 0 162,103,566 162,103,566 NUM_RECORDS 0 3,932,860,344 3,932,860,344 NUM_UNRESOLVED_REVISITS 0 2,482,073 2,482,073

FILE_BYTES_READ 1,379,467,889,493 2,940,283,760,167 4,319,751,649,660 HDFS_BYTES_READ 1,348,985,804,233 0 1,348,985,804,233 FILE_BYTES_WRITTEN 2,717,555,784,717 2,940,286,352,547 5,657,842,137,264 HDFS_BYTES_WRITTEN 0 1,354,124,391,916 1,354,124,391,916

Step 3 - run statistic and sample generators over the merged MDX sequence files

At this point, we can run other profilers and samplers over the merged, reduplicated MDX files and generate a range of datasets for researchers.