Initial comparison with DLS
===============

On or about the 16th of May 2016, all the SIPs from HDFS were scanned and opened up to extract the 153,041 known WARCs of content submitted to DLS. The `ids.py` script used to do this created an `identifiers.txt` file that contains a JSON summary of the submitted WARCs and ZIPs.

From about the 9th of August to the 14th of September, these c. 150,000 mostly WARC files were downloaded and hash-checked using a simple script (`test.sh`). i.e. we downloaded about 150TB of content from DLS over about 5 weeks. This is a pretty reasonable 50 MBps sustained over a long period without notable affecting other services.

Here, we compare the recovered hashes with those from HDFS. We also compare the DLS export summary with the known HDFS content.

TODO
----

- Rule out temporary timeouts/failures by re-running the gaps.
- Make a production (Luigi-based) version of this comparison that knows about ALL items ever submitted.
    - The HDFS ID generation may mistakenly include second-submissions of difficult cases?

Part 1 - Comparing the hash results
---------------------------------

We load in the original identifiers from HDFS, and then load in the DLS results, outputting a comparison file `compare.out` where the first field in each lines is either `OK` or `KO` depending on whether the hashes matched or not.

In [11]:
import re
import json

warcs_by_id = {}
hdfs_ids = set()
with open('identifiers.txt') as f:
    counter = 0
    for line in f:
        crawl_id, json_str = re.split(' ', line, maxsplit=1)
        # Unfortunately, the source list was just printed dict()s rather than JSON, so we hack:
        json_str = json_str.strip().replace("'", '"')
        if json_str == 'None':
            print("WARNING! no identifiers found for %s" % crawl_id)
            continue
        # Build up a dict:
        warc_info = json.loads(json_str)
        warc_info['crawl_id'] = crawl_id
        item_id = warc_info['ark'].replace("ark:/81055/", "")
        hdfs_ids.add(item_id)
        warcs_by_id[item_id] = warc_info
        if counter == 0:
            print(warc_info)
        counter += 1
        if counter%10000 == 0:
            print(item_id)
            print("...%i..." % counter)

print("\nNow running the comparison...")

id_dls_sha = {}
with open('compare.out', 'w') as fout:
    with open('test.clean.out') as f:
        counter = 0
        for line in f:
            if counter%10000 == 0:
                print("...%i..." % counter)
            counter += 1
            # ...
            sha, item_id = re.split(' ', line, maxsplit=1)
            item_id = item_id.lstrip('*').strip()
            # Store:
            id_dls_sha[item_id] = sha
            # Loop
            if item_id in warcs_by_id:
                original_sha = warcs_by_id[item_id]['checksum']
                original_path = warcs_by_id[item_id]['path']
            else:
                original_sha = None
                original_path = None
            if sha == original_sha:
                decision = "OK"
            else:
                decision = "KO"
            #
            fout.write("%s\t%s\t%s\n" % (decision, item_id, json.dumps(warcs_by_id.get(item_id,dict()))))


{u'mimetype': u'application/warc', u'checksum': u'55826718a7a72878637313e33104fba6fb58990f19a025e5c977f2e82576f673b3a198d44de05735290c97668b8e775e2fce631f616bfd48eb345a17bc85540e', 'crawl_id': '2013-domain-crawl/20130916143312', u'checksum_type': u'SHA-512', u'path': u'http://dls.httpfs.wa.bl.uk:14000/webhdfs/v1/heritrix/output/warcs/crawl0-20130412144423/BL-20130422000726955-03689-23518~crawler02~8443.warc.gz?user.name=hadoop&op=OPEN', u'ark': u'ark:/81055/vdc_100000038622.0x00815c', u'size': u'1006644129'}
vdc_100000038622.0x005a4d
...10000...
vdc_100000038622.0x00333d
...20000...
vdc_100000038622.0x000c2d
...30000...
vdc_100023997485.0x00dca4
...40000...
vdc_100023997485.0x00b594
...50000...
vdc_100023997485.0x008e84
...60000...
vdc_100023997485.0x006774
...70000...
vdc_100023997485.0x004064
...80000...
vdc_100023997485.0x001954
...90000...
vdc_100025743210.0x00002e
...100000...
vdc_100022569061.0x000001
...110000...
vdc_100022807075.0x000001
...120000...
vdc_100022565688.0x0003f5
.

This comparison file can then be processed further using grep.

Part 2 - Looking at the DLS Export
--------------------------------

Here we load a copy of the export file from DLS that summarises the state according to the Boston Spa node.

We parse the replication status bitmask and turn it into a replication count. We store the lookup table of known identifiers too.

In [21]:

replication = {}
id_rep = {}
identifiers = set()
dom_ids = set()
with open('Public Web Archive Access Export.txt') as f:
    counter = 0
    for line in f:
        if counter%100000 == 0:
            print("...%i..." % counter)
        counter += 1
        parts = line.strip().split('\t')
        if len(parts) == 4:
            item_id, bs_url, check_date, rep_status = parts
            rep_status = int(rep_status)
        else:
            item_id, bs_url = parts
            check_date = None
            rep_status = 0
        identifiers.add(item_id)
        # dom_id
        dom_id = bs_url[20:]
        dom_ids.add(dom_id)
        # Replication count:
        rep_count = 0
        for i in range(0,4):
            bitmask = 1 << i
            if bitmask&rep_status:
                rep_count += 1
        # printy
        #if rep_count == 0 or counter%100000 == 0:
        #    print(rep_count)
        #    print(item_id, dom_id, check_date, rep_status)
        # Store replication count for each ID:
        id_rep[item_id] = rep_count
        # Sum up
        rep_key = rep_count
        rep_key_count = replication.get(rep_key,0)
        replication[rep_key] = rep_key_count + 1
        
print(len(identifiers), len(dom_ids), counter)
print(replication)

...0...
...100000...
...200000...
...300000...
...400000...
...500000...
...600000...
(698748, 698748, 698748)
{0: 106, 1: 28, 2: 3, 3: 1382, 4: 697229}


This data can then be plotted to make it easier to understand.

In [22]:
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

labels = ['Fully replicated', 'Under-replicated', 'Not replicated']
values = [replication[4], 0, replication[0]]
for rep_count in range(1,4):
    values[1] += replication[rep_count]

data = [go.Pie(
            labels=labels,
            values=values
    )]

py.iplot({ 'data': data, 'layout': {'title': 'Overall replication status of items in DLS'}})

labels = []
values = []
for rep_count in sorted(replication.keys()):
    if rep_count != 4:
        labels.append(rep_count)
        values.append(replication[rep_count])

data = [go.Pie(
            labels=labels,
            values=values, 
            sort=False
    )]

py.iplot({ 'data': data, 'layout': {'title': 'Replication level of the under-replicated items in DLS'}})

This shows that although it's a relatively small percentage of the whole, there still ~1,400 items that are known to the system but not fully replicated.

We can also use this data to track the items we know we submitted, i.e. comparing HDFS and DLS holdings:

In [19]:
rep_all = {}
for item_id in warcs_by_id:
    if item_id in identifiers:
        rep_status = id_rep[item_id]
    else:
        rep_status = -1
    count = rep_all.get(rep_status, 0)
    rep_all[rep_status] = count + 1

print(rep_all)

labels = ['Fully replicated', 'Under-replicated', 'Not replicated', 'Missing']
values = [rep_all[4], 0, rep_all[0], rep_all[-1]]
for rep_count in range(1,4):
    values[1] += rep_all.get(rep_count, 0)

data = [go.Pie(
            labels=labels,
            values=values
    )]

py.iplot({ 'data': data, 'layout': {'title': 'Overall replication status of submitted items'}})


{0: 14, 1: 27, 3: 175, 4: 150557, -1: 2263}


So, now we see that there are significantly more items that are known, and that we believe were submitted to DLS, but are currently completely unknown to the system.

That said, we can also run the comparison the other way, and find there are many more DLS items not covered by the current analysis:

In [23]:
dls_only_ids = []
for item_id in identifiers:
    if item_id not in hdfs_ids:
        dls_only_ids.append(item_id)
        
print(len(dls_only_ids), len(identifiers), len(hdfs_ids))

(547975, 698748, 153036)


This is _mainly_ because this analysis does not cover _any_ WCT content, but we need a full picture of what we've submitted in order to carry this analysis out completely.