# WARC Index Status

We can use the Tracking Database check the WARC index status.

The CDX index status can be broken down by day using the following facet query (taking advantage of [Solr's JSON Facet API](https://lucene.apache.org/solr/guide/8_4/json-facet-api.html)).

In [3]:
import pandas as pd
import json
import requests

headers = {'content-type': "application/json" }

json_facet = {
    # Primary facet is by date - here we break down the last month(s) into days
    'facet': {
        'dates' : { 
            'type' : 'range', 
            'field' : 'timestamp_dt', 
            'start' : "NOW/DAY-1MONTH",
            'end' : "NOW/DAY+10DAY", 
            'gap' : "+2HOUR",
            # For each day, we facet based on the CDX Index field, and make sure items with no value get recorded:
            'facet': { 
                'index_status': { 
                    'terms': { 
                        "field": "cdx_index_ss", 
                        'mincount': 0,
                        'missing': True,
                        'facet': { 
                            'stream_s': { 
                                'terms': { 
                                    "field": "stream_s", 
                                    'mincount': 0,
                                    'missing': True
                                }
                            }
                        }
                    }
                }
            }
        } 
    }
}


params = {
  'q': 'kind_s:"warcs"',
  'rows': 0
}

r = requests.post("http://solr8.api.wa.bl.uk/solr/tracking/select", params=params, data=json.dumps(json_facet), headers=headers)

if r.status_code != 200:
    print(r.text)

from solr.solr_facet_helper import flatten_solr_buckets

df = pd.DataFrame(flatten_solr_buckets(r.json()['facets']))
# Filter empty rows:
df=df[df['count'] != 0]

df

Unnamed: 0,dates,index_status,stream_s,count
0,2021-12-26T00:00:00Z,data-heritrix,frequent,66
7,2021-12-26T02:00:00Z,data-heritrix,frequent,75
14,2021-12-26T04:00:00Z,data-heritrix,frequent,74
21,2021-12-26T06:00:00Z,data-heritrix,frequent,71
28,2021-12-26T08:00:00Z,data-heritrix,frequent,54
...,...,...,...,...
2507,2022-01-24T12:00:00Z,data-heritrix,frequent,12
2514,2022-01-24T14:00:00Z,data-heritrix,frequent,11
2521,2022-01-24T16:00:00Z,data-heritrix,frequent,7
2528,2022-01-24T18:00:00Z,data-heritrix,frequent,1


Which can be used to build a simple visualisation:

In [4]:
import altair as alt

alt.Chart(df).mark_bar(size=6).encode(
    x='dates:T',
    y='count',
    color='index_status',
    tooltip=[alt.Tooltip('dates:T', format='%A, %e %B %Y'),'index_status', 'count']
).properties(width=800).interactive()