# Inspecting WARC

In this notebook we're going to use the Python [warcio](https://github.com/webrecorder/warcio) library to examine a [WARC](https://en.wikipedia.org/wiki/Web_ARChive) file. This particular WARC file was generated for the [Technoromanticism class blog](http://mith.umd.edu/eng738T/) Wordpress site using [wget](https://www.gnu.org/software/wget/):

```
wget --mirror --warc-file eng738T --no-parent http://mith.umd.edu/eng738T/
```
 
<img src="images/website.png" style="width: 70%; border: thin solid #ccc;">

The goal here isn't to learn how to generate a WARC file but to learn a little about the structure of a WARC file, and how to read it with Python.

## Imports

First we need to import some things to help us work with the WARC data. The first thing we need is [WebRecorder](https://webrecorder.io)'s [warcio](https://github.com/webrecorder/warcio) library that makes it possible to read and write WARC files from Python. We're going to be reading WARC data so we need to import the ArchiveIterator class that lets us walk through each record in a WARC file.

In [1]:
from warcio.archiveiterator import ArchiveIterator

We're going to be looking at links in HTML pages so we'll need a good HTML parser like [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) for parsing the HTML. Python also comes with a handy [urlparse](https://docs.python.org/3.6/library/urllib.parse.html#urllib.parse.urlparse) function for parsing URLs:

In [2]:
from bs4 import BeautifulSoup
from urllib.parse import urlparse

## Count Hostnames

Now that we've loaded the libraries we need let's create a dictionary to keep track of the hostnames we find in links on the website.

In [3]:
counts = {}

Now let's create a function that will take a WARC record as input and if the record is for an HTML response it will parse the HTML using BeautifulSoup, and then loop through all the anchor tags and count the host names used.

In [4]:
def count_links(record):
    if "html" not in record.http_headers.get("content-type"):
        return
    doc = BeautifulSoup(record.raw_stream, "lxml")
    for link in doc.select("a"):
        url = urlparse(link["href"])
        counts[url.hostname] = counts.get(url.hostname, 0) + 1

And now we can open up our WARC file and iterate through all the records. We are only interested in counting the urls the responses.

In [None]:
for record in ArchiveIterator(open("eng738T.warc.gz", "rb")):
    if record.rec_type == "response":
        count_links(record)

In [6]:
print(counts)

{'mith.umd.edu': 1387127, 'www.wdl.org': 7, 'upload.wikimedia.org': 8, 'www.aoc.gov': 7, 'www.imdb.com': 7, 'www.blakearchive.org': 63, 'twitter.com': 6, None: 94, 'wordpress.org': 68, 'shelleygodwinarchive.org': 71, 'www.rc.umd.edu': 73, 'elms.umd.edu': 68, 'www.archive.org': 68, 'chronicle.com': 68, 'www.nattywp.com': 68, 'www.umd.edu': 1, 'www.elms.umd.edu': 1, 'www.tei-c.org': 3, 'digitalliterature.net': 4, 'journals.tdl.org': 1, 'juxtacommons.org': 1, 'web.mit.edu': 3, 'pmc.iath.virginia.edu': 1, 'www.heise.de': 1, 'www.deenalarsen.net': 3, 'gaslight.mtroyal.ca': 1, 'www.stanford.edu': 2, 'deoxy.org': 1, 'www.women.it': 1, 'muse.jhu.edu': 1, 'www.lacan.com': 1, 'www.gutenberg.org': 2, 'www.wsu.edu': 1, 'uwf.edu': 1, 'open\r\n002813\r\nobjects.blogspot.com': 1, 'menus.nypl.org': 1, 'transcription.si.edu': 1, 'www.galaxyzoo.org': 1, 'www.infiniteulysses.com': 1, 'en.gravatar.com': 1, 'twinery.org': 9, 'www.auntiepixelante.com': 3, 'selectadecision.info': 3, 'electricopolis.net': 3, 

That's a bit of a jumble so let's print out the hostnames in order of how many links they have:

In [18]:
for hostname in sorted(counts.keys(), key=counts.get, reverse=True):
    print("%-35s  %8i" %(hostname, counts[hostname]))

mith.umd.edu                          1387127
None                                       94
www.rc.umd.edu                             73
shelleygodwinarchive.org                   71
wordpress.org                              68
elms.umd.edu                               68
www.archive.org                            68
chronicle.com                              68
www.nattywp.com                            68
www.blakearchive.org                       63
ginsbergblog.blogspot.com                  11
www.youtube.com                            10
twinery.org                                 9
upload.wikimedia.org                        8
www.wdl.org                                 7
www.aoc.gov                                 7
www.imdb.com                                7
www.literaturegeek.com                      7
twitter.com                                 6
digitalliterature.net                       4
boingboing.net                              4
www.readwriteweb.com              