<a href="https://colab.research.google.com/github/ruebot/notebooks/blob/main/digital-research-web-archive-analysis-workshop/digital_research_web_archive_analysis_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with Web Archive Analysis

In this workshop we will download some example derivatives from [Archive-It's ARCH](https://webservices.archive.org/pages/arch) service to demonstrate a few examples of further exploration of web archive data. We'll be using derivatives created from the [Movimiento estudiantil feminista Universidad Autónoma del Estado de México 2020 ](https://archive-it.org/collections/20429) collection created by [Huellas Incómodas](https://archive-it.org/organizations/2521).

## Archives Unleashed Toolkit Derivatives

The web archive derivatives that we are working with, that ARCH created, are generated using the [Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut). If you have W/ARC files of your own, you can create these same derivatives, and more! Examples of how to do this, and further documentation can be found [here](https://aut.docs.archivesunleashed.org/docs/auk-derivatives).

## Notebook History

This notebook is a derivative of:

* [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6381036.svg)](https://doi.org/10.5281/zenodo.6381036) -- Working with ARCH Derivatives

* Article: [Creating order from the mess: web archive derivative datasets and notebooks](https://doi.org/10.1080/23257962.2022.2100336).

# Datasets

First, we will need to create variables for the derivative data from ARCH. For this, we'll just be using the download URLs for each of the derivatives. You can grab these by right-clicking on the download icon, and selecting "Copy Link".

In [None]:
domain_frequency_data = "https://webdata.archive-it.org/ait/files/download/ARCHIVEIT-ait:1796:20429/DomainFrequencyExtraction/domain-frequency.csv.gz?access=24VLVHYLY3J7FQTVLRPHUC6VC5PWU4KG"
image_info_data = "https://webdata.archive-it.org/ait/files/download/ARCHIVEIT-ait:1796:20429/ImageInformationExtraction/image-information.csv.gz?access=Q5OVM44HHNOH5X2SN63AEQFTXPVGYUKY"
web_graph_data = "https://webdata.archive-it.org/ait/files/download/ARCHIVEIT-ait:1796:20429/WebGraphExtraction/web-graph.csv.gz?access=5TAWDHCMR4I6XCQ2YGZO3AY4QOCXZCSS"
web_pages_data = "https://webdata.archive-it.org/ait/files/download/ARCHIVEIT-ait:1796:20429/WebPagesExtraction/web-pages.csv.gz?access=3I7IQZFO5D4AJA3FURDJSRG4H7OZWGXW"

# Environment

Next, we'll setup our environment so we can load our derivatives into [pandas](https://pandas.pydata.org).

In [None]:
import pandas as pd

# Data Table Display

Colab includes an extension that renders pandas dataframes into interactive displays that can be filtered, sorted, and explored dynamically. This can be very useful for taking a look at what each DataFrame provides!

Data table display for pandas dataframes can be enabled by running:
```python
%load_ext google.colab.data_table
```
and disabled by running
```python
%unload_ext google.colab.data_table
```

In [None]:
%load_ext google.colab.data_table

# Loading our ARCH Datasets as DataFrames

---



Next, we'll setup our datasets as pandas DataFrames to work with, and show a preview of each using the Data Table Display.

Each block of derivative commands create a variable. That variable is a DataFrame with all of the information from a given derivative. After the DataFrame is created, a preview of it is shown.

### Domain Frequency

Provides the following columns:

* domain
* count

In [None]:
domain_frequency = pd.read_csv(domain_frequency_data, compression='gzip')
domain_frequency

### Web Graph

Provides the following columns:

* crawl date
* source
* target
* anchor text

Note that this contains all links and is not aggregated into domains.

In [None]:
web_graph = pd.read_csv(web_graph_data, compression='gzip')
web_graph

### Images

Provides the following columns:

* crawl date
* URL of the image
* filename
* image extension
* MIME type as provided by the web server
* MIME type as detected by Apache TIKA
* image width
* image height
* image MD5 hash
* image SHA1 hash

In [None]:
images = pd.read_csv(image_info_data, compression='gzip')
images

### Web Pages

Provides the following columns:

* crawl date
* web domain
* URL
* MIME type as provided by the web server
* MIME type as detected by Apache TIKA
* content (HTTP headers and HTML removed)

In [None]:
web_pages = pd.read_csv(web_pages_data, compression='gzip')
web_pages

# Data Analysis

Now that we have all of our datasets loaded up, we can begin to work with them!

## Counting total files, and unique files

Let's take a quick look at how to count items in DataFrames, and use total and unique files as an example to work with.

It's definitely work checking out the [pandas documentation](https://pandas.pydata.org/docs/index.html). There are a lot of good examples available, along with a robust [API reference](https://pandas.pydata.org/docs/reference/index.html#api).


#### How many images are in this collection?

We can take our `images` variable try a couple of functions to get the same answer.

1.   `len(images.index)`
  * Get the length of the DataFrame's index.
2.   `images.shape[0]`
  * Get the shape or dimensionality of the DataFrame, and take the first item in the tuple.
3.  `images.count()`
  * Count the number of rows for each column.



In [None]:
len(images.index)

In [None]:
images.shape[0]

In [None]:
images.count()

 #### How many unique images are in the collection?

 We can see if an image is unique or not by computing an [MD5 hash](https://en.wikipedia.org/wiki/MD5#MD5_hashes) of it, and comparing them. The exact same image might have a filename of `example.jpg` or `foo.jpg`. If the hash is computed for each, we can see that even with different file names, they are actually the same image. So, since we have both a `MD5` and `SHA1` hash column available in our DataFrame, we can just find the unique values, and count them!




In [None]:
len(images.md5.unique())

#### What are the top 10 most occurring images in the collection?

Here we can take advantage of [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html) to provide us with a list of MD5 hashes, and their respective counts.

In [None]:
images["md5"].value_counts().head(10)


#### What's the information around all of the occurances of `da5b449fff36752a93779fa4067cd2eb`?

What, you mean you don't know what `da5b449fff36752a93779fa4067cd2eb` means? 

Let's find those images in the DataFrame. We can here see some of the filenames used, it's dimensions, and it's URL.


In [None]:
images.loc[images["md5"] == "da5b449fff36752a93779fa4067cd2eb"]

### What does `da5b449fff36752a93779fa4067cd2eb` look like?

Let's grab the live web URL for the image, and then see if we can display it in a markdown cell.


In [None]:
pd.options.display.max_colwidth = None
one_image = images.loc[images["md5"] == "da5b449fff36752a93779fa4067cd2eb"].head(1)
one_image["url"]

![da5b449fff36752a93779fa4067cd2eb](https://t.teads.tv/track?action=placementCall&env=js-web&auctid=33f38483-62a4-4c26-b8a6-e1c7f33c1196&pageId=46587&pid=51781&debug_metadata=nYruSm6UJ5&fv=1107&ts=1670523222881&f=1&referer=https%3A%2F%2Fwww.milenio.com%2Fpolitica%2Fcomunidad%2F2020-fgjem-21-denuncias-violencia-genero-uaemex)

Unfortunately, we can't easily display it here since it is a tracker image!

While we can see that this is the most popular image in the collection, we can't tell you _why_. That's where the researcher comes in!

...though in this case, it's pretty obvious its for tracking/advertising since it's a 1x1 empty image.



Another point of examination with the `images` DataFrame is the `height` and `width` columns. You could take a look at the largest images, or even `0x0` images, and potentially `spacer.gif` occurrences!

* “[The invention and dissemination of the spacer gif: implications for the future of access and use of web archives](https://link.springer.com/article/10.1007/s42803-019-00006-8)”
* "[GeoCities and the spacer.gif](https://ruebot.net/post/geocities-and-the-spacer-gif/)"


#### What are the top 10 most occuring filenames in the collection?

Note that this is of course different than the MD5 results up above. Here we are focusing _just_ on filename. So `cover.jpg` for example, might actually be referring to different images who happen to have the same name.

Here we can use `value_counts()` again, but this time we'll create a variable for the top filenames so we can use it later.



In [None]:
top_filenames = images["filename"].value_counts().head(10)
top_filenames

#### Let's create our first graph!

We'll plot the data first with pandas [plot](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) functionality, and then plot the data with [Altair](https://altair-viz.github.io/).

In [None]:
top_filenames_chart = top_filenames.plot.bar(figsize=(25, 10))

top_filenames_chart.set_title("Top Filenames", fontsize=22)
top_filenames_chart.set_xlabel("Filename", fontsize=20)
top_filenames_chart.set_ylabel("Count", fontsize=20)

Now let's setup Altair, and plot the data with Altair. Altair is useful for creating vizualization since they can be easily exported as a PNG or SVG.

In [None]:
import altair as alt

In [None]:
top_filenames_altair = (
    images["filename"]
    .value_counts()
    .head(10)
    .rename_axis("Filename")
    .reset_index(name="Count")
)

filenames_bar = (
    alt.Chart(top_filenames_altair)
    .mark_bar()
    .encode(x=alt.X("Filename:O", sort="-y"), y=alt.Y("Count:Q"))
)

filenames_rule = (
    alt.Chart(top_filenames_altair).mark_rule(color="red").encode(y="mean(Count):Q")
)


filenames_text = filenames_bar.mark_text(align="center", baseline="bottom").encode(
    text="Count:Q"
)

(filenames_bar + filenames_rule + filenames_text).properties(
    width=1400, height=700, title="Top Filenames"
)

#### How about a file format distribution?

What _kind_ of image files are present? We can discover this by checking their "media type", or [MIME type](https://en.wikipedia.org/wiki/Media_type). 






In [None]:
image_mime_types = (
    images["mime_type_tika"]
    .value_counts()
    .head(5)
    .rename_axis("MIME Type")
    .reset_index(name="Count")
)

image_mimes_bar = (
    alt.Chart(image_mime_types)
    .mark_bar()
    .encode(x=alt.X("MIME Type:O", sort="-y"), y=alt.Y("Count:Q"))
)

image_mime_rule = (
    alt.Chart(image_mime_types).mark_rule(color="red").encode(y="mean(Count):Q")
)

image_mime_text = image_mimes_bar.mark_text(align="center", baseline="bottom").encode(
    text="Count:Q"
)

(image_mimes_bar + image_mime_rule + image_mime_text).properties(
    width=1400, height=700, title="Image File Format Distribution"
)

#### How do I get the actual images?

...or, how do I get to the actual binary files described by each file format information derivative?

There are a few options!

1. `wget` or `curl` from the live URL, or a replay URL
  * Live web URL
    * `wget` or `curl` the value of the `url` column
  * Replay web URL
    * `wget` or `curl` the value of the `crawl_date` and `url` column using the following pattern:
      * `https://web.archive.org/web/` + `crawl_date` + `*/` + `url`
        * https://web.archive.org/web/20120119*/http://www.archive.org/images/glogo.png
2. Use a scripting language, such as Python
  * Make use of the `url` and `filename` columns (and `crawl_date` if you want to use the replay URL)
  * `import requests`
  * `requests.get(url, allow_redirects=True)`
  * `open('filename', 'wb').write(r.content)`
3. Use the [Archives Unleashed Toolkit](https://aut.docs.archivesunleashed.org/docs/extract-binary) (if you have access to the W/ARC files).

## Let's take a look at the domain frequency derivative.

#### What does the distribution of domains look like?

Here we can see which domains are the most frequent within the collection.

In [None]:
top_domains = domain_frequency.sort_values("count", ascending=False).head(10)

top_domains_bar = (
    alt.Chart(top_domains)
    .mark_bar()
    .encode(
        x=alt.X("domain:O", title="Domain", sort="-y"),
        y=alt.Y("count:Q", title="Count, Mean of Count"),
    )
)

top_domains_rule = (
    alt.Chart(top_domains).mark_rule(color="red").encode(y="mean(count):Q")
)

top_domains_text = top_domains_bar.mark_text(align="center", baseline="bottom").encode(
    text="count:Q"
)

(top_domains_bar + top_domains_rule + top_domains_text).properties(
    width=1400, height=700, title="Domains Distribution"
)

### Top Level Domain Analysis

pandas allows you to create new columns in a DataFrame based off of existing data. This comes in handy for a number of use cases with the available data that we have. In this case, let's create a new column, `tld`, which is based off an existing column, 'domain'. This example should provide you with an implementation pattern for expanding on these datasets to do further research and analysis.

A [top-level domain](https://en.wikipedia.org/wiki/Top-level_domain) refers to the highest domain in an address - i.e. `.ca`, `.com`, `.org`, or yes, even `.pizza`.

Things get a bit complicated, however, in some national TLDs. While `qc.ca` (the domain for Quebec) isn't really a top-level domain, it has many of the features of one as people can directly register under it. Below, we'll use the command `suffix` to include this. 

> You can learn more about suffixes at https://publicsuffix.org.

We'll take the `domain` column and extract the `tld` from it with [`tldextract`](https://github.com/john-kurkowski/tldextract).

First we'll add the [`tldextract`](https://github.com/john-kurkowski/tldextract) library to the notebook. Then, we'll create the new column.

In [None]:
%%capture

!pip install tldextract

In [None]:
import tldextract

domain_frequency["tld"] = domain_frequency.apply(
    lambda row: tldextract.extract(row.domain).suffix, axis=1
)
domain_frequency

#### Next, let's count the distict TLDs.


In [None]:
tld_count = domain_frequency["tld"].value_counts()
tld_count

#### Next, we'll plot the TLD count.


In [None]:
tld_count = (
    domain_frequency["tld"]
    .value_counts()
    .rename_axis("TLD")
    .reset_index(name="Count")
    .head(10)
)

tld_bar = (
    alt.Chart(tld_count)
    .mark_bar()
    .encode(x=alt.X("TLD:O", sort="-y"), y=alt.Y("Count:Q"))
)

tld_rule = alt.Chart(tld_count).mark_rule(color="red").encode(y="mean(Count):Q")

tld_text = tld_bar.mark_text(align="center", baseline="bottom").encode(text="Count:Q")

(tld_bar + tld_rule + tld_text).properties(
    width=1400, height=700, title="Top Level Domain Distribution"
)

## Web Crawl Frequency

Let's see what the crawl frequency looks like by examining the `web_pages` DataFrame. First we'll create a new DataFrame by extracting the `crawl_date` and `domain` columns, and count the occurances of each domain and date combination.

In [None]:
crawl_sites = web_pages[["crawl_date", "domain"]]
crawl_sites = crawl_sites.value_counts().reset_index()
crawl_sites.columns = ["Date", "Site", "Count"]
crawl_sites

Next, we'll create a stacked bar chart where each bar will show the distribution of pages in that crawl by top-level domain.

**NOTE**: Charts like this one work a lot better with collections that have more than 1 or 2 crawl dates. The temporal aspect is definitely something to take into consideration with each of the examples provided in this notebook.

In [None]:
## Altair has a default limit of 5000 rows, and this DataFrame is ~7700 rows, so we're going to disable the max allowed rows.
alt.data_transformers.disable_max_rows()

crawl_chart = (
    alt.Chart(crawl_sites)
    .mark_bar()
    .encode(
        x="Date:O",
        y="Count:Q",
        color="Site",
        tooltip="Site",
        order=alt.Order("Site", sort="descending"),
    )
)

crawl_chart.properties(width=1400, height=700, title="Web Crawl Frequency")

## Examining the Web Graph

Remember the hyperlink web graph? Let's look at the web graph columns again.



In [None]:
web_graph


### What are the most frequent `source` and `target` combinations?

In [None]:
top_links = web_graph[["source", "target"]].value_counts().head(10).reset_index()
top_links.columns = ["source", "target", "count"]
top_links

## Can we create a network graph visualization with the data we have?

Yes! We can take advantage [NetworkX](https://networkx.org/documentation/stable/index.html) to create some basic graphs.

NetworkX is *really* powerful, so there is a lot more that can be done with it than what we're demonstrating here.

First we'll import `networkx` as well as `matplotlib.pyplot`.

In [None]:
import matplotlib.pyplot as plt
import networkx as nx

We can take advantage of [`from_pandas_edgelist`](https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html) here since our three graph derivatives are edge tables, and initialize our graph.


In [None]:
G = nx.from_pandas_edgelist(
    top_links, source="source", target="target", edge_key="target", edge_attr="count"
)

Setup our graph, and draw it!


In [None]:
pos = nx.spring_layout(G, k=15)
options = {
    "node_size": 1000,
    "node_color": "#bc5090",
    "node_shape": "o",
    "alpha": 0.5,
    "linewidths": 4,
    "font_size": 10,
    "font_color": "black",
    "width": 2,
    "edge_color": "grey",
}

plt.figure(figsize=(12, 12))

nx.draw(G, pos, with_labels=True, **options)

labels = {e: G.edges[e]["count"] for e in G.edges}
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)

plt.show()

## Text Analysis

Next, we'll do some basic text analysis with our `web_pages` DataFrame with `nltk` and`spaCy`, and end with a word cloud.


In [None]:
import re

import nltk

In [None]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

In [None]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

We'll drop the `NaN` values in our DataFrame to clean things up a bit.

In [None]:
web_pages = web_pages.dropna()
web_pages

We need to set the [`mode.chained_assignment`](https://pandas.pydata.org/docs/user_guide/options.html?highlight=chained_assignment) to `None` now to silence some exception errors that will come up.

In [None]:
pd.options.mode.chained_assignment = None

Next, we'll setup a tokenizer which will split on words, and create a new column which is the tokenized text.

In [None]:
tokenizer = nltk.RegexpTokenizer(r"\w+")

In [None]:
web_pages["content_tokenized"] = web_pages["content"].map(tokenizer.tokenize)

Now well create a column with the tokenized value count.

In [None]:
web_pages["content_tokens"] = web_pages["content_tokenized"].apply(lambda x: len(x))

### Basic word count statistics with pandas!

Now we can use the power of pandas [Statisitcal functions](https://pandas.pydata.org/docs/user_guide/computation.html) to show us some basic statistics about the tokens.

**Mean**

In [None]:
web_pages["content_tokens"].mean()

**Standard deviation**


In [None]:
web_pages["content_tokens"].std()

**Max**

In [None]:
web_pages["content_tokens"].max()

**Min**

In [None]:
web_pages["content_tokens"].min()

### Pages with most words

Let's create a bar chart that shows the pages with the most words. Here we can see the power of pandas at work, in terms of both analysis and visualization.

First, let's show the query to get the data for our chart.

In [None]:
word_count = (
    web_pages[["url", "content_tokens"]]
    .sort_values(by="content_tokens", ascending=False)
    .head(25)
)

In [None]:
word_count

Next, let's create a bar chart of this.

In [None]:
word_count_bar = (
    alt.Chart(word_count)
    .mark_bar()
    .encode(x=alt.X("url:O", sort="-y"), y=alt.Y("content_tokens:Q"))
)

word_count_rule = (
    alt.Chart(word_count).mark_rule(color="red").encode(y="mean(content_tokens):Q")
)

word_count_text = word_count_bar.mark_text(align="center", baseline="bottom").encode(
    text="content_tokens:Q"
)

(word_count_bar + word_count_rule + word_count_text).properties(
    width=1400, height=700, title="Pages with the most words"
)

### How about NER on the page with the most tokens?

[Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition), or NER, is an exciting field of natural language processing that lets us extract "entities" out of text; the names of people, locations, or organizations.

To do this, we first need to find the pages that have the most tokens.

In [None]:
word_count_max = (
    web_pages[["url", "content_tokens", "content"]]
    .sort_values(by="content_tokens", ascending=False)
    .head(1)
)
word_count_max["url"]

We'll remove the column width limit so we can check out our content for the page.

In [None]:
pd.set_option("display.max_colwidth", None)

Let's take a look at our page's content.

In [None]:
page = word_count_max["content"].astype("unicode").to_string()
page


#### Setup spaCy

We now need to set up [spaCy](https://en.wikipedia.org/wiki/SpaCy), a natural-language processing toolkit, and we will use the Spanish language package since the collection we are working with is in Spanish!


In [None]:
%%capture
!python -m spacy download es_core_news_sm

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("es_core_news_sm")

nlp.max_length = 1100000

Next we'll run the natual language processor from SpaCy, and then display the NER output. Watch how it finds organizations, people, and beyond!

In [None]:
ner = nlp(page)
displacy.render(ner, style="ent", jupyter=True)

### Wordcloud

What better way to wrap-up this notebook than create a word cloud!

Word clouds are always fun, right?! They're an interesting way to visualize word frequency, as the more times that a word occurs, the larger it will appear in the word cloud.

Let's setup some dependencies here. We will install the [word_cloud](https://github.com/amueller/word_cloud) library, and setup some stop words via `nltk`.

In [None]:
%%capture

!pip install wordcloud
from wordcloud import WordCloud, ImageColorGenerator

Let's remove the remove the stopwords from our data.

In [None]:
stopwords = stopwords.words("spanish")

In [None]:
web_pages["stopwords"] = web_pages["content_tokenized"].apply(
    lambda x: [item.lower() for item in x if item not in stopwords]
)

Next we'll pull 500 rows of values from our new column.

In [None]:
words = web_pages["stopwords"].head(500)

Now we can create a word cloud!

In [None]:
wordcloud = WordCloud(
    width=2000,
    height=1500,
    scale=10,
    max_font_size=250,
    max_words=100,
    background_color="white",
).generate(str(words))
plt.figure(figsize=[35, 10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()