## Fetch snapshots of urls on Wayback Machine
- Load url from `.json` file
- Fetch snapshots of urls on Wayback Machine

### Load urls and id from `.json` file

From the previous sketch, we have a `urls.json` file that contains the urls and their unique id. We will load the urls and their id from the file.

```python
[{'id': "abcdef", 'url': 'https://www.example.com/page1'},
 {'id': "abceef", 'url': 'https://www.example.com/page2'},
 ...]
```

In [2]:
########################
## Read URLs from a JSON file
########################
import json
URL_PATH = "urls.json"

# INITIALIZE THE URL LIST
url_list = []

with open(URL_PATH, "r") as f:
    url_list = json.load(f)

print(f"Read {len(url_list)} URLs from {URL_PATH}")
print("First 5 URLs:")
print(url_list)
for url in url_list[:5]:
    print(url)


Read 20 URLs from urls.json
First 5 URLs:
[{'id': 'ada36222219fc23621b082fa89ff77d6', 'url': 'http://www.voicenet.com/~squeeze/contras.html'}, {'id': '19b3682f1c1d94e102ec232a35afdc0d', 'url': 'http://www.mediagate.com'}, {'id': '1e5e89dbb582367562a8073a3f1b1f77', 'url': 'http://www.artzone.gr'}, {'id': '714a9b6cc3b50a18cd8cc73c6baccf80', 'url': 'http://www.noqers.org'}, {'id': 'c779d2aa9338cfaace93c747189a02d7', 'url': 'http://www.ad-guide.com'}, {'id': '637bfdf647cceb27fa18ac2177b3ce48', 'url': 'http://www.dmplaza.com'}, {'id': 'f179aedafc625fee40a6b623be081729', 'url': 'http://www.bestwebs.com/vaudeville'}, {'id': '0d14a10ed63df4b397debece15ead3d3', 'url': 'http://www.homeportfolio.com'}, {'id': '9336d859af045fbed815b745311d47b3', 'url': 'http://www.nortelnetworks.com'}, {'id': '76b72e50a809a58a564f68c92bfc6db0', 'url': 'http://www.koshergrocer.com'}, {'id': 'ef4334af1f326f8f59bb715d23f9a593', 'url': 'http://www.pacificnet.net/~kites/ads/hang.ads.html'}, {'id': '9cedb6d87b11f118d6c7

### Fetch available snapshots from Wayback Machine
Wayback Machine provides an API to fetch the available snapshots of a url, called the [Wayback CDX Server](https://archive.org/developers/wayback-cdx-server.html#basic-usage). By query this API programmatically, we can get the available snapshots of a url in the archive. Currently, these fields are available in the CDX Server:
```
["urlkey","timestamp","original","mimetype","statuscode","digest","length"]
```
  
We will just focus on the `timestamp` field, which is the timestamp of when the snapshot was taken, and the `statuscode` field, which is the [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) of the snapshot. 
- With `url` and `timestamp`, we can construct the url of the snapshot, and scrape the content of the snapshot later. 
- With `statuscode`, we can filter out the snapshots that are not actually available on the Wayback Machine.

In [5]:
# requests is a library for making HTTP requests
# here we are just using it to construct a URL with parameters
import requests

# pandas is a powerful data manipulation library
# here we are using it to read the CSV response from the Wayback Machine
import pandas as pd

from time import sleep

from IPython.display import clear_output


# define the time range for the snapshots
from_time = "19960101"
to_time = "20051231"

def make_wm_cdx_url(url, from_time="19960101", to_time="20051231"):
    """
    Construct a URL to query the Wayback Machine CDX API
    for a given URL and time range
    """
    base_url = "https://web.archive.org/cdx/search/cdx"
    params = {
        "url": url,
        "from": from_time,
        "to": to_time,
    }

    # this will create a URL with the parameters
    # eg. https://web.archive.org/cdx/search/cdx?url=example.com&from=19960101&to=20051231
    url_with_params = requests.Request("GET", base_url, params=params).prepare().url
    return url_with_params


total_url_count = len(url_list)
sleep_time_on_error = 10
sleep_time_on_success = 1


for i, entry in enumerate(url_list):
    print(f"Processing URL {i+1}/{total_url_count}")

    url = entry["url"]
    id = entry["id"]
    print(f"Processing URL {url} with ID {id}")

    # create the URL with parameters
    wayback_cdx_url = make_wm_cdx_url(url)
    print(f"Wayback cdx url {wayback_cdx_url}")

    url_snapshots = []

    # we will try to fetch the URL multiple times in case of errors
    max_tries = 5

    for j in range(max_tries):
        # because we are sending requests to the internet, there are many things that can go wrong
        # we use a try/except block to catch any errors that occur
        # this is important because if one URL fails, we don't want the whole script to stop
        print(f"Try {j+1}/{max_tries}")
        try:
            # pandas fetch the CSV from the URL and parse it into a DataFrame
            # dataframe is the main data structure in pandas
            dataframe = pd.read_csv(
                wayback_cdx_url,
                names=[
                    "urlkey",
                    "timestamp",
                    "original",
                    "mimetype",
                    "statuscode",
                    "digest",
                    "length",
                ],
               sep="\s+"
            )

            # we convert the dataframe to a list of dictionaries
            snapshot_list = dataframe.to_dict("records")

            print(f"Found {len(snapshot_list)} snapshots for {url}")

            # only keep the timestamp and status code for each snapshot
            for snapshot in snapshot_list:
                url_snapshots.append(
                    {
                        "timestamp": snapshot["timestamp"],
                        "statuscode": snapshot["statuscode"],
                    }
                )
            break

        except Exception as e:
            print(f"Error reading {wayback_cdx_url}: {e}")
            print(f"Sleeping for {sleep_time_on_error} seconds")
            sleep(sleep_time_on_error)

            
    print(f"First 5 snapshots for {url}")
    for snapshot in url_snapshots[:5]:
        print(snapshot)
    entry["snapshots"] = url_snapshots
    
    # sleep for a while to avoid hitting the Wayback Machine too hard
    print(f"Sleeping for {sleep_time_on_success} seconds")
    sleep(sleep_time_on_success)
    clear_output()

print("First 5 entries:")
for entry in url_list[:5]:
    print(entry)

First 5 entries:
{'id': 'ada36222219fc23621b082fa89ff77d6', 'url': 'http://www.voicenet.com/~squeeze/contras.html', 'snapshots': [{'timestamp': 19961222204926, 'statuscode': 200}, {'timestamp': 19970406223306, 'statuscode': 200}, {'timestamp': 19970615113827, 'statuscode': 200}, {'timestamp': 19970804051208, 'statuscode': 200}, {'timestamp': 19991012124911, 'statuscode': 200}, {'timestamp': 19991122034535, 'statuscode': 200}, {'timestamp': 19991127111615, 'statuscode': 200}, {'timestamp': 19991128184315, 'statuscode': 200}, {'timestamp': 19991130074106, 'statuscode': 200}, {'timestamp': 20000304055004, 'statuscode': 200}, {'timestamp': 20000308183445, 'statuscode': 200}, {'timestamp': 20000414091954, 'statuscode': 200}, {'timestamp': 20000419180441, 'statuscode': 200}, {'timestamp': 20000420232447, 'statuscode': 200}, {'timestamp': 20000520034647, 'statuscode': 200}, {'timestamp': 20000526101746, 'statuscode': 200}, {'timestamp': 20000608045525, 'statuscode': 200}, {'timestamp': 200006

### Process, filter and save available snapshots
We want to filter out the snapshots that are not available on the Wayback Machine. We also want to aggregate the `timestamp` and `url` to retreve an accessible Wayback Machine url to the exact snapshot. We will save the available snapshots in a `.json` file. 

You might notice there is an `if_` in the Wayback Machine url. This is because having `if_` in the url will hide the default toolbar of the Wayback Machine, which is useful when we want to scrape the content of the snapshot.

We also filters out the snapshots that are not available on the Wayback Machine. Based on the `statuscode`, we will only keep the snapshots that have `statuscode` of `200`. 

For this workshop, we are also sampling a smaller set of snapshots to scrape to save time. We will randomly take 10 snapshots from the available snapshots.

```python
[{'id': "abcdef", 'url': 'https://www.example.com/page1', 'snapshots': [{'timestamp': '20210101000000', 'url': 'https://web.archive.org/web/20210101000000if_/https://www.example.com/page1'}]},
 {'id': "abceef", 'url': 'https://www.example.com/page2', 'snapshots': [{'timestamp': '20210101000000', 'url': 'https://web.archive.org/web/20210101000000if_/https://www.example.com/page2'}]},
 ...]
```

In [6]:
import random

url_with_snapshots_path = "urls_with_snapshots.json"
print(url_list)


new_url_list = []
snapshot_sample_size = 10

#
for entry in url_list:
    snapshots = entry["snapshots"]

    # only keep the snapshots with status code 200
    available_snapshots = []

    # only keep the snapshots with status code 200
    for snapshot in snapshots:
        if snapshot["statuscode"] == 200:
            # construct the URL to the snapshot
            snapshot_url = f"https://web.archive.org/web/{snapshot['timestamp']}if_/{entry['url']}"
            new_snapshot = {
                "timestamp": snapshot["timestamp"],
                "statuscode": snapshot["statuscode"],
                "url": snapshot_url,
            }
            available_snapshots.append(new_snapshot)

    # if there are no snapshots, we don't need to keep this URL
    if len(available_snapshots) == 0:
        continue
    
    
    # randomly sample 10 snapshots
    # because scraping all snapshots can be slow
    available_snapshots = random.sample(available_snapshots, min(snapshot_sample_size, len(available_snapshots)))

    new_entry = {
        "url": entry["url"],
        "id": entry["id"],
        "snapshots": available_snapshots
    }
    new_url_list.append(new_entry)



print(f"Kept {len(new_url_list)} URLs with snapshots")
print("First 5 entries:")
for entry in new_url_list[:5]:
    print(entry)


#######################
# WRITE THE NEW LIST TO A FILE
#######################
print(f"Writing URLs with snapshots to {url_with_snapshots_path}")
with open(url_with_snapshots_path, "w") as f:
    json.dump(new_url_list, f, indent=2)

[{'id': 'ada36222219fc23621b082fa89ff77d6', 'url': 'http://www.voicenet.com/~squeeze/contras.html', 'snapshots': [{'timestamp': 19961222204926, 'statuscode': 200}, {'timestamp': 19970406223306, 'statuscode': 200}, {'timestamp': 19970615113827, 'statuscode': 200}, {'timestamp': 19970804051208, 'statuscode': 200}, {'timestamp': 19991012124911, 'statuscode': 200}, {'timestamp': 19991122034535, 'statuscode': 200}, {'timestamp': 19991127111615, 'statuscode': 200}, {'timestamp': 19991128184315, 'statuscode': 200}, {'timestamp': 19991130074106, 'statuscode': 200}, {'timestamp': 20000304055004, 'statuscode': 200}, {'timestamp': 20000308183445, 'statuscode': 200}, {'timestamp': 20000414091954, 'statuscode': 200}, {'timestamp': 20000419180441, 'statuscode': 200}, {'timestamp': 20000420232447, 'statuscode': 200}, {'timestamp': 20000520034647, 'statuscode': 200}, {'timestamp': 20000526101746, 'statuscode': 200}, {'timestamp': 20000608045525, 'statuscode': 200}, {'timestamp': 20000613151855, 'statu