# Discovery, data analysis, and posibilities with the API

### Setting up the notebook

This notebook will introduce you to some of the analysis that can be done with data from the Discovery API. To start, please run the first cell to setup the notebook (press the play button on the left of the cell). 

In [None]:
import json
import collections
import re
import nltk
from nltk.corpus import stopwords
import ipywidgets as widgets
from datetime import date, timedelta, datetime
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests


nltk.download('stopwords')

count = collections.Counter()
cachedStopWords = stopwords.words("english")

### Setting up data

This next cell loads a dataset. Normally, this would be done via a series of requests to the API (instructions on how to do this can [be found here in the main series of notebooks](./0-what-is-an-api.ipynb)); for this notebook we are using a pre-downloaded file. If you want to see the raw data, expand this cell and find the URL- from there, you will be able to see the data and how it is structured.

In [None]:
f = open('live-notebooks/dataset.json')
dataset = json.load(f)

Lets start by having a look at the data we loaded. This first analysis cell is a simple check - its going to print the record series avaiable to use. When you are working with APIs you typically run checks like this to ensure you have the data you expect; here it shows what data are in the pre-downloaded set.

In [None]:
series = []
for i in dataset:
  series.append(i["series"])
print("Series are:", series)

With the Discovery API, you would download only the data you need - downloading the entire dataset would be over 32million records and take a long time. Here, we are simulating this choice with the next cell. Feel free to choose your prefered record series.  

In [None]:
checkboxes = [widgets.Checkbox(value=False, description=label) for label in series]
output = widgets.VBox(children=checkboxes)
display(output)

And now we check to make sure the checkboxes were read correct. This cell will print the selected record series and number. After running this cell the first time, try changing the selection in the previous cell and running this one again.

In [None]:
selected_data = []
for i in range(0, len(checkboxes)):
    if checkboxes[i].value == True:
        selected_data = selected_data + [checkboxes[i].description]
print(selected_data)

### Some starter analysis

Now we have our data and are confident it is what we want, we can start to investigate it. All of the analyses here serve as a starting point, with some minor tweaks they can be expanded or merged to produce complex and interesting results.

This first analysis cell will investigate the descriptions of the record series you have chosen, and look to see what the 5 most common words are. Note that words such as "a" and "the" are likly to be the most common, but they are not very useful, so they are removed from the analysis by the code mentioning "stopwords". To see the results of the same analysis on each record individually, there is a cell at the end of the notebook for that (note that the output is very long - hence why its not here!).

Analyses such as this are useful for getting a sense of what is in the data, and what you might want to investigate further.

In [None]:
records_and_top_5_words = []

for j in dataset:
  if j["series"] in selected_data:
    print("Series: " + j["series"])

    description = ""
    for i in j["results"]:
      description_sentence = i["record"]["description"].lower()
      description_sentence = ' '.join([word for word in description_sentence.split() if word not in stopwords.words("english")])
      words = re.findall(r'\w+', description_sentence)
      # print("Five most frequent words in the description of the record titled '" + i["record"]["title"] + "' are:") # - Remove the hash in front of this line
      record_and_words = {}
      record_and_words["title"] = i["record"]["title"]
      record_and_words["words"] = []
      for i in collections.Counter(words).most_common(5):
        # print(i[0] + " which appeared " + str(i[1]) + " times.") # -And this one to see the most frequent words in each record
        record_and_words["words"].append(i[0])
      description = description + description_sentence
      records_and_top_5_words.append(record_and_words)
    print("Five most frequent words in the description of", j["series"], "series are:")
    for i in collections.Counter(re.findall(r'\w+', description)).most_common(5):
      print(i[0] + " which appeared " + str(i[1]) + " times.")

We can do a similar analysis on the length of record descriptions. This cell will print the length of the longest description of an indivdual record in each record series. This can be particularly useful for identifying series with atypically long descriptions, inidcating a high level of detail. This could be easily modified to show the average length of a description as well - comparing the two values would then inidicate if there was one unusually long description, or if the series as a whole had long descriptions. The results could be used in conjunction with the previous analysis to see if longer descriptions means more of the most common words, or greater variety.

Similar to the previous cell, cell is showing a truncated output, showing just the length of the longest record in each series. If you would like to see the length of each record, there is another cell at the end of the notebook.

In [None]:
record_description_lengths = []

global_max = ("", 0)
for i in dataset:
  print(i["series"])
  local_max = 0
  for j in i["results"]:
    length = collections.Counter(re.findall(r'\w+', j["record"]["description"].lower())).total()
    if length > local_max:
      local_max = length
    #print("Length of the description for the record titled '" + j["record"]["title"] + "':", length) # - To reduce the output, put a hash in front of this line
    record_description_lengths.append({
        "series": i["series"],
        "title": j["record"]["title"],
        "length": length
    })
  print("The longest description in this series is " + str(local_max) + " words.")
  if global_max[1] < local_max:
    global_max = (i["series"], local_max)
print("Lengthiest description is for the series:", global_max[0], "with length", global_max[1])

Another important aspect of records is their covering dates. This cell looks through the metadata of each record within a series, looking for the earliest start date and latest end date, and printing the dates it finds. This specific analysis serves as an approach to verifying the accuracy of data, as the dates found by looking at the records should match the covering dates recorded in the record series metadata. An example adjustment could be to see where the most common covering dates within a series are, to see whether the series is bias towards a particular time period (e.g. if there is a single record with a longer coverage, and the rest overlap over a smaller period).  

As with the description length cell, this one shows a reduced output - if you would like to see the covering dates for each record, there is a cell at the end of the notebook which will print them.

In [None]:
records_start_and_end_dates = []

for i in dataset: 
    series_details = {}
    print(i["series"])
    series_details["series"] = i["series"]
    series_data = []
    earliest_start_date = date(9999, 12, 31)
    latest_end_date = date(1, 1, 1)
    for j in i["results"]:
        if (j["record"]["startDate"] == "") or (j["record"]["endDate"] == ""):  # Skip records with no dates
            continue
        if datetime.strptime(j["record"]["startDate"], "%d/%m/%Y").date() < earliest_start_date:
            earliest_start_date = datetime.strptime(j["record"]["startDate"], "%d/%m/%Y").date()
        if datetime.strptime(j["record"]["endDate"], "%d/%m/%Y").date() > latest_end_date:
            latest_end_date = datetime.strptime(j["record"]["endDate"], "%d/%m/%Y").date()
        series_data.append({
            "title": j["record"]["title"],
            "start_date": datetime.strptime(j["record"]["startDate"], "%d/%m/%Y").date().isoformat(),
            "end_date": datetime.strptime(j["record"]["endDate"], "%d/%m/%Y").date().isoformat()
        })
    print("Earliest start date:", earliest_start_date)
    print("Latest end date:", latest_end_date)
    series_details["earliest_start_date"] = earliest_start_date
    series_details["latest_end_date"] = latest_end_date
    series_details["series_data"] = series_data
    records_start_and_end_dates.append(series_details)

### Putting it all to use

All of these data in their various lists can become a bit unweidly and hard to see. Rather than trying to spot data within these many lines, we can draw a graphs. This is one of the big advantages of working with large datasets - it can be much easier to both see trends in data without having to go through by hand, and to use these trends to identify specific records of interest. 

In [None]:
event = []
begin = []
end = []
length = []

for i in records_start_and_end_dates:
    event.append(i["series"])
    begin.append(datetime.strptime(i["earliest_start_date"].isoformat(), "%Y-%m-%d").date().year)
    end.append(datetime.strptime(i["latest_end_date"].isoformat(), "%Y-%m-%d").date().year)
    length.append(i["latest_end_date"].isoformat() + " - " + i["earliest_start_date"].isoformat())

event_np = np.array(event)
begin_np = np.array(begin)
end_np = np.array(end)
length_np = np.array(length)

plt.figure(figsize=(12,6))
plt.barh(range(len(begin_np)), (end_np - begin_np), .3, left=begin_np)
plt.yticks(range(len(begin_np)), event)
plt.tick_params(axis='both', which='major', labelsize=15)
plt.tick_params(axis='both', which='minor', labelsize=20)
plt.title('Record Period', fontsize = '25')
plt.xlabel('Period', fontsize = '20')
plt.ylabel('Record', fontsize = '20')
plt.xlim(min(begin_np) - 10, max(end_np) + 10)
plt.grid(linewidth=0.4, alpha=0.6)
for i in range(len(begin_np)):
    plt.text(begin[i] + (end[i]-begin[i])/2, i+.2, length[i], ha='center', fontsize = '12')

### Using the API LIVE

If you want to continue learning what data can be gathered from the [Discovery API](https://discovery.nationalarchives.gov.uk/API/api.htm), the main series of notebooks will take you through the various steps in greater detail, including what different data you can request from the API, and what filters are avaiable, and what you can do with the results. If you want to start with an introduction to APIs, start with the [general introductory notebook](./0-what-is-an-api.ipynb); if you are confident making requests, or want to jump straight in, start with [next one]](./1-requests.ipynb) where the focus shifts more to the Discovery API. 

If you'd like a quick taster, expand the cell below, and add a search term to the `search_query` variable. Then run the cell to see the results. The API provides a lot of data, you will need to scroll down to see the results.

In [None]:
base_discovery_url = "https://discovery.nationalarchives.gov.uk/API"

search_endpoint = "/search/records"

search_query_parameter = "sps.searchQuery"

search_query = "dave" # Ensure you leave the quotes in place around the search query

full_search_url = base_discovery_url + search_endpoint + "?" + search_query_parameter + "=" + search_query

response = requests.request("GET", full_search_url)

print(json.dumps(response.json(), indent=4))

### Full data output

In some of the cells above, the output was for the record series, not the individual records, as the full output for each record would be very long. If you want to see the full outputs, run the cells here. 

#### The top 5 words in each record. 

In [None]:
## print out records_and_top_5_words nicely

for record in records_and_top_5_words:
    print("For the record titled '" + record["title"] + "' the five most frequent words in the description are'")
    for word in record["words"]:
        print(word)
    print("\n")

#### The length of each record description.

In [None]:
for record in record_description_lengths:
    print("Record title - '" + record["title"] + "': \nSeries - '" + record["series"] + "': \nDescription length  - " + str(record["length"]) + ".\n\n")

#### The covering dates of each record.

In [None]:
for series in records_start_and_end_dates:
    for record in series["series_data"]:
        print("Record title - '" + record["title"] + "': \nSeries - '" + series["series"] + "': \nStart date - " + record["start_date"] + ". \nEnd date - " + record["end_date"] + ".\n\n")