# Weird descriptions

Ever come across a newspaper headline that makes you think someone was just sticking random words together, but it turns out it was a real (if somewhat unusual) event? This notebook takes inspiration from those headlines, and runs through the steps to find weird descriptions in Discovery series.

As with always, start by importing the necessary libraries. 

In [None]:
%pip install -q json
%pip install -q wordfreq
%pip install -q matplotlib
%pip install -q numpy
import json
from wordfreq import word_frequency
import matplotlib.pyplot as plt
import numpy as np

import helper_functions as hf

## What is "weird" anyway?

To do this reliably, start by defining metric of weirdness. The metric here is based on the idea that weird headlines tend to stick out because they have words that don't appear often, and in unusual combinations. As the main series of notebooks showed, its fairly straightforwards to get all the descriptions from a series, so lets assume access to that data. 

`record weirdness = average(average word weirdness from local, average word weirdness from global, average weirdness of word combinations)`

Where: 
 
`word weirdness from local = 1 - (word frequency in all records in the series / total number of words in all records in the series)` - As a hypothetical example, if all records start "The", then have 100 different words that don't appear anywhere else, "the" should have a low score - its uncommon within individual records, but very common overall.

`word weirdness from global` - There are datasets that provide word frequency in the English language which can be used to get a sense of how common a word is in general. 

`weirdness of word combinations = 1 - (pair frequency in all records in the series / toal number of pairs in the series)` - Looking at every adjacent pair of words in every record, how often are those two words next to each other? Similar to the word weirdness, if all descriptions start with "The Parliament", then have 100 different words that don't appear in other records, "The Parliament" should have a low score.


These values are then averaged to accomodate for the length variability of record descriptions - some have lots of words, some have very few. If the weirdness value for each word was summed, then longer records would have a higher weirdness score than shorter records.

## Lets start with getting the data

As it was explained in detail in the [main series,](../1-intro-to-discovery-api.ipynb) this step of the process is in the `helper_functions.py` file to avoid repitition and making this notebook full of familiar code. Simply run the cell below to get the data - the default series it'll use is [`TITH`](https://discovery.nationalarchives.gov.uk/details/r/C254) but you can change that if you want. Note that really large series (eg `ADM`) will take a while to run, and may result in a timeout error from too many requests to the API. 

To make it easy to work through and work on the weirdness score, the data gathered from this first step is going to look like this (note the nested nature as we are gathering every record from a series - sub-series will be stepped into using the child endpoint of the API):

```JSON
{
    "series": "TITH",
    "children": [
        {
            "series": "TITH 1",
            "id": "discovery web id",
            "description": "description of the record",
            "children" : [
                {
                    "series": "TITH 1/1",
                    "id": "discovery web id",
                    "description": "description of the record"
                    "children": [ # and so on]
                }
            ]
        }
    ]
}
```

In [None]:
data = hf.get_series_description("ACE")

print(json.dumps(data, indent=4))

With the data gathered, lets start working on the score. The first step is to see what words are in the descriptions, their frequnecy, and their weirdness score. The new data will be stored a new key to the data structure `word_weirdness` at the same level as the `children` key. The data in it will look like this:

```JSON
{
    "series": "TITH",
    "children": [as before],
    "total_words": 1000,
    "word_weirdness": [
        {
            "word": "the",
            "occurences": 100,
            "weirdness": 0.9,
            "word_frequency": 0.1
        },
        {for every word in the description}
    ]
}
```
Word frequency, in this case, is being supplied by the [`wordfreq`](https://github.com/rspeer/wordfreq) package, which has a list of word frequencies in the English language, based on similar maths using a much larger dataset. You can read more about it in the link.

These scores are based on all the descriptions in the data, not the descriptions of individual records, hence looping through every record and keeping a total score, rather than resetting the score each time. After this, the process for word pairs will be similar. 


In [None]:
words = []
word_weirdness = []

def get_words(description):
    if description["children"] != "No children":
        for child in description["children"]:
            get_words(child) # Recursion - if the record has children, call this function again with the child as the argument
    if "description" in description: # The top level of the data doesn't have a description
        if description["description"] != None: # Sometimes the description is empty
            for word in description["description"].split(" "):
                words.append(word)
    return

get_words(data)

for unique_word in set(words):
    word_weirdness.append(
        {
            "word": unique_word,
            "count": words.count(unique_word),
            "weirdness" : 1 - (words.count(unique_word) / len(words)),
            "word_frequency": word_frequency(unique_word, "en", wordlist="large")
        }
    )

word_weirdness.sort(key=lambda x: x["weirdness"], reverse=True) # This sorts the list in place, with the highest weirdness first

data["total_words"] = {
    "total": len(words),
    "unique": len(set(words))
}
data["word_weirdness"] = word_weirdness

print(json.dumps(data["word_weirdness"], indent=4))

If you scroll through the results of that cell, you'll notice that the data have an very [long tail](https://en.wikipedia.org/wiki/Long_tail), plotting that later will allow a better look at the data. Its also interesting to note that, as word frequency scores are based on the English language, many of these words results in really tiny scores.

## Word pairs

This is a similar process to the above, with the main switch being that instead of isolating every word, we are isolating every pair. The output data structure is going to look essentially the same. 


In [None]:
pairs = []
pair_weirdness = []

def get_pairs(description):
    if description["children"] != "No children":
        for child in description["children"]:
            get_pairs(child) # Recursion - if the record has children, call this function again with the child as the argument
    if "description" in description: # The top level of the data doesn't have a description
        if description["description"] != None: # Sometimes the description is empty
            for i in range(len(description["description"].split(" ")) - 1):
                pairs.append((description["description"].split(" ")[i], description["description"].split(" ")[i+1]))
    return

get_pairs(data)

for unique_pair in set(pairs):
    pair_weirdness.append(
        {
            "pair": unique_pair,
            "count": pairs.count(unique_pair),
            "weirdness" : 1 - (pairs.count(unique_pair) / len(pairs))
        }
    )

pair_weirdness.sort(key=lambda x: x["weirdness"], reverse=True) # This sorts the list in place, with the highest weirdness first

data["pair_weirdness"] = pair_weirdness

print(json.dumps(data["pair_weirdness"], indent=4))

The big thing to notice here is that almost every pair is fairly unique, getting a score close to 1. This is to be expected, with the number of possible pairs being so large. For example, for 100 unique words, there would be 100! (100 factorial) possible pairs. This is a very large number, and a description will only have a tiny fraction of them.

## Adding data to the original structure

Now scores for each word have been calculated, the weirdness score for each record (using the definitions earlier in the notebook) can be added to the data structure. They will be added in as new key/values for each record. Given that the different scores for weirdness are so different, the average of the three, and the three scores themselves are included. Remember that the scores are being averaged to accomodate for varying description lengths. 

In [None]:
def add_description_scores(record):
    if record["children"] != "No children":
        for child in record["children"]:
            add_description_scores(child)
    if "description" in record:
        if record["description"] != None:
            record["word_weirdness_score"] = (sum([word["weirdness"] for word in word_weirdness if word["word"] in record["description"].split(" ")])/len(record["description"].split(" ")))
            record["word_frequency_weirdness_score"] = (sum([word["word_frequency"] for word in word_weirdness if word["word"] in record["description"].split(" ")])/len(record["description"].split(" ")))
            record["pair_weirdness_score"] = (sum([pair["weirdness"] for pair in pair_weirdness if pair["pair"] in [(record["description"].split(" ")[i], record["description"].split(" ")[i+1]) for i in range(len(record["description"].split(" ")) - 1)]])/len(record["description"].split(" ")))/2
            record["average_weirdness_score"] = (record["word_weirdness_score"] + record["word_frequency_weirdness_score"] + record["pair_weirdness_score"])/3

add_description_scores(data)

print(json.dumps(data, indent=4))

## Exploring the data

To start exploring the data, the first thing is to show the most and least weird records are. This is done by a similar recursive function, looking for the highest and lowest scores for average weirdness.

In [None]:
# Look through the data and find the record with the highest average_weirdness_score

hightest_weirdness = 0
weirdest_record = {}

def find_highest_weirdness(record):
    global hightest_weirdness
    global weirdest_record
    if record["children"] != "No children":
        for child in record["children"]:
            find_highest_weirdness(child)
    if "average_weirdness_score" in record:
        if record["average_weirdness_score"] > hightest_weirdness:
            hightest_weirdness = record["average_weirdness_score"]
            weirdest_record = record

find_highest_weirdness(data)

print(json.dumps(weirdest_record, indent=4))

lowest_weirdness = 1
least_weird_record = {}

def find_lowest_weirdness(record):
    global lowest_weirdness
    global least_weird_record
    if record["children"] != "No children":
        for child in record["children"]:
            find_lowest_weirdness(child)
    if "average_weirdness_score" in record:
        if record["average_weirdness_score"] < lowest_weirdness:
            lowest_weirdness = record["average_weirdness_score"]
            least_weird_record = record

lowest_weirdness = 1

find_lowest_weirdness(data)

print(json.dumps(least_weird_record, indent=4))

Remember the long tail in the word count scores? Lets plot that to see how long it is. 

In [None]:
word_and_count = {}

for word in data["word_weirdness"]:
    word_and_count[word["word"]] = word["count"]

plt_word_counts, ax_word_counts = plt.subplots(figsize=(30, 30))
ax_word_counts.bar(word_and_count.keys(), word_and_count.values())
ax_word_counts.set_xlabel('Word')
ax_word_counts.set_ylabel('Count')
ax_word_counts.set_title('Word count')
ax_word_counts.set_xticks([]) # This stops the x-ticks from being displayed - there are too many words to display them all
ax_word_counts.yaxis.set_minor_locator(plt.MultipleLocator(1))
ax_word_counts.yaxis.set_minor_formatter(plt.FuncFormatter(lambda x, _: int(x)))
ax_word_counts.grid(which="minor", axis="y", linestyle="--")
plt.show()

Yep, that's a long tail! A lot of words only appear once, with most only appearing fewer than 10 times. 

To see how the weirdness score is distributed, lets plot that too. Here we're going to use a histogram, but with less refinement than other graphs, the goal is an overview of distribution of the scores. Remember that the record data is nested, so we'll need to step into the sub-series to get the scores.

In [None]:
id_and_weirdness = {}

def get_id_and_weirdness(record):
    if record["children"] != "No children":
        for child in record["children"]:
            get_id_and_weirdness(child)
    if "id" in record:
        if "average_weirdness_score" in record:
            id_and_weirdness[record["id"]] = record["average_weirdness_score"]

get_id_and_weirdness(data)

plt_id_and_weirdness, ax_id_and_weirdness = plt.subplots(figsize=(10, 10))

ax_id_and_weirdness.bar(id_and_weirdness.keys(), id_and_weirdness.values())
ax_id_and_weirdness.set_xlabel('ID')
ax_id_and_weirdness.set_xticks([])
ax_id_and_weirdness.set_ylabel('Weirdness')
ax_id_and_weirdness.set_title('Weirdness by ID')
plt.show()

For the default series, this shows that most records are roughly the same weirdness. 

# Some starter stats

The weirdest record has been found, but a quick statistical analysis will give more confidence in saying "its weird". Lets find out the mean and standard deviation of the scores. If our record is more than 3 standard deviations from the mean, then it can be said to be statistically weird. As these are very widely used metrics, `numpy` makes this very approachable.

In [None]:
mean_weirdness = np.mean(list(id_and_weirdness.values()))
standard_deviation_weirdness = np.std(list(id_and_weirdness.values()))

print(f"The mean weirdness is {mean_weirdness} and the standard deviation is {standard_deviation_weirdness}")

# Reprint the weirdest record

print(json.dumps(weirdest_record, indent=4))

# see if the weirdest record is out by more than 3 standard deviations, else print the average weirdness score it would need to be an outlier

if weirdest_record["average_weirdness_score"] > mean_weirdness + 3 * standard_deviation_weirdness or weirdest_record["average_weirdness_score"] < mean_weirdness - 3 * standard_deviation_weirdness:
    print("The weirdest record is an outlier")
else:
    print(f"The weirdest record is not an outlier, it would need an average weirdness score of {mean_weirdness + 3 * standard_deviation_weirdness} or { mean_weirdness - 3 * standard_deviation_weirdness} to be one.")

# Saving the data

After a lot of analysis work, it makes sense to save the data. We're going to save it as a JSON file, as that's the format we've been working with. 

In [None]:
# save the data to a file called data.json

with open("data.json", "w") as file:
    json.dump(data, file, indent=4)

# Conclusion

In this notebook, we've gone through the full process of:
1. From an inspiration, defining a metric to analyse our input data
2. Gathering the correct parts of our data
3. Calculating our metric for each record
4. Visualising the results
5. A little bit of statistical analysis
6. Saving the data

Or, what appeared to be a light hearted look at descriptions was sneakily a full data analysis project (if a small one). Hopefully this notebook has given you a good idea of how to approach a similar problem, and you had fun along the way.