# Archive data
The Wellcome archive sits in a collections management system called CALM, which follows a rough set of standards and guidelines for storing archival records called [ISAD(G)](https://en.wikipedia.org/wiki/ISAD(G). The archive is comprised of _collections_, each of which has a hierarcal set of series, sections, subjects, items and pieces sitting underneath it.  
In the following notebooks I'm going to explore it and try to make as much sense of it as I can programatically.

Let's start by loading in a few useful packages and defining some nice utils.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
plt.rcParams['figure.figsize'] = (20, 20)

import pandas as pd
import numpy as np
import networkx as nx

from umap import UMAP
from tqdm import tqdm_notebook as tqdm

In [None]:
def flatten(list):
    return [item for sublist in list for item in sublist]

def cartesian(*arrays):
    return np.array([x.reshape(-1) for x in np.meshgrid(*arrays)]).T

def clean_subject(subject):
    return subject.strip().lower().replace('<p>', '')

let's load up our CALM data (exported in its entirity as a single `.json`, where each line is a record)

In [None]:
df = pd.read_json('data/calm_records.json')

In [None]:
len(df)

In [None]:
df.astype(str).describe()

# Exploring individual columns
At the moment I have no idea what kind of information CALM contains - lets look at the list of column names

In [None]:
list(df)

Here I'm looking through a sample of values in each column, choosing the columns to explore based on the their headings, a bit of contextual info from colleagues and the `df.describe()` above. 

In [None]:
df['Subject']

# After much trial and error...
Subjects look like an interesting avenue to explore further. Where subjects have _actually_ been filled in and the entry is not `None`, a list of subjects is returned.  
We can explore some of these subjects' subtleties by creating an adjacency matrix. We'll count the number of times each subject appears alongside every other subject and return a big $n \times n$ matrix, where $n$ is the total number of unique subjects.  
We can use this adjacency matrix for all sorts of stuff, but we have to build it first. To start, lets get a uniqur list of all subjects. This involves unpacking each sub-list and flattening them out into one long list, before finding the unique elements. We'll then use the `clean_subject` function defined above to get rid of any irregularities which might become annoying later on.

In [None]:
subjects = list(set(flatten(df['Subject'].dropna().tolist())))
clean_subjects = list(map(clean_subject, subjects))

At this point it's often helpful to _index_ our data, ie transform words into numbers. We'll create two dictionaries which map back and forth between the subjects and their corresponding indicies:

In [None]:
index_to_subject = {index: subject for index, subject in enumerate(clean_subjects)}
subject_to_index = {subject: index for index, subject in enumerate(subjects)}

Lets instantiate an empty numpy array which we'll then fill with our coocurrence data. Each column and each row will represent a subject - each cell (the intersection of a column and row) will therefore represent the 'strength' of the interaction between those subjects. As we haven't seen any interactions yet, we'll set every array element to 0.

In [None]:
adjacency = np.empty((len(subjects), len(subjects)), 
                     dtype=np.uint16)

To populate the matrix, we want to find every possible combination of subject in each sub-list from our original column, ie if we had the subjects

`[Disease, Heart, Heart Diseases, Cardiology]`

we would want to return 

`
[['Disease', 'Disease'],
 ['Heart', 'Disease'],
 ['Heart Diseases', 'Disease'],
 ['Cardiology', 'Disease'],
 ['Disease', 'Heart'],
 ['Heart', 'Heart'],
 ['Heart Diseases', 'Heart'],
 ['Cardiology', 'Heart'],
 ['Disease', 'Heart Diseases'],
 ['Heart', 'Heart Diseases'],
 ['Heart Diseases', 'Heart Diseases'],
 ['Cardiology', 'Heart Diseases'],
 ['Disease', 'Cardiology'],
 ['Heart', 'Cardiology'],
 ['Heart Diseases', 'Cardiology'],
 ['Cardiology', 'Cardiology']]
`

The `cartesian()` function which I've defined above will do that for us. We then find the appropriate intersection in the matrix and add another unit of 'strength' to it.  
We'll do this for every row of subjects in the `['Subjects']` column.

In [None]:
for row_of_subjects in tqdm(df['Subject'].dropna()):
    for subject_pair in cartesian(row_of_subjects, row_of_subjects):
        subject_index_1 = subject_to_index[subject_pair[0]]
        subject_index_2 = subject_to_index[subject_pair[1]]

        adjacency[subject_index_1, subject_index_2] += 1

We can do all sorts of fun stiff now - adjacency matrices are the basis on which all of graph theory happens. Because it's a bit more interesting, I'm going to start with some dimensionality reduction.
Using [UMAP](https://github.com/lmcinnes/umap), we can squash the $n \times n$ dimensional matrix down into a $n \times m$ dimensional one, where $m$ is some arbitrary integer. Setting $m$ to 2 will allow us to plot each subject as a point on a two dimensional plane. UMAP will try to preserve the 'distances' between subjects - in this case, that means that similar subjects will end up clustered together, and different subjects will move apart.

In [None]:
embedding_2d = pd.DataFrame(UMAP(n_components=2)
                            .fit_transform(adjacency))

embedding_2d.plot.scatter(x=0, y=1);

We can isolate the clusters we've found above using a number of different methods - `scikit-learn` provides easy access to some very powerful algorithms. Here I'll use a technique called _agglomerative clustering_, and make a guess that 15 is an appropriate number of clusters to look for.

In [None]:
from sklearn.cluster import AgglomerativeClustering

n_clusters = 15
embedding_2d['labels'] = AgglomerativeClustering(n_clusters).fit_predict(embedding_2d.values)
embedding_2d.plot.scatter(x=0, y=1, c='labels', cmap='Paired');

We can now use the `index_to_subject` mapping that we created earlier to examine which subjects have been grouped together into clusters

In [None]:
for i in range(n_clusters):
    print(str(i) + ' ' + '-'*80 + '\n')
    print(np.sort([clean_subject(index_to_subject[index])
                    for index in embedding_2d[embedding_2d['labels'] == i].index.values]))
    print('\n')

interesting results!