# A couple of brief notes on how archives store information
Archives are arranged in a really interesting way - they're roughly hierarchical trees (with some more graphy elements when you get close to the leaf nodes). Each node contains multiple free-text notes fields which provide descriptions of the common features of their children (this applies even at leaf node level - an _item_ in the hierarchy is usually a box containing multiple _pieces_ of physical material, ie multiple letters or multiple notebook pages). The tree doesn't necessarily describe the arrangement of nested items in physical space; The tree is more like an informational hierarchy where sections, series and items are gathered conceptually.

This sounds great in theory but, as we saw in the last notebook, the tree-based structure makes it difficult to jump quickly from a leaf node in one branch of the hierarchy to another (conceptually related) leaf existing in another branch. Making the jump between the nodes would involve traversing up the tree to (at least) the point where the branches split, and then travelling all the way back down to the related leaf. That journey also requires perfect knowledge of the tree's structure and an understanding of the complete context. While archivists do a good job arranging items into conceptual hierarchies, information is incredibly messy medium to work with and its full complexity can't always be encapsulated by the tree structure. Archivists therefore often default to the items' _provenance_ as a way of overcoming the problem while conforming to the established guidelines. _Provenance_ refers to the way that the archive material was arranged by its original owner. In this way, archives preserve the owner's usage (a good and valuable thing), but that benefit comes at a cost to users trying to discover information or make connections within/across archives.

# Wikipedia linking
When arriving at a leaf node, it can be hard to find its context within the collection without full tree traversal. 

Fun idea - wouldn't it be nice to take any arbitrary record from the archive and automatically annotate it with links to relevant wikipedia articles? In other words, can we make our archive records look and feel more like wikipedia pages?  
Wikipedia's strength is in its internal links - a typical page contains dozens of links to other contextually related pages, allowing users to traverse an endless warren of information without getting stuck at leaf nodes. It's a great established model and it _works_, so why not emulate it?   
Wellcome has great links with wikipedia/wikimedia/wikidata, and a lot of our digitised material has already ended up on their platform(s). Why not use the incredible graph that is wikidata to enhance our archives, and in turn enhance wikidata when the archive data eventually makes its way there?  
Ideally we'll be able to intelligently link all of our own archive data to itself eventually, but we can make a quick, cheap start by tying our records to wikipedia first.

### Loading data
As usual, we'll start by importing a few useful packages for manipulating and displaying the data, and load in archive the data itself.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
from string import punctuation
import spacy
import re
from IPython.core.display import display, HTML

In [2]:
df = pd.read_json('data/calm_records.json')

### Abortion Laws Reform Act
An interesting example to start with

In [3]:
record = df.loc[269057]['AdminHistory'][0]
record[:1000]

"<p>The Abortion Law Reform Association (ALRA) was founded in 1935 for the legalisation of abortion in certain circumstances.  This was achieved by the 1967 Abortion Act: the Association continues to combat attempts to restrict the availability of legal abortions and to ensure that the intentions of the Act are being carried out.\n<p>An outline chronology follows, giving significant dates in the history of abortion law in the United Kingdom and of the ALRA:\n<p>1803                         Lord Ellenborough's Act criminalises the procurement of abortion, previously permissable up to 'quickening'\n<p>1861\t\tLegislation pertaining to abortion consolidated in the Offences Against the Person Act\n<p>1929\t\tInfant Life (Preservation) Act provides for abortion when carrying the pregnancy to term would be fatal\n<p>1931-1932   \tJustice McCardie's remarks during abortion cases at Leeds Assizes receive widespread press attention\n<p>1934\t\tCooperative Women's Guild pass a resolution in favo

this record contains a load of ugly hard-coded HTML - let's parse that and turn it into more readable plaintext

In [4]:
soup = BeautifulSoup(record, 'html.parser')
plain_text = soup.get_text()

print(plain_text)

The Abortion Law Reform Association (ALRA) was founded in 1935 for the legalisation of abortion in certain circumstances.  This was achieved by the 1967 Abortion Act: the Association continues to combat attempts to restrict the availability of legal abortions and to ensure that the intentions of the Act are being carried out.
An outline chronology follows, giving significant dates in the history of abortion law in the United Kingdom and of the ALRA:
1803                         Lord Ellenborough's Act criminalises the procurement of abortion, previously permissable up to 'quickening'
1861		Legislation pertaining to abortion consolidated in the Offences Against the Person Act
1929		Infant Life (Preservation) Act provides for abortion when carrying the pregnancy to term would be fatal
1931-1932   	Justice McCardie's remarks during abortion cases at Leeds Assizes receive widespread press attention
1934		Cooperative Women's Guild pass a resolution in favour of legalisation of abortion at t

`Spacy` is a nice natural language processing (NLP) library which rapidly adds tonnes of metadata to a document. Each word is automatically tagged with a part-of-speech tag, a word vector etc. Without the user having to do anything at all, spacy will do 95% of the usual pre-processing required for typical NLP tasks. 

In [5]:
nlp = spacy.load('en')
doc = nlp(plain_text)

That's all we need to do...

We can use now use spacy's (POS) tags to identify _named entities_, like people, places, or organisations. To a decent approximation, these named entities are usually the words which wikipedia chooses to provide more context to with a link. By identifying the relevant named entities (excluding certain types, see more documentation of entity types [here](https://spacy.io/usage/linguistic-features#entity-types)), we can do a tiny amount of string manipulation and return a neat wikipedia search string.

In [6]:
ent_types = ['PERSON', 
             'NORP', 
             'FACILITY', 
             'ORG', 
             'GPE', 
             'LOC', 
             'PRODUCT', 
             'EVENT', 
             'WORK_OF_ART', 
             'LAW', 
             'LANGUAGE']

In [7]:
for ent in doc.ents:
    if ent.label_ in ent_types and len(ent.text.split()) > 1:
        words = ent.text.lower().split()
        words = [word.replace("'s", '') for word in words]
        words = [word.translate(str.maketrans('', '', punctuation)) 
                 for word in words]

        print('https://en.wikipedia.org/w/index.php?search=' + 
              '+'.join(words))

https://en.wikipedia.org/w/index.php?search=the+abortion+law+reform+association
https://en.wikipedia.org/w/index.php?search=the+1967+abortion+act
https://en.wikipedia.org/w/index.php?search=the+united+kingdom
https://en.wikipedia.org/w/index.php?search=the+offences+against+the+person+act
https://en.wikipedia.org/w/index.php?search=infant+life+preservation+act
https://en.wikipedia.org/w/index.php?search=justice+mccardie
https://en.wikipedia.org/w/index.php?search=cooperative+women+guild
https://en.wikipedia.org/w/index.php?search=meeting+1934
https://en.wikipedia.org/w/index.php?search=british+medical+association+committee
https://en.wikipedia.org/w/index.php?search=the+medical+aspects+of+abortion
https://en.wikipedia.org/w/index.php?search=stella+browne
https://en.wikipedia.org/w/index.php?search=harry+roberts
https://en.wikipedia.org/w/index.php?search=a+ludovici
https://en.wikipedia.org/w/index.php?search=foundation+of+abortion+law+reform+association
https://en.wikipedia.org/w/index.

Wikipedia's search URLs are great. If wikipedia spots that the search returns a unique result, the user is seamlessly redirected to that result's page. If multiple results are close to the search string, a disambiguation page is returned. If the search is rubbish, the raw search page is returned with a typical list of search results. Try a few of the links above and see which kinds of search work better than others.  
We'll now follow the same process as above and dump each link (wrapped with a little HTML) into a dictionary, keyed by their original plaintext strings.

In [8]:
links = {}

for ent in doc.ents:
    if ent.label_ in ent_types and len(ent.text.split()) > 1:
        words = ent.text.lower().split()
        words = [word.translate(str.maketrans('', '', punctuation)) 
                 for word in words]
        url = ('https://en.wikipedia.org/w/index.php?search=' + 
                '+'.join(words))
        link = '<a href="{}">{}</a>'.format(url, ent.text.strip())
        links[ent.text.strip()] = link

We can now perform a super basic regex replacement. We're looking for the original plaintext strings which were recognised as relevant named entities, and replacing them with HTML links to the wikipedia searches. If the search is decent, the reader will be pointed straight to a page of additional contextual information, deepening their understanding of the subject matter much faster than the arduous archive traversal process described above.

In [9]:
pattern = re.compile(r'\b(' + '|'.join(links.keys()) + r')\b')
result = pattern.sub(lambda x: links[x.group()], str(soup))

display(HTML(result))

Admittedly this process is still far from perfect, and the flaws start to appear as soon as you click more than a few basic links... Spacy's NER algorithm is okay at best, and the wikipedia search only works well when it's provided with a clean, unambiguous string. There's a lot to improve here, and we can be much smarter about the way we make use of wikidata (see notebooks to follow), but for the work of an hour on a Monday afternoon, this isn't bad...