## Named Entity Recognition in the Trove Aboriginal Advocate

This data source is a XML format dump from the NLA Trove archive of one title - the Aboriginal Advocate.  The data is in the form of a large XML file contianing 3497 articles from this title.  The goal here is to run a named entity recognition process over the documents to extract names of interest. 

As with other notebooks in this project we will use the SpaCy language processing library to extract names from the text.  The first step is to define a reader for the XML data, this is done in the module [trovereader.py](trovereader.py) which is then imported here. 

In [None]:
#!pip install -q -r requirements.txt

In [None]:
import spacy
import csv
import os
import geocoder
import pandas as pd
import networkx as nx
import trovereader
import matplotlib.pylab as plt

In [None]:
# the source XML filename
xmlfile = "data/nla.obj-573721295_Aborginies_Advocate.xml"

In [None]:
# download the spacy model we need
model = 'en_core_web_lg'
#spacy.cli.download(model)
nlp = spacy.load(model)

The next cell uses the trove XML parser to read the separate document records in the XML file and run the NER system over these. The resulting entities are collected into a list of dictionaries which is then converted to a Pandas DataFrame.   We collect all entities that are found and for each one store a bit of context - the entity plus two tokens either side of it.  

Since finding entities can take some time, we write out the result to a CSV file.  We first check if the CSV file already exists and if it does, we read the entities from the file rather than recomputing them.  To force the NER process to run, set the variable `force` to `True` at the top of the cell.

In [None]:
csvname = "data/Aboriginies_Advocate_Entities.csv"
force = False
if not force and os.path.exists(csvname):
    entities = pd.read_csv(csvname)
else:

    entities = []
    limit = 10000    # optional limit on how many documents we process

    for record in trovereader.trove_parser(xmlfile):
        text = record['description'][0]
        doc = nlp(text)
        for ent in doc.ents:
            context = doc[ent.start-2:ent.end+2]
            context = " ".join([w.text for w in context])
            d = {'entity': ent.label_, 'label': ent.text, 'context': context, 'doc': record['identifier'][0]}
            entities.append(d)
        limit -= 1
        if limit < 0:
            break

    entities = pd.DataFrame(entities)
    entities.to_csv(csvname, index=False)
    
entities.head(20)

Having extracted the entities we can now explore what we have found. Here we look at the locations (GPE) and oganisations (ORG) and see what the most frequent 30 entities are in each case. 

In [None]:
locations = entities[entities.entity == "GPE"]
locations.groupby('label').count().sort_values('entity', ascending=False).head(30)

In [None]:
print(locations.shape)
print(locations.groupby('label').count().shape)

In [None]:
orgs = entities[entities.entity == "ORG"]
orgs.groupby('label').count().sort_values('entity', ascending=False).head(30)

In [None]:
mention_graph = nx.DiGraph()
mention_graph = nx.from_pandas_edgelist(entities, source='doc', target='label', edge_attr=True, create_using=mention_graph)


In [None]:
mention_graph.number_of_nodes(), mention_graph.number_of_edges()

In [None]:
result = []
for name, group in locations.groupby('doc'):
    locs = group.label
    for place1 in group.label:
        for place2 in group.label:
            if not place1 == place2:
                result.append({'loc1': place1, 'loc2': place2, 'doc': name})

colloc = pd.DataFrame(result).sort_values(['loc1','loc2'])
colloc.head()

In [None]:
collocg = colloc.groupby(['loc1', 'loc2']).count()

In [None]:
df = pd.DataFrame({
    'loc1': collocg.index.get_level_values('loc1'), 
    'loc2': collocg.index.get_level_values('loc2'), 
    'count': collocg.doc,
    'weight': [1/c for c in collocg.doc]   # weight is inverse of count
})
df.index = range(df.shape[0])  # reset index to integers
df.head()

In [None]:
# select those collocations that occur more than 10 times
df10 = df[df['count'] > 10]

In [None]:
cg = nx.from_pandas_edgelist(df10, source='loc1', target='loc2', edge_attr=True, create_using=nx.DiGraph())

In [None]:
node = "Perkins"
nodes = cg.neighbors(node)
nodes = list(nodes)
nodes.append(node)
subg = cg.subgraph(nodes)
print("Subgraph of nodes linked to", node, "contains", subg.size(), "nodes")

In [None]:
plt.figure(figsize=(20,20))
pos = nx.kamada_kawai_layout(subg, weight="weight")
colours = [(n==node and'g') or 'y' for n in subg.nodes]
ecol = [e[2] for e in subg.edges.data('weight')]
ecol = 'grey'
nx.draw_networkx(subg, pos, edge_color=ecol, font_weight="bold", font_color='r', arrows=False, node_color=colours)

In [None]:

df10[df10.loc1==node].sort_values('count', ascending=False)

## Notes

- can this data be stored on Github - do we want it to be? 
- should we look at how to create an Alveo resource from this collection?
- full dataset has 3497 records, 100 records yields 6627 entities, so maybe 230k entities all together, what do we do with the resulting entities? 