## Named Entity Recognition in the Trove Aboriginal Advocate

This data source is a XML format dump from the NLA Trove archive of one title - the Aboriginal Advocate.  The data is in the form of a large XML file contianing 3497 articles from this title.  The goal here is to run a named entity recognition process over the documents to extract names of interest. 

As with other notebooks in this project we will use the SpaCy language processing library to extract names from the text.  The first step is to define a reader for the XML data, this is done in the module [trovereader.py](trovereader.py) which is then imported here. 

In [3]:
!pip install -q -r requirements.txt

In [4]:
import spacy
import csv
import geocoder
import pandas as pd
import trovereader

In [40]:
# the source XML filename
xmlfile = "data/nla.obj-573721295_Aborginies_Advocate.xml"

In [7]:
# download the spacy model we need
model = 'en_core_web_md'
spacy.cli.download(model)
nlp = spacy.load(model)


[93m    Linking successful[0m
    /opt/conda/lib/python3.6/site-packages/en_core_web_md -->
    /opt/conda/lib/python3.6/site-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')



The next cell uses the trove XML parser to read the separate document records in the XML file and run the NER system over these. The resulting entities are collected into a list of dictionaries which is then converted to a Pandas DataFrame.   We collect all entities that are found and for each one store a bit of context - the entity plus two tokens either side of it.  

In [42]:
entities = []
limit = 100

for record in trovereader.trove_parser(xmlfile):
    text = record['description'][0]
    doc = nlp(text)
    for ent in doc.ents:
        context = doc[ent.start-2:ent.end+2]
        context = " ".join([w.text for w in context])
        d = {'entity': ent.label_, 'label': ent.text, 'context': context, 'doc': record['identifier'][0]}
        entities.append(d)
    limit -= 1
    if limit < 0:
        break
        
entities = pd.DataFrame(entities)
entities.head(20)

Unnamed: 0,context,doc,entity,label
0,( 1/ AUSTRALIA mSyoimm No,nla.obj-579161800,GPE,AUSTRALIA
1,No . 207 Registered as,nla.obj-579161800,CARDINAL,207
2,"Post . Sept 80 , 1918 . One",nla.obj-579161800,DATE,"Sept 80, 1918"
3,1918 . One Shilling per Annum,nla.obj-579161800,MONEY,One Shilling
4,,nla.obj-579161837,PERSON,Briei
5,. From Sim Who hath,nla.obj-579161837,GPE,Sim
6,"Me . Railton , Secretary",nla.obj-579161837,PERSON,Railton
7,"Secretary of the Mission , has",nla.obj-579161837,ORG,the Mission
8,"passed his 79th year , and",nla.obj-579161837,DATE,79th year
9,", after fourteen years service “",nla.obj-579161837,DATE,fourteen years


Having extracted the entities we can now explore what we have found. Here we look at the locations (GPE) and oganisations (ORG) and see what the most frequent 30 entities are in each case. 

In [43]:
locations = entities[entities.entity == "GPE"]
locations.groupby('label').count().sort_values('entity', ascending=False).head(30)

Unnamed: 0_level_0,context,doc,entity
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rev.,36,36,36
Perth,35,35,35
Cabbage Tree Island,27,27,27
Annandale,26,26,26
Sydney,26,26,26
Australia,22,22,22
Leederville,18,18,18
Sunday Island,18,18,18
Sevington,16,16,16
Christ,15,15,15


In [39]:
orgs = entities[entities.entity == "ORG"]
orgs.groupby('label').count().sort_values('entity', ascending=False).head(30)

Unnamed: 0_level_0,context,doc,entity
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Treasurer,43,43,43
La Perouse,28,28,28
Church,22,22,22
Taree,20,20,20
Mission House,20,20,20
Children’s Home,17,17,17
Council,16,16,16
Madeley Wood,14,14,14
Petersham,14,14,14
Ensign Morgan,13,13,13


## Notes

- can this data be stored on Github - do we want it to be? 
- should we look at how to create an Alveo resource from this collection?
- full dataset has 3497 records, 100 records yields 6627 entities, so maybe 230k entities all together, what do we do with the resulting entities? 