In [1]:
# install all requirements quietly
!pip install -q -r requirements.txt

# Sample NER Workflow for DigiVol

Read data from digiVol CSV file and pass through the SpaCY NER 

In [2]:
import spacy
import csv
import geocoder
import pandas as pd
import utils

In [3]:
# download the spacy models we need
model = 'en_core_web_md'
spacy.cli.download(model)
nlp = spacy.load(model)


[93m    Linking successful[0m
    /opt/conda/lib/python3.6/site-packages/en_core_web_md -->
    /opt/conda/lib/python3.6/site-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')



We first read the data using our utility function.  This gives us a data frame with the text in one column.

In [4]:
texts = utils.read_digivol_csv('data/Project-1536729-DwC.csv')
texts.head()

Unnamed: 0,text
0,Purchased this Diary for which I paid 10/ and ...
1,Went to the Gaol with Requisition for Altering...
2,Paid Cabman 10 shillings for cab to Benevolent...
3,Rather seedy first thing this morning but cont...
4,“Seedy again”. This will not do. I must turn o...


## NER

We now perform NER on the text using the Spacy library.  For we generate a list of location entities and for each entity, record a snippet of text around the occurence.  The result is a DataFrame containing the placename, the context and the document number - really the row number in the original spreadsheet.

In [5]:
places = []

for i, t in texts.iterrows():
    text = t['text']
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ is "GPE":
            context = doc[ent.start-2:ent.end+2]
            context = " ".join([w.text for w in context])
            d = {'placename': ent.text, 'context': context, 'doc': i}
            places.append(d)
locations = pd.DataFrame(places)
locations

Unnamed: 0,context,doc,placename
0,"sister at Geelong , enclosing",0,Geelong
1,him to St Kilda .,1,St Kilda
2,went to St Kilda . Mr,1,St Kilda
3,"steered for Melbourne , leaving",1,Melbourne
4,correct . \r\n\r\n Made up,3,\r\n\r\n
5,"Newby at Richmond , and",3,Richmond
6,houses . \r\n\r\n Harrison walked,4,\r\n\r\n
7,closeness of Atmosphere and startling,4,Atmosphere
8,Invited to St Kilda but refused,6,St Kilda
9,General of Sydney . I,7,Sydney


## Visualisation

Spacy can be used to visualise the NER results in the notebook.  This might not be too useful but illustrates what is possible. 

In [6]:
from spacy import displacy
from IPython.core.display import display, HTML

doc = nlp(texts['text'][0])
display(HTML(displacy.render(doc, style='ent')))

## Geocoding

We can use the `geocoder` module to submit these place names to a geocoding service.  Here we use the Geonames service and make a new table with the results.

In [7]:
locations = utils.geolocate_locations(locations)
locations

Unnamed: 0,context,doc,placename,address,country,lat,lng
0,"sister at Geelong , enclosing",0,Geelong,Geelong,Australia,-38.14711,144.36069
1,him to St Kilda .,1,St Kilda,St Kilda,Australia,-37.8676,144.98099
2,went to St Kilda . Mr,1,St Kilda,St Kilda,Australia,-37.8676,144.98099
3,"steered for Melbourne , leaving",1,Melbourne,Melbourne,Australia,-37.814,144.96332
4,correct . \r\n\r\n Made up,3,\r\n\r\n,Sydney,Australia,-33.86785,151.20732
5,"Newby at Richmond , and",3,Richmond,Richmond,Australia,-20.56967,142.91384
6,houses . \r\n\r\n Harrison walked,4,\r\n\r\n,Sydney,Australia,-33.86785,151.20732
7,closeness of Atmosphere and startling,4,Atmosphere,Atmosphere Kanifushi Maldives,Maldives,5.36435,73.3345
8,Invited to St Kilda but refused,6,St Kilda,St Kilda,Australia,-37.8676,144.98099
9,General of Sydney . I,7,Sydney,Sydney,Australia,-33.86785,151.20732
