## Introduction

This notebook illustrates how we can use NER to search for placenames in a corpus, and enhance a gazetteer. It uses two datesets to illustrate the concepts.

1) [Geograph](https://geograph.org.uk) 
This site invites users to take pictures in the UK and add descriptions. It has almost 7 million pictures, and the data are licenced using a CC By-SA licence, making them available for research as long as we keep the names of the users, and allow others to have access to any data we might create.

2) [Ordnance Survey](https://ordnancesurvey.co.uk/) 50k gazetteer
This gazetteer was published under a UK Open Government licence and contains all place name found on 1:50k maps in the UK. It is a legacy product (i.e. not used or updated any more), but it is suitable for our purposes.

We are going to look for names found in the Geograph data that don't exist in the gazetteer. Since we know that many names occur multiple times, we will do this locally, to increase the chances that we really find new names.

**The first block of our code reads in data and builds a simple spatial index for the gazetteer. We only need to do this once.**

In [None]:
import OSGridConverter #To convert from =SGB36 to WGS84
import pandas as pd #To use pandas for elegant data handling
import spacy #Our NLP tools
import matplotlib.pyplot as plt #To plot results

#Load a language model to do NLP
nlp = spacy.load("en_core_web_md")

In [None]:
#First we read in the geograph data
geograph = pd.read_csv('./data/geograph_mini_corpus.csv', encoding='latin-1')
print(len(geograph))
geograph.head()

sample = geograph

This block demonstrates the NLP results for a single Geograph document. We use the entities later as potential toponyms.

Here we **draw a sample of documents** from the Geograph data and perform NER on those data. 

The sample can either be random, or we can define a search term and extract all records containg that term.

We can rerun this block to build a new sample. The size of this sample can also be changed.

In [None]:
# Build a spatial index of our documents

#First we need to reproject the data so that we have them in the correct projection
for i in sample.index:
    try:
        g = OSGridConverter.latlong2grid (sample.at[i, 'lat'], sample.at[i, 'lon'], tag = 'WGS84')
        sample.at[i, 'x'] = g.E
        sample.at[i, 'y'] = g.N
    except ValueError:
        print("Problem with a document", sample.at[i,'id'])

# Now we can set up the parameters for our index        
resolution = 10000

minx = sample['x'].min()
maxx = sample['x'].max()
miny = sample['y'].min()
maxy = sample['y'].max()

w = maxx - minx
h = maxy - miny

nc = int(w/resolution) + 1
nr = int(h/resolution) + 1

#print(maxx, minx, maxy, miny)
#print(nr, nc)

#Build the spatial index now
spatialIndex = pd.DataFrame(index=range(nc),columns=range(nr))

#Now we populate the index with document ids
for index, row in sample.iterrows():
    i = int((row['x'] - minx)/resolution)
    j = int((row['y'] - miny)/resolution)
    id = row['id']
    
    #print(row['id'])
    #print(row['x'],row['y'],i,j)
    if pd.isnull(spatialIndex.at[i,j]):
        spatialIndex.at[i,j] = {id:(row['x'],row['y'])}
    else:
        names = spatialIndex.at[i,j]
        names.update({id:(row['x'],row['y'])})
        spatialIndex.at[i,j] = names

spatialIndex

In [None]:
# Do a range query on the spatial index
range = 100000 #(100 km)
point = (771500,216500) #Ben Nevis
x1 = point[0] - range/2
x2 = point[0] + range/2
y1 = point[1] - range/2
y2 = point[1] + range/2
    
i1 = int((x1 - minx)/resolution)
j1 = int((y1 - miny)/resolution)
i2 = int((x2 - minx)/resolution) + 1
j2 = int((y2 - miny)/resolution) + 1

result = spatialIndex.iloc[j1:j2, i1:i2]

result


In [None]:
list = {}
elements = set(pd.read_csv('./data/elements.txt', header=None)[0])
qualities = set(pd.read_csv('./data/qualities.txt', header=None)[0])
activities = set(pd.read_csv('./data/activities.txt', header=None)[0])

terms = elements.union(qualities).union(activities)
lemmas = ' '.join(str(e) for e in terms)


doc = nlp(lemmas)
terms = set()
for token in doc:
    terms.add(token.lemma_)
terms

In [None]:
m = 1000
sample = geograph.sample(n = m)
#sample = geograph

# Create a postings list for our geograph documents
docs = nlp.pipe(sample.text,n_process=2, batch_size=100)

postings = dict()

for (idxRow, s1), (_, s2) in zip(sample.iterrows(), enumerate(docs)):
    id = s1.id
    for token in s2:
        lemma = token.lemma_
        if lemma in terms:

            if lemma in postings:
                tf = postings[lemma]
                if id in tf:
                    tf[id] = tf[id] + 1
                else:
                    tf[id] = 1
            else:
                tf = {id: 1}
                postings[lemma] = tf

This block of code does the comparisons. It iterates through all the Geograph documents and does the following:

- For each document it returns all the toponyms in the gazetteer cell at that location
- Compares each name found by the NER to the list of toponyms in the gazetteer cell, and
- Annotates the names as either existing (found in the gazetteer) or new (new names)

If we change the sample of names, or the resolution of the gazetteer, then the results of the following **comparison** should change.

In [None]:
#Now we are going to compare the gazetteer names with those we found

data = list()
#We iterate through all our results
for dict in results:
    #First we get the cell indices for the gazetteer
    x = dict.get("x")
    y = dict.get("y")
    i = int((x - minx)/cellSize)
    j = int((y - minx)/cellSize)
    try:
        #Now we find the names in that cell - n.B. we ignore for now the fact that Geograph cell could be at a boundary
        gazNames = gaz.at[i,j]
        #Deal with a cell having no values in the gazetteer
        if (isinstance(gazNames,set) == False): 
            #print(type(gazNames))
            gazNames = {"NoNamesFound"}
    except KeyError:
        gazNames = {"NoNamesFound"}
    #Get back the named entities for the text        
    ents = dict.get("entities")
    #Now we iterate through, and find out if each name is already in the local gazetteer
    for ent in ents:        
        if (ent.text in gazNames):
            data.append([dict.get("id"), "Existing", ent.text, ent.label_, x, y]) 
            #print("Found existing name:", ent.text, ent.label_)
        else:
            #print("Potential new name:" , ent.text, ent.label_)
            data.append([dict.get("id"), "New", ent.text, ent.label_, x, y]) 
#Store the results in a dataframe
df = pd.DataFrame(data, columns = ['id', 'status','name','type','x','y'])
df

In [None]:
#Split results into existing and candidate names for reporting
new = df.loc[df['status'] == 'New']
existing = df.loc[df['status'] == 'Existing']

We output the new names we found for the class 'PERSON'. You could calculate precision here by judging how many of these are really toponyms (since we assume implicitly that 'PERSON' names are actually toponyms).

In [None]:
#Let's look at one example NER class in the candidate names
names = new.loc[new['type'] == 'PERSON']

with pd.option_context('display.max_rows', None,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(names)

In [None]:
cn = len(set(new['name']))
ce = len(set(existing['name']))
print("Found", ce, "unique existing names and", cn, "unique new candidate names.")

In [None]:
#output all dependencies so that we can reproduce the notebook (we only need this to set things up for Binder)
#%load_ext watermark
#%watermark --iversions

We plot the locations of the new and existing names as simple scatter plots. A density plot would make more sense here to allow a real comparison, but this gives us a first overview.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.scatter(new['x'], new['y'])
ax2.scatter(existing['x'], existing['y'])