## Introduction

This notebook illustrates how we can use NER to search for placenames in a corpus, and enhance a gazetteer. It uses two datesets to illustrate the concepts.

1) [Geograph](https://geograph.org.uk) 
This site invites users to take pictures in the UK and add descriptions. It has almost 7 million pictures, and the data are licenced using a CC By-SA licence, making them available for research as long as we keep the names of the users, and allow others to have access to any data we might create.

2) [Ordnance Survey](https://ordnancesurvey.co.uk/) 50k gazetteer
This gazetteer was published under a UK Open Government licence and contains all place name found on 1:50k maps in the UK. It is a legacy product (i.e. not used or updated any more), but it is suitable for our purposes.

We are going to look for names found in the Geograph data that don't exist in the gazetteer. Since we know that many names occur multiple times, we will do this locally, to increase the chances that we really find new names.

**The first block of our code reads in data and builds a simple spatial index for the gazetteer. We only need to do this once.**

In [59]:
import pandas as pd #To use pandas for elegant data handling
import spacy #Our NLP tools
import math

class Postings:
    
    def __init__(self, firstMondayTerms):
        #Load a language model to do NLP
        self.nlp = spacy.load("en_core_web_md")
        #First we read in the geograph data
        geograph = pd.read_csv('./data/geograph_mini_corpus.csv', encoding='latin-1')
        
        sample = geograph.sample(n = 100)
        self.ndocs = len(sample)
        
        
        # firstMonday works like an inverse stop list, and we only use words in these lists for our posting file
        if firstMondayTerms:
            list = {}
            elements = set(pd.read_csv('./data/elements.txt', header=None)[0])
            qualities = set(pd.read_csv('./data/qualities.txt', header=None)[0])
            activities = set(pd.read_csv('./data/activities.txt', header=None)[0])

            terms = elements.union(qualities).union(activities)
            lemmas = ' '.join(str(e) for e in terms)

            doc = self.nlp(lemmas)
            terms = set()
            for token in doc:
                terms.add(token.lemma_)
                
            # Now we process our corpus and create a postings file
            docs = self.nlp.pipe(sample.text,n_process=2, batch_size=100)

            self.postings = dict()

            for (idxRow, s1), (_, s2) in zip(sample.iterrows(), enumerate(docs)):
                id = s1.id
                for token in s2:
                    lemma = token.lemma_
                    if lemma in terms:

                        if lemma in self.postings:
                            tf = self.postings[lemma]
                            if id in tf:
                                tf[id] = tf[id] + 1
                            else:
                                tf[id] = 1
                        else:
                            tf = {id: 1}
                        self.postings[lemma] = tf
                        
    def tfIdf(self, query):
        results = {}
        qdoc = self.nlp(query)
        for token in qdoc:
            qt = token.lemma_
            if qt in self.postings:
                dc = len(self.postings[qt])
                idf = math.log10(self.ndocs/(dc + 1))
                for doc in self.postings[qt]:
                    tf = self.postings[qt][doc]
                    tfidf = tf * idf
                    if doc in results:
                        score = results[doc]
                        results[doc] = tfidf + score
                    else:
                        results[doc] = tfidf
        results = dict(sorted(results.items(), key = lambda x: x[1], reverse=True))
        
        return results
                        
                        
        

In [60]:
postings = Postings(True)

In [61]:
postings.tfIdf('hill mountain')

{2775351: 2.3098039199714866,
 885215: 1.1549019599857433,
 650430: 1.1549019599857433,
 1171797: 1.1549019599857433,
 3113207: 1.1549019599857433,
 969246: 1.1549019599857433,
 210766: 0.6777807052660807,
 849212: 0.6777807052660807,
 1771153: 0.6777807052660807,
 755207: 0.6777807052660807,
 2004630: 0.6777807052660807,
 54600: 0.6777807052660807,
 1050188: 0.6777807052660807,
 2784778: 0.6777807052660807,
 2932828: 0.6777807052660807,
 1744155: 0.6777807052660807,
 3310901: 0.6777807052660807,
 1144802: 0.6777807052660807,
 407290: 0.6777807052660807,
 3108258: 0.6777807052660807,
 151138: 0.6777807052660807,
 2161064: 0.6777807052660807,
 2883804: 0.6777807052660807,
 39454: 0.6777807052660807,
 2226055: 0.6777807052660807,
 1409271: 0.6777807052660807}

In [63]:
sample.head()


NameError: name 'sample' is not defined