## Introduction

This notebook illustrates some basic document handling using [Spacy] (https://spacy.io/). Spacy is fast, and powerful, but not completely trivial to understand. There are though lots of useful resources, and the documentation is excellent.

**The first block of our code simply sets things up - most important here is the language model that we use.**

In [1]:
import spacy #Our NLP tools
from collections import Counter #We will use this to do simple counts of terms

#Load a German language model to do German NLP - the models we use will influence our results a lot
#nlp = spacy.load('de_core_news_md')
#And an English language model for English
nlp = spacy.load('en_core_web_sm')

Now we load a default list of stop words and print them out.

Look through the list of stop words, and consider what issues they might cause if we are interested in spatial relationships?

In [2]:
# This block loads our stop words 
stopwords = nlp.Defaults.stop_words

print(len(stopwords))
print(stopwords)

326
{'indeed', 'been', 'go', 'seeming', 'around', 'from', 'hereupon', 'although', 'call', 'give', 'anyway', 'n’t', 'anywhere', 'they', 'myself', 'seems', 'via', 'but', 'please', 'together', 'though', "'ve", 'us', 'quite', 'whereafter', 'however', 'than', 'so', 'now', 'get', 'part', 'could', 'latterly', 'meanwhile', 'everywhere', 'by', 'regarding', 'whatever', 'elsewhere', 'per', '’m', 'as', 'show', 'a', 'these', 'say', 'anyone', 'which', 'forty', 'latter', 'where', 'towards', 'why', 'you', 'after', 'therein', 'other', 'two', 'does', 'hundred', 'empty', 'only', 'if', 'few', 'somehow', 'fifty', 'others', 'whence', 'fifteen', 'noone', '‘d', '‘ve', 'moreover', 'for', 'just', 'both', 'very', 'those', 'nobody', 'side', 'that', 'upon', 'rather', 'being', 'ten', 'me', 'various', 'otherwise', 'some', '’re', 'whether', 'himself', 'wherein', 'before', 'seemed', 'this', '’ll', 'ever', 'thus', 'between', 'serious', 'also', 'everything', 'over', 'was', 'in', 'amongst', 'former', 'sometime', 'sometim

This function calculates the term frequency for a given document. At the moment it removes stop words and punctuation and stores lemmas in the index. Experiment with changing this - what happens to the end results?

In [17]:
def tf(text):
    doc = nlp(text)
    n = len(doc)
    
    terms = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct ]
    tf = dict(Counter(terms))
    
    for term, count in tf.items():
        tf[term] = count/n
        
    return tf

This function calculates the document frequency - that means we run it once over our whole collection, and it only changes if we add new documents. It is important that it is consistent with our term frequency. If you change the way we calculate tf, you need to change, and recalculate df.

In [18]:
import math
def df(texts):
    df = dict()
    ndocs = len(texts)
    for text in texts:
        doc = nlp(text)
        terms = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct ]
        tf = set(terms)
        for t in tf:
            if t in df:
                count = df[t]
                count = count + 1
            else:
                count = 1
            df[t] = count/ndocs
    for term, count in df.items():
        df[term] = math.log10(ndocs/(count + 1))

    return df

Now we can calculate tf idf values for our corpus - that means we have a score for every word in every document, that can then be used in ranking. 

In [19]:
def tfidf(corpus):
    idf = df(list(corpus.values()))
    results = {}
    for id, text in corpus.items():                
        t = tf(text)
        scores = {}
        for term in t:
            scores[term] = t[term]*idf[term]
        results[id] = scores
    return results

This our search algorithm. We take a query, and the tfidf scores, and rank each document according to its score.

In [20]:
def simpleSearch(query, weights):
    q = nlp(query)
    
    #Iterate through each document and add its tf-idf score
    results = {}
    for id, scores in weights.items():
        score = 0
        for token in q:
            if token.lemma_ in scores:
                score = score + scores[token.lemma_]
        results[id] = score
    return results

In [21]:
corpus = {1:"the cat sat on the mat",
          2:"the dog played with the cat",
          3:"the cat bit the dog",
          4:"the boy was playing with the dog",
          5:"the girl saw the cat biting the dog far away"}

weights = tfidf(corpus)

results = simpleSearch('cat dog', weights)

print(results)

{1: 0.10045923649826893, 2: 0.20091847299653787, 3: 0.24110216759584546, 4: 0.12055108379792273, 5: 0}
