## Introduction

This notebook illustrates some basic document handling using [Spacy] (https://spacy.io/). Spacy is fast, and powerful, but not completely trivial to understand. There are though lots of useful resources, and the documentation is excellent.

**The first block of our code simply sets things up - most important here is the language model that we use.**

In [1]:
import spacy #Our NLP tools
import math
from collections import Counter #We will use this to do simple counts of terms

#Load a German language model to do German NLP - the models we use will influence our results a lot
#nlp = spacy.load('de_core_news_md')
#And an English language model for English
nlp = spacy.load('en_core_web_sm')

The following function calculates term frequency for a given document. It does so for lemmatised tokens, and ignores stop words and punctuation. You can try changing the terms for which frequency is calculated.

In [2]:
def tf(text):
    doc = nlp(text)
    n = len(doc)
    
    terms = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct ]
    tf = dict(Counter(terms))
    
    for term, count in tf.items():
        tf[term] = count/n
        
    return tf

This function calculates document frequency

In [3]:
def df(texts):
    df = dict()
    ndocs = len(texts)
    for text in texts:
        doc = nlp(text)
        terms = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct ]
        tf = set(terms)
        for t in tf:
            if t in df:
                count = df[t]
                count = count + 1
            else:
                count = 1
            df[t] = count/ndocs
    for term, count in df.items():
        df[term] = math.log10(ndocs/(count + 1))

    return df

Now we take a corpus, and calculate tf-idf scores for every term in every document in the corpus

In [4]:
def tfidf(corpus):
    idf = df(list(corpus.values()))
    results = {}
    for id, text in corpus.items():                
        t = tf(text)
        scores = {}
        for term in t:
            scores[term] = t[term]*idf[term]
        results[id] = scores
    return results

Finally, we are ready to do a search. We calculate scores for each document based on the query and tfidf scores.

In [5]:
def simpleSearch(query, weights):
    q = nlp(query)
    
    #Iterate through each document and add its tf-idf score
    results = {}
    for id, scores in weights.items():
        score = 0
        for token in q:
            if token.lemma_ in scores:
                score = score + scores[token.lemma_]
        results[id] = score
    return results

You can experiment with different corpora, and different queries to see how the results vary. Ranks for results would be based on tfidf scores. 

In [6]:
corpus = {1:"the cat sat on the mat",
          2:"the dog played with the cat",
          3:"the cat bit the dog",
          4:"the boy bit the dog",
          5:"the girl saw the boy far away"}

weights = tfidf(corpus)

results = simpleSearch('cat dog', weights)

print(results)

{1: 0.10045923649826893, 2: 0.20091847299653787, 3: 0.24110216759584546, 4: 0.12055108379792273, 5: 0}
