#### TFIDF Passage Retrieval

A simple and efficient way of retreiving relevant documents (e.g. sentence or paragraph) from a document store is using `TFIDF`. Given a collection of documents, we can create TFIDF vectors for each document, which is a vector of TFIDF weights (one for each word from the vocabulary). 

Term-frequency for term $t$ in document $d$: 

$TF(t,d) = 1 + \log_{10}(\text{count}(t,d))$

where $\text{count}(t,d)$ is the frequency with which term $t$ appears in document $d$. (Note that we take the log to supress the range of count values).

Inverse-document-frequency for term $t$:

$IDF(t) = \log_{10}(\frac{N}{\text{df}(t)})$

where $N$ is the total number of documents in the collectino and $\text{df}(t)$ is the number of documents in which term $t$ occurs. Then the TFIDF is given by the following product:

$TFIDF(t,d) = TF(t,d) IDF(t)$


Then given a query $q$ which is a sequence of terms, we can construct a TFIDF vector for this query. However, since queries are usually short and is likely to contain a single occurance of each unique term, we can simplify it's TFIDF vector by setting the TFIDF weight for each unique term to 1. Then we can compute a score for each document as the cosine similarity between the query vector and the corresponding document TFIDF vector $d$:

$score(q,d) = \frac{q \cdot d}{|q| |d|}$

Since $|q|$ is a fixed constant, we can ignore it because it will not affect the ranking of document scores. Then using our simplifying assumption of $q$ being a vector of binary weights, we have the following document score function:

$score(q,d) = \frac{\sum_{t\in q} TFIDF(t,d)}{\sqrt{\sum_{t\in d} TFIDF(t,d)^2}}$

where the square root term in the denominator is the norm of the document TFIDF vector. So the score for each document is just the TFIDF weights for the query terms which also appear in that document, normalized by the norm of that documents TFIDF vector.

Now each word in the query will not occur in all documents, so we need to only consider documents that actually contain these query words instead of iterating over all documents in the collection. We can maintain an `inverted index` data structure which is a dictionary that maps each unique word to a list of tuples, each tuple containing a document and the TFIDF weight.

e.g. `inverted_index = {'w1' : [(d1, TFIDF(w1,d1)), (d2, TFIDF(w1,d2),..)], 'w2': ...}`

This data structure will allow us to compute and rank the document scores very efficiently.


In [17]:
from nltk.tokenize import RegexpTokenizer
from collections import defaultdict
import math

tokenizer = RegexpTokenizer(r'\w+')

In [32]:
# test documents
test_documents = ["To be brief, I write for various reasons.", 
                  "I will confess that I have a fancy to be numbered among their honourable company.",  
                  "Sir Henry Curtis, as everybody acquainted with him knows, is one of the most hospitable men on earth", 
                  "Everybody turned and stared politely at the curious-looking little lame man, and though his size was insignificant, he was quite worth staring at.",
                  "Once it was a dense forest, now it's open level country cultivated here and there, but for the most part barren.",
                  "Christian, the number of casualties from sickness has been very small indeed, and this although they frequently sleep in the trenches of newly-turned earth at all seasons of the year."]

In [31]:
# tokenize sentence string into words, punctutaions removed
def tokenize(sent):
    return tokenizer.tokenize(sent.lower())

In [63]:
# a basic TFIDF information rereival system
class IR_System():
    def __init__(self, documents):
        self.documents = documents
        self.TFIDF, self.inverted_index, self.doc_tfidf_norms = self.create_inverted_index()
        

    def create_inverted_index(self):
        N = len(self.documents)
        TF = defaultdict(int)
        TFIDF = defaultdict(float)
        term_docs = defaultdict(set)
        inverted_index = defaultdict(list)

        # compute term frequency and document frequencies
        for d, doc in enumerate(self.documents):
            words = tokenize(doc)
            for w in words:
                TF[(w, d)] += 1
                term_docs[w].add(d)

        # create inverted index
        for w, docs in term_docs.items():
            for d in sorted(list(docs)):
                tfidf = (1 + math.log10(TF[(w,d)])) * math.log10(N/len(docs))
                inverted_index[w].append(d)
                TFIDF[(w,d)] = tfidf

        # compute document TFIDF vector norms
        doc_tfidf_norms = [0] * N
        for d, doc in enumerate(self.documents):
            words = tokenize(doc)
            for w in words:
                doc_tfidf_norms[d] = doc_tfidf_norms[d] +  TFIDF[(w,d)]**2
            doc_tfidf_norms[d] = math.sqrt(doc_tfidf_norms[d])

        return TFIDF, inverted_index, doc_tfidf_norms  


    def retrieve_docs(self, query, topk=1):
        query_words = tokenizer.tokenize(query.lower())
        #print(f"query words: {query_words}")
        # get all documents which contain words from query
        docs = []
        for w in query_words:
            docs.extend([d for d in self.inverted_index[w]])
        #print(f"docs: ")    
        # score all these documents
        scores = defaultdict(float)
        for d in docs:
            for w in query_words:
                scores[d] += self.TFIDF[(w,d)]
            scores[d] = scores[d] / self.doc_tfidf_norms[d]        
        #print(f"scores: {scores}")    
        # return topk documents
        sorted_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        #print(f"sorted scores: {sorted_scores}")
        best = sorted_scores[:topk]
        topk_docs = []
        for doc, score in best:
            topk_docs.append((self.documents[doc], score))
        return topk_docs    

In [64]:
IR = IR_System(test_documents)
#for w, docs in IR.inverted_index.items():
#    print(f"{w}: {[(d, IR.TFIDF[(w,d)]) for d in docs]}")  
print(IR.inverted_index)      
print(IR.doc_tfidf_norms)      

defaultdict(<class 'list'>, {'to': [0, 1], 'be': [0, 1], 'brief': [0], 'i': [0, 1], 'write': [0], 'for': [0, 4], 'various': [0], 'reasons': [0], 'will': [1], 'confess': [1], 'that': [1], 'have': [1], 'a': [1, 4], 'fancy': [1], 'numbered': [1], 'among': [1], 'their': [1], 'honourable': [1], 'company': [1], 'sir': [2], 'henry': [2], 'curtis': [2], 'as': [2], 'everybody': [2, 3], 'acquainted': [2], 'with': [2], 'him': [2], 'knows': [2], 'is': [2], 'one': [2], 'of': [2, 5], 'the': [2, 3, 4, 5], 'most': [2, 4], 'hospitable': [2], 'men': [2], 'on': [2], 'earth': [2, 5], 'turned': [3, 5], 'and': [3, 4, 5], 'stared': [3], 'politely': [3], 'at': [3, 5], 'curious': [3], 'looking': [3], 'little': [3], 'lame': [3], 'man': [3], 'though': [3], 'his': [3], 'size': [3], 'was': [3, 4], 'insignificant': [3], 'he': [3], 'quite': [3], 'worth': [3], 'staring': [3], 'once': [4], 'it': [4], 'dense': [4], 'forest': [4], 'now': [4], 's': [4], 'open': [4], 'level': [4], 'country': [4], 'cultivated': [4], 'here'

In [68]:
IR.retrieve_docs(query="open country fancy", topk=2)

[("Once it was a dense forest, now it's open level country cultivated here and there, but for the most part barren.",
  0.592383863918572),
 ('I will confess that I have a fancy to be numbered among their honourable company.',
  0.283974366815071)]