#### TFIDF Passage Retrieval

A simple and efficient way of retreiving relevant documents (e.g. sentence or paragraph) from a document store is using `TFIDF`. Given a collection of documents, we can create TFIDF vectors for each document, which is a vector of TFIDF weights (one for each word from the vocabulary). 

Term-frequency for term $t$ in document $d$: 

$TF(t,d) = 1 + \log_{10}(\text{count}(t,d))$

where $\text{count}(t,d)$ is the frequency with which term $t$ appears in document $d$. (Note that we take the log to supress the range of count values).

`BM25`: By making a slight modification to how the term frequency is computer, we can improve performance. Note that the log counts can grow to large values if certain terms are extremely common, (i.e. unbounded growth). Instead of using the log function, we can use a different function which puts an upper bound on the TF value so that growth saturates when the raw counts keep increasing beyond a certain point. This modification to TF is called `BM25` and is given by the following:

$TF_{BM25}(t,d) = \frac{(1+k)  \text{ count}(t,d)}{\text{count}(t,d) + k \text{ } (1 - b + b \frac{|d|}{d_{avg}})}$

where $|d|$ is the length of the document and $d_{avg}$ is the average document length across the entire collection. $k$ is a hyperparameter which controls how quickly the TF value starts saturating and $b$ is a hyperparameter which penalizes documents which are longer than the average document length.


Inverse-document-frequency for term $t$:

$IDF(t) = \log_{10}(\frac{N}{\text{df}(t)})$

where $N$ is the total number of documents in the collectino and $\text{df}(t)$ is the number of documents in which term $t$ occurs. Then the TFIDF is given by the following product:

$TFIDF(t,d) = TF(t,d) IDF(t)$


Then given a query $q$ which is a sequence of terms, we can construct a TFIDF vector for this query. However, since queries are usually short and is likely to contain a single occurance of each unique term, we can simplify it's TFIDF vector by setting the TFIDF weight for each unique term to 1. Then we can compute a score for each document as the cosine similarity between the query vector and the corresponding document TFIDF vector $d$:

$score(q,d) = \frac{q \cdot d}{|q| |d|}$

Since $|q|$ is a fixed constant, we can ignore it because it will not affect the ranking of document scores. Then using our simplifying assumption of $q$ being a vector of binary weights, we have the following document score function:

$score(q,d) = \frac{\sum_{t\in q} TFIDF(t,d)}{\sqrt{\sum_{t\in d} TFIDF(t,d)^2}}$

where the square root term in the denominator is the norm of the document TFIDF vector. So the score for each document is just the TFIDF weights for the query terms which also appear in that document, normalized by the norm of that documents TFIDF vector.

Now each word in the query will not occur in all documents, so we need to only consider documents that actually contain these query words instead of iterating over all documents in the collection. We can maintain an `inverted index` data structure which is a dictionary that maps each unique word to a list of tuples, each tuple containing a document and the TFIDF weight.

e.g. `inverted_index = {'w1' : [(d1, TFIDF(w1,d1)), (d2, TFIDF(w1,d2),..)], 'w2': ...}`

This data structure will allow us to compute and rank the document scores very efficiently.

This simple information retrieval algorithm works well in most situation. However, note that this algorithm will only work if there is an exact match between a subset of words from the query and the relevant documents, otherwise the cosine similarity between the query and document TFIDF vector will be zero. For this types of situations, it is better to use some kind of dense vector representation for documents and queries which don't rely on raw word counts. 

In [1]:
from nltk.tokenize import RegexpTokenizer
from collections import defaultdict
from unidecode import unidecode
import math
from tqdm import tqdm

tokenizer = RegexpTokenizer(r'\w+')

In [32]:
# test documents
test_documents = ["To be brief, I write for various reasons.", 
                  "I will confess that I have a fancy to be numbered among their honourable company.",  
                  "Sir Henry Curtis, as everybody acquainted with him knows, is one of the most hospitable men on earth", 
                  "Everybody turned and stared politely at the curious-looking little lame man, and though his size was insignificant, he was quite worth staring at.",
                  "Once it was a dense forest, now it's open level country cultivated here and there, but for the most part barren.",
                  "Christian, the number of casualties from sickness has been very small indeed, and this although they frequently sleep in the trenches of newly-turned earth at all seasons of the year."]

In [2]:
# tokenize sentence string into words, punctutaions removed
def tokenize(sent):
    # Replace accented letters with regular letters
    sent = unidecode(sent)
    # Tokenize
    return tokenizer.tokenize(sent.lower())

In [3]:
from tqdm import tqdm
class IR_System():
    def __init__(self, documents, BM_25=False, k = 1.25, b = 0.75):
        self.documents = documents
        self.BM_25 = BM_25
        self.k = k
        self.b = b
        self.TFIDF, self.inverted_index, self.doc_tfidf_norms = self.create_inverted_index()
        
    def create_inverted_index(self):
        N = len(self.documents)
        TFIDF = defaultdict(float)
        inverted_index = defaultdict(list)

        # compute term frequency and document frequencies
        if self.BM_25:
            TF, term_docs = self.compute_TF_weighted()
        else:    
            TF, term_docs = self.compute_TF()

        # create inverted index
        print(f"Computing TFIDF and creating inverted index...")
        for w, docs in tqdm(term_docs.items(), total=len(term_docs)):
            for d in sorted(list(docs)):
                tfidf = TF[(w,d)] * math.log10(N/len(docs))
                inverted_index[w].append(d)
                TFIDF[(w,d)] = tfidf

        # compute document TFIDF vector norms
        print(f"Computing TFIDF vector norms...")
        doc_tfidf_norms = [0] * N
        for d, doc in tqdm(enumerate(self.documents), total=len(self.documents)):
            words = tokenize(doc)
            for w in words:
                doc_tfidf_norms[d] = doc_tfidf_norms[d] +  TFIDF[(w,d)]**2
            doc_tfidf_norms[d] = math.sqrt(doc_tfidf_norms[d])

        return TFIDF, inverted_index, doc_tfidf_norms  
      
    # regular TF
    def compute_TF(self):
        TF = defaultdict(int)
        term_docs = defaultdict(set)
        print(f"Computing TFIDF...")
        for d, doc in tqdm(enumerate(self.documents), total=len(self.documents)):
            words = tokenize(doc)
            for w in words:
                TF[(w, d)] += 1
                term_docs[w].add(d)
        # apply log
        for (w,d), tf in TF.items():
            TF[(w,d)] = 1 + math.log10(tf) 
        return TF, term_docs
    
    # weighted TF for BM25
    def compute_TF_weighted(self):
        TF = defaultdict(int)
        term_docs = defaultdict(set)
        doc_length = defaultdict(float)
        Dtotal = 0
        print(f"Computing TFIDF...")
        for d, doc in tqdm(enumerate(self.documents), total=len(self.documents)):
            words = tokenize(doc)
            for w in words:
                TF[(w, d)] += 1
                term_docs[w].add(d)
            doc_length[d] = len(words)
            Dtotal += len(words)
        Davg = Dtotal / len(self.documents)

        # compute BM25 weighted term frequencies
        TF_weighted = defaultdict(float)
        for (w,d), tf in TF.items():
            TF_weighted[(w,d)] = (tf * (self.k + 1)) / (tf + self.k * (1 - self.b + self.b * (doc_length[d]/Davg)))
        return TF_weighted, term_docs
    

    def retrieve_docs(self, query, topk=1):
        query_words = tokenizer.tokenize(query.lower())
        #print(f"query words: {query_words}")
        # get all documents which contain words from query
        docs = []
        for w in query_words:
            docs.extend([d for d in self.inverted_index[w]])
        #print(f"docs: ")    
        # score all these documents
        scores = defaultdict(float)
        for d in docs:
            for w in query_words:
                scores[d] += self.TFIDF[(w,d)]
            scores[d] = scores[d] / self.doc_tfidf_norms[d]        
        #print(f"scores: {scores}")    
        # return topk documents
        sorted_scores = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        #print(f"sorted scores: {sorted_scores}")
        best = sorted_scores[:topk]
        topk_docs = []
        for doc, score in best:
            topk_docs.append((self.documents[doc], score))
        return topk_docs


In [177]:
IR = IR_System(test_documents, BM_25=True)
#for w, docs in IR.inverted_index.items():
#    print(f"{w}: {[(d, IR.TFIDF[(w,d)]) for d in docs]}")  
print(IR.inverted_index)      
print(IR.doc_tfidf_norms)      

Computing TFIDF...


100%|██████████| 6/6 [00:00<00:00, 35696.20it/s]


Computing TFIDF and creating inverted index...


100%|██████████| 92/92 [00:00<00:00, 660746.52it/s]


Computing TFIDF vector norms...


100%|██████████| 6/6 [00:00<00:00, 57985.77it/s]

defaultdict(<class 'list'>, {'to': [0, 1], 'be': [0, 1], 'brief': [0], 'i': [0, 1], 'write': [0], 'for': [0, 4], 'various': [0], 'reasons': [0], 'will': [1], 'confess': [1], 'that': [1], 'have': [1], 'a': [1, 4], 'fancy': [1], 'numbered': [1], 'among': [1], 'their': [1], 'honourable': [1], 'company': [1], 'sir': [2], 'henry': [2], 'curtis': [2], 'as': [2], 'everybody': [2, 3], 'acquainted': [2], 'with': [2], 'him': [2], 'knows': [2], 'is': [2], 'one': [2], 'of': [2, 5], 'the': [2, 3, 4, 5], 'most': [2, 4], 'hospitable': [2], 'men': [2], 'on': [2], 'earth': [2, 5], 'turned': [3, 5], 'and': [3, 4, 5], 'stared': [3], 'politely': [3], 'at': [3, 5], 'curious': [3], 'looking': [3], 'little': [3], 'lame': [3], 'man': [3], 'though': [3], 'his': [3], 'size': [3], 'was': [3, 4], 'insignificant': [3], 'he': [3], 'quite': [3], 'worth': [3], 'staring': [3], 'once': [4], 'it': [4], 'dense': [4], 'forest': [4], 'now': [4], 's': [4], 'open': [4], 'level': [4], 'country': [4], 'cultivated': [4], 'here'




In [178]:
IR.retrieve_docs(query="open country fancy", topk=2)

[("Once it was a dense forest, now it's open level country cultivated here and there, but for the most part barren.",
  0.5806318569148247),
 ('I will confess that I have a fancy to be numbered among their honourable company.',
  0.28126663765766186)]

#### Now we will train our IR system on the SQuAD v1.0 dataset.

In [4]:
import requests
import json

"""
squad_train_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"  
squad_dev_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json" 

# Cache the train JSON document to a file
response_train = requests.get(squad_train_url)
with open("train.json", "w") as train_file:
    json.dump(response_train.json(), train_file)

# Cache the dev JSON document to a file
response_dev = requests.get(squad_dev_url)
with open("dev.json", "w") as dev_file:
    json.dump(response_dev.json(), dev_file)
"""

# load the train and dev JSON documents
with open("train.json", "r") as train_file:
    squad_train = json.load(train_file)         
with open("dev.json", "r") as dev_file:
    squad_dev = json.load(dev_file)         

#### SQuAD v1.0 data json file structure:

    data: This is a list where each element represents a topic. Each topic is a dictionary with the following keys:
    
        title: The title of the topic (usually the title of the Wikipedia article).
    
        paragraphs: This is a list where each element represents a context passage from the topic. Each passage is a dictionary with the following keys:
    
            context: The context passage text.
        
            qas: This is a list where each element represents a question and its answer(s). Each question-answer pair is a dictionary with the following keys:
        
                answers: This is a list where each element represents an answer to the question. Each answer is a dictionary with the following keys:
        
                    answer_start: The character index in the context passage where the answer starts.
        
                    text: The text of the answer.
        
                question: The question text.
        
                id: A unique identifier for the question.

    version: The version of the SQuAD dataset.

In [84]:
squad_train['data'][0]['paragraphs'][0]

{'qas': [{'question': 'When did Beyonce start becoming popular?',
   'id': '56be85543aeaaa14008c9063',
   'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
   'is_impossible': False},
  {'question': 'What areas did Beyonce compete in when she was growing up?',
   'id': '56be85543aeaaa14008c9065',
   'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
   'is_impossible': False},
  {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
   'id': '56be85543aeaaa14008c9066',
   'answers': [{'text': '2003', 'answer_start': 526}],
   'is_impossible': False},
  {'question': 'In what city and state did Beyonce  grow up? ',
   'id': '56bf6b0f3aeaaa14008c9601',
   'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
   'is_impossible': False},
  {'question': 'In which decade did Beyonce become famous?',
   'id': '56bf6b0f3aeaaa14008c9602',
   'answers': [{'text': 'late 1990s', 'answer_start': 276}],
   'is_impossible': False},
  {'q

#### Get some passages for out training data. This is a large dataset, so we won't use every single passage. We will only draw from a subset of titles.

Note that we also append the title to each passage and question. This inevitably makes the retrieval task eaier, but some questions in the dataset sometimes refer to persons using their pronouns. Without the actual name of the entity present in the query can be problematic, so we simplify things a bit by appending the title to each question.

In [5]:
print(f"Total number of titles: {len(squad_train['data'])}")

def get_passages(squad, num_titles=None):
    if num_titles is None:
        num_titles = len(squad['data'])
    # for each title, get passages and all corresponding questions from SQuAD train set
    passages = []
    questions = []
    for i in range(num_titles):
        print(f"Title# {i}: {squad['data'][i]['title']}, Number of passages: {len(squad['data'][i]['paragraphs'])}")
        for p in squad['data'][i]['paragraphs']:
            # we will append the title to each passage and question
            passages.append(squad['data'][i]['title'] + ": " + p['context'])
            qs = []
            for q in p['qas']:
                q_appended = q
                q_appended['question'] = squad['data'][i]['title'] + ": " + q['question']
                qs.append(q_appended)
            questions.append(qs)       

    print(f"Number of passages: {len(passages)}")
    return passages, questions

passages, questions = get_passages(squad_train, num_titles=200)

Total number of titles: 442
Title# 0: Beyoncé, Number of passages: 66
Title# 1: Frédéric_Chopin, Number of passages: 82
Title# 2: Sino-Tibetan_relations_during_the_Ming_dynasty, Number of passages: 72
Title# 3: IPod, Number of passages: 60
Title# 4: The_Legend_of_Zelda:_Twilight_Princess, Number of passages: 32
Title# 5: Spectre_(2015_film), Number of passages: 43
Title# 6: 2008_Sichuan_earthquake, Number of passages: 77
Title# 7: New_York_City, Number of passages: 148
Title# 8: To_Kill_a_Mockingbird, Number of passages: 62
Title# 9: Solar_energy, Number of passages: 52
Title# 10: Kanye_West, Number of passages: 79
Title# 11: Buddhism, Number of passages: 149
Title# 12: American_Idol, Number of passages: 127
Title# 13: Dog, Number of passages: 75
Title# 14: 2008_Summer_Olympics_torch_relay, Number of passages: 74
Title# 15: Genome, Number of passages: 25
Title# 16: Comprehensive_school, Number of passages: 25
Title# 17: Republic_of_the_Congo, Number of passages: 39
Title# 18: Prime_min

In [185]:
# lets train our IR system on the SqUAD train set passages
IR = IR_System(passages, BM_25=True)

Computing TFIDF...


100%|██████████| 9074/9074 [00:00<00:00, 13771.43it/s]


Computing TFIDF and creating inverted index...


100%|██████████| 49432/49432 [00:00<00:00, 61837.90it/s]


Computing TFIDF vector norms...


100%|██████████| 9074/9074 [00:00<00:00, 13052.31it/s]


In [187]:
# now let's get a query to test out our IR system
passage_idx = 225 
query = questions[passage_idx][0]['question']
query = "".join(query.split(":")[1:])
print(f"QUERY: {query}, GOLD DOCUMENT: {passages[passage_idx]}")

best_docs = IR.retrieve_docs(query, topk=5)
for doc, score in best_docs:
    print(f"SCORE: {score}, DOCUMENT: {doc}")   


QUERY:  In what year was the iPod first introduced?, GOLD DOCUMENT: IPod: Though the iPod was released in 2001, its price and Mac-only compatibility caused sales to be relatively slow until 2004. The iPod line came from Apple's "digital hub" category, when the company began creating software for the growing market of personal digital devices. Digital cameras, camcorders and organizers had well-established mainstream markets, but the company found existing digital music players "big and clunky or small and useless" with user interfaces that were "unbelievably awful," so Apple decided to develop its own. As ordered by CEO Steve Jobs, Apple's hardware engineering chief Jon Rubinstein assembled a team of engineers to design the iPod line, including hardware engineers Tony Fadell and Michael Dhuey, and design engineer Sir Jonathan Ive. Rubinstein had already discovered the Toshiba disk drive when meeting with an Apple supplier in Japan, and purchased the rights to it for Apple, and had also

#### Let's measure the Top-5 accuracy of our trained IR system across all passages used in training and all corresponding questions

In [188]:
def compute_accuracy(IR, passages, questions, topk=5, keeptitle=True, num_passages=1000):
    num_correct = 0
    num_total = 0
    num_passages = min(num_passages, len(passages))
    with tqdm(total=num_passages) as pbar:
        for i in range(num_passages):
            for q in questions[i]:
                # only use questions which have an answer
                if not q['is_impossible']:
                    query = q['question']
                    if not keeptitle:
                        query = " ".join(query.split(":")[1:])
                    best_docs = IR.retrieve_docs(query, topk=topk)
                    for doc, score in best_docs:
                        if doc == passages[i]:
                            num_correct += 1
                            break
                    num_total += 1
                    accuracy = num_correct / num_total
                    pbar.set_postfix(accuracy=accuracy)
            pbar.update(1)
    return accuracy

In [154]:
accuracy = compute_accuracy(IR, passages, questions)
print(f"IR system accuracy: {accuracy}")

100%|██████████| 1000/1000 [09:19<00:00,  1.79it/s, accuracy=0.781]

IR system accuracy: 0.7813238770685579





Accuracy with regular TFIDF on first 1000 passages is about 78%. Now let's try BM25. 

In [189]:
accuracy = compute_accuracy(IR, passages, questions)
print(f"IR system accuracy: {accuracy}")

100%|██████████| 1000/1000 [10:11<00:00,  1.63it/s, accuracy=0.792]

IR system accuracy: 0.7922998986828774





#### We get a slight increase in accuracy using BM25!

In [6]:
# now let's train on all passages
passages, questions = get_passages(squad_train)

Title# 0: Beyoncé, Number of passages: 66
Title# 1: Frédéric_Chopin, Number of passages: 82
Title# 2: Sino-Tibetan_relations_during_the_Ming_dynasty, Number of passages: 72
Title# 3: IPod, Number of passages: 60
Title# 4: The_Legend_of_Zelda:_Twilight_Princess, Number of passages: 32
Title# 5: Spectre_(2015_film), Number of passages: 43
Title# 6: 2008_Sichuan_earthquake, Number of passages: 77
Title# 7: New_York_City, Number of passages: 148
Title# 8: To_Kill_a_Mockingbird, Number of passages: 62
Title# 9: Solar_energy, Number of passages: 52
Title# 10: Kanye_West, Number of passages: 79
Title# 11: Buddhism, Number of passages: 149
Title# 12: American_Idol, Number of passages: 127
Title# 13: Dog, Number of passages: 75
Title# 14: 2008_Summer_Olympics_torch_relay, Number of passages: 74
Title# 15: Genome, Number of passages: 25
Title# 16: Comprehensive_school, Number of passages: 25
Title# 17: Republic_of_the_Congo, Number of passages: 39
Title# 18: Prime_minister, Number of passages: 3

In [7]:
IR = IR_System(passages, BM_25=True)

Computing TFIDF...


100%|██████████| 19035/19035 [00:01<00:00, 10207.41it/s]


Computing TFIDF and creating inverted index...


100%|██████████| 78134/78134 [00:01<00:00, 41137.69it/s]


Computing TFIDF vector norms...


100%|██████████| 19035/19035 [00:01<00:00, 12824.05it/s]


In [8]:
# testing with a query from outside the training set
query = "What is the fundamental teachings of the Buddha?"
print(f"QUERY: {query}")

best_docs = IR.retrieve_docs(query, topk=10)
for doc, score in best_docs:
    print(f"SCORE: {score}, DOCUMENT: {doc}")   

QUERY: What is the fundamental teachings of the Buddha?
SCORE: 0.46870225604860843, DOCUMENT: Buddhism: According to the scriptures, Gautama Buddha presented himself as a model. The Dharma offers a refuge by providing guidelines for the alleviation of suffering and the attainment of Nirvana. The Sangha is considered to provide a refuge by preserving the authentic teachings of the Buddha and providing further examples that the truth of the Buddha's teachings is attainable.
SCORE: 0.40412745416462215, DOCUMENT: Buddhism: The Mahayana sutras are a very broad genre of Buddhist scriptures that the Mahayana Buddhist tradition holds are original teachings of the Buddha. Some adherents of Mahayana accept both the early teachings (including in this the Sarvastivada Abhidharma, which was criticized by Nagarjuna and is in fact opposed to early Buddhist thought) and the Mahayana sutras as authentic teachings of Gautama Buddha, and claim they were designed for different types of persons and differe