#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 1:  Information Retrieval Basics

### 100 points [7% of your final grade]

### Due: January 31 (Friday) by 11:59pm

*Goals of this homework:* In this homework you will get first hand experience building a text-based mini search engine. In particular, there are three main learning objectives: (i) the basics of tokenization (e.g. stemming, case-folding, etc.) and its effect on information retrieval; (ii) basics of index building and Boolean retrieval; and (iii) basics of the Vector Space model and ranked retrieval.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw1.ipynb`. For example, my homework submission would be something like `555001234_hw1.ipynb`. Submit this notebook via eCampus (look for the homework 1 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the 5 total allotted to you).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

## Dataset

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **In this homework, queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

# Part 1: Parsing (20 points)

First, you should tokenize documents (definitions) using **whitespaces and punctuations as delimiters**. Your parser needs to also provide the following three pre-processing options:
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

Please note that you should stick to the stemming package listed above. Otherwise, given the same query, the results generated by your code can be different from others.

In [1]:
# Import necessary packages

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import time
import math

import numpy as np
from numpy import dot
from numpy.linalg import norm

from collections import Counter

In [2]:
# configuration options
remove_stopwords = True
use_stemming = True
remove_otherNoise = True

In [3]:
# Initialize required data structures

# contains the documents indexed by their  ids
# list with 2 fields, [0] -> definition [1] -> Entitiy
documents = {}

# document tokens indexed by their ids
# to avoid redundant processing
doc_tokens = {}

Total_document_length = 0
# inverted index for our corpus
inverted_index = {}

# word frequency for ranking the doucments
# 0 -> term frequency in the corpus
word_frequency = {}

# prepare a vocabulary index for the column id of individual words
vocab_id_table = {}
vocab_id = 0


In [4]:
# Your parser function here. It will take the three option variables above as the parameters
start = time.time()

remove_punct = '!"#$&\'()*+,./;:<=>?@[\\]^`{|}~%_-'
table = str.maketrans(remove_punct,' '*len(remove_punct), '' )

# Stem the words user Porter's Algorithm
if use_stemming == True:
    myStemmer = PorterStemmer()

with open('homework_1_data.txt', 'r', encoding="utf8") as f:
    lines = f.readlines()
    for line in lines:
        doc = line.split('\t')
        curr_doc_id = int(doc[1])

        # ids converted to integer for faster matching
        documents[curr_doc_id] = [doc[2],doc[0]]

        doc_without_punct = doc[2].translate(table)

        words = doc_without_punct.split()
        Total_document_length += len(words)

        # the set of seen words for each document
        for i in range(len(words)):

            # handling leading and trailing hyphens
            words[i] = words[i].strip("-")

            # Case Folding
            words[i] = words[i].lower()            

            if use_stemming == True:
                words[i] = myStemmer.stem(words[i])

            if remove_otherNoise == True:
                remove_digits = '0123456789'
                digit_table = str.maketrans('','',remove_digits)
                words[i] = words[i].translate(digit_table )

            # is this word in our dictionary already?
            if words[i] in inverted_index:

                # add the current document id to that word's document list
                inverted_index[words[i]].add(curr_doc_id)

            # seeing the word for the first time
            else:
                inverted_index[words[i]] = set({curr_doc_id})

                # returns the id for individual words
                vocab_id_table[words[i]] = vocab_id
                vocab_id += 1

        # store the tokens for this document
        doc_tokens[curr_doc_id] = words

# remove stopwrods directly from the inverted index
if remove_stopwords == True:
    for stopword in stopwords.words('english'):
        if stopword in inverted_index:
            del inverted_index[stopword]
            del vocab_id_table[stopword]


# total number of documents in the corpus
N = len(doc_tokens)

avgdl = Total_document_length/N

print("remove_stopwords = ", remove_stopwords, " use_stemmer = ", use_stemming, " remove noise = ", remove_otherNoise," ==>", len(inverted_index))
end = time.time()
print("Time taken to parse dataset :",end - start)


remove_stopwords =  True  use_stemmer =  True  remove noise =  True  ==> 9741
Time taken to parse dataset : 8.570852756500244


### Observations

Once you have your parser working, you should report here the size of your dictionary under the four cases. That is, how many unique tokens do you have with stemming on and casefolding on? And so on. You should fill in the following

* None of pre-processing options      = 16276
* remove stop words       = 16136
* remove stop words + stemming       = 10242
* remove stop words + stemming  + remove other noise     = 9741

# Part 2: Boolean Retrieval (30 points)

In this part you build an inverted index to support Boolean retrieval. We only require your index to support AND queries. In other words, your index does not have to support OR, NOT, or parentheses. Also, we do not explicitly expect to see AND in queries, e.g., when we query **relational model**, your search engine should treat it as **relational** AND **model**.

Search for the queries below using your index and print out matching documents (for each query, print out 5 matching documents):
* relational database
* garbage collection
* retrieval model

Please use the following format to present your results:
* query: relational database
* result 1:
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [5]:
## query processing and setup

queries = ['relational database','garbage collection','retrieval model' ]

query_tokens = []

remove_punct = '!"#$&\'()*+,./;:<=>?@[\\]^`{|}~%_-'
q_table = str.maketrans(remove_punct,' '*len(remove_punct), '' )

# Stem the words user Porter's Algorithm
if use_stemming == True:
    myStemmer = PorterStemmer()

for query in queries:
    # sanity check for punctuations in query
    # keeping inline with our parsing function
    
    query_without_punctuations = query.translate(q_table)
    query_words = query_without_punctuations.split()

    for j in range(len(query_words)):

        # handling leading and trailing hyphens
        query_words[j] = query_words[j].strip("-")

        # Case Folding
        query_words[j] = query_words[j].lower()

        if use_stemming == True:
            query_words[j] = myStemmer.stem(query_words[j])

        if remove_otherNoise == True:
            remove_digits = '0123456789'
            q_digit_table = str.maketrans('','',remove_digits)
            query_words[j] = query_words[j].translate(q_digit_table )

    # remove stopwords separately
    if remove_stopwords == True:
        for k in reversed(range(len(query_words))):
            if query_words[k] in stopwords.words('english'):
                del query_words[k]
        
    # add the tokens for reference
    query_tokens.append(query_words)

In [6]:
print(queries)
print(query_tokens)

['relational database', 'garbage collection', 'retrieval model']
[['relat', 'databas'], ['garbag', 'collect'], ['retriev', 'model']]


In [7]:
#
# Boolean retreival
#

print("~~~ Part 2: Boolean Retrieval ~~~")
print("")

boolean_results_query = []
for query_id in range(len(queries)):

    if query_tokens[query_id]:
        res_Boolean = inverted_index[query_tokens[query_id][0]]
        for word in query_tokens[query_id][1:]:
            res_Boolean = res_Boolean.intersection(inverted_index[word])

    boolean_results_query.append(res_Boolean)
    # Print Results
    print("query: ",queries[query_id], "\n")
    print("Total Number of retrievals: ", len(res_Boolean), "\n")  
    for r in range(5):
        print("result ",r+1,":")
        print("entity: ",documents[list(res_Boolean)[r]][1])
        print("definitiion id: ",list(res_Boolean)[r])
        print("definition: ",documents[list(res_Boolean)[r]][0])


~~~ Part 2: Boolean Retrieval ~~~

query:  relational database 

Total Number of retrievals:  237 

result  1 :
entity:  relational database
definitiion id:  28160
definition:  a group of related databases associated by a key, or a common identifying (qualitative) characteristic.

result  2 :
entity:  relational databases
definitiion id:  5121
definition:  - relational databases store data in 2-dimensional tables - the tables establish connections between the different entities we want to model - the &"relational&" in relational database (rdb) comes from the concept of a relation in set theory.

result  3 :
entity:  relational databases
definitiion id:  5122
definition:  relational databases use two or more tables linked together (to form a relationship). relational databases do not store all the data in the same table. repeated data is moved into it's own table as shown in the image below:

result  4 :
entity:  relational database
definitiion id:  28163
definition:  a relational datab

### Observations
Could your boolean search engine find relevant documents for these queries? What is the impact of the three pre-processing options? Do they improve your search quality?

Answer:
Yes, boolean search engine was able to retrieve relevant documents for the queries when the pre-processing options were applied.
It failed to retrieve documents for the query 'retrieval model' when all pre-processing options were disabled.
It suggests that the initial preprocessing (stemming, removal of stopwords and noise, handling of punctuations etc) indeed helps in improving the retrieval quality.

# Part 3: Ranking Documents (50 points) 

In this part, your job is to rank the documents that have been retrieved by the Boolean Retrieval component in Part 2, according to their relevance with each query.

### A: Ranking with simple sums of TF-IDF scores (15 points) 
For a multi-word query, we rank documents by a simple sum of the TF-IDF scores for the query terms in the document.
TF is the log-weighted term frequency $1+log(tf)$; and IDF is the log-weighted inverse document frequency $log(\frac{N}{df})$

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 results plus the TF-IDF sum score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [8]:
#
## prepare vector space for documents
#

doc_vectors = np.zeros((N,vocab_id))
IDF = {}
IDF_BM25 = {}

for word,word_id in vocab_id_table.items():
    nq = len(inverted_index[word])
    IDF[word] = math.log10(N/nq)
    IDF_BM25[word] = math.log10((N - nq + 0.5)/(nq + 0.5))

for this_doc_id,this_doc_tokens in doc_tokens.items():

    # count the number of occurences for ech word        
    term_freq = Counter(this_doc_tokens)
    # for each word and its count
    for term,count in term_freq.items():
        if remove_stopwords == True and term in stopwords.words('english'):
            continue
        tf = 1 + math.log10(count)
        # documents are rows and words are columns
        doc_vectors[this_doc_id][vocab_id_table[term]] += (tf * IDF[term])


In [9]:
# your code here
# hint: you could first call boolean retrieval function in part 2 to find possible relevant documents, 
# and then rank these documents in this part. Hence, you don't need to rank all documents.

print("~~~ Part 3A: Ranking with simple sums of TF-IDF scores ~~~")
print("")

for query_id in range(len(queries)):

    scores = []
    for match_doc_id in boolean_results_query[query_id]:
        doc_score = 0
        
        for keyword in set(query_tokens[query_id]):

            # count the number of occurrence of this keyword in document
            t_fd = doc_tokens[match_doc_id].count(keyword)
            
            tf = 1 + math.log10(t_fd)

            # sum over all terms in query AND document
            doc_score += (tf * IDF[keyword])
            
        # append the final document score for comparision
        scores.append((doc_score,match_doc_id))


    # Print Results
    print("query: ",queries[query_id], "\n")

    r = 0 # top results to print
    # replace with quickselect for larger corpuses
    for score,doc_id in sorted(scores,reverse=True)[:5]:
        r += 1 
        print("result ",r,":")
        print("score: ", score)
        print("entity: ", documents[doc_id][1])
        print("definition id: ", doc_id )
        print("definition: ", documents[doc_id][0])

~~~ Part 3A: Ranking with simple sums of TF-IDF scores ~~~

query:  relational database 

result  1 :
score:  4.718083875753986
entity:  relational algebra
definition id:  7156
definition:  - a theoretical language with operations that work on one or more relations to define another relation without changing the original relation(s)  - relation-at-a-time (or set) language in which all tuples, possibly from several relations, are manipulated in one statement without looping  relational algebra, first created by edgar f. codd while at ibm, is a family of algebras with a well-founded semantics used for modelling the data stored in relational databases, and defining queries on it.  the main application of relational algebra is providing a theoretical foundation for relational databases, particularly query languages for such databases, chief among which is sql.

result  2 :
score:  4.358466421269945
entity:  relational database
definition id:  28378
definition:  a type of database system wh

### B: Ranking with vector space model with TF-IDF (15 points) 

**Cosine:** You should use cosine as your scoring function. 

**TFIDF:** For the document vectors, use the standard TF-IDF scores as introduced in A. For the query vector, use simple weights (the raw term frequency). For example:
* query: troll $\rightarrow$ (1)
* query: troll trace $\rightarrow$ (1, 1)

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the cosine score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

You can additionally assume that your queries will contain at most three words. Be sure to normalize your vectors as part of the cosine calculation!

In [10]:
#
## Vector Space Model
#

for query_id in range(len(queries)):

    # generating the query vector
    # count the number of occurences for each word        
    term_freq = Counter(query_tokens[query_id])
    query_vector = np.zeros(vocab_id)

    # for each word and its count
    for term,count in term_freq.items():
        if remove_stopwords == True and term in stopwords.words('english'):
            continue
        tf = count
        # documents are rows and words are columns
        query_vector[vocab_id_table[term]] += (tf * IDF[term])

    # cosine similairty calculation for each document
    cos_similarity = np.zeros(N)

    # TODO - optimize this function. Use numpy only.
    for idx in boolean_results_query[query_id]:
        cos_similarity[idx] = dot(doc_vectors[idx], query_vector)/(norm(doc_vectors[idx])*norm(query_vector))

    # gets the top 5 in linear time
    top5_index = np.argpartition(cos_similarity, -5)[-5:]

    top5_vsm = []
    for index in top5_index :
        top5_vsm.append((cos_similarity[index ],index ))

    # Print Results
    print("~~~ Part 3B: Ranking with vector space model with TF-IDF ~~~")
    print("query: ",queries[query_id], "\n")

    r = 0 # top results to print
    for score,doc_id in sorted(top5_vsm,reverse=True)[:5]:
        r += 1 
        print("result ",r,":")
        print("score: ", score)
        print("entity: ", documents[doc_id][1])
        print("definition id: ", doc_id )
        print("definition: ", documents[doc_id][0])    

~~~ Part 3B: Ranking with vector space model with TF-IDF ~~~
query:  relational database 

result  1 :
score:  0.7537315690025447
entity:  relational database
definition id:  28227
definition:  a database using the relational data model.

result  2 :
score:  0.6712815370258216
entity:  relational database
definition id:  28210
definition:  a collection of related database tables

result  3 :
score:  0.6712815370258216
entity:  relational model
definition id:  771
definition:  a database is a collection of relations or tables.

result  4 :
score:  0.6591374090355712
entity:  relational database
definition id:  28134
definition:  a database including tables that are related to each other

result  5 :
score:  0.6005458659464757
entity:  relational database
definition id:  28205
definition:  a database built using the relational database model

~~~ Part 3B: Ranking with vector space model with TF-IDF ~~~
query:  garbage collection 

result  1 :
score:  0.7925773097508486
entity:  garbage c

### C: Ranking with BM25 (20 points) 
Finally, let's try the BM25 approach for ranking. Refer to https://en.wikipedia.org/wiki/Okapi_BM25 for the specific formula. You could choose k_1 = 1.2 and b = 0.75 but feel free to try other options.

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the BM25 score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [11]:
#
## BM25
#

print("~~~ Part 3C: Ranking with BM25 ~~~")
print(" ")

b= 0.75
k_1 = 1.2
avgdl = Total_document_length/N

for query_id in range(len(queries)):

    bm25_scores  = []

    for match_doc_id in boolean_results_query[query_id]:
        bm25_doc_score = 0
        D = len(documents[match_doc_id][0].split())

        for keyword in set(query_tokens[query_id]):
            # count the number of occurrence of this keyword in document
            f_qi_D = doc_tokens[match_doc_id].count(keyword)
            # saturation
            saturation = f_qi_D * ( k_1 + 1 ) / (f_qi_D + k_1)
            # document length normaliztion (B)
            B = (1 - b) + ( b * D / avgdl)                
            # sum over all terms in query AND document
            bm25_doc_score += IDF_BM25[keyword] * saturation / B
        bm25_scores.append((bm25_doc_score,match_doc_id))


    # Print Results
    print("query: ",queries[query_id], "\n")

    r = 0 # top results to print
    for score,doc_id in sorted(bm25_scores,reverse=True)[:5]:
        r += 1 
        print("result ",r,":")
        print("score: ", score)
        print("entity: ", documents[doc_id][1])
        print("definition id: ", doc_id )
        print("definition: ", documents[doc_id][0]) 

~~~ Part 3C: Ranking with BM25 ~~~
 
query:  relational database 

result  1 :
score:  6.1260855760953135
entity:  relational database
definition id:  28177
definition:  relational database schema with data

result  2 :
score:  5.7111963476297145
entity:  relational database
definition id:  28205
definition:  a database built using the relational database model

result  3 :
score:  5.642429950758306
entity:  relational database
definition id:  28210
definition:  a collection of related database tables

result  4 :
score:  5.229555542655476
entity:  relational database
definition id:  28227
definition:  a database using the relational data model.

result  5 :
score:  5.229555542655476
entity:  data warehouse
definition id:  4648
definition:  -set of related databases  -hierarchy of data

query:  garbage collection 

result  1 :
score:  7.719557749327347
entity:  garbage collector
definition id:  4150
definition:  the part of the operating system that performs garbage collection.

result

### Discussion
Briefly discuss the differences you see between the three methods. Is there one you prefer?

Answer: 

1. In the tf-idf weight scheme of ranking documents, each doc is basically represented by a vector of tf-idf weight.
2. In the vector space model, using the cosine tf-idf weighting, documents are ranked according to the proximity of the document to the query. This modeled the similarity precisely.
3. BM25 gives a score to the document. It theoritically should be giving robust results.

Due to the specific nature of the dataset we are using here, the first method works best and is preferred here.

## Bonus: Evaluation (10 points)
Rather than just compare methods by pure observation, there are several metrics to evaluate the performance of an IR engine: Precision, Recall, MAP, NDCG, HitRate and so on. These all require a ground truth set of queries and documents with a notion of **relevance**. These ground truth judgments can be expensive to obtain, so we are cutting corners here and treating a flashcard's front and back as a "relevant" query-document pair.

That is, if a document (definition) in your top-5 results is from the back of query's (entity's) flashcard, this document is regarded as relevant to the query (entity). This document is also called a hit in IR. Based on the ground-truth, you could calculate the metrics for the three ranking methods and provide the results like these:

* metric: Precision@5
* TF-IDF - score1
* Vector Space Model with TF-IDF - score2
* BM25 - score3

You could pick any of the reasonable metrics.

In [12]:
# your code here

## prepearing the relevant document list
## as we have many documents for the same entities,
## we first prepare the ground truth list from our data set
## with this list, we will compare our results and count the intersections

relevant_documents = {}

for idx in range(len(documents)):
    document = documents[idx]
    
    query = document[1]
    query_without_punctuations = query.translate(q_table)
    query_words = query_without_punctuations.split()

    for i in range(len(query_words)):

        # handling leading and trailing hyphens
        query_words[i] = query_words[i].strip("-")

        # Case Folding
        query_words[i] = query_words[i].lower()

        if use_stemming == True:
            query_words[i] = myStemmer.stem(query_words[i])

        # TODO:
        if remove_otherNoise == True:
            remove_digits = '0123456789'
            q_digit_table = str.maketrans('','',remove_digits)
            query_words[i] = query_words[i].translate(q_digit_table )

    # remove stopwords separately
    if remove_stopwords == True:
        for k in reversed(range(len(query_words))):
            if query_words[k] in stopwords.words('english'):
                del query_words[k]
    
    
    # add the tokens for reference
    if tuple(query_words) not in relevant_documents:
        relevant_documents[tuple(query_words)] = set([idx])
    else:
        relevant_documents[tuple(query_words)].add(idx)
    


In [13]:
## finally we will have our scores
tfidf_relevant = 0
vsm_relevant = 0
bm25_relevant = 0

for entity_query, relevant_set in relevant_documents.items():
    
    try:
        if entity_query:
            res_Boolean = inverted_index[entity_query[0]]
            for word in entity_query[1:]:
                    res_Boolean = res_Boolean.intersection(inverted_index[word])
    except:
        res_Boolean = set([])
    

    ## tfidf top-5 results
    #
    tfidf_scores = []
    for match_doc_id in res_Boolean:
        doc_score = 0

        for keyword in set(entity_query):

            # count the number of occurrence of this keyword in document
            t_fd = doc_tokens[match_doc_id].count(keyword)
            
            tf = 1 + math.log10(t_fd)

            # sum over all terms in query AND document
            doc_score += (tf * IDF[keyword])
            
        # append the final document score for comparision
        tfidf_scores.append((doc_score,match_doc_id))


    tfidf_top5 = set()
    for score,doc_id in sorted(tfidf_scores,reverse=True)[:5]:
        tfidf_top5.add(doc_id)

    tfidf_relevant += len(tfidf_top5.intersection(relevant_set))  
        
    ## VSM top5 results
    
    # generating the query vector
    # count the number of occurences for each word        
    term_freq = Counter(entity_query)
    query_vector = np.zeros(vocab_id)

    # for each word and its count
    for term,count in term_freq.items():
        if remove_stopwords == True and term in stopwords.words('english'):
            continue
        elif term not in vocab_id_table:
            continue
        
        tf = count
        # documents are rows and words are columns
        query_vector[vocab_id_table[term]] += (tf * IDF[term])

    # cosine similairty calculation for each document
    cos_similarity = np.zeros(N)

    # TODO - optimize this function. Use numpy only.
    for idx in res_Boolean:
        cos_similarity[idx] = dot(doc_vectors[idx], query_vector)/(norm(doc_vectors[idx])*norm(query_vector))

    # gets the top 5 in linear time
    top5_index = np.argpartition(cos_similarity, -5)[-5:]
    
    vsm_top5 = set(top5_index)
    vsm_relevant += len(vsm_top5.intersection(relevant_set))
    

    b= 0.75
    k_1 = 1.2
    avgdl = Total_document_length/N
    
    bm25_scores  = []

    for match_doc_id in res_Boolean:
        bm25_doc_score = 0
        D = len(documents[match_doc_id][0].split())

        for keyword in set(entity_query):
            if remove_stopwords == True and keyword in stopwords.words('english'):
                continue
            elif keyword not in vocab_id_table:
                continue
            
            # count the number of occurrence of this keyword in document
            f_qi_D = doc_tokens[match_doc_id].count(keyword)
            # saturation
            saturation = f_qi_D * ( k_1 + 1 ) / (f_qi_D + k_1)
            # document length normaliztion (B)
            B = (1 - b) + ( b * D / avgdl)                
            # sum over all terms in query AND document
            bm25_doc_score += IDF_BM25[keyword] * saturation / B
            
        bm25_scores.append((bm25_doc_score,match_doc_id))

    bm25_top5 = set()
    for score,doc_id in sorted(bm25_scores,reverse=True)[:5]:
        bm25_top5.add(doc_id)

    bm25_relevant += len(bm25_top5.intersection(relevant_set))  


In [14]:
## Lets check and compare the scores:

# we are considering 5 retreived results for each query
# relevant documents contains the count of queries

Total_retreived = len(relevant_documents) * 5

# precision is given my relevant retreived / total retreived

tfidf_precision = tfidf_relevant/Total_retreived
vsm_precision = vsm_relevant/Total_retreived
bm25_precision = bm25_relevant/Total_retreived

print("~~~ Part Bonus: Evaluation ~~~")

print("Metric: Precision @ 5","\n")
print("Calculated with each entity as query, the definitions as relevant documents and the top-5 results as the retreived documents\n")
print("TF-IDF = ", tfidf_precision)
print("VSM = ", vsm_precision)
print("BM25 = ", bm25_precision)

~~~ Part Bonus: Evaluation ~~~
Metric: Precision @ 5 

Calculated with each entity as query, the definitions as relevant documents and the top-5 results as the retreived documents

TF-IDF =  0.2140386571719227
VSM =  0.1770091556459817
BM25 =  0.1686673448626653


# Collaboration Declarations

** You should fill out your collaboration declarations here.**

**Reminder:** You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by filling out the Collaboration Declarations at the bottom of this notebook.

Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.