#### Information Storage and Retrieval :: Modeling Texts


# Topic: Modeling Texts by Scoring, Term Weighting and the Vector Space Model

Goals of this Tutorial: In this work, we will explore real-world challenges of term scoring, weighting and document rankings on a Quizlet dataset. In addition, several evaluations will be presented to evaluate the quality of each ranking algorithms.

Part1: Parsing
<br>Part2: Boolean Retrieval
<br>Part3: Ranking Documents by TF-IDF and BM25
<br>Part4: Evaluations by Precision, Recall, HitRate, MAP

<br>[Reference]
<br>Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, "Introduction to Information Retrieval"
[Scoring, term weighting and the
vector space model](https://nlp.stanford.edu/IR-book/pdf/06vect.pdf) (Chap. 6)

## Dataset

The dataset is collected from Quizlet (https://quizlet.com), a website where users can generated their own flashcards. Each flashcard generated by a user is made up of an entity on the front and a definition describing or explaining the entity correspondingly on the back. We treat entities on each flashcard's front as the queries and the definitions on the back of flashcards as the documents. Definitions (documents) are relevant to an entity (query) if the definitions are from the back of the entity's flashcard; otherwise definitions are not relevant. **queries and entities are interchangeable as well as documents and definitions.**

The format of the dataset is like this:

**query \t document id \t document**

Examples:

decision tree	\t 27946 \t	show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers).

where "decision tree" is the entity in the front of a flashcard and "show complex processes with multiple decision rules.  display decision logic (if statements) as set of (nodes) questions and branches (answers)." is the definition on the flashcard's back and "27946" is the id of the definition. Naturally, this document is relevant to the query.

false positive rate	\t 686	\t fall-out; probability of a false alarm

where document 686 is not relevant to query "decision tree" because the entity of "fall-out; probability of a false alarm" is "false positive rate".

# Part 1: Parsing

First, you should tokenize documents (definitions) using **whitespaces and punctuations as delimiters**. Your parser needs to also provide the following three pre-processing options:
* Remove stop words: use nltk stop words list (from nltk.corpus import stopwords)
* Stemming: use [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter)
* Remove any other strings that you think are less informative or nosiy.

Please note that you should stick to the stemming package listed above. Otherwise, given the same query, the results generated by your code can be different from others.

In [0]:
# configuration options
#remove_stopwords = True  # or False
#use_stemming = True # or False
#remove_otherNoise = True # or False
#
import nltk
import string
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict
import re
import glob
import math
import operator

stemmer = PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')
exclude = set(string.punctuation)

# set: cache to store existed tokens, in order to prevent duplicates.
def tokenize(line, words, remove_stopwords, use_stemming, remove_otherNoise):
    line = line.split('\t', 1)[-1]
    line = line.split('\t', 1)[-1]
    temp_list = nltk.word_tokenize(line)

    for word in temp_list:
        # Remove Punctuations
        word = ''.join(ch for ch in word if ch not in exclude)
        if remove_stopwords == True:
            if word in stopwords: 
                continue
        if use_stemming == True:
            word= stemmer.stem(word)
        if remove_otherNoise == True:
            word = re.sub(r'[^\x00-\x7F]+','', word)
        words.append(word)

print (exclude)

{'&', '#', '+', '`', '(', '[', '%', '"', "'", '*', '~', '<', '|', '{', ']', '}', '.', '>', '^', '/', '?', '@', ';', '-', ')', ',', ':', '_', '!', '$', '=', '\\'}


In [0]:
# Your parser function here. It will take the three option variables above as the parameters
# add cells as needed to organize your code

def output_preprocessed_dic (words, filesPath, remove_stopwords, use_stemming, remove_otherNoise):
    for fileName in glob.iglob(filesPath):
        outFile = open(fileName, 'r', encoding="utf-8")
    temp_wordList = []
    for line in outFile:
        tokenize(line, temp_wordList, remove_stopwords, use_stemming, remove_otherNoise)
    for word in temp_wordList:
        words.add(word)

# configuration options
remove_stopwords = True  # or False
use_stemming = True # or False
remove_otherNoise = True # or False

# Input/Output Parameters
filesPath = "homework_1_data.txt"
words_none = set()
output_preprocessed_dic(words_none, filesPath, remove_stopwords=False, use_stemming=False, remove_otherNoise=False)

words_remove_stopwords = set()
output_preprocessed_dic(words_remove_stopwords, filesPath, remove_stopwords=True, use_stemming=False, remove_otherNoise=False)

words_remove_stopwords_stemming = set()
output_preprocessed_dic(words_remove_stopwords_stemming, filesPath, remove_stopwords=True, use_stemming=True, remove_otherNoise=False)

# The rule of removing other noises: remove characters that are not english characters or numbers.
words_remove_stopwords_stemming_removeothernoises = set()
output_preprocessed_dic(words_remove_stopwords_stemming_removeothernoises, filesPath, remove_stopwords=True, use_stemming=True, remove_otherNoise=True)

# Total Calculations
print("None of pre-processing options = " + str(len(words_none)))
print("remove stop words = " + str(len(words_remove_stopwords)))
print("remove stop words + stemming = " + str(len(words_remove_stopwords_stemming)))
print("remove stop words + stemming + remove other noise = " + str(len(words_remove_stopwords_stemming_removeothernoises)))


None of pre-processing options = 18723
remove stop words = 18594
remove stop words + stemming = 12717
remove stop words + stemming + remove other noise = 12268


#### Observations

Once you have your parser working, you should report here the size of your dictionary under the four cases. That is, how many unique tokens do you have with stemming on and casefolding on? And so on. You should fill in the following

* None of pre-processing options      = **18723**
* remove stop words       = **18594**
* remove stop words + stemming       = **12717**
* remove stop words + stemming  + remove other noise     = **12268**

# Part 2: Boolean Retrieval

In this part you build an inverted index to support Boolean retrieval. We only require your index to support AND queries. In other words, your index does not have to support OR, NOT, or parentheses. Also, we do not explicitly expect to see AND in queries, e.g., when we query **relational model**, your search engine should treat it as **relational** AND **model**.

Search for the queries below using your index and print out matching documents (for each query, print out 5 matching documents):
* relational database
* garbage collection
* retrieval model

Please use the following format to present your results:
* query: relational database
* result 1:
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [0]:
# build the index here
# add cells as needed to organize your code
from collections import defaultdict
import re
import glob
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
filesPath = "homework_1_data.txt"
stemmer = PorterStemmer()

# Tokenize Function:
exclude = set(string.punctuation)
stopwords = nltk.corpus.stopwords.words('english')
def tokenize(line):
    words=[]
    remove_stopwords = True  # or False
    use_stemming = True # or False
    remove_otherNoise = True # or False
    line = line.split('\t', 1)[-1]
    line = line.split('\t', 1)[-1]
    temp_list = nltk.word_tokenize(line)
    for word in temp_list:
        # Remove Punctuations
        word = ''.join(ch for ch in word if ch not in exclude)
        if remove_stopwords == True:
            if word in stopwords: 
                continue
        if use_stemming == True:
            word= stemmer.stem(word)
        if remove_otherNoise == True:
            word = re.sub(r'[^\x00-\x7F]+','', word)
        words.append(word)
    return words

# Dictionary for tokens -> definition id, entities_list
dict = defaultdict(list)
definitions = {}
entities = {}

for fileName in glob.iglob(filesPath):
    outFile = open(fileName, 'r', encoding="utf-8")
    for line in outFile:
        cache = {}
        words = tokenize(line)
        tmp_entity = line.split('\t', 1)[0]
        line = line.split('\t', 1)[-1]
        line = re.sub(r'[^\x00-\x7F]+','', line)
        index = line.split('\t', 1)[0]
        line = line.split('\t', 1)[-1]
        definitions[index] = line
        entities[index] = tmp_entity
        for word in words:
            if word not in cache:
                cache[word] = index
                dict[word].append(index)


In [0]:
# search for the input using your index and print out ids of matching documents.
def Boolean_Retrieval (search_text):
    print (str("query: ") + search_text)
    search_text = tokenize(search_text)
    results = set(dict[search_text[0]])
    for word in search_text:
        if word in dict:
            results = results.intersection(set(dict[word]))
        else:
            results.clear()
            break
    i = 0
    for fileName in results:
        i = i +1
        definition_id = re.findall('([0-9]+)', fileName)[0]
        print ("result" + str(i)+ ":")
        print ("entity: " + str(entities[str(definition_id)]))
        print ("definition id: " + definition_id)
        print ("definition: " + str(definitions[str(definition_id)]))

search_text1 = "relational database"
search_text2 = "garbage collection"
search_text3 = "retrieval model"
Boolean_Retrieval (search_text1)
Boolean_Retrieval (search_text2)
Boolean_Retrieval (search_text3)

query: relational database
result1:
entity: relational database
definition id: 28210
definition: a collection of related database tables

result2:
entity: relational database
definition id: 28400
definition: a method of creating a database using tables of related data, with relationships between the tables.

result3:
entity: end users
definition id: 26098
definition: people using a database to support their work or life related tasks and processes

result4:
entity: relational database
definition id: 28312
definition: a collection of related relations in which each relation has a unique name  operational/transactional databases

result5:
entity: relational database
definition id: 28307
definition: a collection of related tabes. establishes the relationships between entities by means of a common field. the tables are constructed so that there is a logical link between them.  -the difference between a database and a relational database is in the way the tables are constructed. -a series o

### Observations
Could your boolean search engine find relevant documents for these queries? 
What is the impact of the three pre-processing options?
Do they improve your search quality?
<br>
(1) **YES**
<br>
(2.1) **Remove Stop Words: No impact to Boolean Retrieval in this example, since the queries are not pre-defined stop words.**
<br>
(2.2) **Stemming: If I didn't set this pre-processing option, the "retrieval model" would match nothing because there are "retrieve" "retrievals" "models" that shall be picked up but be misleaded.**
<br>
(2.3) **Remove other noise: I remove non-ASCII characters, and this does not impact the Boolean Retrieval in this example, unless the query includes non-ASCII characters**
<br>
(3) **Yes, as I mentioned in (2), these parameters (especially stemming) really improve the search quality**


# Part 3: Ranking Documents

In this part, your job is to rank the documents that have been retrieved by the Boolean Retrieval component in Part 2, according to their relevance with each query.

### A: Ranking with simple sums of TF-IDF scores
For a multi-word query, we rank documents by a simple sum of the TF-IDF scores for the query terms in the document.
TF is the log-weighted term frequency $1+log(tf)$; and IDF is the log-weighted inverse document frequency $log(\frac{N}{df})$

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 results plus the TF-IDF sum score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [0]:
# your code here
# hint: you could first call boolean retrieval function in part 2 to find possible relevant documents, 
# and then rank these documents in this part. Hence, you don't need to rank all documents.
# search for the input using your index and print out ids of matching documents.
from collections import defaultdict
import re
import glob
import nltk
import string
import math
import operator
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
from itertools import repeat, chain
stemmer = PorterStemmer()
fileName = "homework_1_data.txt"

# Tokenize Function:
exclude = set(string.punctuation)
stopwords = nltk.corpus.stopwords.words('english')
def tokenize(line):
    words=[]
    remove_stopwords = True  # or False
    use_stemming = True # or False
    remove_otherNoise = True # or False
    line = line.split('\t', 1)[-1]
    line = line.split('\t', 1)[-1]
    temp_list = nltk.word_tokenize(line)
    for word in temp_list:
        # Remove Punctuations
        word = ''.join(ch for ch in word if ch not in exclude)
        if remove_stopwords == True:
            if word in stopwords: 
                continue
        if use_stemming == True:
            word= stemmer.stem(word)
        if remove_otherNoise == True:
            word = re.sub(r'[^\x00-\x7F]+','', word)
        words.append(word)
    return words

# Dictionary for tokens -> definition id, entities_list
dict = defaultdict(list)
definitions = {}
entities = {}
num_terms = 0

outFile = open(fileName, 'r', encoding="utf-8")
for line in outFile:
    num_terms += 1
    words = tokenize(line)
    tmp_entity = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    line = re.sub(r'[^\x00-\x7F]+','', line)
    index = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    definitions[index] = line
    entities[index] = tmp_entity
    for word in words:
        dict[word].append(index)

def TFIDF_Retrieval (search_text):
    query_words = []
    words_set = set()
    doc_vec_dict = defaultdict(list)
    print (str("query: ") + search_text)
    query_words = tokenize(search_text)
    words_set = set(query_words)
    
    idx_set = set()
    for part in words_set:
        idx_set.update(set(dict[part]))
    
    # Calculate TF-IDF
    docid_tfidf = {}
    for docid in idx_set:
        docid_tfidf[docid] = 0
        for word in words_set:
            num_relevant_terms = len(set(dict[word]))
            idf = math.log10((float(num_terms)/num_relevant_terms))
            tf = 0
            if dict[word].count(docid) > 0:
                tf = 1 + math.log10(dict[word].count(docid))
            docid_tfidf[docid] = docid_tfidf[docid] + tf*idf

    sorted_x = sorted(docid_tfidf.items(), key=operator.itemgetter(1), reverse=True)
    i = 0
    for item in sorted_x:
        i = i + 1
        definition_id = str(item[0])
        print ("result " + str(i)+ ":")
        print ("score: " + str(item[1]))
        print ("entity: " + str(entities[str(definition_id)]))
        print ("definition id: " + definition_id)
        print ("definition: " + str(definitions[str(definition_id)]))
        if i == 5:
            break

search_text1 = "relational database"
search_text2 = "garbage collection"
search_text3 = "retrieval model"
TFIDF_Retrieval (search_text1)
TFIDF_Retrieval (search_text2)
TFIDF_Retrieval (search_text3)

query: relational database
result 1:
score: 4.669989147383205
entity: relational algebra
definition id: 7156
definition: - a theoretical language with operations that work on one or more relations to define another relation without changing the original relation(s)  - relation-at-a-time (or set) language in which all tuples, possibly from several relations, are manipulated in one statement without looping  relational algebra, first created by edgar f. codd while at ibm, is a family of algebras with a well-founded semantics used for modelling the data stored in relational databases, and defining queries on it.  the main application of relational algebra is providing a theoretical foundation for relational databases, particularly query languages for such databases, chief among which is sql.

result 2:
score: 4.381474721573064
entity: relational database
definition id: 28378
definition: a type of database system where data is stored in  tables related by common fields. a relational databa

### B: Ranking with vector space model with TF-IDF

**Cosine:** You should use cosine as your scoring function. 

**TFIDF:** For the document vectors, use the standard TF-IDF scores as introduced in A. For the query vector, use simple weights (the raw term frequency). For example:
* query: troll $\rightarrow$ (1)
* query: troll trace $\rightarrow$ (1, 1)

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the cosine score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

You can additionally assume that your queries will contain at most three words. Be sure to normalize your vectors as part of the cosine calculation!

In [0]:
# your code here
# your code here
# hint: you could first call boolean retrieval function in part 2 to find possible relevant documents, 
# and then rank these documents in this part. Hence, you don't need to rank all documents.
# search for the input using your index and print out ids of matching documents.
from collections import defaultdict
import re
import glob
import nltk
import string
import math
import operator
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
from itertools import repeat, chain
stemmer = PorterStemmer()
fileName = "homework_1_data.txt"

# Normalized Function:
def normalize(v):
    vmag = math.sqrt(sum(v[i]*v[i] for i in range(len(v))))
    if vmag == 0:
        return v
    return [ float(v[i])/vmag  for i in range(len(v)) ]

def normalize_mag(v):
    vmag = math.sqrt(sum(v[i]*v[i] for i in v))
    if vmag == 0:
        return v
    for i in v:
        v[i] = float(v[i])/vmag 
    return v

def dot(u, v):
    return sum(u[i]*v[i] for i in range(len(u)))

# Tokenize Function:
exclude = set(string.punctuation)
stopwords = nltk.corpus.stopwords.words('english')
def tokenize(line):
    words=[]
    remove_stopwords = True  # or False
    use_stemming = True # or False
    remove_otherNoise = True # or False
    line = line.split('\t', 1)[-1]
    line = line.split('\t', 1)[-1]
    temp_list = nltk.word_tokenize(line)
    for word in temp_list:
        # Remove Punctuations
        word = ''.join(ch for ch in word if ch not in exclude)
        if remove_stopwords == True:
            if word in stopwords: 
                continue
        if use_stemming == True:
            word= stemmer.stem(word)
        if remove_otherNoise == True:
            word = re.sub(r'[^\x00-\x7F]+','', word)
        words.append(word)
    return words

# Dictionary for tokens -> definition id, entities_list
dict = defaultdict(list)
definitions = {}
entities = {}
filenum = {}
num_terms = 0

outFile = open(fileName, 'r', encoding="utf-8")
for line in outFile:
    num_terms += 1
    words = tokenize(line)
    tmp_entity = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    line = re.sub(r'[^\x00-\x7F]+','', line)
    index = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    definitions[index] = line
    entities[index] = tmp_entity
    filenum[index] = index
    for word in words:
        dict[word].append(index)

def VSM_Retrieval (search_text):
    query_words = []
    query_vec = []
    doc_vec_dict = defaultdict(list)
    results = defaultdict(int)
    print (str("query: ") + search_text)
    query_words = tokenize(search_text)
    words_set = set(query_words)
    visited = []
    # for the query vector, use simple weights (the raw term frequency)
    for word in query_words:
        if word in visited:
            continue
        query_vec.append(query_words.count(word))
        visited.append(word)
    query_vec = normalize(query_vec)

    idx_set = set()
    for part in words_set:
        idx_set.update(set(dict[part]))

    # Calculate TF-IDF (cosine)
    for docid in idx_set:
        doc_vec = {}
        for word in dict:
            #Term freqency
            num_relevant_docs = len(set(dict[word]))
            idf = math.log10((float(num_terms)/num_relevant_docs))
            tf = 0
            if dict[word].count(docid) > 0:
                tf = 1 + math.log10(dict[word].count(docid))
            doc_vec[word] = tf*idf

        doc_vec = normalize_mag(doc_vec)
        relevant_doc_vec = []
        visited = []
        for word in words_set:
            if word in visited:
                continue
            if word not in dict:
                relevant_doc_vec.append(0)
            else:
                relevant_doc_vec.append(doc_vec[word])
            visited.append(word)
        results[docid] = dot(relevant_doc_vec, query_vec)

    sorted_x = sorted(results.items(), key=operator.itemgetter(1), reverse=True)
    i = 0
    for item in sorted_x:
        i = i + 1
        definition_id = str(item[0])
        print ("result " + str(i)+ ":")
        print ("score: " + str(item[1]))
        print ("entity: " + str(entities[str(definition_id)]))
        print ("definition id: " + definition_id)
        print ("definition: " + str(definitions[str(definition_id)]))
        if i == 5:
            break

search_text1 = "relational database"
search_text2 = "garbage collection"
search_text3 = "retrieval model"
VSM_Retrieval (search_text1)
VSM_Retrieval (search_text2)
VSM_Retrieval (search_text3)

query: relational database
result 1:
score: 0.7504158931498572
entity: relational database
definition id: 28227
definition: a database using the relational data model.

result 2:
score: 0.6699603893222419
entity: relational database
definition id: 28210
definition: a collection of related database tables

result 3:
score: 0.6687342479485244
entity: relational model
definition id: 771
definition: a database is a collection of relations or tables.

result 4:
score: 0.6580175885364219
entity: relational database
definition id: 28134
definition: a database including tables that are related to each other

result 5:
score: 0.5992966164930551
entity: relational database
definition id: 28205
definition: a database built using the relational database model

query: garbage collection
result 1:
score: 0.7259400197731763
entity: garbage collector
definition id: 4150
definition: the part of the operating system that performs garbage collection.

result 2:
score: 0.636385757115941
entity: data set
d

### C: Ranking with BM25
Finally, let's try the BM25 approach for ranking. Refer to https://en.wikipedia.org/wiki/Okapi_BM25 for the specific formula. You could choose k_1 = 1.2 and b = 0.75 but feel free to try other options.

**Output:**
For each given query in Part 2, you should just rank the documents retrieved by your boolean search. You only need to output the top-5 documents plus the BM25 score of each of these documents. Please use the following format to present your results:

* query: relational database
* result 1:
* score: 0.1
* entity: database management system
* definition id: 656
* definition: software system used to manage databases
* result 2:
* ......
* query: garbage collection
* ......
* query: retrieval model
* ......

In [0]:
# your code here
from collections import defaultdict
import re
import glob
import nltk
import string
import math
import operator
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
from itertools import repeat, chain
stemmer = PorterStemmer()
fileName = "homework_1_data.txt"

# Tokenize Function:
exclude = set(string.punctuation)
stopwords = nltk.corpus.stopwords.words('english')
def tokenize(line):
    words=[]
    remove_stopwords = True  # or False
    use_stemming = True # or False
    remove_otherNoise = True # or False
    line = line.split('\t', 1)[-1]
    line = line.split('\t', 1)[-1]
    temp_list = nltk.word_tokenize(line)
    for word in temp_list:
        # Remove Punctuations
        word = ''.join(ch for ch in word if ch not in exclude)
        if remove_stopwords == True:
            if word in stopwords: 
                continue
        if use_stemming == True:
            word= stemmer.stem(word)
        if remove_otherNoise == True:
            word = re.sub(r'[^\x00-\x7F]+','', word)
        words.append(word)
    return words

# Dictionary for tokens -> definition id, entities_list
dict = defaultdict(list)
definitions = {}
entities = {}
num_terms = 0

outFile = open(fileName, 'r', encoding="utf-8")

# Calculate avg(D): the avg length of document size
avg_D = 0
for line in outFile:
    num_terms += 1
    words = tokenize(line)
    tmp_entity = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    line = re.sub(r'[^\x00-\x7F]+','', line)
    index = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    definitions[index] = line
    entities[index] = tmp_entity
    for word in words:
        dict[word].append(index)
    avg_D = avg_D + len(line.split(' '))
avg_D = avg_D/num_terms


def BM25_Retrieval (search_text):
    query_words = []
    words_set = set()
    doc_vec_dict = defaultdict(list)
    results = defaultdict(int)
    print (str("query: ") + search_text)
    query_words = tokenize(search_text)
    words_set = set(query_words)
    
    idx_set = set()
    for part in words_set:
        idx_set.update(set(dict[part]))
    
    # Calculate IDF
    idf = {}
    for word in words_set:
        # Inverse Document Freq
        num_relevant_terms = len(set(dict[word]))
        idf[word] = math.log10((num_terms - num_relevant_terms + 0.5)/(num_relevant_terms+0.5))
        
    # Calculate BM25: sum{ idf[qi] * ((tf[qi,dj] * (k_1+1))/( tf[qi,dj] + k_1 * (1 -b +b*L) ))  }, L = dj/avg_D
    k_1 = 1.2
    b = 0.75
    docid_tfidf = {}
    for docid in idx_set:
        docid_tfidf[docid] = 0
        for word in words_set:
            tf = 0
            if dict[word].count(docid) > 0:
                tf = 1 + math.log10(dict[word].count(docid))
            L = len( definitions[docid].split(' ') )/avg_D
            docid_tfidf[docid] = docid_tfidf[docid] + idf[word] * ((tf * (k_1+1)) / (tf + k_1 * (1 -b +b*L) ))

    sorted_x = sorted(docid_tfidf.items(), key=operator.itemgetter(1), reverse=True)
    i = 0
    for item in sorted_x:
        i = i + 1
        definition_id = str(item[0])
        print ("result " + str(i)+ ":")
        print ("score: " + str(item[1]))
        print ("entity: " + str(entities[str(definition_id)]))
        print ("definition id: " + definition_id)
        print ("definition: " + str(definitions[str(definition_id)]))
        if i == 5:
            break

search_text1 = "relational database"
search_text2 = "garbage collection"
search_text3 = "retrieval model"
BM25_Retrieval (search_text1)
BM25_Retrieval (search_text2)
BM25_Retrieval (search_text3)

query: relational database
result 1:
score: 3.892073819350884
entity: relational database
definition id: 28177
definition: relational database schema with data

result 2:
score: 3.7811668780887677
entity: relational database
definition id: 28210
definition: a collection of related database tables

result 3:
score: 3.7423440294331187
entity: relational database
definition id: 28205
definition: a database built using the relational database model

result 4:
score: 3.6764055319997007
entity: relational database
definition id: 28227
definition: a database using the relational data model.

result 5:
score: 3.5772927409025206
entity: data warehouse
definition id: 4648
definition: -set of related databases  -hierarchy of data

query: garbage collection
result 1:
score: 6.078777318273037
entity: garbage collector
definition id: 4150
definition: the part of the operating system that performs garbage collection.

result 2:
score: 5.514531961077818
entity: memory safety
definition id: 13115
defin

### Discussion
Briefly discuss the differences you see between the three methods. Is there one you prefer?
<br><br> **[ANS]** In my opinion, I would prefer the BM25, if the time complexity is not considered. Compared with TF-IDF, BM25 revises the equation of IDF, therefore the more "rare" term could have the more contributions to the BM25-score. The value |D|/avgdl also consider the term frequency in short or long text/sentence, which could make a more accurate matching for queries. b is a parameter to decide the importance of |D|/avgdl. Last but not least, k_1 is a parameter to decide the saturation points. If a query is more common than other queries, then this query's importance should not be considered in linear. In sum, the BM25 is the best, for it considers the most aspects of the information retrieval problem theoretically.

## Part4: Evaluation
Rather than just compare methods by pure observation, there are several metrics to evaluate the performance of an IR engine: Precision, Recall, MAP, NDCG, HitRate and so on. These all require a ground truth set of queries and documents with a notion of **relevance**. These ground truth judgments can be expensive to obtain, so we are cutting corners here and treating a flashcard's front and back as a "relevant" query-document pair.

That is, if a document (definition) in your top-5 results is from the back of query's (entity's) flashcard, this document is regarded as relevant to the query (entity). This document is also called a hit in IR. Based on the ground-truth, you could calculate the metrics for the three ranking methods and provide the results like these:

* metric: Precision@5
* TF-IDF - score1
* Vector Space Model with TF-IDF - score2
* BM25 - score3

You could pick any of the reasonable metrics.

In [0]:
# your code here
#[Ground Truth]
#relational database, relational database, relational database, relational database, relational database, garbage collection, garbage collection, garbage collection, garbage collection, garbage collection, retrieval model, retrieval model, retrieval model, retrieval model, retrieval model
#
#[TFIDF]
#relational algebra, relational database, relational model, database management system, relational database, garbage collection, garbage collection, garbage collection, garbage collection, garbage collector, data model, mathematical model, online analytical processing, online analytical processing, physical design
#
#[VSM]
#relational database, relational database, relational model, relational database, relational database, garbage collector, data set, data set, data collection, data collection, query language, online analytical processing, online analytical processing, data storage, black box
#
#[BM25]
#relational database, relational database, relational database, relational database, data warehouse, garbage collector, memory safety, garbage collection, garbage collection, garbage collection, online analytical processing, online analytical processing, query language, online analytical processing, information processing
GroundTruth = ["relational database", "relational database", "relational database", "relational database", "relational database", 
               "garbage collection", "garbage collection", "garbage collection", "garbage collection", "garbage collection",
               "retrieval model", "retrieval model", "retrieval model", "retrieval model", "retrieval model"]
#print (str(len(GroundTruth)) + str(" ") + str(len(TFIDF)) + str(" ")+ str(len(VSM)) + str(" ")+ str(len(BM25)) + str(" "))

from collections import defaultdict
import re
import glob
import nltk
import string
import math
import operator
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
from itertools import repeat, chain
stemmer = PorterStemmer()
fileName = "homework_1_data.txt"

# Normalized Function:
def normalize(v):
    vmag = math.sqrt(sum(v[i]*v[i] for i in range(len(v))))
    if vmag == 0:
        return v
    return [ float(v[i])/vmag  for i in range(len(v)) ]

def normalize_mag(v):
    vmag = math.sqrt(sum(v[i]*v[i] for i in v))
    if vmag == 0:
        return v
    for i in v:
        v[i] = float(v[i])/vmag 
    return v

def dot(u, v):
    return sum(u[i]*v[i] for i in range(len(u)))

# Tokenize Function:
exclude = set(string.punctuation)
stopwords = nltk.corpus.stopwords.words('english')
def tokenize(line):
    words=[]
    remove_stopwords = True  # or False
    use_stemming = True # or False
    remove_otherNoise = True # or False
    line = line.split('\t', 1)[-1]
    line = line.split('\t', 1)[-1]
    temp_list = nltk.word_tokenize(line)
    for word in temp_list:
        # Remove Punctuations
        word = ''.join(ch for ch in word if ch not in exclude)
        if remove_stopwords == True:
            if word in stopwords: 
                continue
        if use_stemming == True:
            word= stemmer.stem(word)
        if remove_otherNoise == True:
            word = re.sub(r'[^\x00-\x7F]+','', word)
        words.append(word)
    return words

# Dictionary for tokens -> definition id, entities_list
dict = defaultdict(list)
definitions = {}
entities = {}
num_terms = 0

avg_D = 0
outFile = open(fileName, 'r', encoding="utf-8")
for line in outFile:
    num_terms += 1
    words = tokenize(line)
    tmp_entity = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    line = re.sub(r'[^\x00-\x7F]+','', line)
    index = line.split('\t', 1)[0]
    line = line.split('\t', 1)[-1]
    definitions[index] = line
    entities[index] = tmp_entity
    for word in words:
        dict[word].append(index)
    avg_D = avg_D + len(line.split(' '))
avg_D = avg_D/num_terms


def TFIDF_Retrieval (search_text, result):
    query_words = []
    words_set = set()
    doc_vec_dict = defaultdict(list)
    query_words = tokenize(search_text)
    words_set = set(query_words)
    
    idx_set = set()
    for part in words_set:
        idx_set.update(set(dict[part]))
    
    # Calculate TF-IDF
    docid_tfidf = {}
    for docid in idx_set:
        docid_tfidf[docid] = 0
        for word in words_set:
            num_relevant_terms = len(set(dict[word]))
            idf = math.log10((float(num_terms)/num_relevant_terms))
            tf = 0
            if dict[word].count(docid) > 0:
                tf = 1 + math.log10(dict[word].count(docid))
            docid_tfidf[docid] = docid_tfidf[docid] + tf*idf

    sorted_x = sorted(docid_tfidf.items(), key=operator.itemgetter(1), reverse=True)
    i = 0
    for item in sorted_x:
        i = i + 1
        definition_id = str(item[0])
        result.append(str(entities[str(definition_id)]))
        if i == 5:
            break
    return result

def VSM_Retrieval (search_text, result):
    query_words = []
    query_vec = []
    doc_vec_dict = defaultdict(list)
    VSM = defaultdict(int)
    query_words = tokenize(search_text)
    words_set = set(query_words)
    visited = []
    # for the query vector, use simple weights (the raw term frequency)
    for word in query_words:
        if word in visited:
            continue
        query_vec.append(query_words.count(word))
        visited.append(word)
    query_vec = normalize(query_vec)

    idx_set = set()
    for part in words_set:
        idx_set.update(set(dict[part]))

    # Calculate TF-IDF (cosine)
    for docid in idx_set:
        doc_vec = {}
        for word in dict:
            #Term freqency
            num_relevant_docs = len(set(dict[word]))
            idf = math.log10((float(num_terms)/num_relevant_docs))
            tf = 0
            if dict[word].count(docid) > 0:
                tf = 1 + math.log10(dict[word].count(docid))
            doc_vec[word] = tf*idf

        doc_vec = normalize_mag(doc_vec)
        relevant_doc_vec = []
        visited = []
        for word in words_set:
            if word in visited:
                continue
            if word not in dict:
                relevant_doc_vec.append(0)
            else:
                relevant_doc_vec.append(doc_vec[word])
            visited.append(word)
        VSM[docid] = dot(relevant_doc_vec, query_vec)

    sorted_x = sorted(VSM.items(), key=operator.itemgetter(1), reverse=True)
    i = 0
    for item in sorted_x:
        i = i + 1
        definition_id = str(item[0])
        result.append(str(entities[str(definition_id)]))
        if i == 5:
            break
    return result

def BM25_Retrieval (search_text, result):
    query_words = []
    words_set = set()
    doc_vec_dict = defaultdict(list)
    BM25 = defaultdict(int)
    query_words = tokenize(search_text)
    words_set = set(query_words)
    
    idx_set = set()
    for part in words_set:
        idx_set.update(set(dict[part]))
    
    # Calculate IDF
    idf = {}
    for word in words_set:
        # Inverse Document Freq
        num_relevant_terms = len(set(dict[word]))
        idf[word] = math.log10((num_terms - num_relevant_terms + 0.5)/(num_relevant_terms+0.5))
        
    # Calculate BM25: sum{ idf[qi] * ((tf[qi,dj] * (k_1+1))/( tf[qi,dj] + k_1 * (1 -b +b*L) ))  }, L = dj/avg_D
    k_1 = 1.2
    b = 0.75
    for docid in idx_set:
        BM25[docid] = 0
        for word in words_set:
            tf = 0
            if dict[word].count(docid) > 0:
                tf = 1 + math.log10(dict[word].count(docid))
            L = len( definitions[docid].split(' ') )/avg_D
            BM25[docid] = BM25[docid] + idf[word] * ((tf * (k_1+1)) / (tf + k_1 * (1 -b +b*L) ))

    sorted_x = sorted(BM25.items(), key=operator.itemgetter(1), reverse=True)
    i = 0
    for item in sorted_x:
        i = i + 1
        definition_id = str(item[0])
        result.append(str(entities[str(definition_id)]))
        if i == 5:
            break
    return result

search_text1 = "relational database"
search_text2 = "garbage collection"
search_text3 = "retrieval model"

################# Precision #################
TFIDF_result = []
TFIDF_result = TFIDF_Retrieval (search_text1, TFIDF_result)
TFIDF_result = TFIDF_Retrieval (search_text2, TFIDF_result)
TFIDF_result = TFIDF_Retrieval (search_text3, TFIDF_result)

VSM_result = []
VSM_result = VSM_Retrieval (search_text1, VSM_result)
VSM_result = VSM_Retrieval (search_text2, VSM_result)
VSM_result = VSM_Retrieval (search_text3, VSM_result)

BM25_result = []
BM25_result = BM25_Retrieval (search_text1, BM25_result)
BM25_result = BM25_Retrieval (search_text2, BM25_result)
BM25_result = BM25_Retrieval (search_text3, BM25_result)

TP_TFIDF = 0
TP_VSM = 0
TP_BM25 = 0
for i in range(15):
    if TFIDF_result[i] == GroundTruth[i]:
        TP_TFIDF = TP_TFIDF +1
    if VSM_result[i] == GroundTruth[i]:
        TP_VSM = TP_VSM +1
    if BM25_result[i] == GroundTruth[i]:
        TP_BM25 = TP_BM25 +1
FP_TFIDF = 15 - TP_TFIDF
FP_VSM = 15 - TP_VSM
FP_BM25 = 15 - TP_BM25

print ("################# Precision #################")
print ("metric: Precision@5")
print ( "TF-IDF - " + str(TP_TFIDF/(TP_TFIDF+FP_TFIDF)) )
print ( "Vector Space Model with TF-IDF - " + str(TP_VSM/(TP_VSM+FP_VSM)) )
print ("BM25 - " + str(TP_BM25/(TP_BM25+FP_BM25)) )

print ("################# Recall #################")
FN_TFIDF = 5*3 - TP_TFIDF
FN_VSM = 5*3 - TP_VSM
FN_BM25 = 5*3 - TP_BM25
print ("metric: Recall@5")
print ( "TF-IDF - " + str(TP_TFIDF/(TP_TFIDF+FN_TFIDF) ))
print ( "Vector Space Model with TF-IDF - " + str(TP_VSM/(TP_VSM+FN_VSM) ))
print ("BM25 - " + str(TP_BM25/(TP_BM25+FN_BM25) ))

print ("################# HitRate #################")
print ("metric: HitRate@5")
print ( "TF-IDF - " + str(TP_TFIDF/(TP_TFIDF+FN_TFIDF) ))
print ( "Vector Space Model with TF-IDF - " + str(TP_VSM/(TP_VSM+FN_VSM) ))
print ("BM25 - " + str(TP_BM25/(TP_BM25+FN_BM25) ))

print ("################# MAP #################")
search_text1 = "relational database"
search_text2 = "garbage collection"
search_text3 = "retrieval model"


### to recommend N items, the number of relevant items in the full space of items is m
# N = 15, m = # of entities that is "relational database" || "garbage collection" || "retrieval model"

TFIDF = ["relational algebra", "relational database", "relational model", "relational database", "database management system",
         "garbage collection", "garbage collection", "garbage collection", "garbage collection", "garbage collection",
         "data model", "mathematical model", "online analytical processing", "online analytical processing", "web browser"]
VSM = ["relational database", "relational database", "relational model", "relational database", "relational database", 
       "garbage collector", "data set", "data set", "data collection", "data collection",
       "query language", "online analytical processing", "online analytical processing", "data storage", "black box"]
BM25 = ["relational database", "relational database", "relational database", "relational database", "data warehouse", 
        "garbage collector", "memory safety", "garbage collection", "garbage collection", "garbage collection",
        "online analytical processing", "online analytical processing", "online analytical processing", "query language", "information processing"]

All_words = 0
for item in entities:
    if entities[item] == "relational database":
        All_words = All_words +1
    if entities[item] == "garbage collection":
        All_words = All_words +1
    if entities[item] == "retrieval model":
        All_words = All_words +1

# Precision
P_TFIDF = TP_TFIDF/(TP_TFIDF+FP_TFIDF)
P_VSM = TP_VSM/(TP_VSM+FP_VSM)
P_BM25 = TP_BM25/(TP_BM25+FP_BM25)

# Average Precision
AP_TFIDF = 0
AP_VSM = 0
AP_BM25 = 0
for i in range(15):
    if TFIDF[i] == GroundTruth[i]:
        AP_TFIDF = AP_TFIDF + P_TFIDF
    if VSM[i] == GroundTruth[i]:
        AP_VSM = AP_VSM +P_VSM
    if BM25[i] == GroundTruth[i]:
        AP_BM25 = AP_BM25 +P_BM25

print ("metric: MAP@5")
print ( "TF-IDF - " + str(AP_TFIDF/All_words))
print ( "Vector Space Model with TF-IDF - " + str(AP_VSM/All_words))
print ("BM25 - " + str(AP_BM25/All_words))


################# Precision #################
metric: Precision@5
TF-IDF - 0.4666666666666667
Vector Space Model with TF-IDF - 0.26666666666666666
BM25 - 0.4666666666666667
################# Recall #################
metric: Recall@5
TF-IDF - 0.4666666666666667
Vector Space Model with TF-IDF - 0.26666666666666666
BM25 - 0.4666666666666667
################# HitRate #################
metric: HitRate@5
TF-IDF - 0.4666666666666667
Vector Space Model with TF-IDF - 0.26666666666666666
BM25 - 0.4666666666666667
################# MAP #################
metric: MAP@5
TF-IDF - 0.010082304526748973
Vector Space Model with TF-IDF - 0.0032921810699588477
BM25 - 0.010082304526748973
