## Exercise 1

In this exercise we will understand the functioning of TF/IDF ranking. 

Implement the vector space retrieval model, based on the code framework provided below.

For testing we have provided a simple document collection with 5 documents in file bread.txt:

  DocID | Document Text
  ------|------------------
  1     | How to Bake Breads Without Baking Recipes
  2     | Smith Pies: Best Pies in London
  3     | Numerical Recipes: The Art of Scientific Computing
  4     | Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
  5     | Pastry: A Book of Best French Pastry Recipes

Now, for the query $Q = ``baking''$, find the top ranked documents according to the TF/IDF rank.

For further testing, use the collection __epfldocs.txt__, which contains recent tweets mentioning EPFL.

Compare the results also to the results obtained from the reference implementation using the scikit-learn library.

In [1]:
# Loading of libraries and documents

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')

# Tokenize, stem a document
stemmer = PorterStemmer()
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# Read a list of documents from a file. Each line in a file is a document
with open("bread.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = [tokenize(d).split() for d in original_documents]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sanadhisutandi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sanadhisutandi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# TF/IDF code

# create the vocabulary
vocabulary = set([item for sublist in documents for item in sublist])
vocabulary = [word for word in vocabulary if word not in stopwords.words('english')]
vocabulary.sort()

# compute IDF, storing idf values in a dictionary
def idf_values(vocabulary, documents):
    idf = {}
    num_documents = len(documents)
    for i, term in enumerate(vocabulary):
        cnt = sum([1 for document in documents if term in document])
        idf[term] = math.log(num_documents/cnt,math.e)
    return idf

# Function to generate the vector for a document (with normalisation)
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)

    max_count = counts.most_common(1)[0][1]
    for i,term in enumerate(vocabulary):
        vector[i] =  counts[term] * idf[term] / max_count
    # Return the vector of vocabularies relative to the document
    return vector

# Compute IDF values and vectors
idf = idf_values(vocabulary, documents)
document_vectors = [vectorize(s, vocabulary, idf) for s in documents]

# Function to compute cosine similarity
def cosine_similarity(v1,v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    if sumxy == 0:
            result = 0
    else:
            result =  sumxy / (math.sqrt(sumxx*sumyy))
    
    return result

# computing the search result (get the topk documents)
def search_vec(query, topk):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    query_vector = vectorize(q, vocabulary, idf)
    scores = [[cosine_similarity(query_vector, document_vectors[d]), d] for d in range(len(documents))]
    scores.sort(key=lambda x: -x[0])
    for i in range(topk):
            print(original_documents[scores[i][1]])

search_vec("baking",5)
# HINTS
# natural logarithm function
#     math.log(n,math.e)
# Function to count term frequencies in a document
#     Counter(document)
# most common elements for a list
#     counts.most_common(1)

How to Bake Breads Without Baking Recipes
Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Smith Pies: Best Pies in London
Numerical Recipes: The Art of Scientific Computing
Pastry: A Book of Best French Pastry Recipes


In [3]:
# Reference code using scikit-learn
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
features = tf.fit_transform(original_documents)
npm_tfidf = features.todense()
new_features = tf.transform(['baking'])

cosine_similarities = linear_kernel(new_features, features).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]
topk = 5
for i in range(topk):
    print(original_documents[related_docs_indices[i]])

How to Bake Breads Without Baking Recipes
Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Pastry: A Book of Best French Pastry Recipes
Numerical Recipes: The Art of Scientific Computing
Smith Pies: Best Pies in London



## Exercise 2

Implement probabilistic retrieval model based on the query likelihood language model, using a mixture model between the documents and the collection, both weighted at 0.5. Maximum likelihood estimation (mle) is used to estimate both as unigram models. You can use the code framework provided below.

Now, for the query $Q = ``baking''$, find the top ranked documents according to the probabilistic rank.

Compare the results with TF/IDF ranking.

In [10]:
# Probabilistic retrieval code

# smoothing parameter
lmbda = 0.5

# term frequency
def tf(word, document):
    frequency = document.count(word)
    return frequency / len(document)

# collection size
def collection_size(documents):
    cs = 0 
    for document in documents:
        cs += len(document)
    return cs

# collection frequency
def cf(word, documents):
    cf = 0
    for document in documents:
        cf += document.count(word)
    return cf / collection_size(documents)

# probabilistic relevance
def query_prob(query, document, documents):
    prob = 1
    for vocab in query:
        prob *= ( (1-lmbda)*tf(vocab,document) + lmbda*cf(vocab,documents) )
    return prob

# computing the search result
def search_prob(query, k):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    scores = [[query_prob(q, documents[d], documents),d] for d in range(len(documents))]
    scores.sort(key=lambda x: -x[0])
    for i in range(k):
            print(original_documents[scores[i][1]])

search_prob("Baking",5)
# HINTS
# counting occurrences of a word in a document:
#     document.count(word)
# length of a document:
#     len(document)
# querying:
#     print(search_prob("How", documents))

How to Bake Breads Without Baking Recipes
Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Smith Pies: Best Pies in London
Numerical Recipes: The Art of Scientific Computing
Pastry: A Book of Best French Pastry Recipes


## Exercise 3
Following the notation used in class, let us denote the set of terms by $T=\{k_i|i=1,...,m\}$, the set of documents by $D=\{d_j |j=1,...,n\}$, and let $d_i=(w_{1j},w_{2j},...,w_{mj})$. We are also given a query  $q=(w_{1q},w_{2q},...,w_{mq})$. In the lecture we studied that, 

$sim(q,d_j) = \sum^m_{i=1} \frac{w_{ij}}{|d_j|}\frac{w_{iq}}{|q|}$ .  (1)

Another way of looking at the information retrieval problem is using a probabilistic approach. The probabilistic view of information retrieval consists of determining the conditional probability $P(q|d_j)$ that for a given document $d_j$ the query by the user is $q$. So, practically in probabilistic retrieval when a query $q$ is given, for each document it is evaluated how probable it is that the query is indeed relevant for the document, which results in a ranking of the documents.

In order to relate vector space retrieval to a probabilistic view of information retrieval, we interpret the weights in Equation (1) as follows:

-  $w_{ij}/|d_j|$ can be interpreted as the conditional probability $P(k_i|d_j)$ that for a given document $d_j$ the term $k_i$ is important (to characterize the document $d_j$).

-  $w_{iq}/|q|$ can be interpreted as the conditional probability $P(q|k_i)$ that for a given term $k_i$ the query posed by the user is $q$. Intuitively, $P(q|k_i)$ gives the amount of importance given to a particular term while querying.

With this interpretation you can rewrite Equation (1) as follows:

$sim(q,d_j) = \sum^m_{i=1} P(k_i|d_j)P(q|k_i)$ (2)

### Question a
Show that indeed with the probabilistic interpretation of weights of vector space retrieval, as given in Equation (2), the similarity computation in vector space retrieval results exactly in the probabilistic interpretation of information retrieval, i.e., $sim(q,d_j)= P(q|d_j)$.
Given that $d_j$ and $q$ are conditionally independent, i.e., $P(d_j \cap q|ki) = P(d_j|k_i)P(q|k_i)$. You can assume existence of joint probability density functions wherever required. (Hint: You might need to use Bayes theorem)

### Question b
Using the expression derived for $P(q|d_j)$ in (a), obtain a ranking (documents sorted in descending order of their scores) for the documents $P(k_i|d_1) = (0, 1/3, 2/3)$, $P(k_i|d_2) =(1/3, 2/3, 0)$, $P(k_i|d_3) = (1/2, 0, 1/2)$, and $P (k_i|d_4) = (3/4, 1/4, 0)$ and the query $P(q|k_i) = (1/5, 0, 2/3)$.