## Exercise 1

In this exercise we will understand the functioning of TF/IDF ranking. 

Implement the vector space retrieval model, based on the code framework provided below.

For testing we have provided a simple document collection with 5 documents in file bread.txt:

  DocID | Document Text
  ------|------------------
  1     | How to Bake Breads Without Baking Recipes
  2     | Smith Pies: Best Pies in London
  3     | Numerical Recipes: The Art of Scientific Computing
  4     | Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
  5     | Pastry: A Book of Best French Pastry Recipes

Now, for the query $Q = ``baking''$, find the top ranked documents according to the TF/IDF rank.

For further testing, use the collection __epfldocs.txt__, which contains recent tweets mentioning EPFL.

Compare the results also to the results obtained from the reference implementation using the scikit-learn library.

In [1]:
# Loading of libraries and documents

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import string
from nltk.corpus import stopwords
import math
from collections import Counter
from operator import itemgetter
nltk.download('stopwords')
nltk.download('punkt')

# Tokenize, stem a document
stemmer = PorterStemmer()
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens if word not in stopwords.words('english')])

# Read a list of documents from a file. Each line in a file is a document
with open("bread.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = [tokenize(d).split() for d in original_documents]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ravin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ravin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
documents

[['how', 'bake', 'bread', 'without', 'bake', 'recip'],
 ['smith', 'pie', 'best', 'pie', 'london'],
 ['numer', 'recip', 'the', 'art', 'scientif', 'comput'],
 ['bread', 'pastri', 'pie', 'cake', 'quantiti', 'bake', 'recip'],
 ['pastri', 'a', 'book', 'best', 'french', 'pastri', 'recip']]

In [41]:
# TF/IDF code

# create the vocabulary
vocabulary = list(set([item for sublist in documents for item in sublist]))
vocabulary.sort()

# compute IDF, storing idf values in a dictionary
def idf_values(vocabulary, documents):
    idf = {}
    num_documents = len(documents)
    for i, term in enumerate(vocabulary):
        # YOUR CODE HERE
        count = sum(term in doc for doc in documents)
        idf[term] = math.log(count/num_documents, math.e)
    return idf

# Function to generate the vector for a document (with normalisation)
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)
    max_count = counts.most_common(1)[0][1]
    for i,term in enumerate(vocabulary):
        vector[i] =  counts[term]*idf[term]/max_count
    return vector

# Compute IDF values and vectors
idf = idf_values(vocabulary, documents)
document_vectors = [vectorize(s, vocabulary, idf) for s in documents]

# Function to compute cosine similarity
def cosine_similarity(v1,v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    if sumxy == 0:
            result = 0
    else:
            result =  sumxy/(math.sqrt(sumxx*sumyy))
    return result

# computing the search result (get the topk documents)
def search_vec(query, topk):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    query_vector = vectorize(q, vocabulary, idf)
    scores = [[cosine_similarity(query_vector, document_vectors[d]), d] for d in range(len(documents))]
    scores.sort(key=lambda x: -x[0])
    result = []
    for i in range(topk):
        print(original_documents[scores[i][1]])
        result.append(scores[i][1])
    return result

# HINTS
# natural logarithm function
#     math.log(n,math.e)
# Function to count term frequencies in a document
#     Counter(document)
# most common elements for a list
#     counts.most_common(1)

In [10]:
search_vec("baking",5)

How to Bake Breads Without Baking Recipes
Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Smith Pies: Best Pies in London
Numerical Recipes: The Art of Scientific Computing
Pastry: A Book of Best French Pastry Recipes


In [8]:
# Reference code using scikit-learn
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
features = tf.fit_transform(original_documents)
npm_tfidf = features.todense()
new_features = tf.transform(['baking'])

cosine_similarities = linear_kernel(new_features, features).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]
topk = 5
for i in range(topk):
    print(original_documents[related_docs_indices[i]])

How to Bake Breads Without Baking Recipes
Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Pastry: A Book of Best French Pastry Recipes
Numerical Recipes: The Art of Scientific Computing
Smith Pies: Best Pies in London



## Exercise 2: Evaluate retrieval results

In this exercise, we consider the scikit reference code as an “oracle” that supposedly gives the correct result. Your exercise is to compare the above tf-idf retrieval model with this oracle for the following queries "computer science", "IC school", "information systems" on the **epfldocs.txt** collection.

For this exercise, you need to replace **bread.txt** in the first cell with **epfldocs.txt** and rerun all the cells from the begining. 


In [11]:
# Read a list of documents from a file. Each line in a file is a document
with open("epfldocs.txt",encoding="utf-8") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = [tokenize(d).split() for d in original_documents]

# create the vocabulary
vocabulary = list(set([item for sublist in documents for item in sublist]))
vocabulary.sort()

# Compute IDF values and vectors
idf = idf_values(vocabulary, documents)
document_vectors = [vectorize(s, vocabulary, idf) for s in documents]

In [14]:
# Retrieval oracle 
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
features = tf.fit_transform(original_documents)
npm_tfidf = features.todense()

# Return all document ids that that have cosine similarity with the query larger than a threshold
def search_vec_sklearn(query, features, threshold=0.1):
    new_features = tf.transform([query])
    cosine_similarities = linear_kernel(new_features, features).flatten()
    related_docs_indices, cos_sim_sorted = zip(*sorted(enumerate(cosine_similarities), key=itemgetter(1), 
                                                       reverse=True))
    doc_ids = []
    for i, cos_sim in enumerate(cos_sim_sorted):
        if cos_sim < threshold:
            break
        doc_ids.append(related_docs_indices[i])
    return doc_ids

In [15]:
ret_ids = search_vec_sklearn('computer science', features)
for i, v in enumerate(ret_ids):
    print(original_documents[v])

Exciting News: "World University Rankings 2016-2017 by subject: computer science" No1 @ETH &amp; @EPFL on No8. Congrats https://t.co/ARSlXZoShQ
New computer model shows how proteins are controlled "at a distance" https://t.co/zNjK3bZ6mO  via @EPFL_en #VDtech https://t.co/b9TglXO4KD
An interview with Patrick Barth, a new @EPFL professor who combines protein #biophysics with computer modeling https://t.co/iJwBaEbocj
Exposure Science Film Hackathon 2017 applications open! Come join our Scicomm-film-hacking event! #Science #scicomm https://t.co/zwtKPlh6HT
Le mystère Soulages éblouit la science @EPFL  https://t.co/u3uNICyAdi
@cwarwarrior @EPFL_en @EPFL Doing science at @EPFL_en is indeed pretty cool!!! Thank you for visiting!!!
Blue Brain Nexus: an open-source tool for data-driven science https://t.co/m5yTgXf7ym #epfl
Swiss Data Science on Twitter: "Sign up for @EPFL_en #DataJamDays: learn more a… https://t.co/kNVILHWPGb, see more https://t.co/2wg3BbHBNq
The registration for Exposure Scienc

In [20]:
ret_ids = search_vec_sklearn("IC school", features)
for i, v in enumerate(ret_ids):
    print(original_documents[v])

Chuis à un talk de Google à la fac IC @EPFL et j'ai l'impression qu'il y a aucun étudiant Bachelor
Benoit Seguin @Seguin_Be 1st prize at #EPFL IC Research Day for #Replica Demo: Visual Search in Digital Art Collections @Isadilenardo #dhlab https://t.co/jnmLXmgUos
Registrations still open until Jan. 15th - #GETE-school https://t.co/zosnZc4rnc
#EPFL's imposter #Robotic fish infiltrates a school of zebrafish #Switzerland https://t.co/J8TP5su3HT
Heading out for a lecture on #CircuitQED @EPFL in their school on #Lightmatter2017 interactions https://t.co/NFctEddNmh #epfl #epflcampus
Stay tuned! We will be announcing our @EPFL_en #OpenScience summer school next week! Fantastic speakers and great workshops! https://t.co/zeu9qTtTA7
#Danish #school installs world’s largest #solar #façade: https://t.co/i7GsR6qW1m @EPFL_en #Copenhagen #Switzerland #Clean #Energy #Green https://t.co/5mVDxw3Ofn
#Danish #school installs world’s largest #solar #façade: https://t.co/VUoF6Zibbm @EPFL_en #Copenhagen #Swi

In [21]:
ret_ids = search_vec_sklearn("information systems", features)
for i, v in enumerate(ret_ids):
    print(original_documents[v])

Someone explores why Wikipedia is often the first stop when we start searching for information on a topic (even if we don't admit it) https://t.co/XwvnxYfUnq
"A parametric tool to evaluate the environmental and economic feasibility of decentralized energy systems." - https://t.co/mC2D0L1xRW
Ali Motamed Ph.D. of @leso_epfl presenting his thesis on control systems to improve visual comfort #EnergyEfficiency @EPFL_en @epflENAC https://t.co/DPKpwJMxg5
.@j2bryson @marcelsalathe @EPFL_en Rather than "do I trust," let's ask what we require to attain trust in #AI systems - @EffyVayena  (@uzh_news) #AIhealth #swsx #sxswiss
#RT @dgt_switzerland: RT @MartinVetterli: A must-read bestseller for everybody seriously interested in #digitalization: THE INFORMATION, by @JamesGleick https://t.co/tku6HPX2al @EPFL @EPFL_en @dgt_switzerland https://t.co/xYRzBCFWOQ
@JimStolze @aigency_com and Yves Perriard @EPFL_en are the two keynote speakers at High-Tech Systems 2018! Visitor registration is open: https://

In [19]:
queries = ["computer science", "IC school", "information systems"]

for q in queries:
    print("Query is ", q)
    print("Results are")
    print(search_vec(q,10))
    print("*"*50)

Query is  computer science
Results are
Exciting News: "World University Rankings 2016-2017 by subject: computer science" No1 @ETH &amp; @EPFL on No8. Congrats https://t.co/ARSlXZoShQ
New computer model shows how proteins are controlled "at a distance" https://t.co/zNjK3bZ6mO  via @EPFL_en #VDtech https://t.co/b9TglXO4KD
An interview with Patrick Barth, a new @EPFL professor who combines protein #biophysics with computer modeling https://t.co/iJwBaEbocj
Video of Nicola Marzari from @EPFL_en  on Computational Discovery in the 21st Century during #PASC17 now online: https://t.co/tfCkEvYKtq https://t.co/httPdHcK9W
New at @epfl_en Life Sciences @epflSV: "From PhD directly to Independent Group Leader" #ELFIR_EPFL:  Early Independence Research Scholars. See https://t.co/evqyqD7FFl, also for computational biology #compbio https://t.co/e3pDCg6NVb Deadline April 1 2018 at https://t.co/mJqcrfIqkb
@CodeWeekEU is turning 5, yay! We look very much forward to computational thinking unplugged activiti

## Exercise 2.1: Compute the precision and recall at k

In [32]:
def compute_recall_at_k(predict, gt, k):
    correct_recall = set(predict[:k]).intersection(set(gt))
    return len(correct_recall)/len(gt)

In [33]:
def compute_precision_at_k(predict, gt, k):
    correct_predict = set(predict[:k]).intersection(set(gt))
    return len(correct_predict)/k

## Exercise 2.2: Compute the MAP score at 10

In [34]:
def compute_map(queries, topk):
    map_score = 0
    for i, query in enumerate(queries):
        precision_for_query = 0
        predict = search_vec(query, topk)
        gt = search_vec_sklearn(query, features)
        p_int = []
        for k in range(topk, 0, -1):
            pk = compute_precision_at_k(predict, gt, k)
            p_int_k = max([pk]+p_int)
            p_int.insert(0, p_int_k)
        p_int_relevant = sum(p_int[i] for i, retrieved in enumerate(predict) if retrieved in gt)
            
        map_score += p_int_relevant/len(gt)
    map_score = map_score/len(queries)
    return map_score

In [42]:
compute_map(queries, 10)

Exciting News: "World University Rankings 2016-2017 by subject: computer science" No1 @ETH &amp; @EPFL on No8. Congrats https://t.co/ARSlXZoShQ
New computer model shows how proteins are controlled "at a distance" https://t.co/zNjK3bZ6mO  via @EPFL_en #VDtech https://t.co/b9TglXO4KD
An interview with Patrick Barth, a new @EPFL professor who combines protein #biophysics with computer modeling https://t.co/iJwBaEbocj
Video of Nicola Marzari from @EPFL_en  on Computational Discovery in the 21st Century during #PASC17 now online: https://t.co/tfCkEvYKtq https://t.co/httPdHcK9W
New at @epfl_en Life Sciences @epflSV: "From PhD directly to Independent Group Leader" #ELFIR_EPFL:  Early Independence Research Scholars. See https://t.co/evqyqD7FFl, also for computational biology #compbio https://t.co/e3pDCg6NVb Deadline April 1 2018 at https://t.co/mJqcrfIqkb
@CodeWeekEU is turning 5, yay! We look very much forward to computational thinking unplugged activities during @CodeWeek_CH https://t.co/yDP

0.3437707390648568