# Exercise 3: Advanced Information Retrieval

## Question 1 - Latent Semantic Indexing


In this exercise, we will run latent semantic indexing on a term-document matrix using python numpy library.

Suppose we are given the following term-document matrix containing eleven terms and four documents $d_1$ , $d_2$ , $d_3$ and $d_4$:

$
M =
  \begin{bmatrix}
    d_1 & d_2 & d_3 & d_4 \\ 
	1 & 1 & 1 & 1  \\
	0 & 1 & 1 & 1 \\
	1 & 0 & 0 & 0 \\
	0 & 1 & 0 & 0 \\
    1 & 0 & 0 & 0 \\
    1 & 0 & 1 & 2 \\
    1 & 1 & 1 & 1 \\
    1 & 1 & 1 & 0 \\
    1 & 0 & 0 & 0 \\
    0 & 2 & 1 & 1 \\
    0 & 1 & 1 & 0 \\
  \end{bmatrix}
$


<br>

###  Question 1.a

Compute the singular value decomposition of the term-document matrix M. Print the values of the output matrices $K$, $S$ and $D^t$.


<b>Hint:</b> Use the function numpy.linalg.svd. More details of this function can be found here at this link:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html


Here's sample code:

In [2]:
# import Python matrix operations library
import numpy as np

#set M matrix using the given values.
M = [[1,1,1,1], 
     [0,1,1,1],
     [1,0,0,0],
     [0,1,0,0],
     [1,0,0,0],
     [1,0,1,2],
     [1,1,1,1],
     [1,1,1,0],
     [1,0,0,0],
     [0,2,1,1],
     [0,1,1,0]]


M = np.array(M)

# compute SVD
K, S, Dt = np.linalg.svd(M, full_matrices=False)

K

array([[-0.41291701, -0.12294407,  0.05933248, -0.03660797],
       [-0.3359611 ,  0.1962311 , -0.25246121,  0.11968319],
       [-0.07695592, -0.31917516,  0.31179369, -0.15629115],
       [-0.11909604,  0.2663899 ,  0.20432237, -0.52093504],
       [-0.07695592, -0.31917516,  0.31179369, -0.15629115],
       [-0.39922386, -0.49767812, -0.57172873,  0.04465203],
       [-0.41291701, -0.12294407,  0.05933248, -0.03660797],
       [-0.30751414, -0.01459992,  0.48607132,  0.40306708],
       [-0.07695592, -0.31917516,  0.31179369, -0.15629115],
       [-0.45505713,  0.462621  , -0.04813884, -0.40125186],
       [-0.23055822,  0.30457524,  0.17427762,  0.55935823]])


###  Question 1.b

Are the values of $S$ sorted? Perform latent semantic indexing by selecting the first two largest singular values of the matrix $S$.

<b>Hint:</b> See the lecture slides on latent semantic indexing for more details. A sub-matrix of a numpy matrix can be computed using indexing operations (see https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html).


In [3]:
S

array([ 4.78695453,  2.31848919,  1.762346  ,  0.77705263])

In [21]:
K_sel = K[:,:2]
S_sel = S[:2]
Dt_sel = Dt[:2,:]

###  Question 1.c

Given the query $q$:

$
q =
  \begin{bmatrix}
	0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 1 \\ 1 \\
  \end{bmatrix}
$


Map query $q$ into the new document space $D$. The new query is referred to as $q^*$. 

<b>Hint:</b> Use the formulation for mapping queries provided in the lecture slides. You can also use np.linalg.inv function for computing the inverse of a matrix.

In [22]:
# 1c
# Form q
q = np.array([0,0,0,0,0,1,0,0,0,1,1])

# Form S
S_1 = np.linalg.inv(np.diag(S_sel))

# Calculate q_star
q_star = (q).dot(K_sel).dot(S_1)
q_star

array([-0.22662409,  0.11624731])

###  Question 1.d

Arrange the documents based on the cosine similarity measure between $q^*$ and the new documents in the space $D$.

<b>Hint:</b> Use the cosine similarity function from the previous exercise on vector space retrieval.

In [23]:
import math

# Function for computing cosine similarity.
def cosine_similarity(v1, v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy*1.0/math.sqrt(sumxx*sumyy)

d1 = Dt_sel[:,0]
d2 = Dt_sel[:,1]
d3 = Dt_sel[:,2]
d4 = Dt_sel[:,3]

cos1 = cosine_similarity(q_star,d1)
cos2 = cosine_similarity(q_star,d2)
cos3 = cosine_similarity(q_star,d3)
cos4 = cosine_similarity(q_star,d4)

print(cos1) 
print(cos2)
print(cos3)
print(cos4)

-0.0120579132787
0.938882772715
0.952477624421
0.593108626807


Arrangement: D3 > D2 > D4 > D1

###  Question 1.e

Does the order of documents change if document $d_3$ is dropped? If yes, why? 
If no, how should $d_3$ be modified to change the document ordering?

<br><b>No</b> because D3 is aligned with D2 and D4 (similar magnitude). Removing D3 does not significantly change term space (K) and document space (D) of SVD.

<br><b>To change this</b>, define such D3 that goes in different direction compared with D2 and D4

### Question 1.f [Optional]

Run latent semantic indexing for the document collection presented in the previous exercise (presented here as well):

  DocID | Document Text
  ------|------------------
  1     | How to Bake Breads Without Baking Recipes
  2     | Smith Pies: Best Pies in London
  3     | Numerical Recipes: The Art of Scientific Computing
  4     | Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
  5     | Pastry: A Book of Best French Pastry Recipes

Now, for the query $Q=$''<i>baking</i>'', find the top ranked documents according to LSI (use three singular values). 

<b>Hint:</b> Use the code for computing document_vectors from the last exercise. However note that document_vectors represent document-term matrix whereas LSI uses term-document matrix.

In [36]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')
import numpy as np

stemmer = PorterStemmer()

# Tokenize, stem a document
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# Read a list of documents from a file. Each line in a file is a document
with open("bread.txt") as f:
# with open("epfldocs.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = [tokenize(d).split() for d in original_documents]

# create the vocabulary
vocabulary = set([item for sublist in documents for item in sublist])
vocabulary = [word for word in vocabulary if word not in stopwords.words('english')]
vocabulary.sort()

# compute IDF, storing idf values in a dictionary
def idf_values(vocabulary, documents):
    idf = {}
    num_documents = len(documents)
    for i, term in enumerate(vocabulary):
        idf[term] = math.log(num_documents/sum(term in document for document in documents), math.e)
    return idf

# Function to generate the vector for a document (with normalisation)
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)
    max_count = counts.most_common(1)[0][1]
    for i,term in enumerate(vocabulary):
        vector[i] = idf[term] * counts[term]/max_count
    return vector

def vectorize_query(query, vocabulary, idf):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    query_vector = vectorize(q, vocabulary, idf)
    return query_vector

# Compute IDF values and vectors
idf = idf_values(vocabulary, documents)
document_vectors = [vectorize(s, vocabulary, idf) for s in documents]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sanadhisutandi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [37]:
M = np.transpose(np.array(document_vectors))

# compute SVD
K, S, Dt = np.linalg.svd(M, full_matrices=False)

In [38]:
K_sel = K[:,:3]
S_sel = S[:3]
Dt_sel = Dt[:3,:]

In [39]:
# Forming q
# Check Vocabulary
vocabulary

['art',
 'bake',
 'best',
 'book',
 'bread',
 'cake',
 'comput',
 'french',
 'london',
 'numer',
 'pastri',
 'pie',
 'quantiti',
 'recip',
 'scientif',
 'smith',
 'without']

In [40]:
# Forming q
# Forming vector 0 with size of vocabulary
q = np.array([0]*len(vocabulary))

# bake is 2nd element -> index 1
q[1] = 1

In [41]:
# Form S
S_1 = np.linalg.inv(np.diag(S_sel))

# Calculate q_star
q_star = (q).dot(K_sel).dot(S_1)
q_star

array([-0.00416487,  0.11537136, -0.14603541])

In [43]:
# Calculate using approach in 1d
import math

# Function for computing cosine similarity.
def cosine_similarity(v1, v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy*1.0/math.sqrt(sumxx*sumyy)

d1 = Dt_sel[:,0]
d2 = Dt_sel[:,1]
d3 = Dt_sel[:,2]
d4 = Dt_sel[:,3]
d5 = Dt_sel[:,4]

cos1 = cosine_similarity(q_star,d1)
cos2 = cosine_similarity(q_star,d2)
cos3 = cosine_similarity(q_star,d3)
cos4 = cosine_similarity(q_star,d4)
cos5 = cosine_similarity(q_star,d5)

print(cos1) 
print(cos2)
print(cos3)
print(cos4)
print(cos5)

0.998051867877
-0.657760956636
-0.00232887505297
0.723107878968
-0.655106291136


<br>
<br>
<br>
<br>


## Question 2 - Word Embeddings

In this exercise, we would train word embeddings using a state-of-the-art embeddings library fastText. The first step of the exercise is to install the fastText library. Proceed with the following steps:

### FastText installation


#### Run these commands on the shell terminal:

> wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip <br>
> unzip v0.1.0.zip<br>
> cd fastText-0.1.0 <br>
> make<br>

<br>

Move the epfldocs.txt file (provided in the last exercise) to the current directory. Sample command (linux) for copying the file into current directory is as follows:

> cp directory_path/epfldocs.txt ./

<br>

### Generate Embeddings

Further, generate fasttext embeddings for the epfldocs.txt file using the following command:

> ./fasttext skipgram -input epfldocs.txt -output model_epfldocs


The above command generates word embeddings and stores them in a file named model_epfldocs.vec.

<br>

### Load Embeddings

In the second phase of this exercise, we will load these embeddings into memory using python and visualize them.
Use the following python code to load the embeddings into memory:<br><br>



In [None]:
import sys
import codecs
import numpy as np


def load_embeddings(file_name):
    with codecs.open(file_name, 'r', 'utf-8') as f_in:
        lines = f_in.readlines()
        lines = lines[1:]
        vocabulary, wv = zip(*[line.strip().split(' ', 1) for line in lines])
    wv = np.loadtxt(wv)
    return wv, vocabulary


# Replace the path based on your own machine.
word_embeddings, vocabulary = load_embeddings(directory_path + 'model_epfldocs.vec')






### Visualize Embeddings

In the third phase of this exercise, we will visualize the generated embeddings. First install the tsne library using pip:

> $ pip install tsne

In [None]:
from tsne import bh_sne
import matplotlib.pyplot as plt
import logging

vis_data = bh_sne(word_embeddings)

vis_data_x = vis_data[:,0]
vis_data_y = vis_data[:,1]

plt.scatter(vis_data_x, vis_data_y)
for label, x, y in zip(vocabulary, vis_data_x, vis_data_y):
    plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
plt.show()


### Question 2.a

Observe the plot of word embeddings. Do you observe any patterns?



### Question 2.b

Write a python function to find the most similar terms for a given term. The similarity between two terms is defined as the cosine similarity between their corresponding word embeddings.

Find the top 3 terms that are most similar to 'la', 'EPFL', '#robot', 'this', 

### Question 2.c [Optional]

Download the text file using the following command:

> wget http://mattmahoney.net/dc/text8.zip -O text8.gz <br>
> tar -xvf text8.gz 


The above command creates a text file named 'text8'. Regenerate the fasttext embeddings using the text8 file. Plot the word embeddings for first 1000 terms in the vocabulary.

### Question 2.d [Optional]

Observe the word embeddings that are visualized in this link http://www.anthonygarvan.com/wordgalaxy/ . Can you make some interesting observations? 