## Latent Semantic Indexing


In this exercise, we will run latent semantic indexing on a term-document matrix using python numpy library.

Suppose we are given the following term-document matrix containing eleven terms and four documents $d_1$ , $d_2$ , $d_3$ and $d_4$:

$
M =
  \begin{bmatrix}
    d_1 & d_2 & d_3 & d_4 \\ 
	1 & 1 & 1 & 1  \\
	0 & 1 & 1 & 1 \\
	1 & 0 & 0 & 0 \\
	0 & 1 & 0 & 0 \\
    1 & 0 & 0 & 0 \\
    1 & 0 & 1 & 2 \\
    1 & 1 & 1 & 1 \\
    1 & 1 & 1 & 0 \\
    1 & 0 & 0 & 0 \\
    0 & 2 & 1 & 1 \\
    0 & 1 & 1 & 0 \\
  \end{bmatrix}
$


<br>

###  Question 1.a

Compute the singular value decomposition of the term-document matrix M. Print the values of the output matrices $K$, $S$ and $D^t$.


<b>Hint:</b> Use the function numpy.linalg.svd. More details of this function can be found here at this link:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html


Here's sample code:

In [5]:
# import Python matrix operations library
import numpy as np

#set M matrix using the given values.
M = [[1,1,1,1], 
     [0,1,1,1],
     [1,0,0,0],
     [0,1,0,0],
     [1,0,0,0],
     [1,0,1,2],
     [1,1,1,1],
     [1,1,1,0],
     [1,0,0,0],
     [0,2,1,1],
     [0,1,1,0]]


M = np.array(M)

# compute SVD
K,S,Dt = np.linalg.svd(M)

In [6]:
K

array([[-0.41291701, -0.12294407,  0.05933248, -0.03660797, -0.26712177,
        -0.64005397, -0.4112895 , -0.29358326, -0.26712177, -0.0331095 ,
        -0.02646149],
       [-0.3359611 ,  0.1962311 , -0.25246121,  0.11968319,  0.21609294,
         0.20940104, -0.13578172, -0.22687177,  0.21609294, -0.60596737,
        -0.44296471],
       [-0.07695592, -0.31917516,  0.31179369, -0.15629115, -0.34608445,
         0.63322438, -0.05581269, -0.33567917, -0.34608445, -0.11889882,
         0.01040529],
       [-0.11909604,  0.2663899 ,  0.20432237, -0.52093504, -0.02675801,
        -0.03718789, -0.05980868,  0.39531815, -0.02675801, -0.51079828,
         0.42207616],
       [-0.07695592, -0.31917516,  0.31179369, -0.15629115,  0.84973674,
        -0.0397257 , -0.08480244, -0.11551346, -0.15026326,  0.0510951 ,
         0.03474981],
       [-0.39922386, -0.49767812, -0.57172873,  0.04465203,  0.02360429,
         0.19154746, -0.12789716,  0.33181808,  0.02360429,  0.06933868,
         0.308

In [7]:
S

array([4.78695453, 2.31848919, 1.762346  , 0.77705263])

In [9]:
Dt

array([[-0.36838448, -0.57010731, -0.53356439, -0.50455879],
       [-0.74000417,  0.61762211,  0.0885323 , -0.25119473],
       [ 0.54948837,  0.36008671, -0.05294924, -0.75206148],
       [-0.12144645, -0.40479395,  0.83944473, -0.34165065]])


###  Question 1.b

Are the values of $S$ sorted? Perform latent semantic indexing by selecting the first two largest singular values of the matrix $S$.

<b>Hint:</b> See the lecture slides on latent semantic indexing for more details. A sub-matrix of a numpy matrix can be computed using indexing operations (see https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html).


In [15]:
K_sel = K[:,[0,1]]
S_sel = np.diag(S[:2])
Dt_sel = Dt[[0,1],:]

In [36]:
K_sel

array([[-0.41291701, -0.12294407],
       [-0.3359611 ,  0.1962311 ],
       [-0.07695592, -0.31917516],
       [-0.11909604,  0.2663899 ],
       [-0.07695592, -0.31917516],
       [-0.39922386, -0.49767812],
       [-0.41291701, -0.12294407],
       [-0.30751414, -0.01459992],
       [-0.07695592, -0.31917516],
       [-0.45505713,  0.462621  ],
       [-0.23055822,  0.30457524]])

In [35]:
S_sel

array([[4.78695453, 0.        ],
       [0.        , 2.31848919]])

In [40]:
Dt_sel

array([[-0.36838448, -0.57010731, -0.53356439, -0.50455879],
       [-0.74000417,  0.61762211,  0.0885323 , -0.25119473]])

###  Question 1.c

Given the query $q$:

$
q =
  \begin{bmatrix}
	0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 1 \\ 1 \\
  \end{bmatrix}
$


Map query $q$ into the new document space $D$. The new query is referred to as $q^*$. 

<b>Hint:</b> Use the formulation for mapping queries provided in the lecture slides. You can also use np.linalg.inv function for computing the inverse of a matrix.

###  Question 1.d

Arrange the documents based on the cosine similarity measure between $q^*$ and the new documents in the space $D$.

<b>Hint:</b> Use the cosine similarity function from the previous exercise on vector space retrieval.

###  Question 1.e

Does the order of documents change if document $d_3$ is dropped? If yes, why? 
If no, how should $d_3$ be modified to change the document ordering?


### Question 1.f [Optional]

Run latent semantic indexing for the document collection presented in the previous exercise (presented here as well):

  DocID | Document Text
  ------|------------------
  1     | How to Bake Breads Without Baking Recipes
  2     | Smith Pies: Best Pies in London
  3     | Numerical Recipes: The Art of Scientific Computing
  4     | Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
  5     | Pastry: A Book of Best French Pastry Recipes

Now, for the query $Q=$''<i>baking</i>'', find the top ranked documents according to LSI (use three singular values). 

<b>Hint:</b> Use the code for computing document_vectors from the last exercise. However note that document_vectors represent document-term matrix whereas LSI uses term-document matrix.

# 1.C

In [16]:
q = [
    [0],
    [0],
    [0],
    [0],
    [0],
[1],
[0],
[0],
[0],
[1],
[1]]
q = np.array(q)
q_star = q.T@K_sel@np.linalg.inv(S_sel)
q_star

array([[-0.22662409,  0.11624731]])

# 1.D

In [50]:
# compute cosine similarity.
    
import math

# Function for computing cosine similarity.
def cosine_similarity(v1, v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy*1.0/math.sqrt(sumxx*sumyy)

docs_sim = {}
for i in range(M.shape[1]):
    simlar = cosine_similarity(Dt_sel[:,i], q_star[0])
    docs_sim[i] = simlar

sorted_docs = dict(sorted(docs_sim.items(), key=lambda item:-item[1]))

for doc_id, similarity in sorted_docs.items():
    print("The doc {0} has similarity {1}".format(doc_id+1, similarity))


The doc 3 has similarity 0.9524776244205612
The doc 2 has similarity 0.9388827727147445
The doc 4 has similarity 0.5931086268074788
The doc 1 has similarity -0.012057913278690343


# 1.F

In [51]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import string
from nltk.corpus import stopwords
import math
from collections import Counter
nltk.download('stopwords')

stemmer = PorterStemmer()

# Tokenize, stem a document
def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    return " ".join([stemmer.stem(word.lower()) for word in tokens])

# Read a list of documents from a file. Each line in a file is a document
with open("bread.txt") as f:
    content = f.readlines()
original_documents = [x.strip() for x in content] 
documents = [tokenize(d).split() for d in original_documents]

# create the vocabulary
vocabulary = set([item for sublist in documents for item in sublist])
vocabulary = [word for word in vocabulary if word not in stopwords.words('english')]
vocabulary.sort()

# compute IDF, storing idf values in a dictionary
def idf_values(vocabulary, documents):
    idf = {}
    num_documents = len(documents)
    for i, term in enumerate(vocabulary):
        idf[term] = math.log(num_documents/sum(term in document for document in documents), math.e)
    return idf

# Function to generate the vector for a document (with normalisation)
def vectorize(document, vocabulary, idf):
    vector = [0]*len(vocabulary)
    counts = Counter(document)
    max_count = counts.most_common(1)[0][1]
    for i,term in enumerate(vocabulary):
        vector[i] = idf[term] * counts[term]/max_count
    return vector

# Compute IDF values and vectors
idf = idf_values(vocabulary, documents)
document_vectors = [vectorize(s, vocabulary, idf) for s in documents]

vocabulary


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ravin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['art',
 'bake',
 'best',
 'book',
 'bread',
 'cake',
 'comput',
 'french',
 'london',
 'numer',
 'pastri',
 'pie',
 'quantiti',
 'recip',
 'scientif',
 'smith',
 'without']

In [52]:
M = np.matrix.transpose(np.array(document_vectors))

In [53]:
M

array([[0.        , 0.        , 1.60943791, 0.        , 0.        ],
       [0.91629073, 0.        , 0.        , 0.91629073, 0.        ],
       [0.        , 0.45814537, 0.        , 0.        , 0.45814537],
       [0.        , 0.        , 0.        , 0.        , 0.80471896],
       [0.45814537, 0.        , 0.        , 0.91629073, 0.        ],
       [0.        , 0.        , 0.        , 1.60943791, 0.        ],
       [0.        , 0.        , 1.60943791, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.80471896],
       [0.        , 0.80471896, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.60943791, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.91629073, 0.91629073],
       [0.        , 0.91629073, 0.        , 0.91629073, 0.        ],
       [0.        , 0.        , 0.        , 1.60943791, 0.        ],
       [0.11157178, 0.        , 0.22314355, 0.22314355, 0.11157178],
       [0.        , 0.        , 1.

In [57]:
# LSI

K,S,Dt = np.linalg.svd(M)

K_sel = K[:,:3]
S_sel = np.diag(S[:3])
Dt_sel = Dt[:3,:]

In [58]:
S_sel

array([[3.22696189, 0.        , 0.        ],
       [0.        , 3.00604272, 0.        ],
       [0.        , 0.        , 1.54530928]])

In [59]:
q = np.array([0]*len(vocabulary))

#Set the term corresponding to baking = 1
q[1] = 1

q_star = q.T@K_sel@np.linalg.inv(S_sel)
q_star

array([-0.00416487,  0.11537136, -0.14603541])

In [61]:
docs_sim = {}
for i in range(M.shape[1]):
    simlar = cosine_similarity(Dt_sel[:,i], q_star)
    docs_sim[i] = simlar

sorted_docs = dict(sorted(docs_sim.items(), key=lambda item:-item[1]))

for doc_id, similarity in sorted_docs.items():
    print("The doc {0} has similarity {1}".format(doc_id+1, similarity))

The doc 1 has similarity 0.9980518678772611
The doc 4 has similarity 0.7231078789682566
The doc 3 has similarity -0.0023288750529669527
The doc 5 has similarity -0.6551062911362043
The doc 2 has similarity -0.6577609566355738
