# Latent Semantic Indexing

---
**Author**: Marko Bajec

**Last update**: 4.5.2019

**Description**: in this example we show how to use **Singular Value Decomposition (SVD)** to transform terms and documents of a document corpus to a *k-concept space* and use the transformed vector space for Information Retrieval.  
**Required libraries** (use pip3):
* <code>numpy</code>
* <code>scipy</code>
* <code>sklearn</code>
* <code>lemmagen</code>
* <code>nltk</code>
* <code>matplotlib</code>

---
## Document corpus
Let's say we have the following 5 sentences representing the **document corpus**:
* d<sub>1</sub>: *Romeo and Juliet.*
* d<sub>2</sub>: *Juliet: O happy dagger!*
* d<sub>3</sub>: *Romeo died by dagger.*
* d<sub>4</sub>: *"Live free or die”, that’s the New-Hampshire’s motto.*
* d<sub>5</sub>: *Did you know, New-Hampshire is in New-England.*

**Question**: How close (relevant) are the above sentences to the following query: $q=$<span style="color:blue">*died, dagger*</span>?

Remember that we used the same corpus for the exercise on the *Vector Space Models*. There the best matching documents for our query were $d_3$ and $d_2$. The document $d_1$ was however ranked very low despite the fact that in the play *Romeo and Juliet*, written by William Shakespeare, both Romeo and Juliet die by dagger. The problem is that $d_1$ doesn't involve any of the query words. We expect that by using **LSI** and **SVD**, documents including any of these words will come closer based on the fact that they often cooccur together. This should then result in getting the document $d_1$ closer to our query.


## Python implementation
In the Python program below we use <code>scipy</code> implementation of SVD to decompose our matrix to $A=U \Sigma V^T$. Since our term-document matrix is small, i.e. $6 \times 11$, we could have operated with the whole matrix (ignoring that in this way we would not get rid off any noise). Instead, we will use two dimensions only (i.e. $k=2$) just to demonstrate a complete SVD procedure including *reduction* of matrix dimensions. 

To compare the query and documents in the *k-concept space model*, we will use **Cosine distance** implemented in <code>sklearn</code>. The same library will also be used to form the **tf matrix**.   

### Step 1: preprocess Corpus
In this step we import the required libraries, define two supporting functions (one for preprocessing text and the other for reducing the size of a two-dimensional matrix - we will need both functions later) and perform text processing on the corpus. More specifically, we remove *punctuation* and *stopwords*, put all in *lowercase* and *lemmatize*. At the end we print out the Corpus vocabulary.  

In [None]:
import numpy as np
import string
from scipy.linalg import svd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import lemmagen.lemmatizer
from lemmagen.lemmatizer import Lemmatizer
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

# set figure size for plots [width, height]
plt.rcParams['figure.figsize'] = [12, 12]

# This is a simple function that processes the corpus by applying the following transformations: 
# * transform to lowercase, 
# * remove punctuation,
# * lemmatize
# All of this transformations are optional. 
# After transformations are applied, a vocabulary is created by removing stopwords (optional) and duplicates
# Preprocess works for Slovene or English. Use the parameter lang to setup the target language (slovene or english). 
# By default it is set to english. For removing SLO stopwords with nltk, you need a list of SLO stopwords. 
# The list is included in the CLARIN database. See here: https://www.clarin.si/repository/xmlui/handle/11356/1109 
def preproces(corpus, rm_punctuation=None, lemmatize=None, rm_stopwords=None, lowercase=None, lang=None):
    if rm_punctuation==None:
        rm_punctuation=1
    if lemmatize==None:
        lemmatize=1
    if rm_stopwords==None:
        rm_stopwords=1
    if lowercase==None:
        lowercase=1    
    if lang==None:
        language='english'
    # put to lowercase
    if lowercase:
        corpus = [s.lower() for s in corpus]
    # remove punctation
    if rm_punctuation:
        punct = string.punctuation + '’' + '”'
        corpus = [s.translate(str.maketrans('', '', punct)) for s in corpus]
    # lemmatize
    if lemmatize:
        lemmatized_corpus = []
        if lang=='slovene':
            lemmatizer = Lemmatizer(dictionary=lemmagen.DICTIONARY_SLOVENE)            
        else:
            lemmatizer = Lemmatizer(dictionary=lemmagen.DICTIONARY_ENGLISH)
        for l in corpus:
            lemmatized_word_list = []
            for word in l.split():
                lemmatized_word_list.append(lemmatizer.lemmatize(word))
            lemmatized_corpus.append(' '.join(word for word in lemmatized_word_list))
            corpus = lemmatized_corpus
    # create vocabulary
    vocab = []
    for doc in corpus:
        for word in doc.split():
            vocab.append(word)
    # remove stopwords
    vocab = [word for word in vocab if word not in stopwords.words(lang)]

    #remove duplicates
    vocab = set(vocab)
    
    return corpus, vocab

# reducematrix is a function that returns reduced two-dimensional matrix A to k elements according to the dimension dim.
# dim can be 0 (meaning rows), 1 (meaning columns), or 2 (meaning rows and columns)
def reduce_matrix(A, k, dim):
    if dim not in {0, 1 ,2}:
        return A
    if dim == 2:
        k_row = min(k, A.shape[0])
        k_col = min(k, A.shape[1])
        return A[0:k_row, 0:k_col]
    k_dim = min(k, A.shape[dim])
    if dim == 1:
        return A[:, 0:k_dim]
    if dim == 0:
        return A[0:k_dim, :]
    

# document corpus
corpus = ["Romeo and Juliet.", 
          "Juliet: O happy dagger!", 
          "Romeo died by dagger.", 
          "'Live free or die'”, that’s the New-Hampshire’s motto.", 
          "Did you know, New-Hampshire is in New-England."
         ]

# query
query = "died, dagger"

# process the corpus by removing punctuation, stopwords, change to lowercase, create vocabulary
corpus_processed, corpus_vocabulary = preproces(corpus, rm_punctuation=True, 
                                                        lemmatize=True, 
                                                        lowercase=True, 
                                                        rm_stopwords=True,
                                                        lang='english') 

print(" ")
print('Corpus vocabulary')
print("=======================================================================================")
print(corpus_vocabulary)

### Step 2: create TF matrix
In the second step we create *term-frequency matrix* for the Corpus. For this purpose we use Count Vectorizer that comes with <code>sklearn</code> library. As a result, we print out the feature names (terms from our vocabulary in the order as they appear in the TF matrix), dimensions of the TF matrix and its content.

In [None]:
# create term-frequency matrix
cv =  CountVectorizer(vocabulary = corpus_vocabulary)
tf_array = cv.fit_transform(corpus_processed).transpose()

#print(print feature names and TF matrix)
print(" ")
print('Feature names and TF matrix')
print("=======================================================================================")
corpus_vocabulary = cv.get_feature_names()
print(cv.get_feature_names())
print(" ")
print(tf_array.get_shape())
print(tf_array.todense())

### Step 3: transform TF matrix with SVD
In this step we use SVD from <code>scipy</code> library to get the required matrix decomposition. Than we reduce the matrices to the size k and print out the term representations, document representations and singular values, all for the **k-concept space**.

In [None]:
# transform the term-frequency matrix with SVD 
# note that by default CountVectorizer use sparse matrices - that's why we need its dense version 

U, s, VT = svd(tf_array.todense())
Uk = reduce_matrix(U, 2, 1)
VTk = reduce_matrix(VT, 2, 0)
sk = reduce_matrix(np.diag(s), 2, 2)

# print term vectors in k-concept space (i.e. row vectors of U)
print(" ")
print('Term representations')
print("=======================================================================================")
for i in range(0, len(corpus_vocabulary)):
    print('{:15s} {:7.4f}  {:7.4f}'.format( corpus_vocabulary[i], Uk[i][0], Uk[i][1] )) 

# print document vectors in k-concept space (i.e. column vectors of VT)
print(" ")
print('Documents representations')
print("=======================================================================================")
for i in range(0, len(corpus)):
    print('{:45s} {:7.4f}  {:7.4f}'.format( corpus_processed[i], VTk[0][i], VTk[1][i] )) 
    
# print singular values
print(" ")
print('Reduced singular values')
print("=======================================================================================")
print(sk)

### Step 4: calculate query representation in the k-concept space
In step 4, we transform the query so that it is represented as another document in the k-concept space. Its representation is then printed out.

In [None]:
# calculate query representation in the k-concept space
# first pre-process the original query
query_processed, vocab_query = preproces([query])
# use the same Count Vectorizer to get the TF representation of q 
qT = cv.fit_transform(query_processed).todense().tolist()[0]
# calculate inversed sk matrix
sk_inv = np.linalg.inv(sk)
# multiply qT, Uk and sk_inv to get qk
qk = np.matmul(np.matmul(qT, Uk), sk_inv)

# print qk
print(" ")
print('Query representation in the k-concept space')
print("=======================================================================================")
print(qk)

### Step 5: calculate Cosine similarity 
For each document in the Corpus, we calculate and print out its *Cosine similarity* to the query. For the similarity calculation we use the implementation available in the <code>sklearn</code> library.

In [None]:
# for each document as represented in VTk, calculate cosine distance to qk and print it
print(" ")
print('Cosine similarity (di, q)')
print("=======================================================================================")
for i in range (0, len(corpus)):
    print('{:5s} {:7.4f}'.format('d'+str(i+1), cosine_similarity( [VTk[:,i]], [qk])[0][0]))

### Step 6: plot documents, query and terms in the k-concept space 
Finaly, we plot vectors representing documents, the query and the terms from our vocabulary in the k-concept space. We use the <code>matplotlib</code> library for this.

In [None]:
# Plot docs and the query in the k-concept space
# draw axes
ax = plt.axes()
# draw docs
for i in range (0, len(corpus)):
    ax.arrow(0, 0, VTk[0,i], VTk[1,i])
    plt.scatter(VTk[0,i], VTk[1,i], color='black')
    ax.annotate('d'+str(i+1), (VTk[0,i]+0.02,VTk[1,i]),fontsize=14)

# draw query
ax.arrow(0, 0, qk[0], qk[1], color = 'red')
plt.scatter(qk[0], qk[1], color='red')
ax.annotate('q', (qk[0]+0.02,qk[1]),fontsize=14, color='red')

# draw terms - only points
for i in range (0, len(corpus_vocabulary)):
    plt.scatter(Uk[i,0], Uk[i,1], color='blue')
    ax.annotate(corpus_vocabulary[i], (Uk[i,0]+0.02,Uk[i,1]+0.0),fontsize=11, color='blue')

plt.grid()

plt.xlim(0,1)
plt.ylim(-0.4,0.8)

plt.title('Query and corpus documents in the k-concept space',fontsize=10)

plt.show()
plt.close()

## Results
Observe that our document $d_1$: *'Romeo and Juliet.'* got much higher rank with **Latent Semantic Indexing** than in our **vector space model**, where its Cosine similarity to the query was zero.  

| Method | Dot product | Cosine similarity | Latent Semantic Indexing |
| ------ | -----------:| -----------------:| -----:|
| $d_1$  |           0 |                 0 | 0.8617|
| $d_2$  |       0.196 |             0.221 | 0.8440|
| $d_3$  |       0.392 |             0.666 | 0.9886|
| $d_4$  |       0.152 |             0.146 | 0.4350|
| $d_5$  |           0 |                 0 | 0.1486|
