# ECO6128 Tutorial - SVD and LSI

*Latent Semantic Indexing (LSI)* is a method for discovering hidden concepts in document data. Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept. The goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities or semantic relationship which are otherwise hidden.

Created by *Xinghao YU*, March 18th, 2023. For more, please refer to [./Refer - SVD Tutorial (Alex Thomo).pdf]

*Copyright@Chinese University of Hong Kong, Shenzhen*

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (6, 6)

In [None]:
# Direct sentences
corpus = ['Romeo Juliet.',
          'Juliet happy dagger!',
          'Romeo die dagger.',
          '“Live free die”, that’s the Hampshire’s',
          'Hampshire is in.']

In [None]:
# raw documents to tf matrix; not normalized; not use idf
vectorizer = TfidfVectorizer(stop_words='english', 
                             norm=None,
                             use_idf=False)
X = vectorizer.fit_transform(corpus)

In [None]:
# check the key terms
vectorizer.get_feature_names_out()

In [None]:
# check the sparse matrix
print(X)

In [None]:
# SVD to reduce dimensionality: here we choose only 2 concepts
svd_model = TruncatedSVD(n_components=2,       
                         algorithm='randomized',
                         n_iter=5)
dc_matrix = svd_model.fit_transform(X)
# output: the scaled document-concept matrix
document_concept_matrix = pd.DataFrame(dc_matrix)

d = []
for row in range(0, document_concept_matrix.shape[0]):
    d.append(f'd{row+1}')
document_concept_matrix.index = d
document_concept_matrix

In [None]:
# $.components_ return 'The right singular vectors of the input data', that is concept-term matrix
# $.singular_values_ return 'The singular values corresponding to each of the selected components'
# What we need: the scaled term-concept matrix
tc_matrix = np.dot(svd_model.components_.T, np.diag(svd_model.singular_values_))
term_concept_matrix = pd.DataFrame(tc_matrix)
term_concept_matrix.index = vectorizer.get_feature_names_out()
term_concept_matrix

In [None]:
# plot all vectors
document_term = pd.concat([document_concept_matrix, term_concept_matrix])
plt.scatter(x = document_term[0], y = document_term[1])
# add labels to all points
for idx, row in document_term.iterrows(): 
    plt.text(row[0], row[1], idx)

## How about we use tf-idf, with normalization?

In [None]:
corpus = ['Romeo and Juliet.',
          'Juliet: O happy dagger!',
          'Romeo died by dagger.',
          '“Live free or die”, that’s the New-Hampshire’s motto.',
          'Did you know, New-Hampshire is in New-England.']

In [None]:
# raw documents to tf-idf matrix: 
vectorizer = TfidfVectorizer(stop_words='english', 
                             use_idf=True, 
                             smooth_idf=True)
# SVD to reduce dimensionality: 
svd_model = TruncatedSVD(n_components=2,       
                         algorithm='randomized',
                         n_iter=5)
# pipeline of tf-idf + SVD, fit to and applied to documents:
svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])
dc_matrix = svd_transformer.fit_transform(corpus)
# dc_matrix can later be used to compare documents, compare words, or compare queries with documents

In [None]:
document_concept_matrix = pd.DataFrame(dc_matrix)

d = []
for row in range(0, document_concept_matrix.shape[0]):
    d.append(f'd{row+1}')
document_concept_matrix.index = d

tc_matrix = np.dot(svd_model.components_.T, np.diag(svd_model.singular_values_))
term_concept_matrix = pd.DataFrame(tc_matrix)
term_concept_matrix.index = vectorizer.get_feature_names_out()

In [None]:
document_term = pd.concat([document_concept_matrix, term_concept_matrix])
document_term

In [None]:
# plot all vectors
plt.scatter(x = document_term[0], y = document_term[1])
# add labels to all points
for idx, row in document_term.iterrows(): 
    plt.text(row[0], row[1], idx)