<a href="https://colab.research.google.com/github/s2t2/learning-nlp-py/blob/master/notebooks/Latent_Semantic_Analysis_(LSA).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Latent Semantic Analysis

Resources:
  + [LSA (Latent Semantic Analysis) - Minsok Heo](https://www.youtube.com/watch?v=OvzJiur55vo)
  + [Introduction to Latent Semantic Analysis - Databricks Academy](https://www.youtube.com/playlist?list=PLroeQp1c-t3qwyrsq66tBxfR6iX6kSslt)

Content in this notebook is based off of the Databricks Academy video series.

<img src="https://user-images.githubusercontent.com/1328807/216838739-0d05bbb0-725b-44bc-b291-46dee457970a.png" width=600></img>


We're going to use TFIDF for vectorizing text / obtaining word embeddings, and SVD for reducing the dimensionality, and then we'll learn about some semantic relationships between the words and documents.


## Setup

In [77]:
import warnings

warnings.filterwarnings("ignore")

## Data Loading

In [78]:
from pandas import DataFrame

df = DataFrame({"text": [
    "the quick brown fox",
    "the slow brown dog",
    "the quick red dog",
    "the lazy yellow fox"
]})
df.index = df["text"]

df

Unnamed: 0_level_0,text
text,Unnamed: 1_level_1
the quick brown fox,the quick brown fox
the slow brown dog,the slow brown dog
the quick red dog,the quick red dog
the lazy yellow fox,the lazy yellow fox


## Vectorization

In [79]:
x_train = df["text"]

### Bag of Words

In [80]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

print("---------------")
print("BAG OF WORDS:")
cv_matrix = cv.fit_transform(x_train)
print(type(cv_matrix))
#print(cv_matrix.todense())

cv_df = DataFrame(cv_matrix.todense(), columns=cv.get_feature_names(), index=x_train.index)
cv_df

---------------
BAG OF WORDS:
<class 'scipy.sparse.csr.csr_matrix'>


Unnamed: 0_level_0,brown,dog,fox,lazy,quick,red,slow,the,yellow
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
the quick brown fox,1,0,1,0,1,0,0,1,0
the slow brown dog,1,1,0,0,0,0,1,1,0
the quick red dog,0,1,0,0,1,1,0,1,0
the lazy yellow fox,0,0,1,1,0,0,0,1,1


In [81]:
print(cv.get_params())

print(cv.get_feature_names())

print(cv.vocabulary_)

#dir(cv)

{'analyzer': 'word', 'binary': False, 'decode_error': 'strict', 'dtype': <class 'numpy.int64'>, 'encoding': 'utf-8', 'input': 'content', 'lowercase': True, 'max_df': 1.0, 'max_features': None, 'min_df': 1, 'ngram_range': (1, 1), 'preprocessor': None, 'stop_words': None, 'strip_accents': None, 'token_pattern': '(?u)\\b\\w\\w+\\b', 'tokenizer': None, 'vocabulary': None}
['brown', 'dog', 'fox', 'lazy', 'quick', 'red', 'slow', 'the', 'yellow']
{'the': 7, 'quick': 4, 'brown': 0, 'fox': 2, 'slow': 6, 'dog': 1, 'red': 5, 'lazy': 3, 'yellow': 8}


### Term Frequency - Inverse Document Frequency (TF-IDF)

In [82]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer()

print("---------------")
print("TFIDF:")
tv_matrix = tv.fit_transform(x_train)
print(type(tv_matrix))
#print(tv_matrix.todense())

tv_df = DataFrame(tv_matrix.todense(), columns=tv.get_feature_names(), index=x_train.index)
tv_df

---------------
TFIDF:
<class 'scipy.sparse.csr.csr_matrix'>


Unnamed: 0_level_0,brown,dog,fox,lazy,quick,red,slow,the,yellow
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
the quick brown fox,0.539313,0.0,0.539313,0.0,0.539313,0.0,0.0,0.356966,0.0
the slow brown dog,0.497096,0.497096,0.0,0.0,0.0,0.0,0.630504,0.329023,0.0
the quick red dog,0.0,0.497096,0.0,0.0,0.497096,0.630504,0.0,0.329023,0.0
the lazy yellow fox,0.0,0.0,0.463458,0.587838,0.0,0.0,0.0,0.306758,0.587838


## Dimensionality Reduction

### Single Value Decomposition (SVD)

In [83]:
from sklearn.decomposition import TruncatedSVD

n_components = 2

svd = TruncatedSVD(n_components=n_components)

# LSA:
svd_matrix = svd.fit_transform(cv_matrix)
#svd_matrix = svd.fit_transform(tv_matrix)

topic_names = [f"topic_{n}" for n in range(1, n_components+1)]
svd_df = DataFrame(svd_matrix, columns=topic_names, index=x_train.index)
svd_df

Unnamed: 0_level_0,topic_1,topic_2
text,Unnamed: 1_level_1,Unnamed: 2_level_1
the quick brown fox,1.694905,0.299524
the slow brown dog,1.515851,-0.76911
the quick red dog,1.515851,-0.76911
the lazy yellow fox,1.266186,1.440585


In [84]:
#import plotly.express as px
#
#px.scatter(svd_df, x="topic_1", y="topic_2")

<img src="https://user-images.githubusercontent.com/1328807/216840501-42e92a58-8b0b-42aa-ad20-9b43051f3caa.png" width=600></img>

In [85]:
dictionary = cv.get_feature_names()
#dictionary = tv.get_feature_names()
print(dictionary)

['brown', 'dog', 'fox', 'lazy', 'quick', 'red', 'slow', 'the', 'yellow']


In [86]:
svd.components_

array([[ 0.3539373 ,  0.33419932,  0.3264155 ,  0.13957787,  0.3539373 ,
         0.16709966,  0.16709966,  0.66061483,  0.13957787],
       [-0.14025617, -0.4594362 ,  0.5197363 ,  0.43027437, -0.14025617,
        -0.2297181 , -0.2297181 ,  0.0603001 ,  0.43027437]])

In [87]:
svd_encodings_df = DataFrame(svd.components_, columns=dictionary, index=topic_names)
svd_encodings_df

Unnamed: 0,brown,dog,fox,lazy,quick,red,slow,the,yellow
topic_1,0.353937,0.334199,0.326416,0.139578,0.353937,0.1671,0.1671,0.660615,0.139578
topic_2,-0.140256,-0.459436,0.519736,0.430274,-0.140256,-0.229718,-0.229718,0.0603,0.430274


In [88]:
svd_encodings_df.T.sort_values(by="topic_2", ascending=False)

Unnamed: 0,topic_1,topic_2
fox,0.326416,0.519736
lazy,0.139578,0.430274
yellow,0.139578,0.430274
the,0.660615,0.0603
brown,0.353937,-0.140256
quick,0.353937,-0.140256
slow,0.1671,-0.229718
red,0.1671,-0.229718
dog,0.334199,-0.459436


In [None]:
# compare absolute values
# ... https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.abs.html

In [89]:
svd_encodings_df.abs().T.sort_values(by="topic_2", ascending=False)

Unnamed: 0,topic_1,topic_2
fox,0.326416,0.519736
dog,0.334199,0.459436
lazy,0.139578,0.430274
yellow,0.139578,0.430274
red,0.1671,0.229718
slow,0.1671,0.229718
quick,0.353937,0.140256
brown,0.353937,0.140256
the,0.660615,0.0603


In [90]:
svd_encodings_df.abs().T.sort_values(by="topic_1", ascending=False)

Unnamed: 0,topic_1,topic_2
the,0.660615,0.0603
quick,0.353937,0.140256
brown,0.353937,0.140256
dog,0.334199,0.459436
fox,0.326416,0.519736
red,0.1671,0.229718
slow,0.1671,0.229718
lazy,0.139578,0.430274
yellow,0.139578,0.430274


In [91]:
#import plotly.express as px
#
#chart_df = svd_encodings_df.abs().T
#chart_df["label"] = chart_df.index
#
#fig = px.scatter(chart_df, x="topic_1", y="topic_2", text="label")
#
#fig.update_traces(textposition='top center')
#
#fig.show()

## Pipeline

In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD


def lsa_pipeline(x_train=x_train, stop_words=None, tokenizer=None, n_components=2):

    print("---------------")
    print("TFIDF:")
    tv = TfidfVectorizer()
    #tv = TfidfVectorizer(stop_words=stop_words, tokenizer=tokenizer)
    tv_matrix = tv.fit_transform(x_train)
    print(type(tv_matrix))
    #print(tv_matrix.todense())
    tv_df = DataFrame(tv_matrix.todense(), columns=tv.get_feature_names(), index=x_train.index)
    print(tv_df.head())

    print("---------------")
    print("SVD:")
    svd = TruncatedSVD(n_components=n_components)
    svd_matrix = svd.fit_transform(tv_matrix)
    topic_names = [f"topic_{n}" for n in range(1, n_components+1)]
    svd_df = DataFrame(svd_matrix, columns=topic_names, index=x_train.index)
    print(svd_df.head())

    svd_encodings_df = DataFrame(svd.components_, columns=dictionary, index=topic_names)
    encodings_df = svd_encodings_df.abs().T.sort_values(by="topic_2", ascending=False)




In [97]:
lsa_pipeline()

---------------
TFIDF:
<class 'scipy.sparse.csr.csr_matrix'>
                        brown       dog       fox      lazy     quick  \
text                                                                    
the quick brown fox  0.539313  0.000000  0.539313  0.000000  0.539313   
the slow brown dog   0.497096  0.497096  0.000000  0.000000  0.000000   
the quick red dog    0.000000  0.497096  0.000000  0.000000  0.497096   
the lazy yellow fox  0.000000  0.000000  0.463458  0.587838  0.000000   

                          red      slow       the    yellow  
text                                                         
the quick brown fox  0.000000  0.000000  0.356966  0.000000  
the slow brown dog   0.000000  0.630504  0.329023  0.000000  
the quick red dog    0.630504  0.000000  0.329023  0.000000  
the lazy yellow fox  0.000000  0.000000  0.306758  0.587838  
---------------
SVD:
                      topic_1   topic_2
text                                   
the quick brown fox  0.8143