Apply Singular Value Decomposition (SVD) for Latent Semantic Analysis (LSA)

1. Create a Term-Document Matrix using TF-IDF or CountVectorizer.
2. Apply SVD to reduce dimensions and extract latent topics.
3. Analyze topics by examining word contributions.

In [4]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


documents = [
    "LSA is a technique in natural language processing.",
    "Singular value decomposition is used in LSA ",
    "Topic modeling extracts latent topics from text.",
    "Natural language processing involves machine learning."
]

step 1 : Convert text to a TF-IDF matrix 


In [5]:
vectorizer = TfidfVectorizer(stop_words = 'english' )
X = vectorizer.fit_transform(documents)


step 2 : Apply SVD to reduce dimensionality reduction (LSA)

In [6]:
num_topics = 2 # number of latent topics
svd = TruncatedSVD(n_components = num_topics)
X_svd = svd.fit_transform(X)

Step 3: Display top words in each topic

In [10]:
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(svd.components_):
    terms_in_topic = [terms[j] for j in np.argsort(comp)[-5:]] # top 5 words
    print(f'Topic {i+1} : {','.join(terms_in_topic)}')
    print('---')
print()

# Display transformed document-topic matrix 
df = pd.DataFrame(X_svd, columns = [f'Topic {i+1}' for i in range(num_topics)])
print('Document Topic Matrix : ')
print(df)

Topic 1 : technique,lsa,language,natural,processing
---
Topic 2 : text,modeling,latent,topics,topic
---

Document Topic Matrix : 
        Topic 1       Topic 2
0  8.598110e-01  2.081668e-17
1  2.782269e-01 -5.677183e-01
2 -4.768412e-17  8.000000e-01
3  8.135507e-01  1.941544e-01
