# üöÄ Latent Semantic Analysis (LSA) - Day 35

Welcome to Day 35 of the **100 Days of Data Science & AI** series! Today, we explore **Latent Semantic Analysis (LSA)**, a foundational technique in Natural Language Processing (NLP) used to discover hidden (latent) themes within a collection of documents.

---

## üßê What is LSA?

Latent Semantic Analysis (LSA) is a technique that uses **Singular Value Decomposition (SVD)** to reduce the dimensionality of a Term-Document Matrix. By doing so, it groups words that are used in similar contexts, effectively identifying "topics" or "concepts."

### Key Concepts:
1. **Semantic Structure**: LSA assumes that words that are close in meaning will occur in similar pieces of text.
2. **Noise Reduction**: By keeping only the top singular values, we filter out noise (random word variations) and focus on the core semantic structure.
3. **Topic Modeling**: It helps us answer: "What are these documents actually about?"

---

## üõ†Ô∏è The Mechanics of LSA

1. **Pre-processing**: Cleaning text (lowercase, removing stopwords, stemming/lemmatization).
2. **Vectorization (TF-IDF)**: Converting text into numerical vectors where high-frequency but low-information words (like 'the') are penalized.
3. **Truncated SVD**: Decomposing the large TF-IDF matrix into three smaller matrices: $U$, $\Sigma$, and $V^T$. We keep only the top $k$ dimensions.
4. **Cosine Similarity**: Comparing documents or words in this new "latent" space.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from wordcloud import WordCloud

sns.set(style="whitegrid", palette="pastel")

## üìÇ Loading the Dataset

We'll use the **20 Newsgroups** dataset, focusing on a few distinct categories to see if LSA can correctly identify them.

In [None]:
categories = ['rec.sport.baseball', 'sci.space', 'talk.politics.mideast']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

print(f"Total documents: {len(newsgroups.data)}")
print(f"Example text:\n{newsgroups.data[0][:200]}...")

## üßπ Step 1: Text Vectorization (TF-IDF)

We convert our text into a **TF-IDF (Term Frequency-Inverse Document Frequency)** matrix. This serves as the input for our SVD algorithm.

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000, max_df=0.5, min_df=2)
X_tfidf = vectorizer.fit_transform(newsgroups.data)

print(f"TF-IDF Matrix Shape: {X_tfidf.shape}")

## üß± Step 2: Singular Value Decomposition (Truncated SVD)

We apply SVD to reduce our 1000-dimensional TF-IDF space into 3 topics (matching our 3 categories).

In [None]:
n_topics = 3
lsa = TruncatedSVD(n_components=n_topics, random_state=42)
X_lsa = lsa.fit_transform(X_tfidf)

print("Explained Variance by LSA components:", lsa.explained_variance_ratio_)

## üìä Step 3: Visualizing the Latent Space

### 1. ‚òÅÔ∏è Word Clouds for Topics
What words define our 'latent' topics? We'll look at the top words for each component.

In [None]:
terms = vectorizer.get_feature_names_out()

def plot_word_clouds(lsa_model, terms, n_topics):
    plt.figure(figsize=(15, 5))
    for i, comp in enumerate(lsa_model.components_):
        terms_comp = zip(terms, comp)
        sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:50]
        text = " ".join([t[0] for t in sorted_terms])
        cloud = WordCloud(background_color='white', width=400, height=300).generate(text)
        
        plt.subplot(1, n_topics, i+1)
        plt.imshow(cloud)
        plt.title(f"Topic {i+1}")
        plt.axis('off')
    plt.tight_layout()
    plt.show()

plot_word_clouds(lsa, terms, n_topics)

### 2. üó∫Ô∏è Document Clustering (2D Space)
Let's visualize how the documents cluster in the new semantic space.

In [None]:
plt.figure(figsize=(10, 7))
sns.scatterplot(x=X_lsa[:, 0], y=X_lsa[:, 1], hue=newsgroups.target, palette='Dark2', alpha=0.5)
plt.title("LSA Document Clusters (Topic 1 vs Topic 2)")
plt.xlabel("Topic 1 Importance")
plt.ylabel("Topic 2 Importance")
plt.legend(handles=plt.gca().get_legend().legend_handles, labels=newsgroups.target_names)
plt.show()

### 3. üî• Component Importance Heatmap
Showing the top contributing terms for each topic.

In [None]:
top_terms_idx = np.argsort(lsa.components_[:, :])[:, -15:]
top_terms = [[terms[i] for i in topic] for topic in top_terms_idx]

for i, t in enumerate(top_terms):
    print(f"Topic {i+1} Top Words: {', '.join(t[::-1])}")

---

üîπ Key Takeaways

‚úî **Semantic Retrieval**: LSA goes beyond keyword matching by capturing the underlying context. It can recognize that "baseball" and "bat" belong to the same topic even if they don't appear in the same document.

‚úî **Efficiency**: Instead of dealing with thousands of sparse word columns, we now have a compact representation of the documents' meaning.

‚úî **Topic Discover**: Using SVD, we successfully separated sports, space, and politics categories automatically from raw text.

üìå Meta
Author: Tharun Naik Ramavath
Series: 100 Days of Data Science & AI
Day: 35
Platform: LinkedIn
Notebook: Google Colab