<a href="https://colab.research.google.com/github/tfindiamooc/mlp/blob/feature/TextAnalysisClass4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lesson #4: Unsupervised Text Clustering with K-Means

Welcome to a lesson on **Text Clustering**!  So far, we've focused on **supervised text classification**, where we have labeled data to train models to predict categories. Now, we'll explore **unsupervised learning** with text clustering, where we aim to discover hidden structures and group similar documents **without any pre-defined labels**.

In this lesson, you will:

*   Understand **what text clustering is and when to use it**.
*   Learn about the **K-Means clustering algorithm**.
*   Build **K-Means clustering pipelines** for text data.
*   Explore **text vectorization** techniques for clustering.
*   Learn how to **evaluate** text clustering using the **Silhouette Score**.
*   **Inspect and interpret** text clusters by examining top terms.
*   Experiment with **choosing the optimal number of clusters (K)**.

Let's start by building a basic K-Means clustering pipeline for text!

In [None]:
# Code Cell 1: Basic K-Means Clustering Pipeline Code
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans # Import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score # Import silhouette_score

# 1. Load Dataset (using a subset for clustering demonstration - no categories needed)
newsgroups = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics']) # More categories for clustering
X = newsgroups.data # Only data, no labels (unsupervised)
target_names = newsgroups.target_names

# 2. Vectorize Text Data (TF-IDF)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(X) # Fit and transform for clustering

# 3. Apply K-Means Clustering
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) # Set n_clusters = number of categories for demonstration
clusters = kmeans.fit_predict(X_tfidf) # Fit and predict clusters

# 4. Evaluate Clustering (Silhouette Score)
silhouette_avg = silhouette_score(X_tfidf, clusters)
print(f"Silhouette Score for K-Means Clustering: {silhouette_avg:.4f}")

# 5. (Optional) Print cluster sizes
import pandas as pd
cluster_series = pd.Series(clusters)
print("\nCluster Sizes:")
print(cluster_series.value_counts().sort_index())

### K-Means Clustering Pipeline - First Steps

This code sets up a basic K-Means clustering pipeline for text. Let's break down the steps:

1.  **Dataset Loading (Unlabeled Data):**
    *   We load the 20 Newsgroups dataset, but this time, we are using it for **unsupervised learning**. We are *not* using the `target` labels for clustering itself.
    *   We've selected a few more categories (`['alt.atheism', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics']`) to make the clustering task a bit more interesting.
    *   We extract only `newsgroups.data` (the text documents) as `X`. We don't need `y` (labels) for clustering.

2.  **Text Vectorization (TF-IDF):**
    *   We use `TfidfVectorizer` to convert the text documents into a numerical matrix representation (TF-IDF matrix). This is essential because clustering algorithms like K-Means work with numerical data.

3.  **K-Means Clustering:**
    *   **`KMeans(n_clusters=4, ...)`**: We initialize the `KMeans` algorithm.
        *   **`n_clusters=4`**:  We are telling K-Means to find **4 clusters**.  Since we selected 4 categories from 20 Newsgroups, we are setting `n_clusters=4` for demonstration purposes to see if K-Means can roughly recover these topics. In a real-world unsupervised scenario, you often won't know the "true" number of clusters beforehand. We'll discuss how to choose `n_clusters` later.
        *   `random_state=42`: For reproducibility.
        *   `n_init=10`:  K-Means starts with random initial cluster centers. `n_init=10` means it will run the algorithm 10 times with different random initializations and choose the best result (in terms of inertia).

    *   **`clusters = kmeans.fit_predict(X_tfidf)`**: We train the K-Means model on the TF-IDF vectorized data (`X_tfidf`) and get cluster assignments for each document. `clusters` will be an array where each element is the cluster index (0, 1, 2, or 3 in this case) assigned to the corresponding document.

4.  **Evaluate Clustering (Silhouette Score):**
    *   **`silhouette_score(X_tfidf, clusters)`**: We use the **Silhouette Score** to evaluate the quality of the clustering.
        *   **Silhouette Score:** Measures how well each document is clustered with documents in its own cluster, compared to documents in other clusters.
        *   Silhouette Score ranges from -1 to +1:
            *   **+1:** Best value. Indicates clusters are well-separated and documents are well-clustered within their own cluster.
            *   **0:**  Clusters are overlapping, or documents are on cluster boundaries.
            *   **-1:** Worst value. Indicates documents might be better clustered in a *different* cluster.
        *   For text clustering, we generally aim for Silhouette Scores that are positive and as close to +1 as possible, but in practice, scores are often lower.

5.  **(Optional) Print Cluster Sizes:** We use `pandas` to count the number of documents in each cluster to get an idea of cluster distribution.

Run this code to see the Silhouette Score and cluster sizes for the basic K-Means pipeline.

Now, let's understand more about what text clustering is and how K-Means works.

### What is Text Clustering? - Finding Structure in Unlabeled Text

**Text clustering** is an **unsupervised learning** task that aims to group similar text documents together into clusters, without using any pre-defined categories or labels.

**Key Concepts:**

*   **Unsupervised Learning:**  Unlike classification (supervised), clustering works with **unlabeled data**. We don't have pre-assigned categories for documents. The goal is to discover groupings based on the content of the documents themselves.

*   **Document Similarity:** Clustering relies on the concept of **document similarity**.  Documents within the same cluster should be more similar to each other than to documents in other clusters. Similarity is typically measured using vector representations of text (like TF-IDF vectors) and distance metrics (like cosine distance or Euclidean distance).

*   **Discovering Hidden Structure:** Text clustering is used to **discover hidden thematic structures** or topics within a collection of documents.  It can help you:
    *   **Automatically group documents by topic.**
    *   **Explore and understand the main themes** present in a text corpus.
    *   **Organize large collections of text data.**
    *   **Identify sub-groups within a user base based on their text data (e.g., customer reviews, social media posts).**

*   **Contrast with Text Classification:**
    *   **Text Classification (Supervised):**  Predicts pre-defined categories for documents based on labeled training data. You know the categories in advance.
    *   **Text Clustering (Unsupervised):**  Discovers groupings of documents automatically, without pre-defined categories. You don't know the topics beforehand; the algorithm finds them.

**Common Text Clustering Algorithms:**

*   **K-Means:**  A popular centroid-based algorithm (we'll focus on this).
*   **Hierarchical Clustering:** Builds a hierarchy of clusters (agglomerative or divisive).
*   **DBSCAN (Density-Based Spatial Clustering of Applications with Noise):**  Finds clusters based on density of data points.
*   **Topic Modeling (Latent Dirichlet Allocation - LDA, Non-negative Matrix Factorization - NMF):**  Related to clustering, but focuses on discovering latent topics and topic distributions within documents. We might explore topic modeling in a later lesson.

**Why use Text Clustering?**

*   **Exploratory Data Analysis:**  Great for understanding the structure of a new text dataset when you don't have labels.
*   **Organization and Summarization:**  Helps organize and summarize large text collections by grouping similar documents.
*   **Feature Engineering:**  Cluster assignments can sometimes be used as features in supervised learning tasks.
*   **No Labeled Data Required:**  A major advantage is that you don't need manually labeled data, which can be expensive and time-consuming to obtain.

In the next text cell, we'll focus on the K-Means algorithm itself.

### K-Means Clustering Algorithm - Step-by-Step

**K-Means** is a widely used **centroid-based** clustering algorithm. Here's how it works:

1.  **Initialization: Choose K and Initial Centroids:**
    *   You need to specify **`K`**, the **number of clusters** you want to find. This is a hyperparameter you need to decide (we'll discuss how to choose K later).
    *   **Initial centroids** are randomly chosen. A centroid is the center point of a cluster.  For text data vectorized into a TF-IDF matrix, a centroid is a vector in the same feature space.

2.  **Assignment Step: Assign Documents to the Nearest Centroid:**
    *   For each document, calculate the **distance** (e.g., cosine distance, Euclidean distance) between the document's vector representation and each of the **K centroids**.
    *   Assign each document to the cluster whose centroid is **closest** to it.

3.  **Update Step: Recalculate Centroids:**
    *   For each cluster, recalculate the **centroid** by taking the **mean** of all the document vectors assigned to that cluster.  The new centroid becomes the average vector of all documents in the cluster.

4.  **Iteration:**
    *   Repeat steps 2 and 3 (Assignment and Update) until **convergence**.
    *   **Convergence** occurs when the cluster assignments no longer change significantly, or when a maximum number of iterations is reached.

5.  **Final Clusters:**  Once converged, you have your final clusters, and each document is assigned to one of the K clusters.

**Visual Intuition:**

Imagine you have points scattered on a 2D plane (think of document vectors in a high-dimensional space). K-Means tries to find K cluster centers (centroids) and group the points around these centers, so that points within each group are close to each other and far from points in other groups.

**Important Considerations for K-Means:**

*   **Choosing K:** Selecting the correct number of clusters (`K`) is crucial and often not straightforward in unsupervised learning. We'll explore methods like the Elbow method and Silhouette Score to help choose K.
*   **Initialization Sensitivity:** K-Means is sensitive to the initial random placement of centroids. Running K-Means multiple times with different random initializations (controlled by `n_init` parameter in scikit-learn) and choosing the best result helps mitigate this issue.
*   **Distance Metric:** The choice of distance metric (e.g., cosine, Euclidean) can affect clustering results. Cosine distance is often preferred for text data, especially with TF-IDF, as it focuses on the angle between vectors (topic similarity) rather than magnitude.  However, scikit-learn's `KMeans` in `sklearn.cluster` primarily uses Euclidean distance. We'll use TF-IDF and Euclidean distance in this lesson for simplicity, but be aware of cosine distance as an alternative.
*   **Spherical Clusters:** K-Means assumes clusters are somewhat spherical and equally sized. It might not perform well if clusters have complex shapes or varying densities.

Let's now experiment with different vectorizers in our K-Means clustering pipeline.

In [None]:
# Code Cell 2: K-Means with different vectorizers (CountVectorizer, TF-IDF)
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # Import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score

# 1. Load Dataset (Same as before)
newsgroups = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics'])
X = newsgroups.data
target_names = newsgroups.target_names

# 2. Pipelines with different vectorizers
# Pipeline with CountVectorizer (BoW)
kmeans_pipeline_bow = Pipeline([
    ('bow', CountVectorizer(stop_words='english', max_features=5000)), # CountVectorizer
    ('kmeans', KMeans(n_clusters=4, random_state=42, n_init=10))
])

# Pipeline with TfidfVectorizer (TF-IDF) - (Same as before for comparison)
kmeans_pipeline_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)), # TfidfVectorizer
    ('kmeans', KMeans(n_clusters=4, random_state=42, n_init=10))
])

# 3. Fit and Predict Clusters using Pipelines
clusters_bow = kmeans_pipeline_bow.fit_predict(X) # Fit and predict in one step for pipeline
clusters_tfidf = kmeans_pipeline_tfidf.fit_predict(X)

# 4. Evaluate Clustering - Silhouette Score (Compare BoW and TF-IDF)
silhouette_avg_bow = silhouette_score(kmeans_pipeline_bow.transform(X), clusters_bow) # Use transform to get vectorized data for silhouette_score
silhouette_avg_tfidf = silhouette_score(kmeans_pipeline_tfidf.transform(X), clusters_tfidf)

print(f"Silhouette Score for K-Means with CountVectorizer (BoW): {silhouette_avg_bow:.4f}")
print(f"Silhouette Score for K-Means with TfidfVectorizer (TF-IDF): {silhouette_avg_tfidf:.4f}")

### Comparing Vectorizers with K-Means

In this code, we compare K-Means clustering performance using two different text vectorizers:

*   **`kmeans_pipeline_bow`**: Uses `CountVectorizer` (Bag of Words).
*   **`kmeans_pipeline_tfidf`**: Uses `TfidfVectorizer` (TF-IDF).

Run the code and compare the Silhouette Scores for both pipelines.

**Questions to consider:**

*   Does using TF-IDF vectorization lead to a better Silhouette Score compared to using Bag of Words (CountVectorizer) for K-Means clustering?
*   Is the difference in Silhouette Scores significant?
*   Based on these scores, which vectorization method seems to produce better-defined clusters for K-Means?

Generally, TF-IDF is often preferred for text clustering as it downweights common words and emphasizes words that are more distinctive to specific documents, which can help K-Means find better clusters.

Now, let's inspect the clusters to understand the topics K-Means has discovered.

In [None]:
# Code Cell 3: Inspecting K-Means Clusters - Top Terms
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np

# 1. Load Dataset & Vectorize (TF-IDF Pipeline - for inspection)
newsgroups = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics'])
X = newsgroups.data
target_names = newsgroups.target_names

kmeans_pipeline_tfidf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),
    ('kmeans', KMeans(n_clusters=4, random_state=42, n_init=10))
])

# 2. Fit Pipeline and Get Clusters
clusters_tfidf = kmeans_pipeline_tfidf.fit_predict(X)

# 3. Get Vectorizer and KMeans model from pipeline
tfidf_vectorizer = kmeans_pipeline_tfidf.named_steps['tfidf']
kmeans_model = kmeans_pipeline_tfidf.named_steps['kmeans']

# 4. Get Feature Names (Words) and Cluster Centers
feature_names = tfidf_vectorizer.get_feature_names_out() # Get word vocabulary
cluster_centers = kmeans_model.cluster_centers_ # Get cluster centroids (vectors)

# 5. Function to get top terms per cluster
def get_top_terms(cluster_index, top_n=15):
    centroid = cluster_centers[cluster_index] # Get centroid for cluster
    top_term_indices = centroid.argsort()[-top_n:][::-1] # Get indices of top terms (words) in centroid
    top_terms = feature_names[top_term_indices] # Get actual words
    return top_terms

# 6. Print Top Terms for each cluster
print("Top terms per cluster:")
for i in range(4): # Assuming n_clusters=4
    top_terms = get_top_terms(i)
    print(f"\nCluster {i}:")
    for term in top_terms:
        print(f"- {term}")

# 7. (Optional) Print some example documents from each cluster (for qualitative inspection)
print("\nExample documents from each cluster (first 2 from each):")
for i in range(4):
    print(f"\nCluster {i} - Example Documents:")
    cluster_docs_indices = np.where(clusters_tfidf == i)[0] # Indices of docs in cluster i
    for doc_index in cluster_docs_indices[:2]: # First 2 documents
        print(f"Doc index: {doc_index}:")
        print(X[doc_index][:200] + "...") # Print first 200 chars of doc
        print("-" * 20)

### Explanation of K-Means Cluster Inspection - Top Terms

Let's understand the cluster inspection code:

*   **Steps 1-3:**  We set up and fit the K-Means pipeline with TF-IDF, similar to before. We also extract the trained `TfidfVectorizer` and `KMeans` model from the pipeline.

*   **Step 4: Get Feature Names and Cluster Centers:**
    *   **`feature_names = tfidf_vectorizer.get_feature_names_out()`**: Gets the vocabulary (words) from the `TfidfVectorizer`.
    *   **`cluster_centers = kmeans_model.cluster_centers_`**:  This is important! It retrieves the **cluster centroids** from the trained K-Means model.
        *   **Cluster Centroids:** In K-Means, each cluster is represented by a centroid, which is the mean vector of all documents assigned to that cluster.
        *   For TF-IDF vectorized text, a centroid is a vector in the same TF-IDF feature space.  The values in the centroid vector represent the "average" TF-IDF weight of each word in the cluster.

*   **Step 5: `get_top_terms(cluster_index, top_n=15)` Function:**
    *   This function takes a `cluster_index` (0, 1, 2, 3 in our case) and `top_n` (number of top terms to retrieve).
    *   **`centroid = cluster_centers[cluster_index]`**: Gets the centroid vector for the specified cluster.
    *   **`top_term_indices = centroid.argsort()[-top_n:][::-1]`**:  Finds the indices of the **top `top_n` terms** in the centroid vector.
        *   `centroid.argsort()`: Returns indices that would sort the centroid vector in ascending order.
        *   `[-top_n:]`: Selects the indices of the `top_n` largest values (highest TF-IDF weights) in the centroid.
        *   `[::-1]`: Reverses the order to get indices in descending order (highest to lowest weights).
    *   **`top_terms = feature_names[top_term_indices]`**:  Uses the `top_term_indices` to retrieve the actual **words** (feature names) from the vocabulary.

*   **Step 6: Print Top Terms for each cluster:**  We iterate through each cluster (0 to 3) and call `get_top_terms()` to get the top words for each cluster and print them.

*   **Step 7: (Optional) Print Example Documents:**  This part is for qualitative inspection. It prints the first 200 characters of the first 2 documents assigned to each cluster to give you a sense of what kind of documents are in each cluster.

Run this code and examine the "Top terms per cluster" output.

**Interpreting Top Terms:**

*   Examine the top words for each cluster. Do they seem to represent coherent topics?
*   Do the top terms for each cluster relate to the original categories we used (`alt.atheism`, `soc.religion.christian`, `talk.politics.mideast`, `comp.graphics`)?  Remember, clustering is unsupervised, so it won't perfectly match the original categories, but you should see some thematic overlap if clustering is working reasonably well.
*   Look at the example documents to get a qualitative sense of the cluster content.

Now, let's discuss how to choose the number of clusters, `K`.

### Choosing the Number of Clusters (K) - Elbow Method and Silhouette Score

Choosing the right number of clusters (`K`) is a critical step in K-Means clustering (and in clustering in general).  In many real-world scenarios, you won't know the "true" number of clusters beforehand.  Here are two common methods to help you choose K:

**1. Elbow Method (using Inertia):**

*   **Inertia:** In K-Means, **inertia** is the sum of squared distances of samples to their closest cluster center.  It represents the within-cluster sum of squares. Lower inertia is better, meaning clusters are more compact.
*   **Elbow Plot:**
    *   Run K-Means for a range of possible `K` values (e.g., from 2 to 10).
    *   For each `K`, calculate the **inertia**.
    *   Plot **Inertia vs. K**.
    *   Look for an "elbow" in the plot. The "elbow" point is often considered a good indication of a reasonable `K`.  The idea is that inertia decreases as K increases, but the rate of decrease slows down after the "elbow."

**2. Silhouette Score Method:**

*   We've already used the Silhouette Score to evaluate a clustering for a fixed `K`.
*   **Silhouette Score Plot:**
    *   Run K-Means for a range of `K` values (e.g., from 2 to 10).
    *   For each `K`, calculate the **average Silhouette Score** for all documents.
    *   Plot **Silhouette Score vs. K**.
    *   Look for the `K` that maximizes the Silhouette Score.  A higher Silhouette Score indicates better-defined clusters.



In [None]:
**Code Example - Combining Elbow Method and Silhouette Score:**
```python
# Code Cell 4: Choosing K - Elbow Method and Silhouette Score
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# 1. Load Dataset & Vectorize (TF-IDF Pipeline)
newsgroups = fetch_20newsgroups(subset='all', categories=['alt.atheism', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics'])
X = newsgroups.data

tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(X)

# 2. Range of K values to try
k_range = range(2, 11) # Try K from 2 to 10

# 3. Lists to store inertia and silhouette scores
inertia_values = []
silhouette_scores = []

# 4. Loop through different K values and run K-Means
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    clusters = kmeans.fit_predict(X_tfidf)
    inertia_values.append(kmeans.inertia_) # Store inertia
    silhouette_avg = silhouette_score(X_tfidf, clusters) # Calculate Silhouette Score
    silhouette_scores.append(silhouette_avg) # Store Silhouette Score
    print(f"For K={k}, Silhouette Score: {silhouette_avg:.4f}") # Print Silhouette Score for each K

# 5. Elbow Plot (Inertia vs. K)
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(k_range, inertia_values, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal K")
plt.xticks(k_range)
plt.grid(True)

# 6. Silhouette Score Plot (Silhouette Score vs. K)
plt.subplot(1, 2, 2)
plt.plot(k_range, silhouette_scores, marker='o', color='green')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score for Optimal K")
plt.xticks(k_range)
plt.grid(True)

plt.tight_layout()
plt.show()

Run this code to generate the Elbow Plot and Silhouette Score plot.

Interpreting the Plots:

* **Elbow Plot**: Look for an "elbow" point in the Inertia plot. Where does the decrease in inertia start to become less steep? This point might suggest a reasonable K.
* **Silhouette Score Plot**: Look for the K value that corresponds to the highest Silhouette Score.



**Important Notes on Choosing K**:

* **No Single "Best" K**: In many unsupervised clustering tasks, there isn't a single definitively "correct" number of clusters. The "best" K often depends on your goals and how you want to interpret the clusters.
* **Domain Knowledge**: Domain knowledge about your data can be helpful in guiding your choice of K. For example, if you are clustering news articles and you know there are roughly 5-7 major topics covered, you might try K values in that range.
* **Iterative Process**: Choosing K is often an iterative process. You might try different K values, inspect the resulting clusters (top terms, example documents), evaluate using metrics like Silhouette Score and Inertia, and then refine your choice of K based on these insights.
* **Other Evaluation Metrics**: Besides Silhouette Score and Inertia, other clustering evaluation metrics exist (e.g., [Davies-Bouldin Index](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html). You can explore these as well.

Now, let's move on to experimentation prompts to further explore text clustering with K-Means.

### Experimentation Prompts - K-Means Text Clustering Deep Dive

Time to experiment with K-Means text clustering! Try these:

1.  **Datasets and K-Means Clustering:**
    *   Change the `categories` in `fetch_20newsgroups` to different sets of categories (e.g., more categories, different combinations).
    *   For each dataset:
        *   Run the Elbow method and Silhouette Score analysis to help choose a suitable `K`.
        *   Cluster the data using K-Means with your chosen `K`.
        *   Inspect the clusters (top terms, example documents).
        *   How do the clusters and evaluation metrics change with different datasets? Does K-Means find meaningful clusters for different topic sets?

2.  **Vectorizers and K-Means - Deeper Exploration:**
    *   Experiment with different vectorizers and vectorizer parameters in your K-Means pipeline:
        *   **`CountVectorizer` vs. `TfidfVectorizer`:** Compare them systematically across different datasets and K values.
        *   **Vary `ngram_range`, `max_df`, `min_df`, `max_features` in `TfidfVectorizer` and `CountVectorizer`.** How do these vectorizer parameters affect clustering quality (Silhouette Score, cluster interpretability)?

3.  **Distance Metrics in K-Means (Advanced - Optional):**
    *   By default, scikit-learn's `KMeans` uses Euclidean distance.  While cosine distance is often preferred for text similarity, it's not directly available as a built-in metric in `sklearn.cluster.KMeans`.
    *   **If you want to experiment with cosine distance:**
        *   You would need to pre-normalize your TF-IDF vectors (e.g., using `sklearn.preprocessing.normalize` with `norm='l2'`) so that Euclidean distance approximates cosine distance.  Or, you might need to explore other K-Means implementations that directly support cosine distance (which might be more advanced).
        *   Does using cosine distance (or approximating it with normalized vectors and Euclidean distance) improve clustering results compared to standard Euclidean distance on raw TF-IDF vectors?

4.  **Initialization Methods for K-Means (Advanced - Optional):**
    *   Experiment with different initialization methods for K-Means using the `init` parameter:
        *   `init='k-means++'` (default - smart initialization that often leads to better results and faster convergence).
        *   `init='random'` (random initialization).
        *   How does the initialization method affect clustering performance (Silhouette Score, inertia, convergence speed)?

Think about these questions as you experiment:

*   How robust is K-Means clustering for text data? Does it consistently find meaningful clusters?
*   What is the impact of vectorization choices on clustering quality?
*   How sensitive is K-Means to the choice of `K`?
*   When is K-Means a good choice for text clustering, and when might you consider other clustering algorithms or topic modeling techniques?

After your experiments, read the summary and key takeaways for this lesson.

### Summary and Next Steps - Unveiling Text Structure with K-Means

Excellent work exploring text clustering with K-Means! In this lesson, you've:

*   Understood **text clustering** as an unsupervised learning task for grouping similar documents.
*   Learned about the **K-Means clustering algorithm** and its iterative process.
*   Built **K-Means clustering pipelines** for text data using scikit-learn.
*   Experimented with **text vectorization** (TF-IDF, CountVectorizer) for clustering.
*   Used the **Silhouette Score** to evaluate clustering quality.
*   Learned how to **inspect clusters** by examining top terms and example documents.
*   Explored methods for **choosing the number of clusters (K)**, including the Elbow method and Silhouette Score analysis.

**Key Takeaways for K-Means Text Clustering:**

*   Text clustering is a powerful technique for **discovering hidden structure and organizing unlabeled text data**.
*   **K-Means is a widely used and relatively simple clustering algorithm**, but it can be effective for text clustering.
*   **Text vectorization (like TF-IDF) is essential** to represent text documents numerically for K-Means.
*   **Choosing the number of clusters (K) is a crucial step**, and methods like the Elbow method and Silhouette Score can help guide this choice.
*   **Inspecting clusters (top terms, example documents) is important** to understand the topics discovered by K-Means and evaluate the quality of the clustering qualitatively.
*   K-Means is a good starting point for text clustering, but be aware of its limitations (sensitivity to initialization, assumptions about cluster shape, need to pre-specify K).

* There exists more advanced techniques for text analysis and unsupervised learning, which are beyong the scope of this course:
   *   **Topic Modeling (LDA, NMF)** - to discover latent topics in more detail.
   *   **Word Embeddings** (Word2Vec, GloVe, FastText) for richer text representations that can be used for both clustering and classification
   *   Potentially, other clustering algorithms beyond K-Means (e.g., Hierarchical Clustering, DBSCAN).

You are now equipped with valuable unsupervised learning skills for text data! Keep experimenting and exploring different datasets and techniques!

### Key Takeaways for Lesson #3d (K-Means Text Clustering Specific):

*   **Text Clustering is unsupervised learning to group similar documents without labels.**
*   **K-Means is a centroid-based algorithm that iteratively assigns documents to clusters and updates centroids.**
*   **Vectorize text (e.g., TF-IDF) before applying K-Means.**
*   **Silhouette Score evaluates clustering quality.**
*   **Inspect clusters by examining top terms and example documents.**
*   **Choose K using Elbow method and Silhouette Score analysis (iterative process).**
*   **K-Means is a useful baseline for text clustering, but has limitations.**

### Resources for Lesson #3d (K-Means Text Clustering Specific):

*   **Scikit-learn documentation on `KMeans`:** [https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
*   **Scikit-learn documentation on `silhouette_score`:** [https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)

### Additional Notes - K-Means Text Clustering Specific Considerations:

*   **Pre-processing:** Text pre-processing steps (lowercase, punctuation removal, stop word removal, stemming/lemmatization) are important for text clustering, just as they are for classification. Experiment with different pre-processing strategies.

*   **Feature Scaling (for Euclidean Distance):** If you are using Euclidean distance with K-Means on TF-IDF vectors, consider normalizing the TF-IDF vectors (e.g., using L2 normalization) as TF-IDF can have varying magnitudes. Normalization can help ensure that K-Means is less sensitive to document length and focuses more on term frequencies.

*   **High Dimensionality:** Text data is often high-dimensional (many words/features).  Dimensionality reduction techniques (like Principal Component Analysis - PCA or Non-negative Matrix Factorization - NMF) can sometimes be applied *before* K-Means to reduce dimensionality and potentially improve clustering, especially for very large vocabularies. However, for moderately sized vocabularies (like max_features=5000 in our examples), dimensionality reduction might not always be necessary or beneficial.

*   **Cluster Size Imbalance:** K-Means can sometimes produce clusters of very different sizes. This might be acceptable depending on your data, but if you want more balanced clusters, you might explore other clustering algorithms or techniques to address cluster imbalance.

*   **Beyond K-Means:**  Remember that K-Means is just one clustering algorithm. For more complex text clustering tasks, consider exploring other algorithms like Hierarchical Clustering, DBSCAN, or topic modeling techniques like LDA and NMF, which might be better suited for discovering more nuanced thematic structures in text.