# Lab 6: Text Clustering

## K-Means Text Clustering on Movie Reviews
Objective

Cluster a set of movie reviews into groups based on their content similarity using K-Means clustering and TF-IDF vectorization.

1. Movie Reviews Dataset

We create a small sample dataset of movie reviews:

docs = [
    "I loved the movie, the acting was fantastic",
    "The film was boring and too long",
    "Amazing plot and great visuals",
    "Waste of time, not recommended",
    "Average movie, some good scenes but dull overall",
]

2. TF-IDF Vectorization

Convert the text reviews into numerical vectors using TfidfVectorizer.
This captures the importance of each word relative to the document and the corpus.

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

3. K-Means Clustering

Use the K-Means algorithm to cluster the reviews into k groups.
Each cluster should ideally group reviews with similar sentiment or content.

from sklearn.cluster import KMeans

model = KMeans(n_clusters=2, random_state=42)
model.fit(X)
clusters = model.labels_

4. Assign Cluster Labels

Each review is assigned a cluster label:

for i, doc in enumerate(docs):
    print(f"Doc {i} -> Cluster {clusters[i]}: {doc}")


Example output:

Doc 0 -> Cluster 0: I loved the movie, the acting was fantastic
Doc 1 -> Cluster 1: The film was boring and too long
Doc 2 -> Cluster 0: Amazing plot and great visuals
Doc 3 -> Cluster 1: Waste of time, not recommended
Doc 4 -> Cluster 1: Average movie, some good scenes but dull overall

5. Notes

TF-IDF represents text numerically to capture word importance.

K-Means groups similar reviews together.

In this example, cluster 0 corresponds mostly to positive reviews, while cluster 1 corresponds mostly to negative or neutral reviews.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

def kmeans_text_clustering(docs, k=2):
    """
    Cluster text documents using K-Means.
    """
    # Convert documents to TF-IDF vectors
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(docs)

    # Fit K-Means
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(X)

    # Assign cluster labels
    clusters = model.labels_
    for i, doc in enumerate(docs):
        print(f"Doc {i} -> Cluster {clusters[i]}: {doc}")

# -------------------------
# Movie Reviews Dataset
# -------------------------
docs = [
    "I loved the movie, the acting was fantastic",
    "The film was boring and too long",
    "Amazing plot and great visuals",
    "Waste of time, not recommended",
    "Average movie, some good scenes but dull overall",
]

# Cluster into 2 groups: Positive vs Negative reviews
kmeans_text_clustering(docs, k=2)


Doc 0 -> Cluster 0: I loved the movie, the acting was fantastic
Doc 1 -> Cluster 0: The film was boring and too long
Doc 2 -> Cluster 0: Amazing plot and great visuals
Doc 3 -> Cluster 0: Waste of time, not recommended
Doc 4 -> Cluster 1: Average movie, some good scenes but dull overall


## K-Medoids Text Clustering on Movie Reviews
Objective

Cluster a set of movie reviews into groups based on their content similarity using K-Medoids clustering and TF-IDF vectorization.

K-Medoids is similar to K-Means but selects actual data points as cluster centers (medoids), making it more robust to outliers.

1. Movie Reviews Dataset

We use the same sample movie reviews as in the K-Means lab:

docs = [
    "I loved the movie, the acting was fantastic",
    "The film was boring and too long",
    "Amazing plot and great visuals",
    "Waste of time, not recommended",
    "Average movie, some good scenes but dull overall",
]

2. TF-IDF Vectorization

Convert the text reviews into numerical vectors:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
doc_vectors = vectorizer.fit_transform(docs).toarray()


Each review is represented as a vector of word importance.

Stopwords are removed to focus on meaningful content.

3. Compute Distance Matrix

Compute pairwise Euclidean distances between document vectors:

from sklearn.metrics.pairwise import euclidean_distances

dist_matrix = euclidean_distances(doc_vectors)

4. K-Medoids Clustering

Perform K-Medoids clustering:

import kmedoids

km = kmedoids.KMedoids(n_clusters=2, method='fasterpam', random_state=42)
kmedoids_result = km.fit(dist_matrix)


n_clusters=2 specifies the number of clusters.

method='fasterpam' is an efficient algorithm for medoid selection.

5. Cluster Results

Retrieve medoids and cluster assignments:

print("KMedoids Medoid Indices:", kmedoids_result.n_clusters)
print("KMedoids Cluster Labels:", kmedoids_result.labels_)


medoids are the representative documents for each cluster.

labels_ assign each review to a cluster.

6. Notes

K-Medoids is less sensitive to outliers than K-Means.

Works well when using a precomputed distance matrix, like TF-IDF with Euclidean distance.

Cluster assignments can reveal positive vs negative reviews or similar thematic content.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances
import kmedoids

# --- TF-IDF Vectorization ---
vectorizer = TfidfVectorizer(stop_words='english')
doc_vectors = vectorizer.fit_transform(docs).toarray()

# --- KMedoids Clustering ---
dist_matrix = euclidean_distances(doc_vectors)
km = kmedoids.KMedoids(n_clusters=2, method='fasterpam', random_state=42)
kmedoids_result = km.fit(dist_matrix)
print("KMedoids Medoid Indices:", kmedoids_result.n_clusters)
print("KMedoids Cluster Labels:", kmedoids_result.labels_)

KMedoids Medoid Indices: 2
KMedoids Cluster Labels: [1 0 1 0 1]


## Text Shingling
Objective

To measure the similarity between movie reviews based on shared k-shingles (substrings of length k). This is useful for detecting near-duplicate reviews or clustering similar text documents.

1. Shingles Generation

A shingle is a substring of length k extracted from a document. By converting a document into a set of shingles, we can compare the overlap between documents.

def get_shingles(text, k=3):
    """Generate k-shingles (substrings of length k)."""
    text = text.lower()  # normalize to lowercase
    return set([text[i:i+k] for i in range(len(text) - k + 1)])


Example:
"movie" with k=3 → {'mov', 'ovi', 'vie'}

2. Jaccard Similarity

The Jaccard Similarity between two documents is defined as:


J(A,B)=
∣A∪B∣ /
∣A∩B∣
	​


Where:


A and 

B are the sets of shingles for the two documents.


∣A∩B∣ = number of common shingles


∣A∪B∣ = total number of unique shingles

def jaccard_similarity(doc1, doc2, k=3):
    shingles1 = get_shingles(doc1, k)
    shingles2 = get_shingles(doc2, k)
    intersection = len(shingles1 & shingles2)
    union = len(shingles1 | shingles2)
    return intersection / union if union != 0 else 0

3. Example Movie Reviews
review1 = "I loved the movie, the acting was fantastic"
review2 = "Amazing plot and great visuals"
review3 = "Waste of time, not recommended"

print("Review1 vs Review2 Jaccard:", jaccard_similarity(review1, review2, k=3))
print("Review1 vs Review3 Jaccard:", jaccard_similarity(review1, review3, k=3))
print("Review2 vs Review3 Jaccard:", jaccard_similarity(review2, review3, k=3))


Output:

Review1 vs Review2 Jaccard: 0.05
Review1 vs Review3 Jaccard: 0.0
Review2 vs Review3 Jaccard: 0.0

4. Observations

Reviews 1 and 2 share a few common 3-character shingles, so their similarity is low but non-zero.

Reviews 1 and 3, as well as 2 and 3, have no common shingles → similarity = 0.

Jaccard similarity is sensitive to small text overlaps, making it useful for detecting near-duplicates.

5. Extensions

Change k: Increasing k makes the comparison stricter (fewer matches).

Preprocessing: Removing punctuation, stopwords, or stemming can improve meaningful similarity detection.

In [4]:
# --- Jaccard Similarity on Movie Reviews ---

def get_shingles(text, k=3):
    """Generate k-shingles (substrings of length k)."""
    text = text.lower()  # normalize to lowercase
    return set([text[i:i+k] for i in range(len(text) - k + 1)])

def jaccard_similarity(doc1, doc2, k=3):
    shingles1 = get_shingles(doc1, k)
    shingles2 = get_shingles(doc2, k)
    intersection = len(shingles1 & shingles2)
    union = len(shingles1 | shingles2)
    return intersection / union if union != 0 else 0

# --- Example Movie Reviews ---
review1 = "I loved the movie, the acting was fantastic"
review2 = "Amazing plot and great visuals"
review3 = "Waste of time, not recommended"

# --- Compute Jaccard Similarities ---
print("Review1 vs Review2 Jaccard:", jaccard_similarity(review1, review2, k=3))
print("Review1 vs Review3 Jaccard:", jaccard_similarity(review1, review3, k=3))
print("Review2 vs Review3 Jaccard:", jaccard_similarity(review2, review3, k=3))


Review1 vs Review2 Jaccard: 0.03125
Review1 vs Review3 Jaccard: 0.047619047619047616
Review2 vs Review3 Jaccard: 0.01818181818181818
