In [None]:
Question 1: What is the difference between K-Means and Hierarchical Clustering?
Provide a use case for each.

                  K-Means Clustering

Definition: A partition-based clustering algorithm that divides data into k clusters by minimizing the variance within clusters.

How it works:

Choose k (number of clusters).

Randomly initialize cluster centroids.

Assign each point to the nearest centroid.

Recompute centroids until convergence.

Pros:

Fast and scalable for large datasets.

Works well with spherical and well-separated clusters.

Cons:

Must predefine k.

Sensitive to outliers and initialization.

Struggles with non-spherical clusters.

✅ Use Case:
Customer segmentation in marketing (e.g., grouping shoppers into 5 clusters based on purchase behavior).

Hierarchical Clustering

Definition: A tree-based clustering algorithm that builds a hierarchy (dendrogram) of clusters.

Types:

Agglomerative (bottom-up): Start with each point as a cluster, merge iteratively.

Divisive (top-down): Start with all points in one cluster, split iteratively.

Pros:

No need to predefine k.

Produces a dendrogram for flexible cluster selection.

Good for smaller datasets and discovering nested structures.

Cons:

Computationally expensive (O(n²) or worse).

Not suitable for very large datasets.

✅ Use Case:
Document clustering in text mining (e.g., building a hierarchy of news articles by topic/subtopic).

Key Difference in Simple Words

K-Means: "Flat" clustering → you tell it how many groups you want.

Hierarchical: "Tree" clustering → it shows relationships among clusters and lets you decide later.

Question 2: Explain the purpose of the Silhouette Score in evaluating clustering
algorithms

Silhouette Score in Clustering
Purpose

The Silhouette Score is a metric used to evaluate the quality of clusters formed by a clustering algorithm.
It measures how well each data point fits within its assigned cluster compared to other clusters.

How It Works

For each data point i:

a(i): Average distance of point i to all other points in the same cluster (cohesion).

b(i): Minimum average distance of point i to all points in the nearest other cluster (separation).

The Silhouette Score for a point is:


Why It’s Useful

Helps choose the optimal number of clusters (k) in algorithms like K-Means.

Provides an objective measure of cluster quality without requiring ground truth labels.

Balances cohesion (tight clusters) and separation (clear distance between clusters).

Example

Suppose you run K-Means with k=2, 3, 4 on customer data.

You compute Silhouette Scores:

k=2 → 0.45

k=3 → 0.62 ✅ (best separation + cohesion)

k=4 → 0.39

→ You’d choose k=3 as the optimal number of clusters.

✨ In short: The Silhouette Score tells you how natural and well-separated your clusters are, and helps in validating and selecting the best clustering solution.

In [None]:
Question 3: What are the core parameters of DBSCAN, and how do they influence the
clustering process?

Core Parameters of DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) relies mainly on two parameters:

1. ε (epsilon / eps)

Definition: The maximum distance between two points for them to be considered neighbors.

Influence:

Small ε → Many small, fragmented clusters + more noise points.

Large ε → Fewer, larger clusters (risk of merging distinct groups).

Think of ε as the "radius of a neighborhood."

2. MinPts (minimum points)

Definition: The minimum number of points required to form a dense region.

Influence:

Small MinPts → Even small dense regions become clusters → risk of false clusters.

Large MinPts → Only very dense regions form clusters → more noise, fewer clusters.

Rule of thumb: MinPts ≥ dimension of data + 1 (e.g., for 2D, MinPts ≥ 3).

How They Work Together

Core Point: A point with at least MinPts neighbors within radius ε.

Border Point: A point within ε of a core point but with fewer than MinPts itself.

Noise Point (Outlier): Not a core point and not within ε of any core point.

Clusters grow from core points, connecting neighboring points that satisfy ε and MinPts.

Example

Imagine clustering GPS coordinates of taxis in a city:

If ε = 200 meters, MinPts = 10 → You detect only dense areas like taxi stands.

If ε = 1 km, MinPts = 3 → You might merge different stands into one large cluster.

✅ In short:

ε (eps) controls the radius of neighborhoods.

MinPts controls the minimum density threshold.

Together, they decide what counts as a cluster vs. noise.

Question 4: Why is feature scaling important when applying clustering algorithms like
K-Means and DBSCAN?

Why Feature Scaling Matters in Clustering
1. Distance-Based Nature of Clustering

Both K-Means and DBSCAN rely on distance calculations (usually Euclidean).

If features are on very different scales, the larger-scale feature dominates the distance.

👉 Example:

Suppose you cluster customers using:

Income (20,000 – 200,000)

Age (18 – 70)

Without scaling, Income (range in lakhs) outweighs Age, so clusters form mostly by income, ignoring age.

2. Impact on K-Means

Centroids are computed using mean values.

If one feature has a larger scale, centroids shift more in that dimension.

Result → biased clusters that don’t represent true similarity.

3. Impact on DBSCAN

DBSCAN uses ε (radius) in distance space.

If features are unscaled, ε will be too small for one dimension and too large for another.

Result → Wrong neighborhood density → clusters break or merge incorrectly.

4. How to Scale

Standardization (Z-score scaling):

𝑧
=
𝑥
−
𝜇
𝜎
z=
σ
x−μ
	​


→ Mean = 0, Std = 1.
Useful if data is normally distributed.

Min-Max Scaling (Normalization):

𝑥
′
=
𝑥
−
𝑥
𝑚
𝑖
𝑛
𝑥
𝑚
𝑎
𝑥
−
𝑥
𝑚
𝑖
𝑛
x
′
=
x
max
	​

−x
min
	​

x−x
min
	​

	​


→ Scales features into [0,1].
Useful when distribution is unknown or for bounded features.

✅ In short:
Feature scaling ensures that all features contribute equally to distance calculations, preventing bias from scale differences, and leading to more meaningful clusters in both K-Means and DBSCAN.

In [None]:
Question 5: What is the Elbow Method in K-Means clustering and how does it help
determine the optimal number of clusters?

Elbow Method in K-Means
Problem

In K-Means, you must predefine the number of clusters k. But how do you know the “best” k?

Idea of the Elbow Method

Run K-Means with different values of k (say 1 → 10).

For each k, compute the Within-Cluster Sum of Squares (WCSS):

WCSS = total squared distance between each point and its cluster centroid.

Lower WCSS = tighter, more compact clusters.

As k increases:

WCSS always decreases (more clusters = better fit).

But the improvement diminishes after a certain point.

The "best" k is at the elbow point of the curve → the point where adding more clusters does not significantly reduce WCSS.

Steps

Run K-Means with k = 1, 2, 3 … n.

Record WCSS for each k.

Plot k (x-axis) vs. WCSS (y-axis).

Look for the elbow (sharp bend in curve).

Example

k=1 → WCSS = 1000

k=2 → WCSS = 500

k=3 → WCSS = 250

k=4 → WCSS = 210

k=5 → WCSS = 200

👉 From k=3 to k=4, improvement is small → optimal k ≈ 3.

Why It’s Useful

Provides a visual, intuitive way to pick k.

Prevents under-clustering (too few clusters) or over-clustering (too many meaningless clusters).

✅ In short:
The Elbow Method helps choose the optimal number of clusters in K-Means by finding the point where adding more clusters no longer provides significant benefit in reducing WCSS.

In [None]:
Dataset:
Use make_blobs, make_moons, and sklearn.datasets.load_wine() as
specified.
Question 6: Generate synthetic data using make_blobs(n_samples=300, centers=4),
apply KMeans clustering, and visualize the results with cluster centers.
(Include your Python code and output in the code box below.)

In [None]:
# Question 6: KMeans on make_blobs dataset

import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# 1. Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# 2. Apply KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# 3. Get cluster centers
centers = kmeans.cluster_centers_

# 4. Visualization
plt.figure(figsize=(8,6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.7, marker='X', label='Centers')
plt.title("KMeans Clustering on make_blobs Data (4 clusters)")
plt.legend()
plt.show()


In [None]:
Question 7: Load the Wine dataset, apply StandardScaler , and then train a DBSCAN
model. Print the number of clusters found (excluding noise).
(Include your Python code and output in the code box below.)

In [None]:
# Question 7: DBSCAN on Wine dataset

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# 1. Load dataset
wine = load_wine()
X = wine.data

# 2. Feature scaling (important for DBSCAN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)  # eps chosen empirically
labels = dbscan.fit_predict(X_scaled)

# 4. Count clusters (exclude noise: label = -1)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print("Cluster labels:", np.unique(labels))
print("Number of clusters found (excluding noise):", n_clusters)
print("Number of noise points:", list(labels).count(-1))


In [None]:
Cluster labels: [-1  0  1]
Number of clusters found (excluding noise): 2
Number of noise points: 5


In [None]:
Question 8: Generate moon-shaped synthetic data using
make_moons(n_samples=200, noise=0.1), apply DBSCAN, and highlight the outliers in
the plot.
(Include your Python code and output in the code box below.)
    

In [None]:
# Question 8: DBSCAN on moon-shaped data

import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# 1. Generate moon-shaped synthetic data
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

# 2. Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)  # tuned for moons
labels = dbscan.fit_predict(X)

# 3. Plot clusters
plt.figure(figsize=(8,6))

# Core clusters
plt.scatter(X[labels >= 0, 0], X[labels >= 0, 1], c=labels[labels >= 0],
            cmap='viridis', s=50, label="Clusters")

# Outliers (label = -1)
plt.scatter(X[labels == -1, 0], X[labels == -1, 1],
            c='red', s=80, marker='x', label="Outliers")

plt.title("DBSCAN on Moon-Shaped Data")
plt.legend()
plt.show()


In [None]:
Question 9: Load the Wine dataset, reduce it to 2D using PCA, then apply
Agglomerative Clustering and visualize the result in 2D with a scatter plot.
(Include your Python code and output in the code box below.)


In [None]:
# Question 9: Agglomerative Clustering on Wine dataset (2D PCA visualization)

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

# 1. Load dataset
wine = load_wine()
X = wine.data

# 2. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Dimensionality reduction to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# 4. Apply Agglomerative Clustering (let’s assume 3 clusters like Wine dataset)
agg_clust = AgglomerativeClustering(n_clusters=3)
labels = agg_clust.fit_predict(X_pca)

# 5. Visualization
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering on Wine Dataset (2D PCA projection)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()


In [None]:
Question 10: You are working as a data analyst at an e-commerce company. The
marketing team wants to segment customers based on their purchasing behavior to run
targeted promotions. The dataset contains customer demographics and their product
purchase history across categories.
Describe your real-world data science workflow using clustering:
● Which clustering algorithm(s) would you use and why?
● How would you preprocess the data (missing values, scaling)?
● How would you determine the number of clusters?
● How would the marketing team benefit from your clustering analysis?
(Include your Python code and output in the code box below.)


1. Choice of Clustering Algorithm

K-Means → good for large datasets, interpretable clusters.

DBSCAN → useful for detecting outlier customers (rare behaviors).

Agglomerative Clustering → for exploratory analysis (dendrograms).

✅ I’d start with K-Means (scalable, marketing-friendly clusters), and compare with others.

2. Data Preprocessing

Missing values → impute (mean for numeric, mode for categorical, or "Unknown" category).

Categorical features → one-hot encoding (e.g., gender, region).

Numerical features → standardize using StandardScaler (since K-Means/DBSCAN rely on distances).

Dimensionality reduction (PCA) → for visualization and noise removal.

3. Choosing the Number of Clusters

Elbow Method (plot WCSS vs. k).

Silhouette Score (cluster quality).

Business context → e.g., 3–6 segments is usually manageable for marketing.

4. Business Value

The marketing team can use clusters to:

Identify high-value customers (loyalty programs, premium offers).

Find price-sensitive customers (discount campaigns).

Spot cross-sell opportunities (people buying electronics may buy accessories).

Detect dormant customers (send re-engagement promotions).

In [None]:
# Question 10: Customer Segmentation Workflow with Clustering

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# 1. Simulate customer dataset (demographics + purchases)
np.random.seed(42)
data = {
    "Age": np.random.randint(18, 70, 200),
    "Annual_Income": np.random.randint(20000, 150000, 200),
    "Electronics_Spend": np.random.randint(0, 5000, 200),
    "Clothing_Spend": np.random.randint(0, 3000, 200),
    "Grocery_Spend": np.random.randint(500, 5000, 200)
}
df = pd.DataFrame(data)

# 2. Preprocess (scaling)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# 3. Determine optimal number of clusters using Elbow + Silhouette
wcss = []
sil_scores = []
K = range(2, 10)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    wcss.append(kmeans.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

# Plot Elbow
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(K, wcss, marker='o')
plt.title("Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")

# Plot Silhouette
plt.subplot(1,2,2)
plt.plot(K, sil_scores, marker='o', color='green')
plt.title("Silhouette Scores")
plt.xlabel("Number of clusters")
plt.ylabel("Score")
plt.show()

# 4. Train final KMeans model (say k=4 chosen)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X_scaled)

df["Cluster"] = labels

# 5. PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap="viridis", s=50)
plt.title("Customer Segmentation (KMeans Clustering, 2D PCA projection)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# 6. Inspect clusters
print(df.groupby("Cluster").mean())


In [None]:
Expected Output

Elbow + Silhouette plots → helps decide best k (say k=4).

2D PCA scatter plot → visual clusters of customers.

Cluster profiles table → marketing sees:

Cluster 0: Young, low income → budget-friendly promotions.

Cluster 1: Older, high income → luxury offers.

Cluster 2: Middle-aged, family spenders → grocery/household deals.

Cluster 3: Tech enthusiasts → electronics campaigns.