Theory Questions:
1. What is unsupervised learning in the context of machine learning?

Unsupervised learning is a type of machine learning where the model is trained on data without labeled outputs. The goal is to uncover hidden patterns or groupings in the data, such as clustering or dimensionality reduction.

2. How does K-Means clustering algorithm work?

K-Means partitions data into k clusters by initializing k centroids, assigning each point to the nearest centroid, then updating centroids as the mean of assigned points. This repeats until centroids stabilize.

3. Explain the concept of a dendrogram in hierarchical clustering.

A dendrogram is a tree-like diagram that shows the arrangement of the clusters formed at each step of hierarchical clustering. It helps visualize how clusters are merged or split at various levels.

4. What is the main difference between K-Means and Hierarchical Clustering?

K-Means requires a predefined number of clusters and is a partitional method. Hierarchical clustering does not need a specified number of clusters and builds a tree of clusters.

5. What are the advantages of DBSCAN over K-Means?

No need to specify the number of clusters.
Can find clusters of arbitrary shapes.
Handles noise/outliers well.

6. When would you use Silhouette Score in clustering?

Silhouette Score is used to evaluate the quality of clustering. It measures how similar a point is to its own cluster compared to other clusters and helps determine the optimal number of clusters.

7. What are the limitations of Hierarchical Clustering?

Computationally expensive for large datasets.
Once a merge/split is done, it cannot be undone.
Sensitive to noise and outliers.
Does not scale well.

8. Why is feature scaling important in clustering algorithms like K-Means?

K-Means relies on distance calculations (e.g., Euclidean distance). If features are on different scales, one feature may dominate the clustering, leading to poor results.

9. How does DBSCAN identify noise points?

Points that are not core points and do not fall within the neighborhood (ε) of any core point are classified as noise or outliers.

10. Define inertia in the context of K-Means.

Inertia is the sum of squared distances between each point and the centroid of its cluster. Lower inertia indicates tighter, more compact clusters.

11. What is the elbow method in K-Means clustering?

A method to determine the optimal number of clusters by plotting inertia vs. number of clusters. The "elbow" point where inertia starts to decrease slowly is chosen as the ideal number of clusters.

12. Describe the concept of "density" in DBSCAN.

Density refers to the number of points within a specified radius (ε). A region is considered dense if it has at least a minimum number of points (MinPts) within ε.

13. Can hierarchical clustering be used on categorical data?

Yes, but standard hierarchical clustering using Euclidean distance is not ideal. Special distance measures (e.g., Hamming distance) and techniques (e.g., using Gower’s metric) are needed.

14. What does a negative Silhouette Score indicate?

A negative Silhouette Score means a point is closer to a neighboring cluster than to its own, indicating it might be misclassified.

15. Explain the term "linkage criteria" in hierarchical clustering.

Linkage criteria determine how the distance between clusters is calculated. Common types include:
Single linkage (min distance),
Complete linkage (max distance),
Average linkage (mean distance),
Ward’s method (minimize variance).

16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

Because K-Means assumes clusters of similar size and density. It uses centroid-based assignment and may misclassify points in clusters with different shapes, sizes, or densities.

17. What are the core parameters in DBSCAN, and how do they influence clustering?

ε (epsilon): Radius to consider for neighborhood.
MinPts: Minimum number of points within ε to form a dense region.
These parameters determine what is considered a cluster and what is labeled noise.

18. How does K-Means++ improve upon standard K-Means initialization?

K-Means++ selects initial centroids in a smarter way, maximizing the distance between them. This reduces chances of poor clustering and speeds up convergence.

19. What is agglomerative clustering?

A type of hierarchical clustering that starts with each point as its own cluster and iteratively merges the closest pairs until one cluster or desired number is left.

20. What makes Silhouette Score a better metric than just inertia for model evaluation?

Inertia only measures intra-cluster distance, not inter-cluster separation. Silhouette Score considers both compactness and separation, giving a more balanced view of clustering quality.

PRACTICAL QUESTIONS

In [None]:
#21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=500, centers=4, cluster_std=0.6, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
y_pred = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', s=30)
plt.title("K-Means Clustering on Synthetic Data (4 centers)")
plt.show()


In [None]:
#22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
X = iris.data
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

print("First 10 predicted labels:", labels[:10])


In [None]:
#23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import numpy as np

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='plasma', s=30)
outliers = labels == -1
plt.scatter(X[outliers, 0], X[outliers, 1], c='red', s=50, label='Outliers')
plt.legend()
plt.title("DBSCAN on make_moons data")
plt.show()


In [None]:
#24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from collections import Counter

wine = load_wine()
X = wine.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

cluster_sizes = Counter(labels)
print("Cluster sizes:", cluster_sizes)


In [None]:
#25. Use make circles to generate synthetic data and cluster it using DBSCAN. Plot the result.
from sklearn.datasets import make_circles

X, _ = make_circles(n_samples=300, noise=0.05, factor=0.4, random_state=42)
dbscan = DBSCAN(eps=0.15, min_samples=5)
labels = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='coolwarm', s=30)
plt.title("DBSCAN clustering on make_circles")
plt.show()


In [None]:
#26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

cancer = load_breast_cancer()
X = cancer.data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

centroids = pd.DataFrame(kmeans.cluster_centers_, columns=cancer.feature_names)
print("Cluster centroids:\n", centroids)


In [None]:
#27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.
X, _ = make_blobs(n_samples=500, centers=3, cluster_std=[0.3, 1.5, 0.5], random_state=42)
dbscan = DBSCAN(eps=0.8, min_samples=5)
labels = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set1', s=30)
plt.title("DBSCAN on blobs with varying std dev")
plt.show()


In [None]:
#28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means.
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

digits = load_digits()
X = digits.data

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='tab10', s=30)
plt.title("K-Means clusters on Digits PCA-reduced data")
plt.show()


In [None]:
#29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart
from sklearn.metrics import silhouette_score
import numpy as np

X, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.7, random_state=42)
scores = []

for k in range(2, 6):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    scores.append(score)

plt.bar(range(2, 6), scores, color='skyblue')
plt.xlabel("Number of clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for different k values")
plt.show()


In [None]:
#30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage.
from scipy.cluster.hierarchy import dendrogram, linkage
import scipy.cluster.hierarchy as sch

iris = load_iris()
X = iris.data

linked = linkage(X, method='average')

plt.figure(figsize=(10, 6))
dendrogram(linked, labels=iris.target, orientation='top')
plt.title("Dendrogram (Average linkage) for Iris dataset")
plt.xlabel("Samples")
plt.ylabel("Distance")
plt.show()


In [None]:
#31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.
import numpy as np
from matplotlib.colors import ListedColormap

X, y = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Create mesh grid for decision boundaries
h = 0.1
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
plt.figure(figsize=(8,6))
plt.contourf(xx, yy, Z, cmap=cmap_light, alpha=0.5)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', edgecolor='k')
plt.title("K-Means with overlapping clusters and decision boundaries")
plt.show()


In [None]:
#32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.
from sklearn.manifold import TSNE

digits = load_digits()
X = digits.data

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

dbscan = DBSCAN(eps=3, min_samples=5)
labels = dbscan.fit_predict(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab20', s=30)
plt.title("DBSCAN on t-SNE reduced Digits data")
plt.show()


In [None]:
#33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
agg = AgglomerativeClustering(n_clusters=3, linkage='complete')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set1', s=30)
plt.title("Agglomerative Clustering with complete linkage")
plt.show()


In [None]:
#34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means, Show results in a line plot.
cancer = load_breast_cancer()
X = cancer.data

inertia_values = []
k_range = range(2, 7)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia_values.append(kmeans.inertia_)

plt.plot(k_range, inertia_values, marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("K-Means Inertia for Breast Cancer dataset")
plt.show()


In [None]:
#35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title("Agglomerative Clustering (Single linkage) on Concentric Circles")
plt.show()


In [None]:
#36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_wine
import numpy as np

wine = load_wine()
X = wine.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)


In [None]:
#37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points.
X, _ = make_blobs(n_samples=400, centers=4, cluster_std=0.7, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=200, c='red', marker='X')
plt.title("K-Means with cluster centers")
plt.show()


In [None]:
#38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.
iris = load_iris()
X = iris.data

dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

noise_count = list(labels).count(-1)
print("Number of noise samples detected:", noise_count)


In [None]:
#39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='coolwarm')
plt.title("K-Means clustering on make_moons (non-linear data)")
plt.show()


In [None]:

#40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA

digits = load_digits()
X = digits.data

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X)

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels, cmap='tab10', s=40)
ax.set_title("3D PCA + K-Means on Digits dataset")
plt.show()


In [None]:
#41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering.
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.7, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)

score = silhouette_score(X, labels)
print("Silhouette Score for 5-cluster KMeans:", score)


In [None]:
#42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D.
cancer = load_breast_cancer()
X = cancer.data

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

agg = AgglomerativeClustering(n_clusters=2)
labels = agg.fit_predict(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', s=30)
plt.title("Agglomerative Clustering after PCA on Breast Cancer data")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()


In [None]:
#43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side.
X, _ = make_circles(n_samples=300, noise=0.1, factor=0.4, random_state=42)

kmeans = KMeans(n_clusters=2, random_state=42)
labels_km = kmeans.fit_predict(X)

dbscan = DBSCAN(eps=0.15, min_samples=5)
labels_db = dbscan.fit_predict(X)

fig, axs = plt.subplots(1, 2, figsize=(12,5))

axs[0].scatter(X[:, 0], X[:, 1], c=labels_km, cmap='coolwarm')
axs[0].set_title("KMeans Clustering")

axs[1].scatter(X[:, 0], X[:, 1], c=labels_db, cmap='coolwarm')
axs[1].set_title("DBSCAN Clustering")

plt.show()


In [None]:
#44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering.
from sklearn.metrics import silhouette_samples
import seaborn as sns

iris = load_iris()
X = iris.data

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

sil_samples = silhouette_samples(X, labels)

sns.scatterplot(x=range(len(sil_samples)), y=sil_samples, hue=labels, palette='Set1', legend='full')
plt.xlabel("Sample Index")
plt.ylabel("Silhouette Coefficient")
plt.title("Silhouette Coefficient per sample (Iris + KMeans)")
plt.show()


In [None]:
#45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters.
X, _ = make_blobs(n_samples=400, centers=4, cluster_std=0.8, random_state=42)
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set2', s=30)
plt.title("Agglomerative Clustering (Average linkage)")
plt.show()


In [None]:
#46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features).
import pandas as pd
import seaborn as sns

wine = load_wine()
X = wine.data[:, :4]
df = pd.DataFrame(X, columns=wine.feature_names[:4])

kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(X).astype(str)

sns.pairplot(df, hue='cluster', palette='Set1')
plt.suptitle("Wine dataset cluster assignments (first 4 features)", y=1.02)
plt.show()


In [None]:
#47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count.
X, _ = make_blobs(n_samples=500, centers=3, cluster_std=1.5, random_state=42)
dbscan = DBSCAN(eps=1.2, min_samples=5)
labels = dbscan.fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Noise points detected: {n_noise}")


In [None]:
#48. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters
X = load_digits().data
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

agg = AgglomerativeClustering(n_clusters=10)
labels = agg.fit_predict(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10', s=30)
plt.title("Agglomerative Clustering on t-SNE reduced Digits data")
plt.show()

