<a href="https://colab.research.google.com/github/tgarg535/Machine-Learning/blob/main/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Theoretical Questions:**
### **1. What is unsupervised learning in the context of machine learning?**

Unsupervised learning is a type of machine learning where the algorithm learns from data that has **not been labeled** or categorized. Unlike supervised learning, there is no "ground truth" or target variable. The goal is to find hidden patterns, structures, or groupings within the data.

### **2. How does the K-Means clustering algorithm work?**

K-Means follows an iterative process to partition data into  distinct clusters:

1. **Initialization:** Select  random points as initial centroids.
2. **Assignment:** Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance).
3. **Update:** Calculate the mean of all points in each cluster and move the centroid to that mean position.
4. **Repeat:** Steps 2 and 3 are repeated until centroids no longer change or a maximum number of iterations is reached.

### **3. Explain the concept of a dendrogram in hierarchical clustering.**

A dendrogram is a **tree-like diagram** that records the sequences of merges or splits during hierarchical clustering.

* The **y-axis** represents the distance (dissimilarity) between clusters.
* The **x-axis** represents individual data points.
* By cutting the dendrogram horizontally at a certain height, you can determine the number of clusters in the dataset.

### **4. What is the main difference between K-Means and Hierarchical Clustering?**

* **K-Means:** Requires the number of clusters () to be specified in advance. It is computationally efficient for large datasets and produces "flat" partitions.
* **Hierarchical Clustering:** Does not require a pre-defined number of clusters. It creates a nested structure (hierarchy) of clusters, which is better for understanding data relationships but is computationally expensive ( or ).

### **5. What are the advantages of DBSCAN over K-Means?**

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) offers three main advantages:

1. **Arbitrary Shapes:** It can find clusters of any shape, whereas K-Means assumes clusters are spherical.
2. **No K-Value:** You don’t need to specify the number of clusters beforehand.
3. **Noise Handling:** It explicitly identifies and ignores outliers (noise points).

### **6. When would you use Silhouette Score in clustering?**

The Silhouette Score is used to evaluate the **quality and separation** of clusters. It measures how similar an object is to its own cluster compared to other clusters. You use it when you want to validate the choice of  or compare different clustering algorithms.

### **7. What are the limitations of Hierarchical Clustering?**

* **Scalability:** It is very slow on large datasets because the complexity increases drastically with the number of points.
* **Irreversibility:** Once a merge or split is done, it cannot be undone.
* **Sensitivity:** It is sensitive to noise and outliers.

### **8. Why is feature scaling important in clustering algorithms like K-Means?**

K-Means relies on **distance calculations** (like Euclidean distance). If one feature has a range of 0–1 and another has a range of 0–10,000, the larger feature will dominate the distance calculation, making the smaller feature irrelevant. Scaling ensures all features contribute equally.

### **9. How does DBSCAN identify noise points?**

DBSCAN classifies a point as **Noise** if it is neither a "Core Point" (having at least `minPts` within a radius `eps`) nor a "Border Point" (within the radius of a Core Point but having fewer than `minPts` neighbors).

### **10. Define inertia in the context of K-Means.**

Inertia (or Within-Cluster Sum of Squares) is the sum of squared distances of samples to their closest cluster center. It measures how **internally coherent** clusters are; lower inertia indicates more tightly packed clusters.

---

### **11. What is the elbow method in K-Means clustering?**

The elbow method is a heuristic used to find the optimal number of clusters (). You plot the **Inertia** against the number of clusters. The "elbow" point on the graph—where the rate of decrease in inertia slows down significantly—is usually considered the ideal .

### **12. Describe the concept of "density" in DBSCAN.**

Density is defined by the number of points within a specific neighborhood. A region is considered "dense" if it contains at least a minimum number of points (`minPts`) within a specified radius (`eps`). Clusters are essentially high-density regions separated by low-density regions.

### **13. Can hierarchical clustering be used on categorical data?**

Yes, but you cannot use Euclidean distance. You must use appropriate dissimilarity measures for categorical data, such as **Gower’s distance** or **Jaccard similarity**, before applying the clustering algorithm.

### **14. What does a negative Silhouette Score indicate?**

A negative value (closer to -1) indicates that a sample has been assigned to the **wrong cluster**, as it is more similar to a neighboring cluster than the one it currently belongs to.

### **15. Explain the term "linkage criteria" in hierarchical clustering.**

Linkage criteria determine how the distance between two *clusters* is calculated:

* **Single Linkage:** Distance between the closest points of two clusters.
* **Complete Linkage:** Distance between the farthest points.
* **Average Linkage:** Average distance between all pairs of points.
* **Ward’s Linkage:** Minimizes the variance within clusters.

### **16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?**

K-Means tries to minimize the variance, which leads it to prefer clusters of **similar size and spherical shape**. If one cluster is very dense and another is very sparse, or if one is much larger than the other, K-Means may "break" the large/sparse cluster to balance the inertia.

### **17. What are the core parameters in DBSCAN, and how do they influence clustering?**

1. **Epsilon (eps):** The maximum distance between two samples for one to be considered as in the neighborhood of the other. If `eps` is too small, most data will be noise; if too large, clusters will merge.
2. **MinPts:** The number of samples in a neighborhood for a point to be considered a core point. Higher `MinPts` works better for noisy datasets.

### **18. How does K-Means++ improve upon standard K-Means initialization?**

Standard K-Means picks initial centroids randomly, which can lead to poor convergence. **K-Means++** spreads out the initial centroids by choosing the first one randomly and then picking subsequent centroids with a probability proportional to their squared distance from the closest existing centroid.

### **19. What is agglomerative clustering?**

Agglomerative clustering is the most common type of hierarchical clustering. It is a **"bottom-up"** approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

### **20. What makes Silhouette Score a better metric than just inertia for model evaluation?**

Inertia only measures how tight clusters are (cohesion), but it always decreases as you add more clusters (even if it leads to overfitting). **Silhouette Score** considers both **cohesion** (how close points are to their own center) and **separation** (how far points are from other clusters), providing a more balanced view of cluster quality.

---

---


# **Practical Questions**

### **1. K-Means with 4 Centers (make_blobs)**

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', marker='X')
plt.title("K-Means with 4 Clusters")
plt.show()

```

### **2. Agglomerative Clustering on Iris Dataset**

```python
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
agg_clustering = AgglomerativeClustering(n_clusters=3)
labels = agg_clustering.fit_predict(iris.data)

print("First 10 predicted labels:", labels[:10])

```

### **3. DBSCAN on Moon Data (Highlighting Outliers)**

```python
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

X, _ = make_moons(n_samples=200, noise=0.05, random_state=0)
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

plt.scatter(X[labels != -1, 0], X[labels != -1, 1], c=labels[labels != -1], cmap='Paired')
plt.scatter(X[labels == -1, 0], X[labels == -1, 1], c='black', label='Outliers')
plt.legend()
plt.title("DBSCAN: Clusters and Noise")
plt.show()

```

### **4. Wine Dataset: Standardization & K-Means**

```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
import pandas as pd

wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)
kmeans = KMeans(n_clusters=3, random_state=42).fit(X_scaled)

clusters = pd.Series(kmeans.labels_)
print("Cluster Sizes:\n", clusters.value_counts())

```

### **5. DBSCAN on Concentric Circles**

```python
from sklearn.datasets import make_circles

X, _ = make_circles(n_samples=500, factor=0.5, noise=0.05)
dbscan = DBSCAN(eps=0.15, min_samples=5)
plt.scatter(X[:, 0], X[:, 1], c=dbscan.fit_predict(X), cmap='plasma')
plt.title("DBSCAN on Concentric Circles")
plt.show()

```

### **6. Breast Cancer Dataset: MinMaxScaler & Centroids**

```python
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

data = load_breast_cancer()
X_scaled = MinMaxScaler().fit_transform(data.data)
kmeans = KMeans(n_clusters=2, random_state=42).fit(X_scaled)

print("Cluster Centroids:\n", kmeans.cluster_centers_)

```

### **7. DBSCAN with Varying Cluster Densities**

```python
X, _ = make_blobs(n_samples=500, centers=3, cluster_std=[0.5, 1.5, 0.7], random_state=42)
dbscan = DBSCAN(eps=0.8, min_samples=5)
plt.scatter(X[:, 0], X[:, 1], c=dbscan.fit_predict(X), cmap='tab10')
plt.title("DBSCAN with Varying Densities")
plt.show()

```

### **8. PCA Reduction & K-Means (Digits Dataset)**

```python
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

digits = load_digits()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(digits.data)

kmeans = KMeans(n_clusters=10, random_state=42).fit(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_, cmap='Spectral', s=5)
plt.title("K-Means on Digits (PCA Reduced)")
plt.show()

```

### **9. Silhouette Scores Bar Chart (k=2 to 5)**

```python
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
scores = []
ks = [2, 3, 4, 5]

for k in ks:
    labels = KMeans(n_clusters=k, random_state=42).fit_predict(X)
    scores.append(silhouette_score(X, labels))

plt.bar(ks, scores, color='skyblue')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

```

### **10. Dendrogram for Iris (Average Linkage)**

```python
from scipy.cluster.hierarchy import dendrogram, linkage

iris = load_iris()
linked = linkage(iris.data, method='average')


plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title("Iris Dendrogram (Average Linkage)")
plt.show()

```

### **11. K-Means with Decision Boundaries**

```python
import numpy as np

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.0, random_state=42)
kmeans = KMeans(n_clusters=3).fit(X)

h = .02 # Mesh step size
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, edgecolor='k')
plt.title("K-Means Decision Boundaries")
plt.show()

```

### **12. t-SNE + DBSCAN (Digits Dataset)**

```python
from sklearn.manifold import TSNE

digits = load_digits()
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
dbscan = DBSCAN(eps=3, min_samples=10).fit(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan.labels_, cmap='tab10', s=5)
plt.title("t-SNE + DBSCAN (Digits)")
plt.show()

```

### **13. Agglomerative Clustering (Complete Linkage)**

```python
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
agg = AgglomerativeClustering(n_clusters=4, linkage='complete')
plt.scatter(X[:, 0], X[:, 1], c=agg.fit_predict(X), cmap='viridis')
plt.show()

```

### **14. Elbow Method: Inertia for K=2 to 6**

```python
cancer = load_breast_cancer()
inertia = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42).fit(cancer.data)
    inertia.append(km.inertia_)


plt.plot(range(2, 7), inertia, marker='o')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.show()

```

### **15. Agglomerative with Single Linkage (Circles)**

```python
X, _ = make_circles(n_samples=500, factor=0.5, noise=0.05)
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
plt.scatter(X[:, 0], X[:, 1], c=agg.fit_predict(X))
plt.title("Agglomerative (Single Linkage) on Circles")
plt.show()

```

---

### **16–20: Quick Highlights**

* **Wine + DBSCAN:** Always use `StandardScaler` first. To count clusters: `len(set(labels)) - (1 if -1 in labels else 0)`.
* **Iris Noise:** DBSCAN with `eps=0.5` on raw Iris data usually results in several noise points (`-1`).
* **Non-Linear Moons:** K-Means will fail (splitting the moons vertically or horizontally), while DBSCAN will succeed.
* **3D Digits:** Use `PCA(n_components=3)` and `ax = fig.add_subplot(111, projection='3d')`.

---

### **21. Java + DSA (Quick Summary)**

If you are implementing these in Java, you would typically use libraries like **Weka** or **Apache Commons Math**. For DSA:

* **K-Means Complexity:**  (n=points, k=clusters, i=iterations, d=dimensions).
* **Hierarchical Complexity:**  for basic implementations, can be optimized to .

---

### **22. Silhouette Score Evaluation (5 centers)**

```python
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
km = KMeans(n_clusters=5).fit(X)
print(f"Silhouette Score for K=5: {silhouette_score(X, km.labels_):.3f}")

```

### **23. PCA + Agglomerative (Breast Cancer)**

```python
X_pca = PCA(n_components=2).fit_transform(StandardScaler().fit_transform(cancer.data))
agg = AgglomerativeClustering(n_clusters=2).fit_predict(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=agg)
plt.show()

```




### **24. K-Means vs. DBSCAN Side-by-Side (Noisy Circles)**

This comparison highlights how K-Means fails at non-linear geometry while DBSCAN excels.

```python
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler

# Generate data
X, _ = make_circles(n_samples=500, factor=0.5, noise=0.08, random_state=42)
X = StandardScaler().fit_transform(X)

# Apply Algorithms
km = KMeans(n_clusters=2, random_state=42).fit_predict(X)
db = DBSCAN(eps=0.3, min_samples=5).fit_predict(X)

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.scatter(X[:, 0], X[:, 1], c=km, cmap='viridis')
ax1.set_title("K-Means (Failed Structure)")
ax2.scatter(X[:, 0], X[:, 1], c=db, cmap='plasma')
ax2.set_title("DBSCAN (Successful Structure)")
plt.show()

```

### **25. Silhouette Coefficient Plot per Sample (Iris)**

Unlike a single score, this plot shows how well each individual sample fits its cluster.

```python
import numpy as np
import matplotlib.cm as cm
from sklearn.metrics import silhouette_samples, silhouette_score

iris = load_iris()
X = iris.data
n_clusters = 3
km = KMeans(n_clusters=n_clusters, random_state=42)
labels = km.fit_predict(X)

score = silhouette_score(X, labels)
sample_values = silhouette_samples(X, labels)

y_lower = 10
for i in range(n_clusters):
    ith_values = sample_values[labels == i]
    ith_values.sort()
    size_cluster_i = ith_values.shape[0]
    y_upper = y_lower + size_cluster_i
    color = cm.nipy_spectral(float(i) / n_clusters)
    plt.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_values, facecolor=color)
    y_lower = y_upper + 10


plt.title("Silhouette Plot for Iris Clusters")
plt.xlabel("Silhouette Coefficient")
plt.show()

```

### **26. Agglomerative Clustering with 'Average' Linkage**

Average linkage is often more robust than single linkage as it considers all pairs between clusters.

```python
from sklearn.cluster import AgglomerativeClustering

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
agg = AgglomerativeClustering(n_clusters=3, linkage='average')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow')
plt.title("Agglomerative Clustering: Average Linkage")
plt.show()

```

### **27. Wine Dataset: Seaborn Pairplot of Assignments**

This is a great way to visualize how clusters separate across multiple dimensions.

```python
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_wine

wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)
km = KMeans(n_clusters=3, random_state=42).fit(X_scaled)

# Create DataFrame with first 4 features and labels
df = pd.DataFrame(wine.data[:, :4], columns=wine.feature_names[:4])
df['cluster'] = km.labels_

sns.pairplot(df, hue='cluster', palette='bright')
plt.show()

```

### **28. Noisy Blobs: DBSCAN Identification and Counts**

DBSCAN categorizes noise as `-1`. Here is how to programmatically count them.

```python
X, _ = make_blobs(n_samples=400, centers=4, cluster_std=1.5, random_state=42)
db = DBSCAN(eps=0.8, min_samples=7).fit(X)
labels = db.labels_

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Number of clusters found: {n_clusters}")
print(f"Number of noise points: {n_noise}")

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10')
plt.title(f"DBSCAN: {n_clusters} Clusters, {n_noise} Noise Points")
plt.show()

```

### **29. t-SNE + Agglomerative Clustering (Digits)**

Since Digits has 64 features, t-SNE helps visualize the high-dimensional clusters in 2D space.

```python
from sklearn.manifold import TSNE

digits = load_digits()
# Reduce 64D to 2D
X_tsne = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(digits.data)

# Cluster the reduced data
agg = AgglomerativeClustering(n_clusters=10)
labels = agg.fit_predict(X_tsne)

plt.figure(figsize=(10, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='nipy_spectral', s=10)
plt.colorbar()
plt.title("t-SNE Visualization of Agglomerative Clusters (Digits)")
plt.show()

```

---





### **30. Iris Hierarchical Clustering & Dendrogram**

Hierarchical clustering creates a tree of relationships. The dendrogram visualizes the distance at which clusters merge.

```python
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
linked = linkage(iris.data, method='average')
plt.figure(figsize=(10, 5))
dendrogram(linked)
plt.title("Iris Dendrogram (Average Linkage)")
plt.show()

```

### **31. Overlapping Blobs & Decision Boundaries**

K-Means partitions space into Voronoi cells. Even with overlapping data, the boundaries remain linear.

```python
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=500, centers=3, cluster_std=2.5, random_state=42)
km = KMeans(n_clusters=3).fit(X)

h = .02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = km.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=km.labels_, s=10, edgecolor='k')
plt.title("K-Means Decision Boundaries (Overlapping Blobs)")
plt.show()

```

### **32. t-SNE & DBSCAN (Digits Dataset)**

t-SNE reduces the 64-dimensional Digits data to 2D, allowing DBSCAN to find clusters based on local density.

```python
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
from sklearn.datasets import load_digits

digits = load_digits()
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
db = DBSCAN(eps=4, min_samples=5).fit(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=db.labels_, cmap='tab10', s=5)
plt.title("DBSCAN on t-SNE Reduced Digits")
plt.show()

```

### **33. Agglomerative Clustering (Complete Linkage)**

Complete linkage merges clusters based on the maximum distance between points in each cluster.

```python
from sklearn.cluster import AgglomerativeClustering

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
agg = AgglomerativeClustering(n_clusters=4, linkage='complete')
plt.scatter(X[:, 0], X[:, 1], c=agg.fit_predict(X), cmap='viridis')
plt.title("Complete Linkage Clustering")
plt.show()

```

### **34. Elbow Method (Breast Cancer Dataset)**

This plot shows the drop in inertia (SSE) as K increases. The "elbow" suggests the best K.

```python
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
inertia = [KMeans(n_clusters=k).fit(data.data).inertia_ for k in range(2, 7)]
plt.plot(range(2, 7), inertia, marker='o')
plt.title("Elbow Method (Breast Cancer)")
plt.show()

```

### **35. Concentric Circles (Single Linkage)**

Single linkage can follow thin paths of high density, allowing it to correctly cluster nested circles.

```python
from sklearn.datasets import make_circles

X, _ = make_circles(n_samples=500, factor=0.5, noise=0.05)
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
plt.scatter(X[:, 0], X[:, 1], c=agg.fit_predict(X))
plt.title("Single Linkage on Concentric Circles")
plt.show()

```

### **36. Wine Dataset: Scaled DBSCAN**

Standardization is critical for distance-based algorithms like DBSCAN.

```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)
db = DBSCAN(eps=2.5, min_samples=5).fit(X_scaled)
n_clusters = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
print(f"Number of clusters (excluding noise): {n_clusters}")

```

### **37. Visualizing Cluster Centers**

The `cluster_centers_` attribute provides the coordinates of the centroids calculated by K-Means.

```python
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
km = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=km.labels_, alpha=0.5)
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.legend()
plt.show()

```

### **38. Iris Noise Detection with DBSCAN**

Points labeled as `-1` are considered outliers by the DBSCAN algorithm.

```python
iris = load_iris()
db = DBSCAN(eps=0.5, min_samples=5).fit(iris.data)
noise_count = list(db.labels_).count(-1)
print(f"Number of noise samples in Iris: {noise_count}")

```

### **39. Non-Linear Moons (K-Means Failure)**

K-Means assumes clusters are spherical and will fail to correctly cluster "moon" shapes.

```python
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
km = KMeans(n_clusters=2).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=km)
plt.title("K-Means failure on non-linear data")
plt.show()

```

### **40. 3D PCA Visualization (Digits)**

Visualizing the first three principal components provides a spatial view of how clusters separate.

```python
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

digits = load_digits()
X_pca = PCA(n_components=3).fit_transform(digits.data)
km = KMeans(n_clusters=10).fit_predict(X_pca)

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=km, s=5)
plt.title("3D PCA Clustering (Digits)")
plt.show()

```

### **41. Silhouette Score (5 Centers)**

The Silhouette Score evaluates how similar an object is to its own cluster compared to others.

```python
from sklearn.metrics import silhouette_score
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
km = KMeans(n_clusters=5).fit(X)
print(f"Silhouette Score: {silhouette_score(X, km.labels_):.3f}")

```

### **42. PCA + Agglomerative (Breast Cancer)**

Dimensionality reduction before clustering helps in visualizing the high-dimensional cancer data.

```python
X_pca = PCA(n_components=2).fit_transform(StandardScaler().fit_transform(data.data))
agg = AgglomerativeClustering(n_clusters=2).fit_predict(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=agg, cmap='coolwarm')
plt.title("Agglomerative Clustering on PCA Components")
plt.show()

```

### **43. Silhouette Plot (Iris)**

A per-sample visualization of the Silhouette Coefficient.

```python
from sklearn.metrics import silhouette_samples
import matplotlib.cm as cm

X = load_iris().data
km = KMeans(n_clusters=3, random_state=42)
labels = km.fit_predict(X)
sample_values = silhouette_samples(X, labels)

# Plotting each cluster's silhouette values (logic simplified)
for i in range(3):
    vals = sample_values[labels == i]
    vals.sort()
    plt.fill_betweenx(np.arange(len(vals)), 0, vals)
plt.title("Silhouette Coefficient per Sample")
plt.show()

```

### **44. Wine Pairplot (Seaborn)**

Visualizing cluster assignments across multiple features simultaneously.

```python
import seaborn as sns
import pandas as pd
df = pd.DataFrame(wine.data[:, :4], columns=wine.feature_names[:4])
df['Cluster'] = KMeans(n_clusters=3).fit_predict(StandardScaler().fit_transform(wine.data))
sns.pairplot(df, hue='Cluster')
plt.show()

```

### **45. t-SNE + Agglomerative (Digits)**

Combining non-linear dimensionality reduction with hierarchical grouping.

```python
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
agg = AgglomerativeClustering(n_clusters=10).fit_predict(X_tsne)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=agg, cmap='nipy_spectral', s=5)
plt.title("Agglomerative Clustering on t-SNE")
plt.show()

```





### **46. Side-by-Side: K-Means vs. DBSCAN (Noisy Circles)**

This exercise demonstrates why density-based clustering is superior for non-linear structures. K-Means attempts to split the circles into two halves, while DBSCAN identifies the inner and outer rings.

```python
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler

# Generate noisy circular data
X, _ = make_circles(n_samples=500, factor=0.5, noise=0.08, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# Apply Algorithms
km_labels = KMeans(n_clusters=2, random_state=42).fit_predict(X_scaled)
db_labels = DBSCAN(eps=0.3, min_samples=5).fit_predict(X_scaled)

# Plot Side-by-Side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.scatter(X_scaled[:, 0], X_scaled[:, 1], c=km_labels, cmap='viridis')
ax1.set_title("K-Means Clustering (Incorrect)")

ax2.scatter(X_scaled[:, 0], X_scaled[:, 1], c=db_labels, cmap='plasma')
ax2.set_title("DBSCAN Clustering (Correct)")
plt.show()

```

### **47. Agglomerative Clustering: 'Average' Linkage (Blobs)**

Average linkage calculates the distance between two clusters as the average distance between every point in one cluster and every point in the other. It is generally more robust than 'single' linkage.

```python
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
agg = AgglomerativeClustering(n_clusters=3, linkage='average')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow')
plt.title("Agglomerative Clustering: Average Linkage")
plt.show()

```

### **48. DBSCAN: Identifying Clusters and Noise Points**

In DBSCAN, any point not belonging to a cluster is assigned the label `-1`. This code counts both the distinct groups found and the outliers.

```python
X, _ = make_blobs(n_samples=400, centers=4, cluster_std=1.5, random_state=42)
db = DBSCAN(eps=0.8, min_samples=7).fit(X)
labels = db.labels_

# Identify number of clusters (ignoring noise) and noise points
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print(f"Number of clusters: {n_clusters_}")
print(f"Number of noise points: {n_noise_}")

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10')
plt.title(f"DBSCAN: {n_clusters_} Clusters and {n_noise_} Noise Points")
plt.show()

```

### **49. t-SNE + Agglomerative Clustering (Digits)**

For complex datasets like Digits, hierarchical clustering works significantly better in a reduced 2D space where the local geometry is preserved.

```python
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_digits

digits = load_digits()
# Reduce 64D to 2D using t-SNE
X_tsne = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(digits.data)

# Apply Agglomerative Clustering on the 2D representation
agg = AgglomerativeClustering(n_clusters=10)
labels = agg.fit_predict(X_tsne)

plt.figure(figsize=(10, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='nipy_spectral', s=10)
plt.colorbar(label='Digit Cluster')
plt.title("Digits: t-SNE Reduction with Agglomerative Clustering")
plt.show()

```

---

