#Theory quesitons



**1. What is unsupervised learning in the context of machine learning?**
Unsupervised learning is a type of machine learning where the model is given input data without labeled responses. The goal is to discover hidden patterns, structures, or groupings in the data, with clustering being one of the primary tasks.

**2. How does K-Means clustering algorithm work?**
K-Means starts by initializing `k` centroids. It assigns each data point to the nearest centroid, forms clusters, and then recalculates the centroids as the mean of the points in each cluster. This process is repeated until the centroids stabilize.

**3. Explain the concept of a dendrogram in hierarchical clustering.**
A dendrogram is a tree-like diagram used to visualize the merging (or splitting) of clusters in hierarchical clustering. It helps determine the number of clusters by cutting the tree at a desired height.

**4. What is the main difference between K-Means and Hierarchical Clustering?**
K-Means is a partitional algorithm that requires the number of clusters as input and assigns data points based on distance to centroids. Hierarchical clustering builds a hierarchy of clusters and does not need the number of clusters to be specified upfront.

**5. What are the advantages of DBSCAN over K-Means?**
DBSCAN can detect clusters of arbitrary shapes, handles noise well, and doesn’t require the number of clusters to be specified beforehand, unlike K-Means which assumes spherical clusters and needs `k` as input.

**6. When would you use Silhouette Score in clustering?**
Silhouette Score is used to measure how well a point fits within its cluster compared to other clusters. It helps in evaluating and selecting the optimal number of clusters, especially when ground truth is not available.

**7. What are the limitations of Hierarchical Clustering?**
Hierarchical clustering is computationally intensive for large datasets, sensitive to noise, and does not allow reassignments once a merge or split has occurred. It also struggles with scalability.

**8. Why is feature scaling important in clustering algorithms like K-Means?**
Since K-Means relies on distance calculations, unscaled features can distort cluster formation. Feature scaling ensures all features contribute equally, preventing dominant features from skewing results.

**9. How does DBSCAN identify noise points?**
DBSCAN labels a point as noise if it has fewer than `min_samples` neighbors within the `eps` radius and is not part of any cluster. These points are outliers not assigned to any group.

**10. Define inertia in the context of K-Means.**
Inertia is the sum of squared distances between each point and its assigned cluster centroid. It measures the compactness of clusters and is used to evaluate the clustering performance.

**11. What is the elbow method in K-Means clustering?**
The elbow method involves plotting the inertia against different values of `k`. The “elbow” point, where the inertia starts to level off, is considered the optimal number of clusters.

**12. Describe the concept of "density" in DBSCAN.**
Density in DBSCAN refers to the number of data points within a given radius (`eps`) of a point. A high-density area forms a cluster, while low-density regions are treated as noise or boundaries.

**13. Can hierarchical clustering be used on categorical data?**
Yes, hierarchical clustering can be applied to categorical data, but it requires appropriate distance metrics like Hamming distance or Gower distance, as Euclidean distance is unsuitable.

**14. What does a negative Silhouette Score indicate?**
A negative Silhouette Score means that a sample is closer to another cluster than to the one it’s assigned to. This indicates potential misclassification and poor clustering structure.

**15. Explain the term "linkage criteria" in hierarchical clustering.**
Linkage criteria determine how the distance between clusters is calculated. Common methods include single linkage (minimum distance), complete linkage (maximum), and average linkage (mean distance).

**16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?**
K-Means assumes equal-sized, spherical clusters with similar densities. When clusters vary significantly in size or density, K-Means can misclassify points, leading to poor clustering results.

**17. What are the core parameters in DBSCAN, and how do they influence clustering?**
DBSCAN uses two main parameters: `eps` (radius) and `min_samples` (minimum number of points). These control the definition of a dense region. Larger `eps` or smaller `min_samples` can lead to fewer but larger clusters.

**18. How does K-Means++ improve upon standard K-Means initialization?**
K-Means++ selects initial centroids in a way that spreads them out, reducing the chance of poor initialization and improving convergence speed and clustering accuracy.

**19. What is agglomerative clustering?**
Agglomerative clustering is a type of hierarchical clustering that builds clusters in a bottom-up approach, starting with individual points and merging the closest pairs step-by-step until all points form one cluster.

**20. What makes Silhouette Score a better metric than just inertia for model evaluation?**
While inertia measures only compactness, Silhouette Score considers both cohesion (intra-cluster similarity) and separation (inter-cluster dissimilarity), offering a more holistic view of clustering quality.




#Practical assignment

In [None]:
# Practical Assignment: Clustering Algorithms using Synthetic and Real Datasets

# Imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons, make_circles, load_iris, load_wine, load_digits, load_breast_cancer
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import dendrogram, linkage

# 1. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4)
labels = kmeans.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title("K-Means Clustering (4 centers)")
plt.show()

# 2. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels
iris = load_iris()
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(iris.data)
print("First 10 Agglomerative Clustering Labels on Iris:", labels[:10])

# 3. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot
X, _ = make_moons(n_samples=300, noise=0.05)
db = DBSCAN(eps=0.2, min_samples=5).fit(X)
labels = db.labels_
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Spectral')
plt.title("DBSCAN on make_moons")
plt.show()

# 4. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster
wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)
labels = KMeans(n_clusters=3, random_state=42).fit_predict(X_scaled)
unique, counts = np.unique(labels, return_counts=True)
print("Cluster Sizes in Wine Dataset:", dict(zip(unique, counts)))

# 5. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result
X, _ = make_circles(n_samples=300, noise=0.05, factor=0.5)
labels = DBSCAN(eps=0.2).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='plasma')
plt.title("DBSCAN on make_circles")
plt.show()

# 6. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids
data = load_breast_cancer()
X_scaled = MinMaxScaler().fit_transform(data.data)
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)
print("KMeans Cluster Centroids on Breast Cancer:\n", kmeans.cluster_centers_)

# 7. Generate synthetic data using make_blobs with varying cluster std devs and cluster with DBSCAN
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)
labels = DBSCAN(eps=1.5).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10')
plt.title("DBSCAN with Varying Cluster Standard Deviations")
plt.show()

# 8. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means
digits = load_digits()
X_pca = PCA(n_components=2).fit_transform(digits.data)
labels = KMeans(n_clusters=10, random_state=42).fit_predict(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='nipy_spectral')
plt.title("KMeans Clustering on Digits PCA")
plt.show()

# 9. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
scores = [silhouette_score(X, KMeans(n_clusters=k).fit_predict(X)) for k in range(2, 6)]
plt.bar(range(2, 6), scores)
plt.title("Silhouette Scores for K = 2 to 5")
plt.xlabel("k")
plt.ylabel("Silhouette Score")
plt.show()

# 10. Load the Iris dataset and use hierarchical clustering. Plot a dendrogram with average linkage
linked = linkage(iris.data, method='average')
plt.figure(figsize=(10, 5))
dendrogram(linked, labels=iris.target)
plt.title("Hierarchical Clustering - Average Linkage")
plt.show()

# 11. Generate overlapping clusters using make_blobs, apply KMeans and visualize with decision boundaries
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)
kmeans = KMeans(n_clusters=3).fit(X)
labels = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='coolwarm')
plt.title("KMeans with Overlapping Clusters")
plt.show()

# 12. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
labels = DBSCAN(eps=5).fit_predict(X_tsne)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='Spectral')
plt.title("DBSCAN on Digits (t-SNE)")
plt.show()

# 13. Generate synthetic data and apply Agglomerative Clustering with complete linkage. Plot the result
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
labels = AgglomerativeClustering(linkage='complete', n_clusters=3).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Accent')
plt.title("Agglomerative Clustering (Complete Linkage)")
plt.show()

# 14. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using KMeans
X_scaled = StandardScaler().fit_transform(data.data)
inertias = [KMeans(n_clusters=k).fit(X_scaled).inertia_ for k in range(2, 7)]
plt.plot(range(2, 7), inertias, marker='o')
plt.title("KMeans Inertia on Breast Cancer")
plt.xlabel("k")
plt.ylabel("Inertia")
plt.show()

# 15. Generate synthetic concentric circles and cluster using Agglomerative Clustering with single linkage
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)
labels = AgglomerativeClustering(linkage='single', n_clusters=2).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set2')
plt.title("Agglomerative Clustering on make_circles")
plt.show()

# 16. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)
X_scaled = StandardScaler().fit_transform(wine.data)
labels = DBSCAN(eps=1.5).fit_predict(X_scaled)
clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("DBSCAN Clusters on Wine (excluding noise):", clusters)

# 17. Generate data and apply KMeans. Plot cluster centers on data points
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='Paired')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.title("KMeans Centers on make_blobs")
plt.show()

# 18. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise
X = StandardScaler().fit_transform(iris.data)
labels = DBSCAN(eps=0.6, min_samples=5).fit_predict(X)
noise_points = list(labels).count(-1)
print("DBSCAN Noise Samples in Iris:", noise_points)

# 19. Generate make_moons, apply KMeans, and visualize clustering
X, _ = make_moons(n_samples=300, noise=0.1)
labels = KMeans(n_clusters=2).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='spring')
plt.title("KMeans on Non-Linearly Separable Data (make_moons)")
plt.show()

# 20. Load Digits, PCA (3D), KMeans, and visualize
X_pca = PCA(n_components=3).fit_transform(digits.data)
labels = KMeans(n_clusters=10).fit_predict(X_pca)
fig = plt.figure(figsize=(8, 5))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels, cmap='Spectral')
ax.set_title("3D Clustering of Digits via PCA + KMeans")
plt.show()

# 21. Generate blobs with 5 centers, apply KMeans, and evaluate using silhouette_score
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
labels = KMeans(n_clusters=5).fit_predict(X)
score = silhouette_score(X, labels)
print("Silhouette Score (5-center Blobs):", score)

# 22. Breast Cancer dataset, PCA, Agglomerative Clustering, 2D visualization
X_pca = PCA(n_components=2).fit_transform(data.data)
labels = AgglomerativeClustering(n_clusters=2).fit_predict(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='cool')
plt.title("Breast Cancer - PCA + Agglomerative")
plt.show()

# 23. make_circles clustering: KMeans vs DBSCAN side-by-side
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(X[:, 0], X[:, 1], c=KMeans(n_clusters=2).fit_predict(X))
axes[0].set_title("KMeans")
axes[1].scatter(X[:, 0], X[:, 1], c=DBSCAN(eps=0.2).fit_predict(X))
axes[1].set_title("DBSCAN")
plt.suptitle("make_circles: KMeans vs DBSCAN")
plt.show()

# 24. Load Iris dataset and plot Silhouette Coefficient for each sample after KMeans clustering
from sklearn.metrics import silhouette_samples
X = StandardScaler().fit_transform(iris.data)
labels = KMeans(n_clusters=3).fit_predict(X)
sample_silhouette_values = silhouette_samples(X, labels)
plt.bar(range(len(X)), sample_silhouette_values)
plt.title("Silhouette Coefficient per Sample (Iris)")
plt.show()

# 25. make_blobs + Agglomerative Clustering with 'average' linkage. Visualize clusters
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
labels = AgglomerativeClustering(n_clusters=3, linkage='average').fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Dark2')
plt.title("Agglomerative Clustering (Average Linkage)")
plt.show()

# 26. Wine dataset + KMeans + seaborn pairplot (first 4 features)
import pandas as pd
df = pd.DataFrame(wine.data[:, :4], columns=wine.feature_names[:4])
df['cluster'] = KMeans(n_clusters=3).fit_predict(wine.data)
sns.pairplot(df, hue='cluster')
plt.suptitle("KMeans Cluster Assignment on Wine Data (First 4 Features)", y=1.02)
plt.show()

# 27. Generate noisy blobs, use DBSCAN to identify clusters and noise, print the count
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)
labels = DBSCAN(eps=1.2).fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print("DBSCAN: Clusters =", n_clusters, ", Noise Points =", n_noise)

# 28. Digits dataset, reduce with t-SNE, apply Agglomerative Clustering, plot
X_tsne = TSNE(n_components=2).fit_transform(digits.data)
labels = AgglomerativeClustering(n_clusters=10).fit_predict(X_tsne)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10')
plt.title("Digits: Agglomerative Clustering on t-SNE")
plt.show()
