**Clustering Assignment**

**Clustering Theory Questions: Answers

```
# This is formatted as code
```

**


1. What is unsupervised learning in the context of machine learning?
   
   -->

   Unsupervised learning is a branch of machine learning where the algorithm learns patterns and structures from unlabeled data. Unlike supervised learning, there are no predefined output labels or target variables. The goal is to discover inherent groupings, relationships, or representations within the data without human guidance on what the output should be. Common tasks include clustering, dimensionality reduction, and anomaly detection.

2. How does K-Means clustering algorithm work?
   
   -->

   K-Means clustering is an iterative, centroid-based clustering algorithm that aims to partition 'n' observations into 'k' clusters, where each observation belongs to the cluster with the nearest mean (centroid). Here's how it works:

Initialization: Randomly select 'k' data points from the dataset as initial cluster centroids.

Assignment Step (E-step): Each data point is assigned to the cluster whose centroid is closest to it (e.g., using Euclidean distance).

Update Step (M-step): The centroids of the clusters are re-calculated by taking the mean of all data points assigned to that cluster.

Iteration: Steps 2 and 3 are repeated until the cluster assignments no longer change or a maximum number of iterations is reached, indicating convergence.

3. Explain the concept of a dendrogram in hierarchical clustering.
  
   -->

   A dendrogram is a tree-like diagram that visually represents the sequence of merges or splits in hierarchical clustering.
   
   Agglomerative Clustering (bottom-up): A dendrogram for agglomerative clustering starts with each data point as its own cluster at the bottom. As the algorithm proceeds, individual points and then clusters are successively merged. The height of the 'U'-shaped link in the dendrogram indicates the dissimilarity (or distance) at which two clusters were merged. The longer the vertical line, the greater the dissimilarity between the merged clusters.
   
   Divisive Clustering (top-down): A dendrogram for divisive clustering starts with all data points in one large cluster at the top, which is then successively split into smaller clusters.
   
   Dendrograms are used to determine the optimal number of clusters by visually inspecting where a "cut" across the dendrogram would produce meaningful groups, often by looking for large vertical gaps.

4. What is the main difference between K-Means and Hierarchical Clustering?
   
   -->

   The main differences between K-Means and Hierarchical Clustering are: Approach:
   
   K-Means: A partitioning method that aims to partition data into a pre-defined number of 'k' clusters. It's iterative and optimizes an objective function (minimizing within-cluster sum of squares).
   
   Hierarchical Clustering: Builds a hierarchy of clusters. It can be agglomerative (bottom-up, merging clusters) or divisive (top-down, splitting clusters). It does not require 'k' to be specified beforehand, but 'k' is often chosen by cutting the dendrogram.

   Number of Clusters (k):

   K-Means: Requires the number of clusters 'k' to be specified before running the algorithm.

   Hierarchical Clustering: Does not explicitly require 'k' beforehand; it produces a hierarchy, and 'k' can be chosen afterward by cutting the dendrogram at a certain dissimilarity level.

   Output:

   K-Means: Produces a single set of 'k' clusters.

   Hierarchical Clustering: Produces a hierarchy (dendrogram) that shows how clusters are nested at various levels of granularity.

   Computational Cost:

   K-Means: Generally faster for large datasets (O(n⋅k⋅i⋅d) where i is iterations, d is dimensions).

   Hierarchical Clustering: Can be computationally more expensive, especially for agglomerative clustering (O(n
2
 ⋅d) to O(n
3
 ) depending on linkage, where d is dimensions), making it less suitable for very large datasets.

5. What are the advantages of DBSCAN over K-Means?
   
   -->

   DBSCAN (Density-Based Spatial Clustering of Applications with Noise) offers several advantages over K-Means:
   
   Handles Arbitrary Shapes: DBSCAN can discover clusters of arbitrary shapes (e.g., non-spherical, complex structures) because it's based on density, unlike K-Means which assumes spherical clusters.
   
   Identifies Noise/Outliers: DBSCAN can explicitly identify "noise" points (outliers) that do not belong to any cluster. K-Means forces every data point into a cluster, even outliers.
   
   Does Not Require 'k': DBSCAN does not require the number of clusters 'k' to be specified beforehand. It discovers the number of clusters based on the density parameters.
   
   Robust to Noise: By explicitly modeling noise, DBSCAN is more robust to the presence of outliers in the data.

6. When would you use Silhouette Score in clustering?
   
   -->

   The Silhouette Score (or Silhouette Coefficient) is used to evaluate the quality of clustering results, particularly when the true labels are unknown (unsupervised evaluation). You would use it:

To Determine Optimal 'k': When trying to find the best number of clusters (e.g., for K-Means or hierarchical clustering), you can calculate the Silhouette Score for different values of 'k' and choose the 'k' that yields the highest score.

To Compare Different Clustering Algorithms: When comparing the performance of different clustering algorithms on the same dataset, the algorithm that produces clusters with a higher Silhouette Score is generally considered better.

To Assess Cluster Cohesion and Separation: A high Silhouette Score indicates that objects are well-matched to their own cluster (high cohesion) and poorly matched to neighboring clusters (high separation).

Score range: −1 to 1.

Near +1: Data point is far away from the neighboring clusters.

Near 0: Data point is on or very close to the decision boundary between two clusters.

Near -1: Data point is likely assigned to the wrong cluster.

7. What are the limitations of Hierarchical Clustering?
   
   -->

   Despite its flexibility, Hierarchical Clustering has several limitations:

Computational Complexity: For large datasets, its time complexity can be very high, typically O(n
2
 ⋅logn) or O(n
3
 ) depending on the linkage method, making it impractical for millions of data points.

Space Complexity: It requires storing the dissimilarity matrix (O(n
2
 )), which can be memory-intensive for large datasets.

Irrevocable Decisions: Once a merge or split is made, it cannot be undone. This can lead to suboptimal clusters if an early decision was poor.

Sensitivity to Noise/Outliers: Especially with certain linkage criteria (like single linkage), hierarchical clustering can be very sensitive to noise and outliers, leading to "chaining" effects where clusters are incorrectly merged due to single close points.

Difficulty with Non-Globular Shapes: Like K-Means, it can struggle with clusters that are non-globular or have varying densities, although some linkage methods can mitigate this.

8. Why is feature scaling important in clustering algorithms like K-Means?
   
   -->

   Feature scaling is crucial in clustering algorithms like K-Means because these algorithms typically use distance metrics (e.g., Euclidean distance) to determine the similarity or dissimilarity between data points.

If features have different scales (e.g., one feature ranges from 0-1000 and another from 0-1), the feature with the larger range will disproportionately influence the distance calculations. This means:

Dominance of Large-Scale Features: The clustering algorithm will implicitly give more weight to features with larger numerical ranges, even if they are not inherently more important.

Distorted Distances: The calculated distances between points will be heavily skewed by the feature with the largest values, leading to inaccurate groupings and suboptimal clusters.

Scaling ensures that all features contribute equally to the distance calculations, preventing features with larger numerical values from dominating the clustering process and leading to more meaningful and accurate clusters. Common scaling methods include Standardization (Z-score normalization) and Min-Max Scaling.

9. How does DBSCAN identify noise points?
   
   -->

   DBSCAN identifies noise points (outliers) based on its core concept of density reachability:

Core Points: A data point is a "core point" if there are at least min_samples (a parameter, minimum number of points) within its eps (epsilon, a radius) neighborhood.

Border Points: A data point is a "border point" if it is within the eps neighborhood of a core point but is not a core point itself (i.e., it has fewer than min_samples within its own eps neighborhood).

Noise Points (Outliers): Any data point that is neither a core point nor a border point is considered a noise point. These are points that lie in low-density regions and are too far from any core point to be part of a cluster.

By explicitly categorizing points as core, border, or noise, DBSCAN effectively separates outliers from dense clusters, which is a significant advantage over partition-based methods like K-Means.

10. Define inertia in the context of K-Means.
    
   -->

   Inertia (also known as "within-cluster sum of squares" or WCSS) in the context of K-Means clustering is a measure of how internally coherent clusters are. It is defined as:

Inertia=
i=0
∑
n−1
​
  
μ
j
​
 ∈C
min
​
 (∣∣x
i
​
 −μ
j
​
 ∣∣
2
 )
Where:

x
i
​
  is a data point.

μ
j
​
  is the centroid of cluster j.

C is the set of all cluster centroids.

∣∣x
i
​
 −μ
j
​
 ∣∣
2
  is the squared Euclidean distance between data point x
i
​
  and the centroid μ
j
​
  of the cluster it is assigned to.

The goal of the K-Means algorithm is to minimize inertia. A lower inertia value generally indicates better clustering, as it means data points are closer to their respective cluster centroids. However, inertia always decreases as 'k' increases, so it cannot be used as the sole metric to determine the optimal number of clusters.

11. What is the elbow method in K-Means clustering?
    
   -->

   The elbow method is a heuristic used to determine an optimal number of clusters (k) for K-Means clustering. It involves:

Run K-Means for various 'k': Perform K-Means clustering for a range of k values (e.g., from 1 to 10).

Calculate Inertia: For each k, calculate the inertia (WCSS - Within-Cluster Sum of Squares).

Plot Inertia vs. 'k': Plot the inertia values on the y-axis against the number of clusters (k) on the x-axis.

Find the "Elbow": Look for an "elbow" point on the plot. This is the point where the rate of decrease in inertia sharply changes, forming an "elbow" shape. The k value at this elbow is often considered the optimal number of clusters, as adding more clusters beyond this point provides diminishing returns in terms of reducing within-cluster variance.

12. Describe the concept of "density" in DBSCAN.
    
   -->

   In DBSCAN, "density" is a fundamental concept used to define clusters and identify noise. It's not a global measure but is defined locally around each data point using two parameters:

ϵ (epsilon): This defines the maximum radius of the neighborhood around a data point. If eps is too small, many points might be labeled as noise; if too large, distinct clusters might merge.

MinPts (Minimum Points): This defines the minimum number of data points required within the $\epsilon$-neighborhood of a point for that point to be considered a "dense" region (a core point). If MinPts is too small, noise points might form small clusters; if too large, sparse clusters might be missed.

A cluster in DBSCAN is then defined as a dense region in the data space that is reachable from a core point through a chain of other core points. Points that are not part of any such dense region are classified as noise.

13. Can hierarchical clustering be used on categorical data?
    
   -->

   Yes, hierarchical clustering can be used on categorical data, but it requires appropriate handling and distance metrics. Standard distance metrics like Euclidean distance are not suitable for categorical data.

To use hierarchical clustering with categorical data:

Encoding: Categorical data usually needs to be converted into a numerical format.

One-Hot Encoding: Converts each category into a binary (0 or 1) feature. This can lead to very high-dimensional sparse data, which can be problematic for distance calculations and memory usage.

Ordinal Encoding: If there's an inherent order, categories can be mapped to integers.

Distance Metrics: Use distance metrics appropriate for categorical data:

Hamming Distance: For binary data (like after one-hot encoding), it counts the number of positions at which the corresponding symbols are different.

Gower Distance: A general distance metric that can handle mixed data types (numerical and categorical) by calculating a weighted average of individual attribute distances.

While possible, the choice of encoding and distance metric significantly impacts the quality of the clustering.

14. What does a negative Silhouette Score indicate?
    
   -->

   A negative Silhouette Score indicates that a data point might be assigned to the wrong cluster. Specifically:

For a data point i, the Silhouette Score s(i) is calculated as:
s(i)=
max(a(i),b(i))
b(i)−a(i)
​

where:

a(i) is the average distance from i to all other points in the same cluster. (Measures cohesion)

b(i) is the minimum average distance from i to all points in a different cluster (the nearest neighboring cluster). (Measures separation)

If s(i) is negative (b(i)−a(i)<0, meaning b(i)<a(i)): This implies that the average distance from data point i to points in its own cluster (a(i)) is greater than the minimum average distance from i to points in a neighboring cluster (b(i)). In essence, the point is, on average, closer to points in a different cluster than to points in its assigned cluster. This suggests a poor or incorrect clustering assignment for that point.

15. Explain the term "linkage criteria" in hierarchical clustering.
    
   -->

   In agglomerative hierarchical clustering, "linkage criteria" (also known as linkage methods or merging strategies) determine how the distance between two clusters is calculated when deciding which clusters to merge at each step. This distance is not just between individual points but between groups of points.

Common linkage criteria include:

Single Linkage: The distance between two clusters is the minimum distance between any single point in one cluster and any single point in the other cluster. Prone to "chaining" effect.

Complete Linkage: The distance between two clusters is the maximum distance between any single point in one cluster and any single point in the other cluster. Tends to produce more compact, spherical clusters.

Average Linkage: The distance between two clusters is the average distance between all pairs of points, where one point is from each cluster.

Ward's Linkage: Calculates the increase in the within-cluster sum of squares (variance) when two clusters are merged. It merges clusters that result in the minimum increase in total within-cluster variance. Tends to produce well-balanced clusters of similar size.

The choice of linkage criteria significantly influences the shape and structure of the resulting clusters and the appearance of the dendrogram.

16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?
    
   -->

   K-Means clustering often performs poorly on data with varying cluster sizes or densities due to its fundamental assumptions and optimization objective:

Assumption of Spherical/Globular Clusters: K-Means defines clusters based on centroids and assumes that clusters are convex and isotropic (i.e., spherical or globular). When clusters are elongated, crescent-shaped, or have other arbitrary forms, K-Means struggles to correctly identify them.

Assumption of Similar Density: K-Means tries to minimize the within-cluster sum of squares (inertia). This objective implicitly pushes it to create clusters of roughly similar densities. If one cluster is very dense and another is sparse, K-Means might split the dense cluster or merge parts of the sparse one, creating suboptimal divisions.

Sensitivity to Centroid Initialization: For varying densities, a centroid might get "pulled" into a denser region even if it's not the true center of the intended cluster, leading to misassignments.

Fixed Number of Clusters (k): The need to pre-specify k means the algorithm doesn't adapt to the natural varying sizes/densities that might suggest a different number of inherent groups.

Algorithms like DBSCAN (density-based) or hierarchical clustering with appropriate linkage can be more suitable for such data.

17. What are the core parameters in DBSCAN, and how do they influence clustering?
    
   -->

   The two core parameters in DBSCAN are:

ϵ (epsilon or eps): This defines the maximum distance between two samples for one to be considered as in the neighborhood of the other. It determines the radius of the neighborhood to consider around each point.

Influence:

Small eps: Can lead to many points being labeled as noise and potentially splitting true clusters into smaller ones.

Large eps: Can cause multiple distinct clusters to merge into a single large cluster.

MinPts (Minimum Points or min_samples): This defines the minimum number of data points required within an $\epsilon$-neighborhood for a point to be considered a "core point" (i.e., part of a dense region).

Influence:

Small MinPts: Can lead to "noisy" clusters, where even small, sparse groupings are considered clusters.

Large MinPts: Can cause more points to be labeled as noise and potentially miss less dense but valid clusters.

Choosing optimal eps and MinPts is crucial and often involves domain knowledge, visual inspection (e.g., using a k-distance graph), or trial and error.

18. How does K-Means++ improve upon standard K-Means initialization?
    
   -->

   K-Means++ is an initialization algorithm for K-Means that improves upon the standard (random) K-Means initialization by selecting initial centroids in a way that aims to be "smarter" and spread them out across the data. This helps in:

Faster Convergence: By selecting initial centroids that are already somewhat representative and well-separated, K-Means++ often leads to fewer iterations for the K-Means algorithm to converge.

Better Quality Clusters: It significantly reduces the chances of converging to a poor local optimum (suboptimal clustering) that can occur with purely random initialization, especially when clusters are well-separated or have varying densities.

How K-Means++ works:

Choose the first centroid uniformly at random from the data points.

For each remaining data point, calculate its distance to the closest centroid already chosen.

Choose the next centroid from the remaining data points with a probability proportional to the squared distance to the closest existing centroid. This means points far from existing centroids are more likely to be chosen as new centroids.

Repeat steps 2 and 3 until k centroids have been chosen.

19. What is agglomerative clustering?
    
   -->

   Agglomerative clustering is a "bottom-up" approach to hierarchical clustering. It starts with each data point as its own individual cluster. Then, it iteratively merges the closest pairs of clusters until all data points are in a single large cluster, or a stopping criterion (e.g., a specific number of clusters k) is met.

The "closeness" or "distance" between clusters is determined by a linkage criterion (e.g., single, complete, average, or Ward's linkage), which defines how the distance between two groups of points is calculated. The result of agglomerative clustering is a dendrogram, which visually represents the merging process and the hierarchy of clusters.

20. What makes Silhouette Score a better metric than just inertia for model evaluation?
    
   -->

   While inertia (WCSS) is the objective function that K-Means directly tries to minimize, the Silhouette Score is often considered a better metric for evaluating the quality of clustering results because:

Considers Both Cohesion and Separation:

Inertia: Only measures the cohesion of clusters (how close points are to their own centroid). It always decreases as the number of clusters (k) increases, even when adding more clusters doesn't make logical sense, making it a poor sole indicator for optimal k.

Silhouette Score: Measures both the cohesion (how similar a point is to its own cluster) and the separation (how dissimilar it is to other clusters). It provides a measure of how well-defined and separated the clusters are.

Provides Intuitive Interpretation: A high Silhouette Score (closer to 1) indicates that points are well-matched to their own cluster and well-separated from neighboring clusters. A low score (near 0) suggests overlapping clusters, and a negative score suggests misclassified points. This makes it more intuitive for interpreting cluster quality.

Better for Optimal 'k' Selection: Because it balances cohesion and separation, the Silhouette Score often peaks at a k that represents a more natural and meaningful clustering, unlike inertia which continuously decreases. This makes it more effective for methods like the "elbow method" for k selection.

**Practical Questions // Answers**

In [1]:
# ==============================================================================
# Complete Python Code for Clustering Techniques Practical Questions (21-48)
# This code is designed to be run in a Google Colab environment.
# ==============================================================================

# --- Essential Library Imports ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports for various models, datasets, and metrics
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import (
    make_blobs,
    make_moons,
    make_circles,
    load_iris,
    load_wine,
    load_breast_cancer,
    load_digits
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import silhouette_score, mean_squared_error, r2_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import dendrogram, linkage # For hierarchical clustering visualization


# To suppress warnings that might arise during execution
import warnings
warnings.filterwarnings('ignore')

print("All necessary libraries imported successfully.")

# ==============================================================================
# Q21. Generate synthetic data with 4 centers using make_blobs and apply K-Means
#      clustering. Visualize using a scatter plot.
# ==============================================================================
print("\n" + "="*80)
print("Q21. K-Means on make_blobs with 4 centers and visualization.")
print("="*80)

# 1. Generate synthetic data with 4 centers
X_q21, y_q21 = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# 2. Apply K-Means clustering
kmeans_q21 = KMeans(n_clusters=4, random_state=0, n_init=10) # n_init for robust initialization
kmeans_q21.fit(X_q21)
labels_q21 = kmeans_q21.labels_
centroids_q21 = kmeans_q21.cluster_centers_

# 3. Visualize using a scatter plot
plt.figure(figsize=(10, 7))
plt.scatter(X_q21[:, 0], X_q21[:, 1], c=labels_q21, cmap='viridis', s=50, alpha=0.8, label='Data Points')
plt.scatter(centroids_q21[:, 0], centroids_q21[:, 1], c='red', s=200, alpha=0.9, marker='X', label='Centroids')
plt.title('K-Means Clustering on Synthetic Blobs (4 Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()
print("K-Means clustering visualization with 4 centers displayed above.\n")


# ==============================================================================
# Q22. Load the Iris dataset and use Agglomerative Clustering to group the data
#      into 3 clusters. Display the first 10 predicted labels.
# ==============================================================================
print("\n" + "="*80)
print("Q22. Agglomerative Clustering on Iris dataset and first 10 labels.")
print("="*80)

# 1. Load the Iris dataset
iris_q22 = load_iris()
X_q22 = iris_q22.data
y_q22_true = iris_q22.target # True labels for comparison, though not used in clustering

# 2. Apply Agglomerative Clustering to group data into 3 clusters
# Using 'ward' linkage which is common for general purpose clustering
agg_clust_q22 = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_q22 = agg_clust_q22.fit_predict(X_q22)

# 3. Display the first 10 predicted labels
print(f"First 10 predicted cluster labels for Iris dataset: {labels_q22[:10]}\n")


# ==============================================================================
# Q23. Generate synthetic data using make_moons and apply DBSCAN. Highlight
#      outliers in the plot.
# ==============================================================================
print("\n" + "="*80)
print("Q23. DBSCAN on make_moons with outlier highlighting.")
print("="*80)

# 1. Generate synthetic data using make_moons
X_q23, y_q23 = make_moons(n_samples=200, noise=0.05, random_state=0)

# 2. Apply DBSCAN
# Optimal eps and min_samples often require tuning; these are common starting points
dbscan_q23 = DBSCAN(eps=0.3, min_samples=5) # Tune these parameters for best results
labels_q23 = dbscan_q23.fit_predict(X_q23)

# 3. Highlight outliers in the plot
# Noise points are labeled as -1 by DBSCAN
core_samples_mask_q23 = np.zeros_like(labels_q23, dtype=bool)
core_samples_mask_q23[dbscan_q23.core_sample_indices_] = True

n_clusters_q23 = len(set(labels_q23)) - (1 if -1 in labels_q23 else 0)
n_noise_q23 = list(labels_q23).count(-1)

print(f'Estimated number of clusters: {n_clusters_q23}')
print(f'Estimated number of noise points: {n_noise_q23}')

plt.figure(figsize=(10, 7))
unique_labels_q23 = set(labels_q23)
colors_q23 = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels_q23))]

for k, col in zip(unique_labels_q23, colors_q23):
    if k == -1: # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask_q23 = (labels_q23 == k)

    # Plot core points
    xy_q23 = X_q23[class_member_mask_q23 & core_samples_mask_q23]
    plt.plot(xy_q23[:, 0], xy_q23[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=8)

    # Plot non-core points (border points or noise)
    xy_q23 = X_q23[class_member_mask_q23 & ~core_samples_mask_q23]
    plt.plot(xy_q23[:, 0], xy_q23[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=4)

plt.title(f'DBSCAN on make_moons (Clusters: {n_clusters_q23}, Noise: {n_noise_q23})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
print("DBSCAN clustering visualization with outliers highlighted displayed above.\n")


# ==============================================================================
# Q24. Load the Wine dataset and apply K-Means clustering after standardizing
#      the features. Print the size of each cluster.
# ==============================================================================
print("\n" + "="*80)
print("Q24. K-Means on Wine dataset with standardization and cluster sizes.")
print("="*80)

# 1. Load the Wine dataset
wine_q24 = load_wine()
X_q24 = wine_q24.data
y_q24_true = wine_q24.target # True labels (3 classes in Wine dataset)

# 2. Standardize the features
scaler_q24 = StandardScaler()
X_scaled_q24 = scaler_q24.fit_transform(X_q24)

# 3. Apply K-Means clustering (using k=3 based on true classes for demonstration)
kmeans_q24 = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_q24.fit(X_scaled_q24)
labels_q24 = kmeans_q24.labels_

# 4. Print the size of each cluster
unique_labels_q24, counts_q24 = np.unique(labels_q24, return_counts=True)
cluster_sizes_q24 = dict(zip(unique_labels_q24, counts_q24))

print("Cluster sizes after K-Means on standardized Wine dataset:")
for cluster_id, size in cluster_sizes_q24.items():
    print(f"  Cluster {cluster_id}: {size} samples")
print("\n")


# ==============================================================================
# Q25. Use make_circles to generate synthetic data and cluster it using DBSCAN.
#      Plot the result.
# ==============================================================================
print("\n" + "="*80)
print("Q25. DBSCAN on make_circles data and visualization.")
print("="*80)

# 1. Use make_circles to generate synthetic data
X_q25, y_q25 = make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=0)

# 2. Cluster it using DBSCAN
# DBSCAN is excellent for concentric circles
dbscan_q25 = DBSCAN(eps=0.1, min_samples=10) # Tune these for best results
labels_q25 = dbscan_q25.fit_predict(X_q25)

# 3. Plot the result
plt.figure(figsize=(10, 7))
# Color noise points (-1) differently if present
unique_labels_q25 = set(labels_q25)
colors_q25 = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels_q25))]

for k, col in zip(unique_labels_q25, colors_q25):
    if k == -1: # Black for noise
        col = [0, 0, 0, 1]
    class_member_mask_q25 = (labels_q25 == k)
    xy_q25 = X_q25[class_member_mask_q25]
    plt.plot(xy_q25[:, 0], xy_q25[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

n_clusters_q25 = len(set(labels_q25)) - (1 if -1 in labels_q25 else 0)
n_noise_q25 = list(labels_q25).count(-1)

plt.title(f'DBSCAN Clustering on Concentric Circles (Clusters: {n_clusters_q25}, Noise: {n_noise_q25})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
print("DBSCAN clustering visualization on make_circles displayed above.\n")


# ==============================================================================
# Q26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with
#      2 clusters. Output the cluster centroids.
# ==============================================================================
print("\n" + "="*80)
print("Q26. K-Means on Breast Cancer with MinMaxScaler and centroids.")
print("="*80)

# 1. Load the Breast Cancer dataset
breast_cancer_q26 = load_breast_cancer()
X_q26 = breast_cancer_q26.data
y_q26_true = breast_cancer_q26.target # True labels (2 classes)
feature_names_q26 = breast_cancer_q26.feature_names

# 2. Apply MinMaxScaler
scaler_q26 = MinMaxScaler()
X_scaled_q26 = scaler_q26.fit_transform(X_q26)

# 3. Use K-Means with 2 clusters
kmeans_q26 = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans_q26.fit(X_scaled_q26)

# 4. Output the cluster centroids
centroids_q26_scaled = kmeans_q26.cluster_centers_

# It's often useful to inverse transform centroids to original scale for interpretability
centroids_q26_original_scale = scaler_q26.inverse_transform(centroids_q26_scaled)

print("K-Means Cluster Centroids (Original Scale) for Breast Cancer dataset:")
centroid_df_q26 = pd.DataFrame(centroids_q26_original_scale, columns=feature_names_q26)
print(centroid_df_q26)
print("\n")


# ==============================================================================
# Q27. Generate synthetic data using make_blobs with varying cluster standard
#      deviations and cluster with DBSCAN.
# ==============================================================================
print("\n" + "="*80)
print("Q27. DBSCAN on make_blobs with varying standard deviations.")
print("="*80)

# 1. Generate synthetic data using make_blobs with varying cluster standard deviations
# make_blobs can take a list for cluster_std to create varying densities
X_q27, y_q27 = make_blobs(n_samples=500, centers=3, cluster_std=[0.5, 1.0, 0.2], random_state=42)

# 2. Cluster with DBSCAN
# DBSCAN might struggle with very varying densities or require careful tuning of eps/min_samples
# to capture all clusters appropriately.
# For illustration, let's pick parameters that might show its behavior.
dbscan_q27 = DBSCAN(eps=0.5, min_samples=5) # Example parameters
labels_q27 = dbscan_q27.fit_predict(X_q27)

n_clusters_q27 = len(set(labels_q27)) - (1 if -1 in labels_q27 else 0)
n_noise_q27 = list(labels_q27).count(-1)

print(f'Estimated number of clusters by DBSCAN: {n_clusters_q27}')
print(f'Estimated number of noise points by DBSCAN: {n_noise_q27}')

# Plotting the result
plt.figure(figsize=(10, 7))
unique_labels_q27 = set(labels_q27)
colors_q27 = [plt.cm.viridis(each) for each in np.linspace(0, 1, len(unique_labels_q27))]

for k, col in zip(unique_labels_q27, colors_q27):
    if k == -1:
        # Black color for noise points
        col = [0, 0, 0, 1]

    class_member_mask_q27 = (labels_q27 == k)
    xy_q27 = X_q27[class_member_mask_q27]
    plt.plot(xy_q27[:, 0], xy_q27[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('DBSCAN Clustering on Blobs with Varying Standard Deviations')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
print("DBSCAN clustering visualization on make_blobs with varying std deviations displayed above.\n")


# ==============================================================================
# Q28. Load the Digits dataset, reduce it to 2D using PCA, and visualize
#      clusters from K-Means.
# ==============================================================================
print("\n" + "="*80)
print("Q28. K-Means on Digits dataset (PCA-reduced to 2D) and visualization.")
print("="*80)

# 1. Load the Digits dataset
digits_q28 = load_digits()
X_q28 = digits_q28.data
y_q28_true = digits_q28.target # True labels (0-9, so 10 classes)

# 2. Reduce it to 2D using PCA
pca_q28 = PCA(n_components=2, random_state=42)
X_pca_q28 = pca_q28.fit_transform(X_q28)

# 3. Apply K-Means (10 clusters for digits 0-9)
kmeans_q28 = KMeans(n_clusters=10, random_state=42, n_init=10)
kmeans_q28.fit(X_pca_q28)
labels_q28 = kmeans_q28.labels_
centroids_q28 = kmeans_q28.cluster_centers_

# 4. Visualize clusters from K-Means
plt.figure(figsize=(10, 7))
plt.scatter(X_pca_q28[:, 0], X_pca_q28[:, 1], c=labels_q28, cmap='tab10', s=20, alpha=0.8)
plt.scatter(centroids_q28[:, 0], centroids_q28[:, 1], c='black', s=100, marker='X', label='Centroids')
plt.title('K-Means Clustering on Digits Dataset (PCA-reduced to 2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.grid(True)
plt.show()
print("K-Means clustering visualization on Digits dataset (PCA-reduced) displayed above.\n")


# ==============================================================================
# Q29. Create synthetic data using make_blobs and evaluate silhouette scores
#      for k=2 to 5. Display as a bar chart.
# ==============================================================================
print("\n" + "="*80)
print("Q29. Silhouette Scores for K-Means (k=2 to 5) and bar chart.")
print("="*80)

# 1. Create synthetic data using make_blobs
X_q29, y_q29 = make_blobs(n_samples=300, centers=4, cluster_std=0.7, random_state=42)

# 2. Evaluate silhouette scores for k=2 to 5
k_values_q29 = range(2, 6) # k=2, 3, 4, 5
silhouette_scores_q29 = []

for k in k_values_q29:
    kmeans_q29 = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_q29 = kmeans_q29.fit_predict(X_q29)
    score = silhouette_score(X_q29, labels_q29)
    silhouette_scores_q29.append(score)
    print(f"  K = {k}, Silhouette Score: {score:.4f}")

# 3. Display as a bar chart
plt.figure(figsize=(8, 6))
plt.bar(k_values_q29, silhouette_scores_q29, color='lightgreen')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for K-Means at different k values")
plt.xticks(k_values_q29)
plt.ylim(0, 1) # Silhouette score ranges from -1 to 1
plt.grid(axis='y', linestyle='--')
plt.show()
print("Silhouette scores bar chart displayed above.\n")


# ==============================================================================
# Q30. Load the Iris dataset and use hierarchical clustering to group data.
#      Plot a dendrogram with average linkage.
# ==============================================================================
print("\n" + "="*80)
print("Q30. Hierarchical Clustering on Iris and dendrogram with average linkage.")
print("="*80)

# 1. Load the Iris dataset
iris_q30 = load_iris()
X_q30 = iris_q30.data

# 2. Use hierarchical clustering (we need the linkage matrix for dendrogram)
# 'average' linkage: distance between two clusters is the average distance between all pairs of observations.
Z_q30 = linkage(X_q30, method='average')

# 3. Plot a dendrogram with average linkage
plt.figure(figsize=(15, 8))
dendrogram(Z_q30, leaf_rotation=90, leaf_font_size=8, labels=iris_q30.target_names[load_iris().target])
plt.title('Hierarchical Clustering Dendrogram (Average Linkage) on Iris Dataset')
plt.xlabel('Sample Index or (Cluster Size)')
plt.ylabel('Distance')
plt.show()
print("Dendrogram with average linkage for Iris dataset displayed above.\n")


# ==============================================================================
# Q31. Generate synthetic data with overlapping clusters using make_blobs, then
#      apply K-Means and visualize with decision boundaries.
# ==============================================================================
print("\n" + "="*80)
print("Q31. K-Means on overlapping make_blobs with decision boundaries.")
print("="*80)

# 1. Generate synthetic data with overlapping clusters
X_q31, y_q31 = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42) # Higher std for overlap

# 2. Apply K-Means
kmeans_q31 = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_q31.fit(X_q31)
labels_q31 = kmeans_q31.labels_
centroids_q31 = kmeans_q31.cluster_centers_

# 3. Visualize with decision boundaries
# Create a meshgrid to plot decision boundaries
h_q31 = 0.02 # step size in the mesh
x_min, x_max = X_q31[:, 0].min() - 1, X_q31[:, 0].max() + 1
y_min, y_max = X_q31[:, 1].min() - 1, X_q31[:, 1].max() + 1
xx_q31, yy_q31 = np.meshgrid(np.arange(x_min, x_max, h_q31), np.arange(y_min, y_max, h_q31))

# Predict cluster for each point in the meshgrid
Z_q31 = kmeans_q31.predict(np.c_[xx_q31.ravel(), yy_q31.ravel()])
Z_q31 = Z_q31.reshape(xx_q31.shape)

plt.figure(figsize=(10, 7))
plt.imshow(Z_q31, interpolation='nearest',
           extent=(xx_q31.min(), xx_q31.max(), yy_q31.min(), yy_q31.max()),
           cmap=plt.cm.Paired, aspect='auto', origin='lower', alpha=0.8)

plt.scatter(X_q31[:, 0], X_q31[:, 1], c=labels_q31, cmap='viridis', s=50, edgecolors='k', label='Data Points')
plt.scatter(centroids_q31[:, 0], centroids_q31[:, 1], marker='X', s=200, linewidths=3,
            color='red', edgecolors='black', zorder=10, label='Centroids')
plt.title('K-Means Clustering with Decision Boundaries on Overlapping Blobs')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()
print("K-Means clustering visualization with decision boundaries on overlapping blobs displayed above.\n")


# ==============================================================================
# Q32. Load the Digits dataset and apply DBSCAN after reducing dimensions with
#      t-SNE. Visualize the results.
# ==============================================================================
print("\n" + "="*80)
print("Q32. DBSCAN on Digits (t-SNE reduced) and visualization.")
print("="*80)

# 1. Load the Digits dataset
digits_q32 = load_digits()
X_q32 = digits_q32.data
y_q32_true = digits_q32.target # True labels (10 classes)

# 2. Reduce dimensions with t-SNE
print("  Applying t-SNE (this might take a moment)...")
tsne_q32 = TSNE(n_components=2, random_state=42)
X_tsne_q32 = tsne_q32.fit_transform(X_q32)
print("  t-SNE dimensionality reduction complete.")

# 3. Apply DBSCAN
# DBSCAN parameters need careful tuning for t-SNE output, as densities can vary
dbscan_q32 = DBSCAN(eps=2.5, min_samples=10) # Example parameters; tune for optimal results
labels_q32 = dbscan_q32.fit_predict(X_tsne_q32)

n_clusters_q32 = len(set(labels_q32)) - (1 if -1 in labels_q32 else 0)
n_noise_q32 = list(labels_q32).count(-1)

print(f'Estimated number of clusters by DBSCAN: {n_clusters_q32}')
print(f'Estimated number of noise points by DBSCAN: {n_noise_q32}')

# 4. Visualize the results
plt.figure(figsize=(10, 7))
unique_labels_q32 = set(labels_q32)
colors_q32 = [plt.cm.get_cmap('tab10', len(unique_labels_q32))(i) for i in range(len(unique_labels_q32))]

for k, col in zip(unique_labels_q32, colors_q32):
    if k == -1: # Black for noise
        col = (0, 0, 0, 1) # Full black

    class_member_mask_q32 = (labels_q32 == k)
    xy_q32 = X_tsne_q32[class_member_mask_q32]
    plt.plot(xy_q32[:, 0], xy_q32[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6, alpha=0.7)

plt.title(f'DBSCAN Clustering on Digits Dataset (t-SNE Reduced)\n(Clusters: {n_clusters_q32}, Noise: {n_noise_q32})')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.grid(True)
plt.show()
print("DBSCAN clustering visualization on Digits dataset (t-SNE reduced) displayed above.\n")


# ==============================================================================
# Q33. Generate synthetic data using make_blobs and apply Agglomerative
#      Clustering with complete linkage. Plot the result.
# ==============================================================================
print("\n" + "="*80)
print("Q33. Agglomerative Clustering (complete linkage) on make_blobs.")
print("="*80)

# 1. Generate synthetic data using make_blobs
X_q33, y_q33 = make_blobs(n_samples=300, centers=4, cluster_std=0.7, random_state=42)

# 2. Apply Agglomerative Clustering with complete linkage
# Complete linkage: distance between two clusters is the maximum distance between any two points.
agg_clust_q33 = AgglomerativeClustering(n_clusters=4, linkage='complete')
labels_q33 = agg_clust_q33.fit_predict(X_q33)

# 3. Plot the result
plt.figure(figsize=(10, 7))
plt.scatter(X_q33[:, 0], X_q33[:, 1], c=labels_q33, cmap='viridis', s=50, alpha=0.8)
plt.title('Agglomerative Clustering (Complete Linkage) on Synthetic Blobs (4 Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
print("Agglomerative Clustering visualization (complete linkage) displayed above.\n")


# ==============================================================================
# Q34. Load the Breast Cancer dataset and compare inertia values for K=2 to 6
#      using K-Means. Show results in a line plot.
# ==============================================================================
print("\n" + "="*80)
print("Q34. K-Means Inertia (WCSS) comparison for K=2 to 6 on Breast Cancer.")
print("="*80)

# 1. Load the Breast Cancer dataset
breast_cancer_q34 = load_breast_cancer()
X_q34 = breast_cancer_q34.data

# It's good practice to scale data before K-Means
scaler_q34 = StandardScaler()
X_scaled_q34 = scaler_q34.fit_transform(X_q34)

# 2. Compare inertia values for K=2 to 6
k_values_q34 = range(2, 7) # K=2, 3, 4, 5, 6
inertia_values_q34 = []

print("Inertia values for K-Means on Breast Cancer dataset:")
for k in k_values_q34:
    kmeans_q34 = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_q34.fit(X_scaled_q34)
    inertia_values_q34.append(kmeans_q34.inertia_)
    print(f"  K = {k}, Inertia: {kmeans_q34.inertia_:.4f}")

# 3. Show results in a line plot (Elbow Method visualization)
plt.figure(figsize=(10, 6))
plt.plot(k_values_q34, inertia_values_q34, marker='o', linestyle='-')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.title("K-Means Inertia vs. Number of Clusters (Breast Cancer Dataset)")
plt.xticks(k_values_q34)
plt.grid(True)
plt.show()
print("Inertia comparison line plot displayed above.\n")


# ==============================================================================
# Q35. Generate synthetic concentric circles using make_circles and cluster
#      using Agglomerative Clustering with single linkage.
# ==============================================================================
print("\n" + "="*80)
print("Q35. Agglomerative Clustering (single linkage) on concentric circles.")
print("="*80)

# 1. Generate synthetic concentric circles
X_q35, y_q35 = make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=0)

# 2. Cluster using Agglomerative Clustering with single linkage
# Single linkage is often effective for elongated or intertwined clusters like circles
agg_clust_q35 = AgglomerativeClustering(n_clusters=2, linkage='single') # Expecting 2 circles
labels_q35 = agg_clust_q35.fit_predict(X_q35)

# 3. Plot the result
plt.figure(figsize=(10, 7))
plt.scatter(X_q35[:, 0], X_q35[:, 1], c=labels_q35, cmap='plasma', s=50, alpha=0.8)
plt.title('Agglomerative Clustering (Single Linkage) on Concentric Circles (2 Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
print("Agglomerative Clustering visualization (single linkage) on concentric circles displayed above.\n")


# ==============================================================================
# Q36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the
#      number of clusters (excluding noise).
# ==============================================================================
print("\n" + "="*80)
print("Q36. DBSCAN on Wine dataset (scaled) and cluster count.")
print("="*80)

# 1. Use the Wine dataset
wine_q36 = load_wine()
X_q36 = wine_q36.data

# 2. Apply scaling to the data
scaler_q36 = StandardScaler()
X_scaled_q36 = scaler_q36.fit_transform(X_q36)

# 3. Apply DBSCAN after scaling the data
# DBSCAN parameters (eps, min_samples) are crucial and dataset-dependent.
# These are illustrative values and might need tuning.
dbscan_q36 = DBSCAN(eps=1.5, min_samples=5) # Example parameters
labels_q36 = dbscan_q36.fit_predict(X_scaled_q36)

# 4. Count the number of clusters (excluding noise)
# Noise points are labeled as -1 by DBSCAN
n_clusters_q36 = len(set(labels_q36)) - (1 if -1 in labels_q36 else 0)
n_noise_q36 = list(labels_q36).count(-1)

print(f"Number of clusters found (excluding noise): {n_clusters_q36}")
print(f"Number of noise points identified: {n_noise_q36}\n")


# ==============================================================================
# Q37. Generate synthetic data with make_blobs and apply KMeans. Then plot the
#      cluster centers on top of the data points.
# ==============================================================================
print("\n" + "="*80)
print("Q37. K-Means on make_blobs and plot cluster centers.")
print("="*80)

# 1. Generate synthetic data with make_blobs
X_q37, y_q37 = make_blobs(n_samples=300, centers=3, cluster_std=0.7, random_state=42)

# 2. Apply KMeans
kmeans_q37 = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_q37.fit(X_q37)
labels_q37 = kmeans_q37.labels_
centroids_q37 = kmeans_q37.cluster_centers_

# 3. Plot the cluster centers on top of the data points
plt.figure(figsize=(10, 7))
plt.scatter(X_q37[:, 0], X_q37[:, 1], c=labels_q37, cmap='coolwarm', s=50, alpha=0.8, label='Data Points')
plt.scatter(centroids_q37[:, 0], centroids_q37[:, 1], c='black', s=250, alpha=1.0, marker='*', edgecolor='white', linewidth=1.5, label='Cluster Centroids')
plt.title('K-Means Clustering with Cluster Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()
print("K-Means clustering visualization with cluster centroids displayed above.\n")


# ==============================================================================
# Q38. Load the Iris dataset, cluster with DBSCAN, and print how many samples
#      were identified as noise.
# ==============================================================================
print("\n" + "="*80)
print("Q38. DBSCAN on Iris dataset and noise sample count.")
print("="*80)

# 1. Load the Iris dataset
iris_q38 = load_iris()
X_q38 = iris_q38.data

# It's good practice to scale data for DBSCAN
scaler_q38 = StandardScaler()
X_scaled_q38 = scaler_q38.fit_transform(X_q38)

# 2. Cluster with DBSCAN
# These parameters are often found to work reasonably well for Iris after scaling
dbscan_q38 = DBSCAN(eps=0.5, min_samples=5) # Tune eps/min_samples as needed
labels_q38 = dbscan_q38.fit_predict(X_scaled_q38)

# 3. Print how many samples were identified as noise
n_noise_q38 = list(labels_q38).count(-1)
print(f"Number of samples identified as noise by DBSCAN on Iris dataset: {n_noise_q38}\n")


# ==============================================================================
# Q39. Generate synthetic non-linearly separable data using make_moons, apply
#      K-Means, and visualize the clustering result.
# ==============================================================================
print("\n" + "="*80)
print("Q39. K-Means on non-linearly separable make_moons data and visualization.")
print("="*80)

# 1. Generate synthetic non-linearly separable data using make_moons
X_q39, y_q39 = make_moons(n_samples=200, noise=0.05, random_state=0)

# 2. Apply K-Means (K-Means will struggle here due to non-linear shape)
kmeans_q39 = KMeans(n_clusters=2, random_state=0, n_init=10) # Expecting 2 crescent shapes
kmeans_q39.fit(X_q39)
labels_q39 = kmeans_q39.labels_
centroids_q39 = kmeans_q39.cluster_centers_

# 3. Visualize the clustering result
plt.figure(figsize=(10, 7))
plt.scatter(X_q39[:, 0], X_q39[:, 1], c=labels_q39, cmap='spring', s=50, alpha=0.8, label='K-Means Clusters')
plt.scatter(centroids_q39[:, 0], centroids_q39[:, 1], c='blue', s=200, marker='X', label='Centroids')
plt.title('K-Means Clustering on Non-Linearly Separable Make_Moons')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()
print("K-Means clustering visualization on make_moons displayed above (note its struggle with non-linear shapes).\n")


# ==============================================================================
# Q40. Load the Digits dataset, apply PCA to reduce to 3 components, then use
#      KMeans and visualize with a 3D scatter plot.
# ==============================================================================
print("\n" + "="*80)
print("Q40. K-Means on Digits (PCA-reduced to 3D) and 3D visualization.")
print("="*80)

# 1. Load the Digits dataset
digits_q40 = load_digits()
X_q40 = digits_q40.data
y_q40_true = digits_q40.target # True labels (0-9, 10 classes)

# 2. Apply PCA to reduce to 3 components
pca_q40 = PCA(n_components=3, random_state=42)
X_pca_q40 = pca_q40.fit_transform(X_q40)

# 3. Use KMeans (10 clusters for digits 0-9)
kmeans_q40 = KMeans(n_clusters=10, random_state=42, n_init=10)
kmeans_q40.fit(X_pca_q40)
labels_q40 = kmeans_q40.labels_
centroids_q40 = kmeans_q40.cluster_centers_

# 4. Visualize with a 3D scatter plot
fig_q40 = plt.figure(figsize=(10, 8))
ax_q40 = fig_q40.add_subplot(111, projection='3d')

scatter_q40 = ax_q40.scatter(X_pca_q40[:, 0], X_pca_q40[:, 1], X_pca_q40[:, 2],
                             c=labels_q40, cmap='tab20', s=30, alpha=0.8)
ax_q40.scatter(centroids_q40[:, 0], centroids_q40[:, 1], centroids_q40[:, 2],
               c='black', s=150, marker='X', edgecolor='white', linewidth=1.5, label='Centroids')

ax_q40.set_title('K-Means Clustering on Digits Dataset (PCA-reduced to 3D)')
ax_q40.set_xlabel('Principal Component 1')
ax_q40.set_ylabel('Principal Component 2')
ax_q40.set_zlabel('Principal Component 3')
fig_q40.colorbar(scatter_q40, ax=ax_q40, label='Cluster Label')
plt.legend()
plt.show()
print("K-Means clustering visualization on Digits dataset (PCA-reduced to 3D) displayed above.\n")


# ==============================================================================
# Q41. Generate synthetic blobs with 5 centers and apply KMeans. Then use
#      silhouette_score to evaluate the clustering.
# ==============================================================================
print("\n" + "="*80)
print("Q41. K-Means on 5-center blobs and Silhouette Score evaluation.")
print("="*80)

# 1. Generate synthetic blobs with 5 centers
X_q41, y_q41 = make_blobs(n_samples=400, centers=5, cluster_std=0.8, random_state=42)

# 2. Apply KMeans
kmeans_q41 = KMeans(n_clusters=5, random_state=42, n_init=10)
labels_q41 = kmeans_q41.fit_predict(X_q41)

# 3. Use silhouette_score to evaluate the clustering
score_q41 = silhouette_score(X_q41, labels_q41)
print(f"Silhouette Score for K-Means (5 clusters) on synthetic blobs: {score_q41:.4f}\n")


# ==============================================================================
# Q42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and
#      apply Agglomerative Clustering. Visualize in 2D.
# ==============================================================================
print("\n" + "="*80)
print("Q42. Agglomerative Clustering on Breast Cancer (PCA-reduced to 2D) and visualization.")
print("="*80)

# 1. Load the Breast Cancer dataset
breast_cancer_q42 = load_breast_cancer()
X_q42 = breast_cancer_q42.data
y_q42_true = breast_cancer_q42.target

# Scale the data first
scaler_q42 = StandardScaler()
X_scaled_q42 = scaler_q42.fit_transform(X_q42)

# 2. Reduce dimensionality using PCA
pca_q42 = PCA(n_components=2, random_state=42)
X_pca_q42 = pca_q42.fit_transform(X_scaled_q42)

# 3. Apply Agglomerative Clustering (2 clusters based on true labels)
agg_clust_q42 = AgglomerativeClustering(n_clusters=2, linkage='ward')
labels_q42 = agg_clust_q42.fit_predict(X_pca_q42)

# 4. Visualize in 2D
plt.figure(figsize=(10, 7))
plt.scatter(X_pca_q42[:, 0], X_pca_q42[:, 1], c=labels_q42, cmap='cividis', s=50, alpha=0.8)
plt.title('Agglomerative Clustering on Breast Cancer (PCA-reduced to 2D)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
print("Agglomerative Clustering visualization on Breast Cancer (PCA-reduced) displayed above.\n")


# ==============================================================================
# Q43. Generate noisy circular data using make_circles and visualize clustering
#      results from KMeans and DBSCAN side-by-side.
# ==============================================================================
print("\n" + "="*80)
print("Q43. K-Means vs. DBSCAN on noisy circular data side-by-side.")
print("="*80)

# 1. Generate noisy circular data
X_q43, y_q43 = make_circles(n_samples=400, factor=0.5, noise=0.1, random_state=42)

# 2. Apply K-Means
kmeans_q43 = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_kmeans_q43 = kmeans_q43.fit_predict(X_q43)

# 3. Apply DBSCAN
dbscan_q43 = DBSCAN(eps=0.2, min_samples=10) # Tune as needed
labels_dbscan_q43 = dbscan_q43.fit_predict(X_q43)

# 4. Visualize clustering results side-by-side
fig_q43, axes_q43 = plt.subplots(1, 2, figsize=(16, 7))

# K-Means plot
axes_q43[0].scatter(X_q43[:, 0], X_q43[:, 1], c=labels_kmeans_q43, cmap='viridis', s=50, alpha=0.8)
axes_q43[0].set_title('K-Means Clustering on Noisy Circles')
axes_q43[0].set_xlabel('Feature 1')
axes_q43[0].set_ylabel('Feature 2')
axes_q43[0].grid(True)

# DBSCAN plot
# Handle noise points for DBSCAN visualization
unique_labels_dbscan_q43 = set(labels_dbscan_q43)
colors_dbscan_q43 = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels_dbscan_q43))]

for k, col in zip(unique_labels_dbscan_q43, colors_dbscan_q43):
    if k == -1:
        col = [0, 0, 0, 1] # Black for noise
    class_member_mask_dbscan_q43 = (labels_dbscan_q43 == k)
    xy_dbscan_q43 = X_q43[class_member_mask_dbscan_q43]
    axes_q43[1].plot(xy_dbscan_q43[:, 0], xy_dbscan_q43[:, 1], 'o', markerfacecolor=tuple(col),
                     markeredgecolor='k', markersize=6, alpha=0.8)

axes_q43[1].set_title('DBSCAN Clustering on Noisy Circles')
axes_q43[1].set_xlabel('Feature 1')
axes_q43[1].set_ylabel('Feature 2')
axes_q43[1].grid(True)

plt.tight_layout()
plt.show()
print("K-Means vs. DBSCAN clustering on noisy circular data side-by-side displayed above.\n")


# ==============================================================================
# Q44. Load the Iris dataset and plot the Silhouette Coefficient for each sample
#      after KMeans clustering.
# ==============================================================================
print("\n" + "="*80)
print("Q44. Silhouette Coefficient per sample for Iris K-Means.")
print("="*80)

# 1. Load the Iris dataset
iris_q44 = load_iris()
X_q44 = iris_q44.data

# Scale the data first
scaler_q44 = StandardScaler()
X_scaled_q44 = scaler_q44.fit_transform(X_q44)

# 2. Apply KMeans clustering (3 clusters)
kmeans_q44 = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_q44 = kmeans_q44.fit_predict(X_scaled_q44)

# 3. Calculate the Silhouette Coefficient for each sample
from sklearn.metrics import silhouette_samples
silhouette_per_sample_q44 = silhouette_samples(X_scaled_q44, labels_q44)

# 4. Plot the Silhouette Coefficient for each sample
plt.figure(figsize=(10, 7))
y_lower = 10
for i in range(3): # Iterate through each cluster (0, 1, 2)
    # Aggregate the silhouette scores for samples belonging to cluster i, and sort them
    ith_cluster_silhouette_values = silhouette_per_sample_q44[labels_q44 == i]
    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = plt.cm.get_cmap("Spectral")(float(i) / 3) # Using a color map
    plt.fill_betweenx(np.arange(y_lower, y_upper),
                      0, ith_cluster_silhouette_values,
                      facecolor=color, edgecolor=color, alpha=0.7)

    # Label the silhouette plots with their cluster numbers at the middle
    plt.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

    # Compute the new y_lower for next plot
    y_lower = y_upper + 10 # 10 for the 0 samples

plt.title("Silhouette plot for the various clusters (Iris Dataset)")
plt.xlabel("The silhouette coefficient values")
plt.ylabel("Cluster label")
plt.axvline(x=silhouette_score(X_scaled_q44, labels_q44), color="red", linestyle="--", label='Average Silhouette Score')
plt.legend()
plt.yticks([]) # Clear the yaxis labels / ticks
plt.xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
plt.show()
print("Silhouette Coefficient per sample plot displayed above.\n")


# ==============================================================================
# Q45. Generate synthetic data using make_blobs and apply Agglomerative
#      Clustering with 'average' linkage. Visualize clusters.
# ==============================================================================
print("\n" + "="*80)
print("Q45. Agglomerative Clustering ('average' linkage) on make_blobs and visualization.")
print("="*80)

# 1. Generate synthetic data using make_blobs
X_q45, y_q45 = make_blobs(n_samples=300, centers=4, cluster_std=0.7, random_state=42)

# 2. Apply Agglomerative Clustering with 'average' linkage
# 'average' linkage: distance between two clusters is the average distance between all pairs of observations.
agg_clust_q45 = AgglomerativeClustering(n_clusters=4, linkage='average')
labels_q45 = agg_clust_q45.fit_predict(X_q45)

# 3. Visualize clusters
plt.figure(figsize=(10, 7))
plt.scatter(X_q45[:, 0], X_q45[:, 1], c=labels_q45, cmap='magma', s=50, alpha=0.8)
plt.title('Agglomerative Clustering (Average Linkage) on Synthetic Blobs (4 Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
print("Agglomerative Clustering visualization (average linkage) displayed above.\n")


# ==============================================================================
# Q46. Load the Wine dataset, apply KMeans, and visualize the cluster
#      assignments in a seaborn pairplot (first 4 features).
# ==============================================================================
print("\n" + "="*80)
print("Q46. K-Means on Wine dataset and pairplot visualization.")
print("="*80)

# 1. Load the Wine dataset
wine_q46 = load_wine()
X_q46 = wine_q46.data
feature_names_q46 = wine_q46.feature_names

# Scale the data first for K-Means
scaler_q46 = StandardScaler()
X_scaled_q46 = scaler_q46.fit_transform(X_q46)

# 2. Apply KMeans (3 clusters based on true labels)
kmeans_q46 = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_q46 = kmeans_q46.fit_predict(X_scaled_q46)

# 3. Visualize the cluster assignments in a seaborn pairplot (first 4 features)
# Convert scaled data back to DataFrame for pairplot with feature names
df_q46 = pd.DataFrame(X_scaled_q46, columns=feature_names_q46)
df_q46['Cluster'] = labels_q46

# Select first 4 features plus the 'Cluster' column for pairplot
selected_features_q46 = list(feature_names_q46[:4]) + ['Cluster']
sns.pairplot(df_q46[selected_features_q46], hue='Cluster', palette='viridis', diag_kind='kde')
plt.suptitle('K-Means Cluster Assignments on Wine Dataset (First 4 Features)', y=1.02) # Adjust suptitle position
plt.show()
print("K-Means cluster assignments visualization in a seaborn pairplot displayed above.\n")


# ==============================================================================
# Q47. Generate noisy blobs using make_blobs and use DBSCAN to identify both
#      clusters and noise points. Print the count.
# ==============================================================================
print("\n" + "="*80)
print("Q47. DBSCAN on noisy blobs: cluster and noise count.")
print("="*80)

# 1. Generate noisy blobs using make_blobs (higher cluster_std for more noise/overlap)
X_q47, y_q47 = make_blobs(n_samples=500, centers=3, cluster_std=1.2, random_state=42)

# 2. Use DBSCAN to identify both clusters and noise points
# Parameters need tuning for noisy data
dbscan_q47 = DBSCAN(eps=0.8, min_samples=8) # Example parameters
labels_q47 = dbscan_q47.fit_predict(X_q47)

# 3. Print the count of clusters and noise points
n_clusters_q47 = len(set(labels_q47)) - (1 if -1 in labels_q47 else 0)
n_noise_q47 = list(labels_q47).count(-1)

print(f"Number of clusters identified by DBSCAN: {n_clusters_q47}")
print(f"Number of noise points identified by DBSCAN: {n_noise_q47}\n")

# Optional: Visualize for confirmation
plt.figure(figsize=(10, 7))
unique_labels_q47 = set(labels_q47)
colors_q47 = [plt.cm.turbo(each) for each in np.linspace(0, 1, len(unique_labels_q47))]

for k, col in zip(unique_labels_q47, colors_q47):
    if k == -1: # Noise points are black
        col = [0, 0, 0, 1]
    class_member_mask_q47 = (labels_q47 == k)
    xy_q47 = X_q47[class_member_mask_q47]
    plt.plot(xy_q47[:, 0], xy_q47[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6, alpha=0.7)
plt.title(f'DBSCAN Clustering on Noisy Blobs (Clusters: {n_clusters_q47}, Noise: {n_noise_q47})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
print("DBSCAN clustering visualization on noisy blobs (optional) displayed above.\n")


# ==============================================================================
# Q48. Load the Digits dataset, reduce dimensions using t-SNE, then apply
#      Agglomerative Clustering and plot the clusters.
# ==============================================================================
print("\n" + "="*80)
print("Q48. Agglomerative Clustering on Digits (t-SNE reduced) and visualization.")
print("="*80)

# 1. Load the Digits dataset
digits_q48 = load_digits()
X_q48 = digits_q48.data
y_q48_true = digits_q48.target # True labels (10 classes)

# 2. Reduce dimensions using t-SNE
print("  Applying t-SNE (this might take a moment)...")
tsne_q48 = TSNE(n_components=2, random_state=42)
X_tsne_q48 = tsne_q48.fit_transform(X_q48)
print("  t-SNE dimensionality reduction complete.")

# 3. Apply Agglomerative Clustering (e.g., 10 clusters for digits)
# Linkage method choice can significantly impact results on t-SNE output.
agg_clust_q48 = AgglomerativeClustering(n_clusters=10, linkage='ward')
labels_q48 = agg_clust_q48.fit_predict(X_tsne_q48)

# 4. Plot the clusters
plt.figure(figsize=(10, 7))
plt.scatter(X_tsne_q48[:, 0], X_tsne_q48[:, 1], c=labels_q48, cmap='tab20', s=30, alpha=0.8)
plt.title('Agglomerative Clustering on Digits Dataset (t-SNE Reduced to 2D)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.grid(True)
plt.show()
print("Agglomerative Clustering visualization on Digits dataset (t-SNE reduced) displayed above.\n")

print("\n" + "="*80)
print("All practical clustering questions have been addressed with runnable code.")
print("==============================================================================")


Output hidden; open in https://colab.research.google.com to view.