# **THEORY QUESTIONS**

Q1. What is unsupervised learning in the context of machine learning?
- Unsupervised learning is a type of machine learning where the model is trained on data that does not have labeled outputs. The goal is to identify patterns, structures, or groupings within the data without any prior knowledge of the categories or classes. Common applications of unsupervised learning include clustering, dimensionality reduction, and anomaly detection. The algorithms analyze the input data to find hidden structures or relationships, allowing for insights that can be used for further analysis or decision-making.

Q2. How does the K-Means clustering algorithm work?
- The K-Means clustering algorithm works through the following steps:

 Initialization: Choose the number of clusters
 and randomly initialize
 centroids in the feature space.

 Assignment Step: Assign each data point to the nearest centroid based on a distance metric (usually Euclidean distance). This forms
 clusters.

 Update Step: Calculate the new centroids by taking the mean of all data points assigned to each cluster.

 Convergence Check: Repeat the assignment and update steps until the centroids no longer change significantly or a maximum number of iterations is reached.

 The algorithm aims to minimize the within-cluster variance, which is the sum of squared distances between data points and their respective centroids.

Q3. Explain the concept of a dendrogram in hierarchical clustering.
- A dendrogram is a tree-like diagram that visually represents the arrangement of clusters formed through hierarchical clustering. It illustrates how clusters are merged or split at various levels of similarity or distance. The vertical axis typically represents the distance or dissimilarity between clusters, while the horizontal axis represents the individual data points or clusters.

 In a dendrogram, each leaf node corresponds to a data point, and the branches indicate the merging of clusters. The height at which two clusters are joined reflects the distance between them, allowing for an intuitive understanding of the relationships and hierarchy among the clusters.

Q4. What is the main difference between K-Means and Hierarchical Clustering?
- The main differences between K-Means and hierarchical clustering are:

 Cluster Number Specification: K-Means requires the user to specify the number of clusters
 in advance, while hierarchical clustering does not require this; it builds a hierarchy of clusters that can be cut at different levels to obtain various numbers of clusters.

 Algorithm Structure: K-Means is a partitional clustering method that iteratively refines clusters based on centroids, while hierarchical clustering creates a tree structure (dendrogram) that shows how clusters are formed and related.

 Cluster Shape: K-Means tends to form spherical clusters and may struggle with clusters of varying shapes and densities, whereas hierarchical clustering can capture more complex relationships between data points.

Q5. What are the advantages of DBSCAN over K-Means?
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has several advantages over K-Means:

 No Need for Predefined Clusters: DBSCAN does not require the number of clusters to be specified in advance, making it more flexible for discovering clusters of varying shapes and sizes.

 Handling Noise and Outliers: DBSCAN can effectively identify and separate noise and outliers from the clusters, labeling them as noise points, while K-Means can be heavily influenced by outliers.

 A rbitrary Cluster Shapes: DBSCAN can find clusters of arbitrary shapes, as it groups points based on density rather than distance to centroids, which is a limitation of K-Means.

 Scalability: DBSCAN can be more efficient for large datasets, especially when the data has a lot of noise, as it does not require multiple iterations like K-Means.

Q6. When would you use Silhouette Score in clustering?

- Optimal Number of Clusters: The Silhouette Score helps in selecting the optimal number of clusters by evaluating different values of k and identifying the one that maximizes the score.

 Cluster Quality Assessment: It provides a quantitative measure of how well-separated and compact the clusters are, allowing for a better understanding of clustering performance.

 Visual Insights: Silhouette plots can visually represent the quality of clusters, making it easier to identify potential issues with clustering.

 Unsupervised Evaluation: It is particularly useful in unsupervised learning scenarios where true labels are not available.

Q7. What are the limitations of Hierarchical Clustering?

- Computational Complexity: Hierarchical clustering can be computationally expensive, especially for large datasets, as it requires calculating distances between all pairs of points.

 Scalability Issues: The algorithm does not scale well with increasing data size, making it less practical for very large datasets.

 Sensitivity to Noise and Outliers: Hierarchical clustering can be significantly affected by noise and outliers, which can distort the dendrogram and lead to misleading clusters.

 Dendrogram Interpretation: The interpretation of the dendrogram can be subjective, and determining the optimal number of clusters from it can be challenging.

Q8. Why is feature scaling important in clustering algorithms like K-Means?

- Equal Contribution: Feature scaling ensures that all features contribute equally to the distance calculations, preventing features with larger ranges from dominating the clustering process.

 Improved Convergence: It can lead to faster convergence of the K-Means algorithm, as the centroids will be updated more effectively when features are on a similar scale.

 Better Cluster Formation: Without scaling, the algorithm may produce clusters that do not accurately reflect the underlying data structure, leading to poor clustering results.

Q9. How does DBSCAN identify noise points?

- Density-Based Approach: DBSCAN identifies noise points by examining the density of data points in the vicinity of each point. Points that do not have enough neighboring points within a specified distance (epsilon) are classified as noise.

 Core Points and Border Points: Points that are not core points (which have a minimum number of neighbors) and do not belong to any cluster are labeled as noise.

 Robustness to Outliers: This method allows DBSCAN to effectively handle outliers, as they are naturally identified as noise rather than being forced into clusters.

Q10. Define inertia in the context of K-Means.

- Sum of Squared Distances: Inertia refers to the sum of squared distances between each data point and its assigned cluster centroid. It quantifies how tightly the clusters are packed.

 Cluster Compactness: Lower inertia values indicate more compact clusters, while higher values suggest that the clusters are spread out and less well-defined.

 Optimization Goal: The K-Means algorithm aims to minimize inertia during the clustering process, leading to better-defined clusters.

Q11. What is the elbow method in K-Means clustering?

- The elbow method is a heuristic used to determine the optimal number of clusters in K-Means clustering. It involves running the K-Means algorithm for a range of values of $ k $ (the number of clusters) and calculating the within-cluster sum of squares (WCSS) for each $ k $. The WCSS measures the compactness of the clusters; lower values indicate tighter clusters. The results are plotted on a graph with $ k $ on the x-axis and WCSS on the y-axis. The "elbow" point, where the rate of decrease sharply changes, suggests the optimal number of clusters, as adding more clusters beyond this point yields diminishing returns in terms of WCSS reduction.

Q12. Describe the concept of "density" in DBSCAN?

- In DBSCAN (Density-Based Spatial Clustering of Applications with Noise), "density" refers to the number of data points within a specified radius (epsilon, $ \varepsilon $) around a given point. DBSCAN identifies clusters based on the density of points: a point is considered a core point if it has at least a minimum number of neighboring points (MinPts) within the radius $ \varepsilon $. Points that are within the $ \varepsilon $ radius of a core point are considered part of the same cluster. Points that are not core points and do not fall within the $ \varepsilon $ radius of any core point are classified as noise.

Q13. Can hierarchical clustering be used on categorical data?

- Yes, hierarchical clustering can be used on categorical data, but it requires a different approach than when dealing with numerical data. Since traditional distance metrics like Euclidean distance are not suitable for categorical data, alternative measures such as Jaccard distance or Hamming distance can be used. These metrics assess the similarity or dissimilarity between categorical variables, allowing hierarchical clustering algorithms to group similar categories effectively.

Q14. What does a negative Silhouette Score indicate?

- A negative Silhouette Score indicates that a data point is likely assigned to the wrong cluster. The Silhouette Score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1, where a score close to 1 indicates that the point is well-clustered, a score of 0 indicates that the point is on or very close to the decision boundary between two neighboring clusters, and a negative score suggests that the point is closer to points in another cluster than to points in its own cluster.

Q15. Explain the term "linkage criteria" in hierarchical clustering?

- Linkage criteria in hierarchical clustering refer to the method used to determine the distance between clusters when merging them. Different linkage criteria can lead to different clustering results. Common linkage methods include:

 Single Linkage: The distance between the closest points of two clusters.

 Complete Linkage: The distance between the farthest points of two clusters.

 Average Linkage: The average distance between all pairs of points in two clusters.

 Ward's Linkage: Minimizes the total within-cluster variance by merging clusters that result in the smallest increase in total variance.

 The choice of linkage criteria can significantly affect the shape and size of the resulting clusters

Q16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

- Assumption of Equal Size and Shape: K-Means assumes that clusters are spherical and of similar sizes. When clusters vary significantly in size or density, K-Means may incorrectly merge smaller clusters or split larger ones.

 Sensitivity to Outliers: Outliers can disproportionately affect the position of centroids, leading to poor clustering results, especially in datasets with varying densities.

 Distance Metric Limitations: K-Means uses Euclidean distance, which may not effectively capture the true structure of clusters that are irregularly shaped or have different densities.

Q17. What are the core parameters in DBSCAN, and how do they influence clustering?

- Epsilon (ε): This parameter defines the maximum distance between two points for them to be considered as part of the same neighborhood. A smaller ε may lead to many points being classified as noise, while a larger ε can merge distinct clusters.

 MinPts: This parameter specifies the minimum number of points required to form a dense region (core point). A higher MinPts value can prevent the formation of small clusters, while a lower value may lead to noise being classified as clusters.

 Influence on Clustering: Together, these parameters determine the density of clusters and how well DBSCAN can identify clusters of varying shapes and sizes. Proper tuning is essential for optimal performance.

Q18. How does K-Means++ improve upon standard K-Means initialization?

- Strategic Centroid Selection: K-Means++ improves the initialization process by selecting initial centroids that are spread out, rather than randomly. This helps in forming well-separated clusters.

 Reduced Likelihood of Poor Clustering: By minimizing the potential clustering error from the start, K-Means++ often leads to better clustering quality and lower within-cluster variance.

 Faster Convergence: The strategic placement of centroids allows the algorithm to converge more quickly, requiring fewer iterations to reach a stable solution.

Q19. What is agglomerative clustering?

- Hierarchical Clustering Method: Agglomerative clustering is a bottom-up approach where each data point starts as its own cluster. The algorithm iteratively merges the closest pairs of clusters based on a distance metric.

 Distance Metrics: Common metrics used include single-linkage (minimum distance), complete-linkage (maximum distance), and average-linkage (mean distance).

 Dendrogram Representation: The results can be visualized using a dendrogram, which illustrates the merging process and helps in determining the optimal number of clusters.

Q20. What makes Silhouette Score a better metric than just inertia for model evaluation?

- Comparison of Clusters: The Silhouette Score measures how similar a data point is to its own cluster compared to other clusters, providing a more comprehensive evaluation of cluster quality.

 Range of Values: The score ranges from -1 to 1, where higher values indicate better-defined clusters. In contrast, inertia only measures the compactness of clusters without considering their separation.

 Insight into Cluster Structure: Silhouette Score helps identify whether clusters are well-separated and appropriately defined, making it a more informative metric for evaluating clustering performance

# **PRACTICAL QUESTIONS**

In [None]:
# Q21.  Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Step 1: Generate synthetic data with 4 centers
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Step 2: Apply KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Step 3: Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering (4 Centers)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q22.  Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

# Display the first 10 predicted labels
print("First 10 predicted labels:", labels[:10])


In [None]:
# Q23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Generate synthetic data
X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot DBSCAN results
plt.figure(figsize=(8, 6))
# Core and border points
plt.scatter(X[labels != -1][:, 0], X[labels != -1][:, 1], c=labels[labels != -1], cmap='viridis', s=50)
# Outliers
plt.scatter(X[labels == -1][:, 0], X[labels == -1][:, 1], c='red', s=50, label='Outliers')
plt.title("DBSCAN Clustering with make_moons Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

# Load the Wine dataset
wine = load_wine()
X = wine.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Print the size of each cluster
unique, counts = np.unique(labels, return_counts=True)
cluster_sizes = dict(zip(unique, counts))
print("Cluster sizes:", cluster_sizes)


In [None]:
# Q25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN

# Generate synthetic data
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the clustering result
plt.figure(figsize=(8, 6))
# Core and border points
plt.scatter(X[labels != -1][:, 0], X[labels != -1][:, 1], c=labels[labels != -1], cmap='viridis', s=50)
# Outliers
plt.scatter(X[labels == -1][:, 0], X[labels == -1][:, 1], c='red', s=50, label='Outliers')
plt.title("DBSCAN Clustering on make_circles Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# Apply MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

# Output the cluster centroids
print("Cluster centroids:\n", kmeans.cluster_centers_)


In [None]:
# Q27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

# Generate synthetic data with varying standard deviations
X, _ = make_blobs(n_samples=500,
                  centers=[[0, 0], [3, 3], [0, 4]],
                  cluster_std=[0.2, 0.5, 1.0],
                  random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the result
plt.figure(figsize=(8, 6))
plt.scatter(X[labels != -1][:, 0], X[labels != -1][:, 1], c=labels[labels != -1], cmap='viridis', s=50)
plt.scatter(X[labels == -1][:, 0], X[labels == -1][:, 1], c='red', s=50, label='Outliers')
plt.title("DBSCAN Clustering on make_blobs Data with Varying std")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q28.  Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Load the Digits dataset
digits = load_digits()
X = digits.data

# Reduce dimensions to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='tab10', s=50)
plt.title("K-Means Clustering on Digits Data (2D PCA)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid(True)
plt.show()


In [None]:
# Q29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate synthetic data
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.60, random_state=42)

# Evaluate silhouette scores for k = 2 to 5
k_values = range(2, 6)
scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    scores.append(score)

# Display as a bar chart
plt.figure(figsize=(8, 6))
plt.bar(k_values, scores, color='skyblue')
plt.title("Silhouette Scores for Different k in K-Means")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.xticks(k_values)
plt.grid(axis='y')
plt.show()


In [None]:
# Q30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Perform hierarchical clustering with average linkage
linked = linkage(X, method='average')

# Plot the dendrogram
plt.figure(figsize=(10, 6))
dendrogram(linked, labels=iris.target, leaf_rotation=90, leaf_font_size=10)
plt.title("Hierarchical Clustering Dendrogram (Average Linkage)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.grid(True)
plt.show()


In [None]:
# Q31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np

# Generate synthetic data with overlapping clusters
X, y = make_blobs(n_samples=500, centers=3, cluster_std=1.5, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.predict(X)
centers = kmeans.cluster_centers_

# Create a meshgrid for decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                     np.linspace(y_min, y_max, 500))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundaries and clusters
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis', edgecolor='k')
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X', label='Centroids')
plt.title("K-Means Clustering with Decision Boundaries")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN

# Load the Digits dataset
digits = load_digits()
X = digits.data

# Reduce dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30, init='pca')
X_tsne = tsne.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=5, min_samples=5)
labels = dbscan.fit_predict(X_tsne)

# Plot the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[labels != -1, 0], X_tsne[labels != -1, 1], c=labels[labels != -1], cmap='tab10', s=50)
plt.scatter(X_tsne[labels == -1, 0], X_tsne[labels == -1, 1], c='red', s=30, label='Outliers')
plt.title("DBSCAN Clustering on Digits Dataset (t-SNE Reduced)")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Apply Agglomerative Clustering with complete linkage
agg_clust = AgglomerativeClustering(n_clusters=4, linkage='complete')
labels = agg_clust.fit_predict(X)

# Plot the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering with Complete Linkage")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()


In [None]:
# Q34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compare inertia values for K = 2 to 6
inertia_values = []
k_range = range(2, 7)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia_values.append(kmeans.inertia_)

# Plot the inertia values
plt.figure(figsize=(8, 6))
plt.plot(k_range, inertia_values, marker='o', linestyle='-')
plt.title("K-Means Inertia for K = 2 to 6")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.grid(True)
plt.show()


In [None]:
# Q35.  Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic concentric circles
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)

# Apply Agglomerative Clustering with single linkage
agg_clust = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg_clust.fit_predict(X)

# Plot the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering (Single Linkage) on Concentric Circles")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()


In [None]:
# Q36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Load the Wine dataset
data = load_wine()
X = data.data

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=1.2, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Count the number of clusters (excluding noise)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)


In [None]:
# Q37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Step 1: Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Step 2: Apply KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Step 3: Plot the clusters and the cluster centers
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')  # Clustered points

# Plot cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X', label='Centroids')

plt.title("KMeans Clustering with Cluster Centers")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise
from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data

# Step 2: Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Step 4: Count and print number of noise points (label = -1)
n_noise = np.sum(labels == -1)
print(f"Number of noise samples: {n_noise}")


In [None]:
# Q39.  Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans

# Step 1: Generate synthetic moon-shaped data
X, y = make_moons(n_samples=300, noise=0.1, random_state=42)

# Step 2: Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

# Step 3: Plot the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', s=200, marker='X', label='Centroids')
plt.title("K-Means Clustering on make_moons Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Q40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

digits = load_digits()
X = digits.data

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=10, random_state=0)
labels = kmeans.fit_predict(X_pca)

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels, cmap='tab10', s=50)
ax.set_title('Digits Dataset: PCA (3D) + KMeans Clustering')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()


In [None]:
# Q41.  Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Generate synthetic data with 5 centers
X, y_true = make_blobs(n_samples=500, centers=5, cluster_std=0.60, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)

# Evaluate using silhouette score
score = silhouette_score(X, labels)
print(f"Silhouette Score: {score:.4f}")

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10', s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            color='black', marker='X', s=200, label='Centers')
plt.title('KMeans Clustering on Synthetic Blobs')
plt.legend()
plt.show()


In [None]:
# Q42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# Reduce dimensionality using PCA to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=2)
labels = agg.fit_predict(X_pca)

# Visualize the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='Set1', s=30)
plt.title('Agglomerative Clustering on Breast Cancer Dataset (PCA-reduced)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()


In [None]:
# Q43.  Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

# Generate noisy circular data
X, _ = make_circles(n_samples=500, factor=0.5, noise=0.05, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
labels_kmeans = kmeans.fit_predict(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels_dbscan = dbscan.fit_predict(X)

# Plot side-by-side
plt.figure(figsize=(12, 5))

# KMeans
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=labels_kmeans, cmap='tab10', s=30)
plt.title('KMeans Clustering')

# DBSCAN
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=labels_dbscan, cmap='tab10', s=30)
plt.title('DBSCAN Clustering')

plt.suptitle('Clustering Comparison on Noisy Circles')
plt.show()


In [None]:
# Q44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# Load Iris dataset
iris = load_iris()
X = iris.data

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Compute silhouette scores for each sample
silhouette_vals = silhouette_samples(X, labels)

# Plot silhouette scores
plt.figure(figsize=(8, 5))
y_ticks = []
y_lower = 0

for i in range(3):
    cluster_silhouette_vals = silhouette_vals[labels == i]
    cluster_silhouette_vals.sort()
    y_upper = y_lower + len(cluster_silhouette_vals)
    plt.barh(range(y_lower, y_upper), cluster_silhouette_vals, edgecolor='none')
    y_ticks.append((y_lower + y_upper) / 2)
    y_lower = y_upper

plt.axvline(silhouette_score(X, labels), color='red', linestyle='--')
plt.yticks(y_ticks, [f'Cluster {i}' for i in range(3)])
plt.xlabel('Silhouette Coefficient')
plt.title('Silhouette Plot for KMeans Clustering (Iris Dataset)')
plt.show()


In [None]:
# Q45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage.Visualize clusters
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Apply Agglomerative Clustering with 'average' linkage
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg.fit_predict(X)

# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10', s=40)
plt.title("Agglomerative Clustering with 'average' Linkage")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()


In [None]:
# Q46.  Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans

# Load Wine dataset
wine = load_wine()
X = wine.data[:, :4]  # Use first 4 features
df = pd.DataFrame(X, columns=wine.feature_names[:4])

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
df['Cluster'] = labels

# Visualize using seaborn pairplot
sns.pairplot(df, hue='Cluster', palette='tab10', diag_kind='hist')


In [None]:
# Q47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np

# Generate noisy blobs
X, _ = make_blobs(n_samples=500, centers=4, cluster_std=1.0, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.9, min_samples=5)
labels = dbscan.fit_predict(X)

# Count clusters and noise
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = np.sum(labels == -1)

print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")


In [None]:
# Q48. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Load Digits dataset
digits = load_digits()
X = digits.data

# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=10)
labels = agg.fit_predict(X_tsne)

# Plot the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10', s=30)
plt.title('Agglomerative Clustering on Digits Dataset (t-SNE Reduced)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()
