In [None]:
#Q 1:   What is the difference between K-Means and Hierarchical Clustering? Provide a use case for each.

    :-K-Means clustering is a partition-based algorithm that divides data into a predefined
      number of clusters (K) by minimizing intra-cluster variance. It is fast and efficient
      for large datasets but requires the number of clusters in advance.
      Hierarchical clustering builds a tree-like structure (dendrogram) of clusters without
      requiring a predefined number of clusters. It can be agglomerative (bottom-up) or divisive
     (top-down) and is useful for understanding cluster relationships.
      Use cases:
      K-Means: Customer segmentation in large e-commerce datasets
      Hierarchical: Gene expression analysis or document clustering

#Q 2: Explain the purpose of the Silhouette Score in evaluating clustering algorithms. 
 
:-    The Silhouette Score measures how well a data point fits within
      its assigned cluster compared to other clusters.
      Its value ranges from –1 to +1:
      +1 → well clustered
       0 → overlapping clusters
       –1 → incorrect clustering
       It helps in evaluating clustering quality and selecting the optimal number of clusters.

#Q 3:What are the core parameters of DBSCAN, and how do they influence the clustering process? 
 
:-   The two core parameters of DBSCAN are:
     eps (ε): Maximum distance between two points to be considered neighbors
     min_samples: Minimum number of points required to form a dense region
     Influence:
     Small eps → many points marked as noise
     Large eps → clusters may merge
     Higher min_samples → stricter cluster formation

#Q 4:Why is feature scaling important when applying clustering algorithms like K-Means and DBSCAN?

:-    Feature scaling is important because clustering algorithms like K-Means and DBSCAN use distance calculations.
      If features are on different scales, larger-scale features dominate the distance, leading to incorrect clustering.
      Standardization ensures equal contribution of all features.

#Q 5: What is the Elbow Method in K-Means clustering and how does it help determine the optimal number of clusters?
    
:-     The Elbow Method is used to determine the optimal number of clusters (K) by plotting K against
       the Within-Cluster Sum of Squares (WCSS).
       As K increases, WCSS decreases.
      The point where the decrease slows down significantly (forming an “elbow”) is considered the optimal K.

In [1]:
#Question 6: Generate synthetic data using make_blobs(n_samples=300, centers=4), apply KMeans clustering, and visualize the results with cluster centers. (Include your Python code and output in the code box below.) 

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

plt.scatter(X[:,0], X[:,1], c=labels)
plt.scatter(centers[:,0], centers[:,1], marker='X', s=200)
plt.title("K-Means Clustering with Cluster Centers")
plt.show()

ModuleNotFoundError: No module named 'sklearn'

In [2]:
#Q7 : Load the Wine dataset, apply StandardScaler , and then train a DBSCAN model. Print the number of clusters found (excluding noise).

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

db = DBSCAN(eps=1.5, min_samples=5)
labels = db.fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters:", n_clusters)

ModuleNotFoundError: No module named 'sklearn'

In [3]:
#Q8 Generate moon-shaped synthetic data using make_moons(n_samples=200, noise=0.1), apply DBSCAN, and highlight the outliers in the plot

from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

db = DBSCAN(eps=0.3, min_samples=5)
labels = db.fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("DBSCAN on Moon-Shaped Data (Outliers Highlighted)")
plt.show()

ModuleNotFoundError: No module named 'sklearn'

In [4]:
#Q9: Load the Wine dataset, reduce it to 2D using PCA, then apply Agglomerative Clustering and visualize the result in 2D with a scatter plot.

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X_pca)

plt.scatter(X_pca[:,0], X_pca[:,1], c=labels)
plt.title("Agglomerative Clustering after PCA")
plt.show()

ModuleNotFoundError: No module named 'sklearn'

In [None]:
#​Q 10: 10: You are working as a data analyst at an e-commerce company. The marketing team wants to segment customers based on their purchasing behavior to run targeted promotions.
#           The dataset contains customer demographics and their product purchase history across categories. Describe your real-world data science workflow using clustering:
#           Which clustering algorithm(s) would you use and why? ● How would you preprocess the data (missing values, scaling)? ● How would you determine the number of clusters? 
#           How would the marketing team benefit from your clustering analysis? 

:-    ​Algorithm: I would use K-Means for its efficiency and ease of interpretation, or DBSCAN 
       if I suspect there are irregularly shaped clusters or many outliers (one-time buyers).
      ​Preprocessing: * Missing Values: Use median imputation for numerical data (like age) or a "Missing" category for categorical data. 
      ​Scaling: Use StandardScaler because purchase history amounts and demographics have wildly different scales .
      ​Determining Clusters: Use the Elbow Method combined with the Silhouette Score to ensure clusters are distinct and meaningful.
      ​Marketing Benefit: * Personalization: Send high-end luxury offers to the "High Spender" cluster. 
      ​Retention: Send discount codes to the "Churn-risk" cluster (infrequent buyers).
      ​Efficiency: Save budget by not sending generic ads to groups unlikely to respond