Ques 1:

In [None]:
# Q1
import pandas as pd
data_path = '/content/USA_Housing.csv'
data = pd.read_csv(data_path)
data.head()

data_omit_last = data.iloc[:, :-1]
data_omit_last.head()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_omit_last)

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(data_scaled)
labels = kmeans.labels_

inertia = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(data_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(k_range, inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

k_optimal = 3
kmeans = KMeans(n_clusters=k_optimal, random_state=42)
kmeans.fit(data_scaled)

final_labels = kmeans.labels_
data_omit_last['Cluster'] = final_labels
data_omit_last.head()

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=final_labels, cmap='viridis', marker='o', edgecolor='k')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clusters (PCA-Reduced)')
plt.colorbar(label='Cluster')
plt.show()

Ques 2:

In [None]:
# Q2
!pip install scikit-learn-extra
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn_extra.cluster import KMedoids
import matplotlib.pyplot as plt

data_path = '/content/USA_Housing.csv'
data = pd.read_csv(data_path)
data_omit_last = data.iloc[:, :-1]

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_omit_last)
sse = []
k_range = range(1, 11)

for k in k_range:
    kmedoids = KMedoids(n_clusters=k, random_state=42, method='pam')
    kmedoids.fit(data_scaled)
    sse.append(kmedoids.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(k_range, sse, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('SSE (Inertia)')
plt.title('Elbow Method for Optimal k (K-Medoids)')
plt.show()

k_optimal = 3
kmedoids = KMedoids(n_clusters=k_optimal, random_state=42, method='pam')
kmedoids.fit(data_scaled)

labels = kmedoids.labels_
data_omit_last['Cluster'] = labels
silhouette_avg = silhouette_score(data_scaled, labels)
print(f'Silhouette Score for k={k_optimal}: {silhouette_avg:.2f}')

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(data_pca[:, 0], data_pca[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Medoids Clusters (PCA-Reduced)')
plt.colorbar(label='Cluster')
plt.show()

S(i)=
max(a(i),b(i))/
b(i)−a(i)
​


The Silhouette Score is an important metric used in clustering to evaluate the quality of the clusters produced by a clustering algorithm like K-Means. In particular, it helps assess how well-defined and well-separated the clusters are. Using the silhouette score alongside the Elbow Method can give you a more robust way to select the optimal number of clusters (k).

The **Silhouette Score** is not a direct alternative to the **Elbow Method**; rather, it is a complementary evaluation metric that can be used **along with** the Elbow Method to better assess the quality of the clusters formed by a clustering algorithm like **K-Means**.

Here’s why using the **Silhouette Score** alongside the **Elbow Method** is a good idea:

### 1. **Elbow Method (Inertia)**
The **Elbow Method** helps identify the **range** of values for `k` (the number of clusters) where the inertia (within-cluster sum of squared distances) starts to level off.
- **Pros**: It’s a simple and widely used heuristic that helps you visually find the point where increasing `k` no longer results in significant improvements in clustering.
- **Limitations**: It only measures how tightly the points are packed within clusters but **doesn't account for how well-separated the clusters are**. This means that clusters could be tight but overlap with each other, making them poorly defined.

### 2. **Silhouette Score**
The **Silhouette Score** measures the **quality** of clustering by considering both the cohesion (how close the points within the same cluster are) and separation (how far apart the clusters are). It helps you understand how well-defined and distinct your clusters are.
- **Pros**: The Silhouette Score is a **direct measure of clustering quality**, and it takes both cohesion and separation into account. This gives you more information about the cluster structure, and it’s not just about minimizing inertia.
- **Limitations**: It can be sensitive to the **shape** of the clusters, and if the data doesn't form clearly separated clusters (e.g., non-convex clusters), the silhouette score may not perform well.

### Using **Both Methods Together**
By using the **Elbow Method** and **Silhouette Score** together, you get a more robust understanding of how to choose the optimal number of clusters:

1. **Step 1: Use the Elbow Method** to identify a **range** of possible values for `k`. This gives you an idea of where inertia starts to level off.
   - You can **look for the "elbow"** in the inertia plot, which is often the point where the rate of decrease in inertia slows down.
   
2. **Step 2: Use the Silhouette Score** to **refine your choice of `k`** within that range.
   - The silhouette score helps you confirm whether the clusters are well-defined and distinct at each candidate `k`. You may find that even if the inertia starts to level off at a certain `k`, the silhouette score suggests that a different `k` gives the best clustering quality.

### **Why Use Both Methods?**
- **Elbow Method** tells you where the inertia (error) stabilizes, but **Silhouette Score** tells you if those clusters are well-separated and meaningful.
- In some cases, the elbow method might suggest a larger `k`, but the silhouette score could indicate that fewer clusters give a more optimal separation. For example:
  - You might find that `k=6` gives a sharp elbow in the inertia plot, but the silhouette score is highest for `k=4`. In this case, `k=4` is likely the better choice.

### Example Workflow:

#### 1. **Elbow Method**:
   - First, plot the inertia for different values of `k` (1 to 10, for example).
   - Look for the **elbow** where the rate of inertia reduction slows down. This is typically a good starting point for selecting `k`.

#### 2. **Silhouette Score**:
   - After identifying the possible range of `k` from the elbow plot, compute the silhouette score for each value of `k` (2 to 10).
   - Choose the `k` with the **highest silhouette score**, which gives you the **most well-defined clusters** in terms of both **cohesion** and **separation**.

### Example Scenario:
Let’s say you run both methods on a dataset, and the results are as follows:

- **Elbow Method**: The elbow appears at `k=4`, suggesting that increasing `k` beyond 4 doesn’t significantly reduce inertia.
- **Silhouette Scores**: The silhouette score for `k=4` is `0.72`, which is high, but the silhouette score for `k=5` is `0.68`, which is lower.

In this case, even though the elbow method suggests `k=4` as the optimal point based on inertia, the silhouette score reinforces that `k=4` is indeed the best choice for the **quality** of the clusters.

### When to Rely on One Over the Other?
- **If the Elbow Method is unclear** (e.g., if there's no clear "elbow" or if the inertia decreases very gradually), the **Silhouette Score** can help you choose a more meaningful `k`.
- **If the Elbow Method suggests a range of possible values for `k`**, the **Silhouette Score** can help you **narrow it down** to the best `k` that produces well-defined and separated clusters.

### Conclusion:
- **The Elbow Method** and **Silhouette Score** serve complementary purposes: one helps find a range of possible `k` values, while the other helps assess the quality of the clustering.
- **Use both methods together** to make a more informed decision when selecting the optimal number of clusters.

Would you like to dive deeper into the calculations or implementation details for using these methods together?
