# Unsupervised Learning with Hierarchical Clustering  
### What will we learn?
* The intuition behind hierarchical (agglomerative) clustering and the Ward linkage.  
* How to read a dendrogram and decide on an appropriate number of clusters.  
* Practical data‑preprocessing steps (selecting numeric columns, scaling).  
* Dimensionality reduction with PCA for 2‑D exploratory plots.  
* Communicating cluster insights on a world map.

### Part 1: Loading the Country‑Level Indicator Data  
We will use the *“Country‑data.csv”* file from Kaggle  
(<https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data>).  
Each row represents a country and each column an economic or health indicator (e.g., GDP per capita, child mortality).  

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("rohan0301/unsupervised-learning-on-country-data")

print("Path to dataset files:", path)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

# Assuming your CSV file is named 'Country-data.csv'
# and is located inside the downloaded directory
file_path = os.path.join(path, 'Country-data.csv')

# Now read the CSV file using the file_path
data = pd.read_csv(file_path)
data.head() # Print the first few rows of the DataFrame

### Part 2: Feature Selection & Quick EDA
* We keep only our features (e.g. everything but country names).

In [None]:

# Optional sanity‑check visual


### Part 3: Standardize by Scaling
- Hierarchical clustering uses Euclidean distance; indicators measured on different scales (GDP vs. fertility) would dominate the metric.  
- We standardise to zero mean / unit variance using `StandardScaler`.  


### Part 5: Building the Hierarchical Tree
* **Ward linkage** merges clusters that yield the *smallest* increase in total within‑cluster variance.  
* The dendrogram gives us two insights:  
  1. Similarity structure (who merges early).  
  2. Reasonable cut heights (horizontal line) for k clusters.  
We truncate to the last 30 merges to keep the plot readable.  

In [None]:

# Standardize the numeric features (centering and scaling)


### Part 6: Choosing k & Assigning Clusters
After visually inspecting the dendrogram we select **k=3** (feel free to experiment).  
Agglomerative clustering with the same linkage method produces integer labels we can append to the dataframe.  


In [None]:

# Select k and assigning cluster label with fit_predict()


### Part 7: Low‑Dimensional Insight with PCA
**Note:** PCA is *only* for display; it was **not** used to fit the clusters.  

In [None]:
# Step 4: Visualize the Clustering Results Using PCA
from sklearn.decomposition import PCA

# Reduce the dimensions for visualization (2D scatter plot)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)

plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis', s=60, edgecolor='k', alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Agglomerative Clustering on Country Data (via PCA)')
plt.legend(*scatter.legend_elements(), title="Clusters")
plt.grid(True)
plt.show()

### Part 8: Putting Clusters on the Map  
Choropleth maps make the result tangible for non‑technical audiences.  
Plotly Express offers an immediate interactive world map keyed by **country name**.  

In [None]:
import plotly.express as px

# Define a discrete color mapping for the clusters (adjust colors as needed)
color_map = {0:"blue", 1:"orange", 2:"green", 3:"red"}  # add more if k>4


# Create the choropleth map
fig = px.choropleth(
    data[['country', 'Cluster']],
    locationmode='country names',
    locations='country',
    title='Country Clusters on World Map',
    color='Cluster',
    color_discrete_map=color_map
)

# Update the geographic layout and legend settings
fig.update_geos(fitbounds="locations", visible=True)
fig.update_layout(
    legend_title_text='Cluster',
    legend_title_side='top'
)

fig.show(engine='kaleido')

In [None]:
# --- OPTIMAL k: Silhouette Elbow ----------------------------------------
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Range of candidate cluster counts
k_range = range(2, 11)     # try 2–10 clusters; adjust as you like
sil_scores = []

for k in k_range:
    # Fit hierarchical clustering with Ward linkage (same as dendrogram)
    labels = AgglomerativeClustering(n_clusters=k, linkage="ward").fit_predict(X_scaled)

    # Silhouette: +1 = dense & well‑separated, 0 = overlapping, −1 = wrong clustering
    score = silhouette_score(X_scaled, labels)
    sil_scores.append(score)

# Plot the curve
plt.figure(figsize=(7,4))
plt.plot(list(k_range), sil_scores, marker="o")
plt.xticks(list(k_range))
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Average Silhouette Score")
plt.title("Silhouette Analysis for Agglomerative (Ward) Clustering")
plt.grid(True, alpha=0.3)
plt.show()

# Optional: print best k
best_k = k_range[np.argmax(sil_scores)]
print(f"Best k by silhouette: {best_k}  (score={max(sil_scores):.3f})")