<a href="https://colab.research.google.com/github/vidhya2324/Machine_learning_starting_project/blob/main/Clustering_Countries_Data_(DBSCAN_%26_PCA)_Unsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clustering Countries Using Unsupervised Learning for Strategic Aid Allocation by HELP International

In [2]:
!kaggle datasets download -d rohan0301/unsupervised-learning-on-country-data

Dataset URL: https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data
License(s): MIT
Downloading unsupervised-learning-on-country-data.zip to /content
  0% 0.00/5.21k [00:00<?, ?B/s]
100% 5.21k/5.21k [00:00<00:00, 7.12MB/s]


In [3]:
!unzip \*.zip && rm *.zip

Archive:  unsupervised-learning-on-country-data.zip
  inflating: Country-data.csv        
  inflating: data-dictionary.csv     


#Data Transformation and Feature Engineering

Yeo-Johnson Transformation:

The PowerTransformer with the **Yeo-Johnson method** is applied to stabilize variance and make the data distribution more Gaussian-like, which can improve the effectiveness of clustering algorithms.

Dimensionality Reduction with **PCA**,full form of PCA is principle component Analysis, is used to feature our data into 95%..its can reduce the dimension

Clustering with **DBSCAN**

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is selected for its ability to identify clusters of arbitrary shapes and its robustness to noise.

In [4]:
#import the neccessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler, PowerTransformer
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

In [5]:
# Load the country data from the specified CSV file
df = pd.read_csv("/content/Country-data.csv")

In [7]:
# Define the list of socio-economic and health-related features to be used for clustering
features = [
    "child_mort",          # Child mortality rate
    "exports",             # Total exports
    "gdpp",                # Gross Domestic Product per capita
    "health",              # Health expenditure
    "imports",             # Total imports
    "income",              # Average income
    "inflation",           # Inflation rate
    "life_expec",          # Life expectancy
    "total_fer",           # Total fertility rate
]
# Extract the feature matrix 'X' by selecting the defined features from the DataFrame
X = df[features].copy()


#Apply Yeo-Johnson Transformation

In [8]:
# Initialize the PowerTransformer with the Yeo-Johnson method to stabilize variance and make the data more Gaussian-like
pt = PowerTransformer(method="yeo-johnson")

In [9]:
# Fit the transformer to the data and apply the transformation, creating a new DataFrame with transformed features
X_transformed = pd.DataFrame(pt.fit_transform(X), columns=X.columns)

In [10]:
# Generate interaction terms to capture the combined effect of pairs of features, potentially enhancing clustering performance
X_transformed["gdpp_health"] = X_transformed["gdpp"] * X_transformed["health"]
X_transformed["exports_imports"] = X_transformed["exports"] * X_transformed["imports"]
X_transformed["life_expec_child_mort"] = (
    X_transformed["life_expec"] * X_transformed["child_mort"]
)
X_transformed["income_gdpp"] = X_transformed["income"] * X_transformed["gdpp"]
X_transformed["health_life_expec"] = (
    X_transformed["health"] * X_transformed["life_expec"]
)


#Scale the Features