<a href="https://colab.research.google.com/github/vidhya2324/Machine_learning_starting_project/blob/main/Clustering_Countries_Data_(DBSCAN_%26_PCA)_Unsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Clustering Countries Using Unsupervised Learning for Strategic Aid Allocation by HELP International

In [1]:
!kaggle datasets download -d rohan0301/unsupervised-learning-on-country-data

Dataset URL: https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data
License(s): MIT
Downloading unsupervised-learning-on-country-data.zip to /content
  0% 0.00/5.21k [00:00<?, ?B/s]
100% 5.21k/5.21k [00:00<00:00, 9.04MB/s]


In [2]:
!unzip \*.zip && rm *.zip

Archive:  unsupervised-learning-on-country-data.zip
  inflating: Country-data.csv        
  inflating: data-dictionary.csv     


#Data Transformation and Feature Engineering

Yeo-Johnson Transformation:

The PowerTransformer with the **Yeo-Johnson method** is applied to stabilize variance and make the data distribution more Gaussian-like, which can improve the effectiveness of clustering algorithms.

Dimensionality Reduction with **PCA**,full form of PCA is principle component Analysis, is used to feature our data into 95%..its can reduce the dimension

Clustering with **DBSCAN**

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is selected for its ability to identify clusters of arbitrary shapes and its robustness to noise.

In [3]:
#import the neccessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler, PowerTransformer
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

In [4]:
# Load the country data from the specified CSV file
df = pd.read_csv("/content/Country-data.csv")

In [5]:
# Define the list of socio-economic and health-related features to be used for clustering
features = [
    "child_mort",          # Child mortality rate
    "exports",             # Total exports
    "gdpp",                # Gross Domestic Product per capita
    "health",              # Health expenditure
    "imports",             # Total imports
    "income",              # Average income
    "inflation",           # Inflation rate
    "life_expec",          # Life expectancy
    "total_fer",           # Total fertility rate
]
# Extract the feature matrix 'X' by selecting the defined features from the DataFrame
X = df[features].copy()


#Apply Yeo-Johnson Transformation

In [6]:
# Initialize the PowerTransformer with the Yeo-Johnson method to stabilize variance and make the data more Gaussian-like
pt = PowerTransformer(method="yeo-johnson")

In [7]:
# Fit the transformer to the data and apply the transformation, creating a new DataFrame with transformed features
X_transformed = pd.DataFrame(pt.fit_transform(X), columns=X.columns)

In [8]:
# Generate interaction terms to capture the combined effect of pairs of features, potentially enhancing clustering performance
X_transformed["gdpp_health"] = X_transformed["gdpp"] * X_transformed["health"]
X_transformed["exports_imports"] = X_transformed["exports"] * X_transformed["imports"]
X_transformed["life_expec_child_mort"] = (
    X_transformed["life_expec"] * X_transformed["child_mort"]
)
X_transformed["income_gdpp"] = X_transformed["income"] * X_transformed["gdpp"]
X_transformed["health_life_expec"] = (
    X_transformed["health"] * X_transformed["life_expec"]
)


#Scale the Features

In [9]:
# Initialize the RobustScaler to scale features, making them robust to outliers by removing the median and scaling the data according to the quantile range
scaler = RobustScaler()

In [10]:
# Fit the scaler to the transformed data and apply the scaling
X_scaled = scaler.fit_transform(X_transformed)

In [11]:
# Initialize PCA to reduce the dimensionality of the data while retaining 95% of the variance
pca = PCA(n_components=0.95)

In [12]:
# Fit PCA on the scaled data and transform it, resulting in a lower-dimensional representation
X_pca = pca.fit_transform(X_scaled)

In [13]:
# Initialize variables to keep track of the best silhouette score and corresponding DBSCAN parameters
best_score = -np.inf
best_eps = 0
best_min_samples = 0

In [14]:
# Define the range of epsilon values to explore for DBSCAN (from 0.1 to 2.0 with step size 0.1)
eps_range = np.arange(0.1, 2.1, 0.1)

In [15]:
# Define the range of min_samples values to explore for DBSCAN (from 2 to 10)
min_samples_range = range(2, 11)

In [16]:
# Iterate over each combination of epsilon and min_samples
for eps in eps_range:
    for min_samples in min_samples_range:
        # Initialize DBSCAN with the current epsilon and min_samples
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)

        # Fit DBSCAN on the PCA-transformed data and predict cluster labels
        labels = dbscan.fit_predict(X_pca)
        # Calculate the number of clusters found (excluding noise points labeled as -1)
        unique_labels = set(labels[labels != -1])
        if len(unique_labels) >= 2:
            # Create a mask to exclude noise points from the silhouette score calculation
            mask = labels != -1
            if np.sum(mask) > 1:  # Ensure there are at least two non-noise points
                # Calculate the silhouette score for the current clustering
                score = silhouette_score(X_pca[mask], labels[mask])

                # Update the best score and corresponding parameters if the current score is higher
                if score > best_score:
                    best_score = score
                    best_eps = eps
                    best_min_samples = min_samples


In [17]:
# Check if a valid clustering solution was found
if best_score == -np.inf:
    # Raise an error if no valid clustering solution was found
    raise ValueError("No valid clustering solution found")

# Initialize DBSCAN with the best epsilon and min_samples found during grid search
final_dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
# Fit DBSCAN on the PCA-transformed data and predict the final cluster labels
final_labels = final_dbscan.fit_predict(X_pca)

In [18]:
# Create a mask to exclude noise points from the silhouette score calculation
mask = final_labels != -1

In [19]:
# Calculate the silhouette score for the final clustering
final_score = silhouette_score(X_pca[mask], final_labels[mask])

In [20]:
# Print the best DBSCAN parameters found
print(f"Best Parameters: eps={best_eps:.2f}, min_samples={best_min_samples}")

# Print the number of clusters formed with the best parameters
print(f"Number of clusters: {len(set(final_labels[final_labels != -1]))}")

# Print the silhouette score of the final clustering
print(f"Silhouette Score: {final_score:.4f}")

# Print the number of noise points identified by DBSCAN
print(f"Number of noise points: {np.sum(final_labels == -1)}")

# Print the number of PCA components used
print(f"Number of PCA components: {X_pca.shape[1]}")

Best Parameters: eps=0.30, min_samples=2
Number of clusters: 2
Silhouette Score: 0.9285
Number of noise points: 163
Number of PCA components: 7


#real time code building


In [36]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler, PowerTransformer
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN

# Load the country data
df = pd.read_csv("/content/Country-data.csv")

# Features for clustering
features = [
    "child_mort", "exports", "gdpp", "health",
    "imports", "income", "inflation",
    "life_expec", "total_fer"
]
X = df[features].copy()

# Preprocessing: Yeo-Johnson Transformation
pt = PowerTransformer(method="yeo-johnson")
X_transformed = pd.DataFrame(pt.fit_transform(X), columns=X.columns)

# Add interaction terms
X_transformed["gdpp_health"] = X_transformed["gdpp"] * X_transformed["health"]
X_transformed["exports_imports"] = X_transformed["exports"] * X_transformed["imports"]
X_transformed["life_expec_child_mort"] = X_transformed["life_expec"] * X_transformed["child_mort"]
X_transformed["income_gdpp"] = X_transformed["income"] * X_transformed["gdpp"]
X_transformed["health_life_expec"] = X_transformed["health"] * X_transformed["life_expec"]

# Scaling and PCA
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_transformed)

pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# Clustering using DBSCAN
best_eps = 0.30  # Use optimal values from previous training
best_min_samples = 2
dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
final_labels = dbscan.fit_predict(X_pca)

# Adding cluster labels to the dataframe
df['Cluster'] = final_labels

# Function to suggest a country for aid allocation
def suggest_countries(df, cluster_priority=0):
    """
    Suggest countries for aid allocation based on cluster priority.

    Args:
        df (pd.DataFrame): Dataframe with country data and cluster labels.
        cluster_priority (int): Cluster number to prioritize for aid.

    Returns:
        list: List of countries in the prioritized cluster.
    """
    # Filter countries in the specified cluster
    target_countries = df[df['Cluster'] == cluster_priority]['country'].tolist()
    return target_countries

# Example: Get countries in cluster 0 (most needy cluster)
priority_countries = suggest_countries(df, cluster_priority=0)

# Display suggested countries
print("Suggested countries for aid allocation:")
print(priority_countries)

# Optional: Save the output to a CSV for record-keeping
df.to_csv("clustered_countries.csv", index=False)


Suggested countries for aid allocation:
['Croatia', 'Poland']


In [37]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import RobustScaler, PowerTransformer
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN

# Load the country data
df = pd.read_csv("/content/Country-data.csv")

# Features for clustering
features = [
    "child_mort", "exports", "gdpp", "health",
    "imports", "income", "inflation",
    "life_expec", "total_fer"
]
X = df[features].copy()

# Preprocessing: Yeo-Johnson Transformation
pt = PowerTransformer(method="yeo-johnson")
X_transformed = pd.DataFrame(pt.fit_transform(X), columns=X.columns)

# Add interaction terms
X_transformed["gdpp_health"] = X_transformed["gdpp"] * X_transformed["health"]
X_transformed["exports_imports"] = X_transformed["exports"] * X_transformed["imports"]
X_transformed["life_expec_child_mort"] = X_transformed["life_expec"] * X_transformed["child_mort"]
X_transformed["income_gdpp"] = X_transformed["income"] * X_transformed["gdpp"]
X_transformed["health_life_expec"] = X_transformed["health"] * X_transformed["life_expec"]

# Scaling and PCA
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_transformed)

pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

# Clustering using DBSCAN
best_eps = 0.30  # Use optimal values from previous training
best_min_samples = 2
dbscan = DBSCAN(eps=best_eps, min_samples=best_min_samples)
final_labels = dbscan.fit_predict(X_pca)

# Adding cluster labels to the dataframe
df['Cluster'] = final_labels

# Prepare transformation pipelines
def preprocess_user_input(user_input):
    """
    Preprocess the user input using the trained pipelines (PowerTransformer, RobustScaler, PCA).

    Args:
        user_input (dict): Dictionary of user-provided socio-economic data.

    Returns:
        np.array: PCA-transformed user input.
    """
    input_df = pd.DataFrame([user_input], columns=features)

    # Apply Yeo-Johnson transformation
    input_transformed = pd.DataFrame(pt.transform(input_df), columns=input_df.columns)

    # Add interaction terms
    input_transformed["gdpp_health"] = input_transformed["gdpp"] * input_transformed["health"]
    input_transformed["exports_imports"] = input_transformed["exports"] * input_transformed["imports"]
    input_transformed["life_expec_child_mort"] = input_transformed["life_expec"] * input_transformed["child_mort"]
    input_transformed["income_gdpp"] = input_transformed["income"] * input_transformed["gdpp"]
    input_transformed["health_life_expec"] = input_transformed["health"] * input_transformed["life_expec"]

    # Apply scaling and PCA
    scaled_input = scaler.transform(input_transformed)
    pca_input = pca.transform(scaled_input)

    return pca_input

# Suggest a cluster for a new country
def suggest_cluster(user_input):
    """
    Suggest the cluster for the user input and determine if aid is recommended.

    Args:
        user_input (dict): Dictionary of socio-economic metrics for a country.

    Returns:
        int: Cluster label for the input country.
    """
    processed_input = preprocess_user_input(user_input)
    cluster_label = dbscan.fit_predict(processed_input)[0]
    return cluster_label

# Real-time user input
print("Provide the socio-economic metrics for the country you want to analyze.")
user_input = {
    "child_mort": float(input("Child mortality rate: ")),
    "exports": float(input("Total exports (% of GDP): ")),
    "gdpp": float(input("GDP per capita: ")),
    "health": float(input("Health expenditure (% of GDP): ")),
    "imports": float(input("Total imports (% of GDP): ")),
    "income": float(input("Average income: ")),
    "inflation": float(input("Inflation rate: ")),
    "life_expec": float(input("Life expectancy: ")),
    "total_fer": float(input("Total fertility rate: "))
}

# Determine the cluster for the user input
cluster = suggest_cluster(user_input)

# Output the cluster label and decision
if cluster == -1:
    print("The input country is classified as noise, which means it doesn't fit into any defined cluster.")
else:
    print(f"The input country belongs to cluster {cluster}.")
    if cluster == 0:  # Assuming cluster 0 is the priority for aid
        print("Aid allocation is recommended for this country.")
    else:
        print("Aid allocation may not be necessary for this country.")


Provide the socio-economic metrics for the country you want to analyze.
Child mortality rate: 25
Total exports (% of GDP): 30
GDP per capita: 1500
Health expenditure (% of GDP): 4
Total imports (% of GDP): 40
Average income: 2500
Inflation rate: 3
Life expectancy: 15
Total fertility rate: 3
The input country is classified as noise, which means it doesn't fit into any defined cluster.


In [30]:
df["Cluster"] = final_labels
print(final_labels)

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1  1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]


In [32]:
cluster_0_countries = df[df["Cluster"] == 0]
cluster_1_countries = df[df["Cluster"] == 1]


In [33]:
print(cluster_0_countries)


     country  child_mort  exports  health  imports  income  inflation  \
41   Croatia         5.5     37.6    7.76     38.1   20100      0.821   
121   Poland         6.0     40.1    7.46     42.1   21800      1.660   

     life_expec  total_fer   gdpp  Cluster  
41         76.3       1.55  13500        0  
121        76.3       1.41  12600        0  


In [34]:
noise_countries = df[df["Cluster"] == -1]

In [35]:
print("Countries in Cluster 0:")
print(cluster_0_countries["country"].tolist())

print("Countries in Cluster 1:")
print(cluster_1_countries["country"].tolist())

print("Noise Countries:")
print(noise_countries["country"].tolist())


Countries in Cluster 0:
['Croatia', 'Poland']
Countries in Cluster 1:
['Italy', 'Spain']
Noise Countries:
['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire", 'Cyprus', 'Czech Republic', 'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Irel