## 1. Importing Required Libraries

In [None]:
# Filtering out the warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing the required libraries
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

## 2.Reading the country-data.csv file 

In [None]:
#Read the csv file and print first 5 rows of the file
data=pd.read_csv('Country-data.csv')
data.head()

# Data Description

In [None]:
#Displaying the number of rows and columns in the data
data.shape

## Observation:

   - We have 167 rows and 10 columns in our data

In [None]:
#Displaying data types and non-Null value count of all columns
data.info()         

# Insights:
   - Income and gdpp is in integer format which is as expected,Column country is in Object format which is as ecpected.
   - Columns: child_mort,exports,health,imports,inflation,life_expec and total_fer is in float format which is as expected.
   - Datatype of all columns is as expected,Datatype handling is not required for this dataset.

In [None]:
#Statistical Summary of the dataset
data.describe()

## 3. Inspecting Missing Values

In [None]:
# Verify null value count for all columns
data.isnull().sum()

## Observation:

   - We have 0 Null values in all the columns.SO Null value treatment is not required for this dataset.

## 4.Data Validation

In [None]:
data.head()

### Convert columns/features: exports,health and imports to their original values, Since the data given is percentage of the GDP per capita

In [None]:
# convert columns/features: exports,health and imports to their original values 
data['exports']=data['gdpp']/data['exports']
data['health']=data['gdpp']/data['health']
data['imports']=data['gdpp']/data['imports']
data

## Insights:
   - Since the following features : `exports`, `health` and `imports` are given in the form of percenatge of GDP we have converted to their original values in the above step which will help for further analysis

## Negative Value Check

### Lets validate that we dont have negative values in the data.Negative values in the data may indicate the false data in some of the features

In [None]:
data[data['child_mort']<0]

In [None]:
data[data['exports']<0]

In [None]:
data[data['health']<0]

In [None]:
data[data['imports']<0]

In [None]:
data[data['income']<0]

In [None]:
data[data['life_expec']<0]

In [None]:
data[data['total_fer']<0]

In [None]:
data[data['gdpp']<0]

## Insights:
   - We dont have negative values for the features `child_mort`,`exports`,`health`,`imports`,`income`,`life_expec`,`total_fer` and `gdpp`, which is good sign we can go ahead with further steps.

In [None]:
data[data['inflation']<0]

## Insights:

   - Some countries have Negative inflaton which is possible actually. So lets not remove them at this point of analysis.   We can treat them in further analysis if required.


## 5.Univariate Analysis

In [None]:
data.head()

In [None]:
# Reusable method for performing univariate analysis

def univ_anal(col_name):  # Pass feature name for making Box plot.
    plt.title('Data Distribution of '+ col_name+ ' column',size=15,color='green')
    sns.boxplot(y=data[col_name])    # Box plot can be drawn for the given feature.
    plt.ylabel(col_name,size=12)
    plt.show()

In [None]:
# Box plot for child_mort
univ_anal('child_mort')

# Insights:

   - As observed child mortality tells us how many deaths out of 1000 children  in a country.We have few observations where child mortality rate is greater than 150 which is a serious consideration in our outcome.
   - High child_mort indicates countries which may indicate poor countries,lets consider this feature for further analysis.
   - This feature is strong indicator  who are required in help

In [None]:
univ_anal('exports')

## Insights:

   - exports indicates exporting of goods and services. We could see that their are outliers.
   - Let's cap this exports column which will help in avoiding skewness

In [None]:
univ_anal('health')

## Insights:

   - As shown in the above boxplot `health` feature has outlier at the value near 40000 which look suspicious.Lets cap them at 0.95 percentile in further steps after analysing the distribution plot as well.
    

In [None]:
univ_anal('imports')

## Insights:
    
   - As shown in the above boxplot `imports` feature has outlier near 14000 which is max value and it look suspicious. Lets cap them at 0.95 percentile
   -We also Noticed that there are huge set of values above 75th percentile.Since we have less data dropping them result in loss of data.Lets cap them at 0.95th percentile

In [None]:
univ_anal('income')

# Insights:

   - As observed column income has huge set of outliers are visible in this feature.Lets cap them at 0.95
   - Column `income` has an outlier which looks suspicious at greater than 120000 let treat these values as well

In [None]:
univ_anal('inflation')

## Insights:

   - As observed in the above box plot.We have outliers near 100 which looks suspicious.But there is a possibility of some countries having higher inflation.So lets not remove them.We will verify the distribution as well to better understand the data

In [None]:
univ_anal('life_expec')

## Insights:

   - `life_expec`has some lower values near 32. And we have few more outliers near 46.Lets see the distribution plot and cap the outliers if required in the next steps.

In [None]:
univ_anal('total_fer')

## Insights:

   - `total_fer` column has some good spread of data and nothing suspicious other than value near 7. Lets analyse the distribution of data and lets treat the outlier if required.
   - Value near 7 has a possibilty in fertility. so lets understand in further plots to treat if required.

In [None]:
univ_anal('gdpp')

## Insights:

   - `gdpp` column is having some good amount of outliers.Lets understand the percentile values and cap them in further steps after verifying the distribution plot for the same

## Understand the distribution of data 

In [None]:
sns.distplot(data['child_mort'])
plt.title('Distribution plot for child_mort',size=15,color='green')
plt.show()

## Insights:

   - As shown above `child_mort` feature is right skewed. Which is quite common since some countries have the possibilty of this type of numbers which are high in child_mort.
   - `child_mort` feature gives good insights in finding the countries which require help from orphanages.

In [None]:
sns.distplot(data['exports'])
plt.title('Distribution plot for exports',size=15,color='green')
plt.show()

## Insights:

   - As shown above `exports` feature is right skewed. Which is quite common since some countries have the highest exports due to highly developed manufacturing and farming capabilities and many other reasons can make the countries exports high  
   - `exports` feature has right skewed data considering this may affect the model's perfomance

In [None]:
sns.distplot(data['health'])
plt.title('Distribution plot for health',size=15,color='green')
plt.show()

## Insights:

   - As shown above `health` feature is right skewed, considering this may affect the model's perfomance lets cap the feature to 0.95 in further steps
   - Capping the outlier is the best option since we have less data

In [None]:
sns.distplot(data['imports'])
plt.title('Distribution plot for imports',size=15,color='green')
plt.show()

## Insights:

   - As shown above `imports` feature is right skewed. Which is quite common since some countries has less farming lands,Industries and raw materials,Due to which these countries relay on imports  
   - `imports` feature has right skewed data considering this may affect the model's perfomance,lets cap the outliers in the next step.

In [None]:
sns.distplot(data['income'])
plt.title('Distribution plot for income',size=15,color='green')
plt.show()

## Insights:

   - `income` feature says about net income of the person in the country
   - As shown above `income` feature is right skewed. Which is quite common since some countries has less jobs due to which many of the citizens will be unemployed  
   - `income` feature has right skewed data considering this may affect the model's perfomance,lets cap the outliers in the next step.

In [None]:
sns.distplot(data['inflation'])
plt.title('Distribution plot for inflation',size=15,color='green')
plt.show()

## Insights:
   - `inflation` is all about the measurement of the annual growth rate of the Total GDP 
   - As shown above `inflation` feature is right skewed.When inflation gets higher,economic growth will decelerate resulting in cost of living drastically changes
   - This is one of the strong indicator for people required aid due to cost of living is very high when inflation is high

In [None]:
sns.distplot(data['life_expec'])
plt.title('Distribution plot for life_expec',size=15,color='green')
plt.show()

## Insights:

   - `life_expec` feature says about the average number of years a new born child would live if the current mortality patterns are to remain the same
   - As shown above `life_expec` feature is left skewed.Average human life expectancy for some counries looks weird we need to treat the left skewed data where outliers are present,Lets cap the outliers in the next step.

In [None]:
sns.distplot(data['total_fer'])
plt.title('Distribution plot for total_fer',size=15,color='green')
plt.show()

## Insights:

   - `total_fer` feature says about the the number of children that would be born to each woman if the current age-fertility rates remain the same.

   - As shown above `total_fer` feature is right skewed very slightly.Which doesn't require any outlier treatment

In [None]:
sns.distplot(data['gdpp'])
plt.title('Distribution plot for gdpp',size=15,color='green')
plt.show()

## Insights:

   - `gdpp` feature says about the GDP per capita. Calculated as the Total GDP divided by the total population.
   - As shown above `gdpp` feature is right skewed. Which is quite common since some developed countries have high GDP.
   - Majority of the countries have GDP less than 10000
   - `gdpp` feature has right skewed data considering this may affect the model's perfomance,lets cap the outliers in the next step.

## Capping outlier: Since we have less data we are capping the data instead of dropping the outliers

### We are not capping some of the features because of the following reason:

  - `life_expec` is not heavily skewed.so we are not treating outliers in this feature.
  
  - `child_mort` tells the needs of aid. so there may be countries which are very high in number in child_mort which might be true need and lets not drop or cap the outliers here
  - `total_fer` is not heavily skewed as observed in the above plots so no treatment of outlier is required
  - `inflation` has some negative values and not hevaily skewed in the positive range. So we are not cappping them

## Lets Treat the columns which are highly skewed,Lets cap the outliers which are less than 0.05 percentile and greater than 0.95 percentile

In [None]:
# split1 hold feature names which helps in capping the outliers

split1=['exports','health','imports','income','gdpp']
len_split1=len(split1)  # length/count of the features

plt.figure(figsize=(14,10))

print('Distribution of Features before perfroming capping on outliers')
for i,j in zip(split1,range(len_split1)): 
    plt.subplot(3,2,j+1)
    sns.distplot(data[i])   # Distribution plot

## Insights:

   - Above plot shows the distribution of features `exports`, `health`,`imports`,`income` and `gdpp` before perfoming capping to treat outliers.
   - Insights of each column/feature we are considering here is already explained in previous steps where we studied about distribution of plots.

## Look for different percentile values before capping the outliers

In [None]:
# split1 hold feature names which helps in capping the outliers

split1=['exports','health','imports','income','gdpp']
len_split1=len(split1)
print('Percentile values before capping the outliers\n')
for i,j in zip(split1,range(len_split1)):
    print('Percentile values before capping the outliers for '+ i +' is :\n', data[i].quantile([0,0.05,0.1,0.9,0.95,0.99,1]))
    print()

## Insights:

   - In the above step we analysed the different percentiles like (0.05,0.95,1) features `exports`, `health`,`imports`,`income` and `gdpp` before perfoming capping to treat outliers.
   - Lets cap the data for the following features `exports`, `health`,`imports`,`income` and `gdpp` at 0.05 for percentiles having less than 0.05 and 0.95 for percentiles having greater than 0.95 percentiles 

## Perform capping on the features

In [None]:
# split1 hold feature names which helps in capping the outliers

split1=['exports','health','imports','income','gdpp']
len_split1=len(split1)

for i,j in zip(split1,range(len_split1)):
    percentilevalues = data[i].quantile([0.05,0.95]).values
    data[i] = np.clip(data[i], percentilevalues[0], percentilevalues[1])  # Replace the original features after capping the data 
data.head()

## Insights:

   - In the above step we have capped the outliers for the following features `exports`, `health`,`imports`,`income` and `gdpp` 
   - Lets see the distribution plot to analyse how the skewness has been changed after this step

In [None]:
split1=['exports','health','imports','income']
len_split1=len(split1)

print('Percentile values After capping the outliers\n')
for i,j in zip(split1,range(len_split1)):
    print('percentile values After applying capping for '+ i +' is :\n', data[i].quantile([0,0.05,0.1,0.9,0.95,0.99,1]))
    print()

## Insights:

   - In the above step we analysed the different percentiles like (0.05,0.95,1) features `exports`, `health`,`imports`,`income` and `gdpp` After perfoming capping to treat outliers.
   - capping has been in the following way values less than 0.05 percentile has been capped to 0.05 percentile value,Values greater than 0.95 has been capped to 0.95 th percentile value

In [None]:
# split1 hold feature names which helps in capping the outliers

split1=['exports','health','imports','income','gdpp']
len_split1=len(split1)  # length/count of the features

plt.figure(figsize=(14,10))

print('Distribution of Features After perfroming capping on outliers')
for i,j in zip(split1,range(len_split1)): 
    plt.subplot(3,2,j+1)
    sns.distplot(data[i])   # Distribution plot

## Insights:

   - Now the skewness of the data has been reduced, Which is good for building the clusters
   - Majority of the skewed data has been treated now

# 6.Bivariate  and Multivariate Analyssis

## Numeric - Numeric analysis

- There are three ways to analyse the *`numeric- numeric`* data types simultaneously.
- **Scatter plot**: describes the pattern that how one variable is varying with other variable.
- **Correlation matrix**: to describe the linearity of two numeric variables.
- **Pair plot**: group of scatter plots of all numeric variables in the data frame.

### Lets visualise the correlation between variables

In [None]:
sns.heatmap(data.iloc[:,1:].corr(),cmap='gray',cbar=True,annot=True)
plt.show()

## Insights:

   - As shown above we have plotted correlation values using heatmap.which helps in further understanding which variables are highly correlated
   - Lets see the top correlated variables in the next step

In [None]:
data.corr()

In [None]:
# Top correlated variables
data.corr().abs().unstack().sort_values(ascending= False)[9:30]

## Insights:

   - As shown above we have arrived at features having high correlation values
   - `income` and `health` has highest correlation among all the features
   - `income` and `gdpp` has the second highest correlation with the value of 0.94151
   - `imports` and `exports` has the third highest correlation with the correlation value of 0.933

In [None]:
# reusable method for bivariate analysis
def bi_anal(feature1,feature2):
    plt.title('Data Distribution of '+ feature1+ ' versus ' + feature2,size=15,color='green')
    sns.scatterplot(feature1,feature2,data=data)
    plt.xlabel(feature1,size=12)
    plt.ylabel(feature2,size=12)
    plt.show()

In [None]:
bi_anal('income','gdpp')

## Insights:

   - As shown in the above plot `gdpp` and `income` has highest correlation as the income increases gdpp also increases linearly
    

In [None]:
bi_anal('income','health')

## Insights:

   - As we have seen in the correlation matrix we have high correlation values for `income` and `health`,We could see that as the average income of the person increases Total health spending also increases

In [None]:
bi_anal('imports','exports')

## Insights:

   - As we have seen in the correlation matrix we have high correlation values for `imports` and `exports`.
   - There is a equal chances for countries to have higher import and export values since all the countries might not be     good in all the goods and due to lack of resources there is a goods exchange tackes place between countries.

In [None]:
bi_anal('total_fer','child_mort')

## Insights:
   - `total_fer` tells about the number of children that would be born to each woman if the current age-fertility rates remain the same. 
   - As the `total_fer` increase `child_mort` increases this is quite obvious since both the varibales are positively correlated

In [None]:
bi_anal('health','child_mort')

## Insights:

   - From the above plot we could understand that when the `health`(Total health spending per capita) increase child_mort decreases.
   - Less spending of `health` Increase in `child_mort`.

In [None]:
bi_anal('health','gdpp')

## Insights:

  - Feature `health` tells about the Total health spending per capita.
  - We can consider high GDP for developed countries where usually the average income and health spending will increase,Due to which we can see health and GP is correlated.

In [None]:
bi_anal('child_mort','life_expec')

## Insights:

   - As we can see when `life_expec` is high `child_mort` is low which is quite obvious since child_mort tells about the Death of children under 5 years of age per 1000 live births.
   - So both the variables are negatively correlated


In [None]:
bi_anal('imports','gdpp')

## Insights:

    - GDP and imports are linearly related when the imports are at lower range GDP is also low.
    - As the imports increases GDP also increases linearly.
    - But we need to consider multiple factors when calclulating GDP

In [None]:
bi_anal('life_expec','total_fer')

## Insights:

  - we could that the life_expec is more when the ferility is less but we cannot conclude anything since the data we have has high range of life_expec values and total_fer value is high between 1to 5 as we observed in the total_fer box plot

In [None]:
# Create a new dataframe data1 which holds features other than country column
data1=data.iloc[:,1:]
data1.head()

## 7.Hopkins Test

### Hopkins analysis to understand how well our data is suitable for clustering

In [None]:
#Calculating the Hopkins statistic
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values) 
    rand_X = sample(range(0, n, 1), m) 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1]) 
    HO = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(HO):
        print(ujd, wjd)
        HO = 0
 
    return HO

In [None]:
# verify hopkins value 10 times/multiple times to make sure our data is well suited for clustering
for i in range(10):
    print(hopkins(data1))

## Insights:

   - We are considering all the columns and lets see how Hopkins analysis say about how well data is suitable for clustering

## Lets analyse the features:  `child_mort`,`gdpp` and `income` and form clusters using these features

In [None]:
# Create a new dataframe data2 which holds following features: gdpp,child_mortality and income
data2=data[['gdpp','child_mort','income']]
data2.head()

In [None]:
# verify hopkins value 10 times/multiple times to make sure our data is well suited for clustering
for i in range(10):
    print(hopkins(data2))

## Insights:

   - Value of Hopkins is always greater than 0.85 even after executing 10 times, which is strong indicator that data is suitable for cluster formation
   

In [None]:
data2.corr()

In [None]:
sns.heatmap(data2.corr(),cmap='gray',cbar=True,annot=True)
plt.show()

## Insights:

   - As observed in the above heatmap we have high correlation at 0.94 between `gdpp` and `income` feature
   - `child_mort` and `income` features have a high correlation value

In [None]:
# Importing minmax scaler
from sklearn.preprocessing import MinMaxScaler
mm=MinMaxScaler()

# 8.Finding optimal k value for kmeans

### FInding optimal k value with within cluster sum of squares method

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

## Optimal cluster identification using within cluster sum of squares

In [None]:
wcss=[]  # wcss is a empty list indicates within cluster sum of squares
k=range(1,15)
for i in k:
    kmeans=KMeans(i,random_state=42)
    kmeans.fit(data2)
    wcss.append(kmeans.inertia_)
print(wcss)
plt.plot(k,wcss)                                                      #plot the diagram in the form of x and y 
plt.title("Find optimal k value")
plt.xlabel("k value")
plt.ylabel("wcss value : within cluster sum of squares")
plt.show()

In [None]:
kmeans.inertia_

In [None]:
help(kmeans)

## Insights:

   - As we can see at value k=3 and k=2 looks we have optimal value of k
   - using this k value lets analyse the spread of data explained by clustersand decide the better k value for our model

In [None]:
# Verify that the data2 has only three columns before applying clustering on that
data2.head()

## `K=2`

In [None]:
# Fithe data with k=2 clusters
kmeans=KMeans(2,random_state=42)
kmeans.fit(data2)

## Insights:

   - We have taken 2 clusters and data has been fitted on that lets see the cluster count across the data in next steps

In [None]:
# stire the cluster labels in the variable y_kmeans
y_kmeans=kmeans.fit_predict(data2)
y_kmeans

In [None]:
# Add the cluster column to the dataframe : data2
data2['cluster']=y_kmeans
data2.head()

## Insights:

   - We have added cluster labels in the previous step to the dataframe data2

In [None]:
data2.cluster.value_counts()

## Insights:

   - cluster 0 count is 128 and cluster 1 count is 39.
   - Cluster 0 is having many data points we can go for other cluster value which is k=3 and see if that can explain the datapoints better than this cluster

In [None]:
# Lets remove the cluster column from data2 before fitting the data2 to kmeans algorithm
data2=data2.iloc[:,:-1]
data2.head()

## `K=3`

In [None]:
# Fithe data with k=3 clusters
kmeans=KMeans(3,random_state=42)
kmeans.fit(data2)

## Insights:

   - We have taken 3 clusters and data has been fitted on that lets see the cluster count across the data in next steps

In [None]:
# stire the cluster labels in the variable y_kmeans
y_kmeans=kmeans.fit_predict(data2)
y_kmeans

## Insights:

   - As shown above we have got the cluster number to which each row of the data belong to.Lets add this cluster value to the dataframe

In [None]:
# Add the cluster column to the dataframe : data2
data2['cluster']=y_kmeans
data2.head()

In [None]:
data2.cluster.value_counts()

## Insights:

   - cluster 0 count is 96,cluster 1 count is 40 and cluster2 count is 31.
   - It looks when we considered k=3 clusters are explaining the data very well and data is not biased to a single cluster. So lets consider k value as 3

In [None]:
# cluster centers
kmeans.cluster_centers_

## Insights:
   - Cluster center values are displyed in the above step

In [None]:
# Add the cluster column to the dataframe : data2
data2['cluster']=y_kmeans
data2.head()

In [None]:
# Add the country column to the dataframe : data2
data2['country']=data['country']
data2.head()

## Insights:

   - In the above two steps we have added country and cluster lables to the dataframe data3 which helps in further steps in identifying which country falls under which cluster group.

## Before visualising the cluster lets find the optimal k value using Silhouette analysis

# Silhouette Analysis

## Optimal value of k can be obtained using silhouette_score

In [None]:
# scaling
# Create new dataframe data3  with features: gdpp, child_mort and income and apply minmax scaler
data2_columns=data2.iloc[:,:-2].columns
data3=pd.DataFrame(mm.fit_transform(data2.iloc[:,:-2]),columns=data2_columns)
data3.head()

## Insights:

    - Minmax scaling has been applied on the following features: `gdpp`,`child_mort` and `income`

In [None]:
#silhouette_score for finding optimal value of k 
from sklearn.metrics import silhouette_score

In [None]:
#silhoutee_avg=[]
#k=range(1,10)
range_clusters=[2,3,4,5,6,7,8]
for i in range_clusters:
    kmeans=KMeans(i,random_state=42)
    kmeans.fit(data3)
    cluster_labels=kmeans.labels_
    silhouette=silhouette_score(data3,cluster_labels)
    print(silhouette)

In [None]:
## Learning

In [None]:
# Return the cluster centers(Centroids)
kmeans.cluster_centers_

In [None]:
# Cluster labels tells to which cluster the data point belongs  to
kmeans.labels_

In [None]:
data3.shape


## Insights:

   - As observed in the above analysis we have better silhouette score when k=2: 0.65 and k=3: 0.52
   - As we already observed the k value using ssd tells the same analysis but we saw better distribution of clusters when  k=3 in the above step. Lets consider k=3 as optimal calue and visualize the spread of clusters.       

### After verifying k values at 2 and 3 data is well explained when k=3

## Lets visualise the clusters

In [None]:
data2.head()

In [None]:
sns.scatterplot(x = 'income', y ='child_mort', hue = 'cluster', data =data2,palette=['blue','green','red'])
plt.xlabel('income',fontsize=13)
plt.ylabel('child_mort',fontsize=13)
plt.legend()
plt.show()

## Insights:

   - Plot shows that whereever income is low child_mortality is high which indicates average income to the person is strong indicator in mortality rate
   - Three clusters has been well explained in the data as shown above.
   - Cluster 0 will be our priority where we observed high child_mortality rate and less average income

In [None]:
sns.scatterplot(x = 'income', y ='gdpp', hue = 'cluster', data =data2,palette=['blue','green','red'])
plt.xlabel('income',fontsize=13)
plt.ylabel('gdpp',fontsize=13)
plt.legend()
plt.show()

## Insights:

   - NetIncome per person looks linearly related with GDP of a country
   - When the average income of the person increases GDP gradually increases
   - We can consider countries with less average income and less gdpp wherethe trust can provide the funds and help them.      - As observed cluster 0 will be our criteria

In [None]:
sns.scatterplot(x = 'child_mort', y ='gdpp', hue = 'cluster', data =data2,palette=['blue','green','red'])
plt.xlabel('child_mort',fontsize=13)
plt.ylabel('gdpp',fontsize=13)
plt.legend()
plt.show()

## Insights:

   - As observed in the above plot `child_mort` is high when `gdpp` is low
   - `gdpp` is a strong indicator for `child_mort`
   - As observed cluster 0 tells about high child mortality rate
   - In further steps we will filter the countries where this pattern is observed
   - We can concentrate on cluster 0 where child_mort is high and gdpp is less

In [None]:
# scaling
# Create new dataframe data3  with features: gdpp, child_mort and income and apply minmax scaler
data2_columns=data2.iloc[:,:-2].columns
data3=pd.DataFrame(mm.fit_transform(data2.iloc[:,:-2]),columns=data2_columns)
data3

## Insights:

   - After scaling the data we are storing them in data3 
   - We have applied the min max scaler in the previous step its good that we scale the data since considering the data without scaling will give high priority to higher values

In [None]:
# Add cluster value to data3
data3['cluster']=data2['cluster']
data3.head()

In [None]:
# Add country column to data3
data3['country']=data2['country']
data3.head()

## Insights:

   - In the above two steps we have added country and cluster lables to the dataframe data3 which helps in further steps in identifying which country falls under which cluster group.

In [None]:
#group the clusters
data3.groupby('cluster').mean().plot(kind='bar')

## Insights:

   - Now we know that cluster 0 is having high child_mort, low income and gdpp these are the fields we need to concentrate to help the people in this region and provide help which requires help from charity/Trust.

### Considering cluster : 0 as priority in our scenario lets see which countries fall under cluster 0 which requires help when compared with countries

In [None]:
data3[data3['cluster']==0].sort_values(by=['child_mort','gdpp','income'],ascending=[False,True,True]).head()

## Insights:

   - Countries having High child_mortality rate, Low gdpp and low income is considred on priority where Trust can spend the money for needy people.
   - As shown above countries are kept in order which required Trust help on priority
   - The order of countries which require help is in the following order : `Haiti`,`Sierra Leone`,`Chad`, `Central African Republic` and ` Mali`

## 9.Hierarchical clustering

In [None]:
# import libraries

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [None]:
# Dataframe which we will use in hierarchical clustering
data3.head()

## Agglomerative clustering
- Here we consider each data point as one cluster and we find the distance from each point to the other data points.Then the data points with minimum distance is formed as a cluster resulting in n-1 cluster
- This process iterates until all the data points are formed as a single cluster
- We can visualize the dendograms obtained which we will see in further steps

### single linkage

In [None]:
plt.figure(figsize=(18,8))
dendogram(linkage)

In [None]:
# single linkage
plt.figure(figsize=(16,8))
single_linkage=linkage(data3.iloc[:,:-2],method='single',metric='euclidean')   #single linkage method
dendrogram(single_linkage)  #This prints the dendograms 
plt.show()          #display the dendogram on the screen

## Insights:

- As seen in the above diagram the dendogram is tightly coupled lets try other linkage methods like complete and average linkage methods

In [None]:
# complete linkage
plt.figure(figsize=(16,8))
complete_linkage=linkage(data3.iloc[:,:-2],method='complete',metric='euclidean')   #Complete linkage method
dendrogram(complete_linkage)  #This prints the dendograms 
plt.show()          #display the dendogram

## Insights:

- We could see that complete linkage performe better than sngle linkage and clear dendograms has been formed, Lets cut the tree at different levels and find the optimal cluster value

In [None]:
# average linkage
plt.figure(figsize=(15,8))
avg_linkage=linkage(data3.iloc[:,:-2],method='average',metric='euclidean')   #Average linkage method 
dendrogram(avg_linkage)  #this prints the dendograms 
plt.show()          #display the dendogram

## Insights:

- We have obtained Dendogrames for Single linkage, Complete linkage and average linkage tree.
- Lets cut the tree to obtain the k value at 3 and 2.Then we can come to conclusion which is performng better   

## `n_clusters=2` With complete Linkage method

In [None]:
cluster_labels=cut_tree(complete_linkage,2).reshape(-1,)
cluster_labels

## Insights:
- Now we got cluster labels when number of cluster is 2

In [None]:
# Add the cluster lables to the dataframe 'data3'
data3['cluster_labels']=cluster_labels
data3

In [None]:
# Verify the count of cluster lables in data3
data3.cluster_labels.value_counts()

## Insights:
- As observed we have 129 countries which fall unders cluster 0 and 38 countries which fall under cluster 1

## `n_clusters=3`

In [None]:
cluster_labels=cut_tree(complete_linkage,3).reshape(-1,)
cluster_labels

## Insights:
- Now we got cluster labels when number of cluster is 3

In [None]:
data3['cluster_labels']=cluster_labels
data3

In [None]:
data3.cluster.value_counts()

## Insights:
- As observed we have 96 countries fall unders cluster 0, 31 countries fall under cluster1 and 40 countries fall under cluster 2

In [None]:
data3.head()

In [None]:
# Create a new dataframe data4 which holds gdpp,child_mort,income,country and cluster_labels column
data4=data3.drop('cluster',axis=1)#[['gdpp','child_mort','income','cluster_labels']]
data4

## Lets Group the data based on cluster labels

In [None]:
# Group the dataframe by cluster_labels
data4.groupby('cluster_labels').mean().plot(kind='bar')

## Insights:

- Cluster 0 is having high `child_mort`,less`gdpp` and `income` which is a good indicator for the people required in aid.
- Cluster 0 can be concentrated to provide help to the people in that countries which are in need

In [None]:
# Filter the cluster 0 labels includes country names
data4[data4.cluster_labels==0]

In [None]:
# Top 5 countries which requires
data4[data4['cluster_labels']==0].sort_values(by=['child_mort','gdpp','income'],ascending=[False,True,True]).head()

## Insights:
- HELP International can concentrate on this top 5 countries on priority in fighting the poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities. 
- As shown above countries are kept in order which required Trust help on priority
- The order of countries which require help is in the following order : `Haiti`,`Sierra Leone`,`Chad`, `Central African Republic` and ` Mali`
- We can cut the tree at different level to obtain different cluster value based on business understanding

## Overall Insights:
- HELP International can concentrate on this Top 5 countries which are in need of help.The order of countries which require help is in the following order : `Haiti`,`Sierra Leone`,`Chad`, `Central African Republic` and ` Mali`.
