## Countries Population from 1995 to 2020 dataset

## STEPS :
1. Perform Data Pre-processing and Data Visualization on your data set to develop a thorough understanding of the data. 

2. Regression

3. Apply clustering techniques

This dataset consists of 4195 observations on the following 14 columns (features):
 - **Year**
 - **Country**
 - **Population**
 - **Yearly %change**
 - **Yearly change**
 - **Migrants(net)**
 - **Median age**
 - **Fertility Rate**
 - **Density(P/Km2)**
 - **Urban Pop%**
 - **Urban Population**
 - **Country's Share of World Pop %**
 - **World Population**
 - **Country Global Rank** 

In [None]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly
import plotly.express as px

In [None]:
# read dataset
data=pd.read_csv('../input/countries-population-from-1955-to-2020/Countries Population from 1995 to 2020.csv')
ps = data
ds=data

In [None]:
#to display first 5 rows of dataset
data.head()

In [None]:
#to display last 5 rows of dataset
data.tail()

# Data Preprocessing

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
#checking for null values
data.dropna(axis=0,inplace=True)

In [None]:
sns.heatmap(data.isnull(),yticklabels=False,cbar=False,cmap='viridis')

> **There are no null values in the dataset**

In [None]:
#Plotting boxplot (displaying the distribution of data based on a five number summary )
sns.set(style="whitegrid")
fig,ax = plt.subplots(nrows=2, ncols=2, figsize=(8,8))
plt.suptitle('Box Plot',fontsize=24)
sns.boxplot(x="Country Global Rank", data=data,ax=ax[0,0],palette='Set2')
sns.boxplot(x="Urban Pop %", data=data,ax=ax[0,1],palette='Set1')
sns.boxplot (x ='Fertility Rate', data=data, ax=ax[1,0], palette='Set1')
sns.boxplot(x='Median Age', data=data, ax=ax[1,1],palette='Set2')
plt.show()

> **The median (middle quartile) marks the mid-point of the attribute (eg - around 50 in case of urban pop %) and is shown by the line that divides the box into two parts.**

# Exploratory Data Analysis

In [None]:
#Finding correlation between all attributes of dataset
fig, ax = plt.subplots(figsize=(10,9))
sns.heatmap(data.corr(), center=0, cmap='BrBG', annot=True)
ax.set_title('HEAT MAP')

**INFERENCE**:
> #### Attributes having highest correlation(0.95) are Country's share of world pop % and Population.

In [None]:
sns.jointplot(x="Urban Pop %",y="Year",data=data,kind="hex",color="magenta")

**INFERENCE:**
> **In this plot, the hexagon with most number of points gets darker color. So it can be infered that the percentage of urban pop % which donate the most is around 50%  and the corresponding year for the same is between 2015 and 2020**

In [None]:
#plotting histogram
data.hist(figsize=(20,30),color="orange")

> **It shows the distribution of various attributes in the dataset**

In [None]:
#graph showing population of top 30 countries in 2020
current_population = data[data['Year'] == 2020][:20]
plt.rcParams['figure.figsize'] = (25, 7)
ax = sns.barplot(x = current_population['Country'][:20], y = current_population['Population'][:20], palette = 'dark')
ax.set_xlabel(xlabel = 'Countries', fontsize = 10)
ax.set_ylabel(ylabel = 'Population in Billion', fontsize = 10)
ax.set_title(label = 'Population of top 30 countries in 2020', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

**INFERENCE**
> **It can be seen that China has the highest population in 2020 followed by India**

In [None]:
population_2020 = data[data['Year'] == 2020]

In [None]:
fig = px.choropleth(population_2020, locations="Country", 
                    locationmode='country names', color="Density (P/Km²)", 
                    hover_name="Country", 
                    color_continuous_scale="blues", 
                    title='Density of Countries in 2020')
fig.update(layout_coloraxis_showscale=True)
fig.show()

In [None]:
#plot showing Total Share of in World's Population for the top 10 countries
unique_countries = data['Country'].unique()
plt.style.use("seaborn-talk")
# set year
year = 2020
df_last_year = data[data['Year'] == year]
series_last_year = df_last_year.groupby('Country')['Population'].sum().sort_values(ascending=False)
print(series_last_year)
labels = []
values = []
country_count = 10
other_total = 0
for country in series_last_year.index:
    if country_count > 0:
        labels.append(country)
        values.append(series_last_year[country])
        country_count -= 1
else:
    other_total += series_last_year[country]
labels.append("Other")
values.append(other_total)
wedge_dict = {
'edgecolor': 'black',
'linewidth': 2
}
explode = (0, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
plt.title(f"Total Share of in World's Population for the top 10 countries in {year}")
plt.pie(values, labels=labels, explode=explode, autopct='%1.2f%%', wedgeprops=wedge_dict)
plt.show()

**INFERENCE**
> **It can be seen that China accounts for the highest share in World's population followed by India in 2020**

In [None]:
india=data[data['Country'] == "India"]

In [None]:
#plot showing population of India in different years
fig = plt.figure(figsize=(10,5))
plt.plot(india['Year'], india['Yearly Change'])
plt.title('Yearly Population Change in India')
plt.xlabel('Year')
plt.ylabel('Population in 10 Million')
plt.show()

**INFERENCE**
> **It can be seen that India has the highest population in 2000**

In [None]:
#plotting violin plots to visualise the distribution of the data and its probability density
fig,ax = plt.subplots(nrows=2, ncols=2, figsize=(10,10))
plt.suptitle('Violin Plots',fontsize=24)
sns.violinplot(x="Migrants (net)", data=data,ax=ax[0,0],palette="Set1")
sns.violinplot(x="World Population", data=data,ax=ax[0,1],palette="Set2")
sns.violinplot (x ='Yearly Change', data=data, ax=ax[1,0],palette="Set2")
sns.violinplot(x='Country\'s Share of World Pop %', data=data, ax=ax[1,1],palette="Set1")

plt.show()

> **The white dot in the middle is the median value and the thick black bar in the centre represents the interquartile range. The thin black line extended from it represents the upper (max) and lower (min) adjacent values in the data.**

In [None]:
data.drop(['Country'], axis=1, inplace=True)
data.drop(['Migrants (net)'], axis=1,inplace=True)
data.drop(['Density (P/Km²)'], axis=1,inplace=True)

In [None]:
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
scaled_data = standard_scaler.fit_transform(data)

# K Means Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
np.nan_to_num(scaled_data)

In [None]:
# Using the elbow method to find the optimal number of clusters
X =scaled_data[: , :]   #taking all the columns into account
wcss = []
for n in range(1 , 11):
    algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300))
    algorithm.fit(X)
    wcss.append(algorithm.inertia_)
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , wcss , 'o')
plt.plot(np.arange(1 , 11) , wcss , '-' )
plt.xlabel('Number of Clusters') , plt.ylabel('WCSS')
plt.title('Elbow Method Diagram')
plt.show()


> **It can be seen that optimal number of clusters are 3**

In [None]:
data['Urban Pop %'].astype('int64')

In [None]:
# Fitting K-Means to the dataset
algorithm = (KMeans(n_clusters = 3 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 823) )
algorithm.fit(X)
labels = algorithm.labels_

In [None]:
data['Cluster'] = labels
data.tail(10)

In [None]:
#representing which features are important to the clustering using RandomForest Feature Importance Plot
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state = 823)
df_dv = data.copy()
df_dv.drop('Cluster', axis = 1, inplace = True)
rfc.fit(df_dv,data['Cluster'])
features = df_dv.columns.tolist()
feature_value = rfc.feature_importances_
d = {'Features' : features, 'Values' : feature_value}
fi = pd.DataFrame(d).sort_values('Values', ascending = False).reset_index()
fi
plt.rcParams['figure.figsize'] = (20.0, 5.0)
ax = sns.barplot(x=fi['Features'], y = fi['Values'], data = fi, palette="Blues_d")

> **Fertility Rate has the highest importance in the dataset followed by Median Age**

In [None]:
data

In [None]:
X = data.iloc[:, [0,4,5,9]].values  

In [None]:
#Some zoomed in biplots
fig, axs = plt.subplots(ncols=2,nrows=2, figsize = (15,15))
sns.scatterplot(x="Fertility Rate", y="Year", hue="Cluster",
                     palette = 'colorblind',data =data , legend = False, s = 100, ax=axs[0][0])
sns.scatterplot(x="Fertility Rate", y="Population", hue="Cluster",
                     palette = 'colorblind', data=data , legend = False, s = 100, ax=axs[0][1])
sns.scatterplot(x="Median Age", y="World Population", hue="Cluster",
                     palette = 'colorblind', data = data, legend = False, s = 100,  ax=axs[1][0])
sns.scatterplot(x="Year", y="World Population", hue="Cluster",
                     palette = 'colorblind', data = data, legend = False, s = 100, ax=axs[1][1])

In [None]:
#visualisation of pairplots using top 5 important features for the clusters
sns.pairplot(data[['Fertility Rate','Median Age','Year','World Population','Urban Pop %','Cluster']],palette = 'colorblind',hue='Cluster');

In [None]:
#obtain the principal components
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
principal_comp=pca.fit_transform(X)
principal_comp

In [None]:
#create dataframe with two components
pca_df=pd.DataFrame(data=principal_comp,columns=['pca1','pca2'])
pca_df.head()

In [None]:
preds=pd.Series(KMeans(n_clusters=3).fit_predict(pca_df))

In [None]:
#concatenate the cluster labels to the dataframe
pca_df=pd.concat([pca_df,preds],axis=1)

In [None]:
pca_df.columns=['pca1','pca2','Cluster']

In [None]:
pca_df

In [None]:
plt.figure(figsize=(7,7))
ax=sns.scatterplot(x='pca1',y='pca2',hue=pca_df.Cluster.tolist(),data=pca_df,palette=['red','blue','green'])
ax.legend(title='Cluster')
plt.show()


# Heirarchical Clustering

In [None]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram,linkage

In [None]:
# Using the dendrogram to find the optimal number of clusters(ward linkage)
import scipy.cluster.hierarchy as sch
fig = plt.figure(figsize =(12,12),facecolor='w')
dendrogram=sch.dendrogram(sch.linkage(X,method='ward'))
plt.title("Dendrogram",fontsize=20)
plt.xlabel('X',fontsize=12)
plt.ylabel('Euclidean Distances',fontsize=12)
plt.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward')
y_hc = hc.fit_predict(X)

In [None]:
#obtain the principal components
from sklearn.decomposition import PCA
pca=PCA(n_components=2)
principal_comp=pca.fit_transform(X)
principal_comp

In [None]:
#create dataframe with two components
pca_df_agglomerative=pd.DataFrame(data=principal_comp,columns=['pca1','pca2'])
pca_df_agglomerative.head()

In [None]:
preds=pd.Series(AgglomerativeClustering(n_clusters=3).fit_predict(pca_df_agglomerative))

In [None]:
pca_df_agglomerative=pd.concat([pca_df_agglomerative,preds],axis=1)

In [None]:
pca_df_agglomerative.columns=['pca1','pca2','Cluster']

In [None]:
pca_df_agglomerative

In [None]:
pca_df_agglomerative.columns=['pca1','pca2','Cluster']
pca_df_agglomerative.head()

In [None]:
plt.figure(figsize=(7,7))
ax=sns.scatterplot(x='pca1',y='pca2',hue=pca_df_agglomerative.Cluster.tolist(),data=pca_df_agglomerative,palette=['red','blue','green'])
ax.legend(title='Cluster')
plt.show()

In [None]:
# Using the dendrogram to find the optimal number of clusters(complete linkage)
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'complete'))
plt.title('Dendrogram')
plt.xlabel('data points')
plt.ylabel('Euclidean distances')
plt.show()

In [None]:
# Using the dendrogram to find the optimal number of clusters(single linkage)
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X, method = 'single'))
plt.title('Dendrogram')
plt.xlabel('data points')
plt.ylabel('Euclidean distances')
plt.show()
fig = plt.figure(figsize =(12,12),facecolor='w')

In [None]:
best_cols = ['Fertility Rate','Median Age','World Population','Year']
data_final = pd.DataFrame(data[best_cols])

In [None]:
# create a 'cluster' column
data_final['cluster'] = labels
best_cols.append('cluster')

In [None]:
data_final['cluster'].value_counts().plot.bar(figsize=(10,5),color = list('rgbkymc'), title='Entries by cluster');


In [None]:
sns.pairplot(data_final[best_cols], hue='cluster', x_vars=['Fertility Rate','Median Age','World Population','Year'],
            y_vars=['cluster'],
            height=3, aspect=1)

#  Performance measures

## a) Silhouette Analysis

In [None]:
from sklearn.metrics import silhouette_score  
no_of_clusters = [3,4, 5,6] 
silhouette_coeff = []
for n_clusters in no_of_clusters: 
  
    cluster = KMeans(n_clusters = n_clusters) 
    cluster_labels = cluster.fit_predict(X) 
  
    # The silhouette_score gives the  
    # average value for all the samples. 
    silhouette_avg = silhouette_score(X, cluster_labels) 
    silhouette_coeff.append(silhouette_avg)
    print("For no of clusters =", n_clusters, 
          " The average silhouette_score is :", silhouette_avg)

In [None]:
from sklearn.metrics import silhouette_score  
silhouette_coefficients = []
for k in range(3, 6):
    kmeans = KMeans(n_clusters = k, init = 'k-means++', max_iter=300, n_init=10)
    kmeans.fit(data)
    score = silhouette_score(data, kmeans.labels_)
    silhouette_coefficients.append(score)

## b)Davies-Bouldin Index

In [None]:
from sklearn.metrics import davies_bouldin_score 

In [None]:
kmeans=KMeans(n_clusters=3,random_state=1).fit(X)
# to store the cluster labels 
klabels = kmeans.labels_ 

print(davies_bouldin_score(X, klabels)) 

**INFERENCE**
> **As the DB index shrinks, the clustering is considered ‘better'.**

# DBSCAN

**DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular learning method utilized in model building and machine learning algorithms. This is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density.**

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
#taking fertility rate and yearly % change
X = data.iloc[:, [2,5]].values 
X

In [None]:
model=DBSCAN(eps=0.25, min_samples=10)
model.fit(X)

In [None]:
model.labels_

In [None]:
fig,ax = plt.subplots(figsize=(6,5))
ax.scatter(X[:,0], X[:,1] , c=model.labels_)
fig.show()

# INDIA

In [None]:
# Storing the value of India in new Dataframe
Ind=pd.DataFrame()
Ind=ps.loc[ps['Country']=='India']

In [None]:
Ind.head()

#### Population change and India's contribution in it

In [None]:
plt.plot(Ind['Year'], Ind['Population'], color='g')
plt.plot(Ind['Year'], Ind['World Population'], color='orange')
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('Population Change')
plt.show()

#### Change in population Density with years of India

In [None]:
plt.plot(Ind['Year'], Ind['Density (P/Km²)'])
plt.gca().invert_yaxis()
plt.show()

##### Sudden change in past few years

# Multiple Regression

In [None]:
# creating new dataframe
ds1=pd.DataFrame()
ds1['Year']=Ind['Year']
ds1['Fertility Rate']=Ind['Fertility Rate']
ds1['Migrants (net)']=Ind['Migrants (net)']
ds1['Population']=Ind['Population']

In [None]:
ds1.head()

In [None]:
# our features
X = ds1[['Fertility Rate', 'Migrants (net)']]
y = ds1['Population']

In [None]:
# Testing and training dataset split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

In [None]:
# Building model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
# Beta coefficients of our model
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

##### Our  β Coefficients value in Multiple Regression

In [None]:
# predicting the value
y_pred = regressor.predict(X_test)

In [None]:
# Actual and Predicted value
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1=df
df

In [None]:
# Diffference in actual and predicted value
df1.plot(kind='bar')

##### Our Regression model performs quite well as can be seen from the graph

**R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model**

**Root mean squared error tells you how concentrated the data is around the line of best fit.**

In [None]:
# accuracy check
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score
rmsd = np.sqrt(mean_squared_error(y_test, y_pred))      
r2_value = r2_score(y_test, y_pred)                     

print("Root Mean Square Error :", rmsd)
print("R^2 Value :", r2_value)

##### R^2 value ~ 1 tells how good our regression model is.


> # Forecasting 

In [None]:
ds1.head()

In [None]:
# Dropping irrelevant features
ds1.drop(['Fertility Rate','Migrants (net)'],axis=1,inplace=True)

In [None]:
# making year as index
ds1.set_index('Year',inplace=True)

In [None]:
ds1.head()

In [None]:
Test=ds1[:5] 
Train=ds1[5:]

## Naive Forecasting : 
**Estimating technique in which the last period's actuals are used as this period's forecast, without adjusting them or attempting to establish causal factors. It is used only for comparison with the forecasts generated by the better (sophisticated) techniques.**

In [None]:
# Naive forecast - It gives our forecast value seeing our past few values
dd= np.asarray(Test.Population)
y_hat = Test.copy()
y_hat['naive'] = dd[len(dd)-1]
plt.plot(Train.index, Train['Population'], label='Train')
plt.plot(Test.index,Test['Population'], label='Test')
plt.plot(y_hat.index,y_hat['naive'], label='Naive Forecast')
plt.legend(loc='best')
plt.title("Naive Forecast")
plt.show()

##### So, we can see our around what our country population will vary in coming future

## THANK YOU !!!