# Introduction
Suicide has been a serious problem among different countries for many years. 
In this study, we are going to perform a data analysis on the suicide rates in various countries, to understand the trend and gain some insights. Then, we will try to perform clustering to separate countries into clusters to observe some patterns. 

In the followings, the study will be mainly divided into two parts:

1. Exploratory Data Analysis
2. Clustering

First, we'll import some libraries for the study.


In [None]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
import geopandas as gpd
import plotly.express as px
%matplotlib inline

The dataset contains one csv file. To load the CSV file, we will use the Pandas library. The name `suicide_raw_df` will be used to show this is the raw data that is required to have further processing.

In [None]:
suicide_raw_df = pd.read_csv('../input/suicide-rates-overview-1985-to-2016/master.csv')
suicide_raw_df.head(5)

After loading the dataset, we will go to the next step of preprocessing the data for the analysis.

# Data Preprocessing
Let's select a subset of columns with the relevant data for our analysis. 

In [None]:
suicide_raw_df.columns

In [None]:
selected_columns = [
                    'country', 'year', 'sex', 'age', 'suicides_no', 'population',
                    'suicides/100k pop', 'HDI for year', ' gdp_for_year ($) ', 
                    'gdp_per_capita ($)', 'generation'
                    ]
len(selected_columns)

We will extract a copy of the data from these columns into a new dataframe `suicide_df`. We can continue to modify the dataframe without affecting the original dataframe.

In [None]:
suicide_df = suicide_raw_df[selected_columns].copy()
suicide_df.head()

Let's view some basic information about the data frame first.

In [None]:
suicide_df.shape

The dataframe contains 11 columns and 27820 rows. 
Let's look at the list of columns in the dataframe.

In [None]:
suicide_df.columns

It seems we have to rename some of the columns to make it tidier and easier to perform analysis.

In [None]:
suicide_df = suicide_df.rename(columns={'suicides/100k pop': 'suicides/100k_pop',
                   'HDI for year': 'HDI_for_year',
                   ' gdp_for_year ($) ': 'gdp_for_year',
                   'gdp_per_capita ($)': 'gdp_per_capita'}
                  )
suicide_df.columns

In [None]:
suicide_df.info()

Five columns have the data type object, and six columns are numeric.

It appears that the column `HDI_for_year` contains a lot of empty values since the Non-Null count for this column is lower than the total number of rows (27820). We'll drop this column to deal with the empty values and it means we have 10 columns left for processing. 

In [None]:
suicide_df.drop(columns=['HDI_for_year'], inplace=True)
suicide_df.info()

As for the column `gdp_for_year`, we need to change the data type from object to numeric value. To make analysis easier, we can divide the value by 1000000000, to make the unit as billion.

In [None]:
suicide_df["gdp_for_year"] = suicide_df["gdp_for_year"].str.replace(",", "").astype(int) / 1000000000
suicide_df.info()

Now view some basic statistics about numeric columns.

In [None]:
suicide_df.describe()

The numeric data seems valid. 

We've cleaned up and prepared the dataset for analysis. Let's see some samples from the dataframe.

In [None]:
suicide_df.sample(5)

# Exploratory Data Analysis & Data Visualization
Fristly, we are going to explore the variables to understand how representative the data is of the worldwide programming community.

## Country
Take a look with the number of countries that are included in this data.

In [None]:
suicide_df.country.nunique()

## Suicide Number & Suicide Rate
Suicide rate per year is calculated by dividing the suicide number by the population of the respective country and multiplied by 100 000.

### Suicide Number & Suicide Rate by Years

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(121)
overall_suicides_no = suicide_df.groupby(['year'], as_index=False)['suicides_no'].sum()
sns.lineplot(x='year', y='suicides_no', data=overall_suicides_no)
plt.title('Global  Suicide Number by Years')

plt.subplot(122)
sns.lineplot(x='year', y='suicides/100k_pop', data=suicide_df)
plt.title('Overall Suicide Rate by Years');

As we can see, most suicide cases happened in around 1999, while the highest suicide rate recorded in 1995.The lowest suicide number was in 2016 and the lowest suicide rate was in 2011.

It seems the overall suicide number and suicide rate both have a descreasing trend among the years from 1995 to 2015. 

Although the suicide numbers in the period remain in 200000-250000, there is a population growth happened around the period, which makes the decrease in the suicide rate is quite significant in the period.

In [None]:
suicide_df.groupby(['year'])['population'].sum().plot(label='population');

### Suicide Number & Suicide Rate by Countries
We can see the top 10 countries with the highest suicide number and suicide rate.

In [None]:
plt.figure(figsize=(25,5))

plt.subplot(121)
suicides_no_country_df = suicide_df.groupby('country')['suicides_no'].sum().sort_values(ascending=False).head(10)
suicides_no_country_df.sort_values().plot.barh()
plt.title('Top 10 Countries with Highest Suicide Number')
plt.xlabel('suicide number')

plt.subplot(122)
suicides_rate_country_df = suicide_df.groupby('country')['suicides/100k_pop'].mean().sort_values(ascending=False).head(10)
suicides_rate_country_df.sort_values().plot.barh()
plt.title('Top 10 Countries with Highest Suicide Rate')
plt.xlabel('suicide rate');

Russian Federation got the most highest suicide number (around 1.2M), following by the United States and Japan. Lithuania got the highest suicide rate, following by Sri Lanka and Russian Federation. 

Let's also take a look with the top 10 countries with the lowest suicide numbers and suicide rates.

In [None]:
plt.figure(figsize=(25,5))

plt.subplot(121)
suicides_no_country_low_df = suicide_df.groupby('country')['suicides_no'].sum().sort_values(ascending=True).head(10)
suicides_no_country_low_df.sort_values(ascending=False).plot.barh()
plt.title('Countries with Top 10 Lowest Suicide Numbers')
plt.xlabel('suicide number')

plt.subplot(122)
suicides_rate_country_low_df = suicide_df.groupby('country')['suicides/100k_pop'].mean().sort_values(ascending=True).head(10)
suicides_rate_country_low_df.sort_values(ascending=False).plot.barh()
plt.title('Countries with Top 10 Lowest Suicide Rates')
plt.xlabel('suicide rate');

As shown in the plots, Dominica and Saint Kitts and Nevis got the lowest suicide numbers and suicide rates.

Next, we'll make a geospatial data visualization regarding to the suicides numbers and suicide rates in different countries over years.

In [None]:
suicides_no_country_year_df = suicide_df[['country','year', 'suicides_no']]
px.choropleth(suicides_no_country_year_df, locations='country', 
              locationmode='country names', 
              animation_frame=suicides_no_country_year_df['year'], 
              color=np.log(suicides_no_country_year_df['suicides_no']),
              color_continuous_scale=px.colors.sequential.Viridis,
              title='World Suicides Numbers in different countries over years')

In [None]:
suicide_rate_country_year_df = suicide_df[['country','year', 'suicides/100k_pop']]
px.choropleth(suicide_rate_country_year_df, locations='country', 
              locationmode='country names', 
              animation_frame=suicide_rate_country_year_df['year'], 
              color=np.log(suicide_rate_country_year_df['suicides/100k_pop']),
              color_continuous_scale=px.colors.sequential.Viridis,
              title='World Suicides Rate in different countries over years')

To see the trend of suicide number in the top 10 countries with highest suicide numbers and rates, we will make use of a lineplot.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

suicides_no_top10country_year_df = suicide_df.loc[suicide_df['country'].isin(suicides_no_country_df.index)]
suicides_no_top10country_year_df = suicides_no_top10country_year_df.groupby(['year', 'country'])['suicides_no'].sum()
suicides_no_top10country_year_df.unstack().plot(style='-o', ax=ax1)
ax1.legend(loc='upper right', fontsize=7.5)
ax1.set_title('Top 10 countries with highest suicide numbers over years')
ax1.set_ylabel('suicide number')

suicides_rate_top10country_year_df = suicide_df.loc[suicide_df['country'].isin(suicides_rate_country_df.index)]
suicides_rate_top10country_year_df = suicides_rate_top10country_year_df.groupby(['year', 'country'])['suicides/100k_pop'].mean()
suicides_rate_top10country_year_df.unstack().plot(style='-o', ax=ax2)
ax2.legend(loc='upper right', fontsize=7.5)
ax2.set_title('Top 10 countries with highest suicide rates over years')
ax2.set_ylabel('suicide rate');

The United States and the Republic of Korea have shown a significant increasing trend and Brazil has shown a gradual increasing trend, while Russian Federation (the country that has the most highest average suicide number over the years) has a significant decreasing trend in the suicide number among the years. 

As for the suicide rates, all of the top ten countries that with the highest suicide rates has a decreasing trend over the years.

Let's see if the three countries (United States, Brazil and the Republic of Korea) have also had an increasing trend in suicide rates.

In [None]:
suicides_rate_country_df = suicide_df.groupby(['year', 'country'], as_index=False)['suicides/100k_pop'].mean()
px.line(suicides_rate_country_df, x='year', y='suicides/100k_pop',animation_frame='country', range_y=[0,55])

From the above interactive line plot, it is observed that Republic of Korea, Suriname and Guyana have an increasing trend in suicide rates over years. 

It appears that Brazil and the United States' trends in suicide rates over years are steady. It should be due to the population growth in each countries, even their suicide numbers are increasing by years, their suicide rates are steady.

### Suicide Number & Suicide Rate by Continents
Now, we are going to see the suicide number and suicide rate by continents. To get the continents data, we will make use of the dataset in the geopandas library. We will need to rename the column of it to match our data to be processed.

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.rename(columns={'name': 'country'}, inplace=True)
world.head()

Since the `world` data is ready, we can merge the dataframes together to get a visualization for overall suicide number and suicide rate in different continents and observe which continent has the highest suicide number and rate.

There are some country names needed to be renamed.

In [None]:
suicide_df[~suicide_df['country'].isin(world['country'])].country.unique()

Data for some countries are missing in geopandas, we'll ignore them for this visualisation.

In [None]:
suicide_map_df = suicide_df.copy()
suicide_map_df['country'].replace({'Bosnia and Herzegovina': 'Bosnia and Herz',
                            'Czech Republic': 'Czechia',
                            'Dominica': 'Dominican Rep.',
                            'Republic of Korea': 'South Korea',
                            'Russian Federation': 'Russia',
                            'United States': 'United States of America'}, inplace=True)

In [None]:
suicide_no_map_df = pd.merge(world, suicide_map_df[['country', 'suicides_no', 'sex']], on='country')
suicide_no_map_df = suicide_no_map_df.groupby(['continent', 'sex'], as_index=False)['suicides_no'].sum()
suicide_no_map_df

In [None]:
suicide_rate_map_df = pd.merge(world, suicide_map_df[['country', 'suicides/100k_pop', 'sex']], on='country')
suicide_rate_map_df = suicide_rate_map_df.groupby(['continent', 'sex'], as_index=False)['suicides/100k_pop'].mean()
suicide_rate_map_df

In [None]:
plt.figure(figsize=(20,5))

plt.subplot(121)
sns.barplot(data=suicide_no_map_df, x='suicides_no', y='continent', hue='sex')
plt.title('Suicide Numbers in Different Continents')

plt.subplot(122)
sns.barplot(data=suicide_rate_map_df, x='suicides/100k_pop', y='continent', hue='sex')
plt.title('Average Suicide Rates in Different Continents');

As shown in the plot, Europe has the highest suicide number and suicide rate. Also, we can see most of the people who committed suicide are male.

Let's see the trends of suicide number and suicide rate in different continents among the years.

In [None]:
suicide_no_map_df = pd.merge(world, suicide_map_df[['country', 'suicides_no', 'year']], on='country')
suicide_no_map_df = suicide_no_map_df.groupby(['continent', 'year'], as_index=False)['suicides_no'].sum()

suicide_rate_map_df = pd.merge(world, suicide_map_df[['country', 'suicides/100k_pop', 'year']], on='country')
suicide_rate_map_df = suicide_rate_map_df.groupby(['continent', 'year'], as_index=False)['suicides/100k_pop'].mean()

fig, ax = plt.subplots(1, 2, figsize=(20, 5))

plt.subplot(121)
sns.lineplot(data=suicide_no_map_df, x='year', y='suicides_no', hue='continent')
plt.title('Suicide Number by Continents')

plt.subplot(122)
sns.lineplot(data=suicide_rate_map_df, x='year', y='suicides/100k_pop', hue='continent')
plt.title('Suicide Rates by Continents');

Europe has the highest suicide numbers over the years, following by Asia and North America.

For suicide rates, Europe, Asia and North America show a decreasing trend in around 1995 and later, but South America has an increasing trend among the continents from 1996 onward.

### Suicide Number & Suicide Rate by Sex
As mentioned in the previous section, there are more males suicided. Let's take a look with the sex distribution for number of suicided people.

In [None]:
suicides_sex_df = suicide_df.groupby('sex')['suicides_no'].sum()
suicides_sex_df.plot(kind='pie', labels=suicides_sex_df.index, autopct='%1.1f%%');
plt.title('Sex Distribution for Total Number of Suicided People')
plt.ylabel(None);

Next, let's see the sex distribution for number of suiciders over years.

In [None]:
suicides_sex_year_df = suicide_df.groupby(['year','sex'])['suicides_no'].sum()
suicides_sex_year_df.unstack().plot.line(style='.-')
plt.ylabel('suicide number')
plt.title('Sex Distribution for Number of Suiciders over Years');

In 1989, the number of suicide people rised sharply, especially for male. We can see the suiders are mostly male over the years. Now, let's see the suicide rates in each counrtry by sex over years to see if the suiciders are mostly male in each country.

In [None]:
suicides_rate_sex_df = suicide_df.groupby(['year', 'country','sex'], as_index=False)['suicides/100k_pop'].mean()
px.line(suicides_rate_sex_df, x='year', y='suicides/100k_pop',animation_frame='country', range_y=[0,70], color='sex')

Let's see if there's a country having a higher average suicide rate for females than for males.

In [None]:
suicides_rate_sex_df = suicide_df.groupby(['country','sex'])['suicides/100k_pop'].mean().to_frame().unstack()
suicides_rate_sex_df[suicides_rate_sex_df[('suicides/100k_pop', 'female')] > suicides_rate_sex_df[('suicides/100k_pop', 'male')]]

In [None]:
sns.lineplot(data=suicide_df[suicide_df.country == 'Maldives'], x='year', y='suicides/100k_pop', hue='sex', estimator='mean');

As we can see here, in 2001, Maldives' average suicide rate for females is higher than for males, which also lead to a raise in the overall suicide rate for females.

### Suicide Number & Suicide Rate by Age
The distribution of suicided people' age is another crucial factor to look at. 
First, let's take a look with the age distribution for number of suicided people. A pie chart would be a great way to visualize the distribution.

In [None]:
suicides_age_df = suicide_df.groupby('age')['suicides_no'].sum()
suicides_age_df.plot(kind='pie', labels=suicides_age_df.index, autopct='%1.1f%%');
plt.title('Age Distribution for Total Number of Suicided People')
plt.ylabel(None);

It appears that a large percentage of suiciders are 35-54 years old. 

Next, let's see the age distribution for suicide numbers and suicide rates over years.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

suicides_age_year_df = suicide_df.groupby(['year','age'])['suicides_no'].sum()
suicides_age_year_df.unstack().plot.line(style='.-', ax=ax[0])
ax[0].set_ylabel('suicide number')
ax[0].set_title('Age Distribution for Number of Suiciders over Years')

suicides_age_year_df = suicide_df.groupby(['year','age'])['suicides/100k_pop'].mean()
suicides_age_year_df.unstack().plot.line(style='.-', ax=ax[1])
ax[1].set_ylabel('suicide rate')
ax[1].set_title('Age Distribution for Suicide Rates over Years');

It seems most of the suiciders aged 35-54 years old. 

The suicide rate for suiciders aged 75+ years old is the highest over the years. The suicide rate for suiciders aged 5-14 years old remains steady, and for people with other age intervals, the suicide rates decrease starting from 1995 onwards.

### Suicide Number & Suicide Rate by Generation
Let's have a look on the unique values of the column `generation`.

In [None]:
suicide_df.generation.unique()

Here is some information regarding the generations.
- Greatest Generation (G.I. Generation): 1901-1927
- Silent Generation: 1928-1945
- Baby Boomers: 1946 – 1964
- Generation X: 1965 – 1980
- Millennials (Generation Y): 1981 – 1996
- Generation Z: 1997 – 2010


Now, take a look with the generation distribution for number of suicided people by using a pie chart.

In [None]:
suicides_age_df = suicide_df.groupby('generation')['suicides_no'].sum()
suicides_age_df.plot(kind='pie', labels=suicides_age_df.index, autopct='%1.1f%%');
plt.title('Generation Distribution for Total Number of Suicided People')
plt.ylabel(None);

Most of the suicided people are the Baby Boomers,Silent Generation and Generation X. 

Let's see the generation distribution for suicide numbers and suicide rates over years.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

suicides_gen_year_df = suicide_df.groupby(['year','generation'])['suicides_no'].sum()
suicides_gen_year_df.unstack().plot.line(style='.-', ax=ax[0])
ax[0].set_ylabel('suicide number')
ax[0].set_title('Generation Distribution for Number of Suiciders over Years')

suicides_gen_year_df = suicide_df.groupby(['year','generation'])['suicides/100k_pop'].mean()
suicides_gen_year_df.unstack().plot.line(style='.-', ax=ax[1])
ax[1].set_ylabel('suicide rate')
ax[1].set_title('Generation Distribution for Suicide Rates over Years');

The Boomers hit the highest suicide number in 1994.

As we know in the previous section, the suicide rate for suiciders aged 75+ years old is the highest over the years, which is also shown in this plot that the G.I. Generation has the highest suicide rates. The highest suicide rate happened in 1995.

## GDP and Per Capita GDP
GDP is the main indicator of a country’s economic productivity. The GDP of a country shows the market value of the goods and services it produces.

Per capita gross domestic product (GDP) is a metric that breaks down a country's economic output per person and is calculated by dividing the GDP of a country by its population. 

GDP per capita is usually analyzed together with GDP to gain insights into the domestic productivity of their own country and the productivity of other countries. GDP per capita takes into account a country's GDP and population.

### GDP and GDP Per Capita by Years

As the values for columns `gdp_for_year` and `gdp_per_capita` in each country per year are duplicate , we need to drop the duplicates first.

In [None]:
suicide_df[(suicide_df.country == 'Albania') & (suicide_df.year == 1987)]

In [None]:
overall_GDP = suicide_df[['year','gdp_for_year']].drop_duplicates()
overall_GDP

In [None]:
overall_GDP_per_capita = suicide_df[['year','gdp_per_capita']].drop_duplicates()
overall_GDP_per_capita

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(121)
overall_GDP = overall_GDP.groupby(['year'], as_index=False)['gdp_for_year'].sum()
sns.lineplot(x='year', y='gdp_for_year', data=overall_GDP)
plt.title('Global GDP by Years')
plt.ylabel('GDP for year (billion)')

plt.subplot(122)
overall_GDP_per_capita = overall_GDP_per_capita.groupby(['year'], as_index=False)['gdp_per_capita'].mean()
sns.lineplot(x='year', y='gdp_per_capita', data=overall_GDP_per_capita)
plt.title('World GDP Growth Rate by Years')
plt.ylabel('Per capita GDP');

In [None]:
overall_GDP.iloc[overall_GDP.gdp_for_year.idxmax()]

In [None]:
overall_GDP_per_capita.iloc[overall_GDP_per_capita.gdp_per_capita.idxmax()]

As we can see, the world GDP and GDP per capita hit the highest point in 2013. We can see both plots have an significant increasing trend.

## Correlation Analysis
Correlation analysis can be used to quantify the degree to which two variables are related to describe their linear relationship. 

Let's select some features to carry out the analysis.

In [None]:
col = ['suicides_no','population','suicides/100k_pop','gdp_for_year','gdp_per_capita']
correlations = suicide_df[col].corr()

ax = sns.heatmap(
    correlations, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.color_palette("magma", as_cmap=True),
    square=True,
    annot=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

The correlations between the features is low, except population with suicides number and population with GDP for year. They shows a strong positive correlation and it is reasonable. 

There is a moderate positive association between suicide number and GDP per year, which means there is a moderate relationship that if GDP per year increases, the suicide number increases. There may have other features that do not appear in this dataset have stronger associations with the suicide number.

# Clustering
In this part, we're going to cluster countries based on their country names, suicide rates and gdp per capita.

GDP per capita is theoretically the amount of money that each individual gets in that particular country. The GDP per capita provides a much better determination of living standards as compared to GDP alone.

To take population into account, we use suicide rate instead of suicide number for this analysis.

In [None]:
df = suicide_df.groupby(['country'])[['suicides/100k_pop', 'gdp_per_capita', 'population']].mean()
df.head(5)

## Data Preprocessing for Modeling

All distance based algorithms, such as K means clustering, are affected by the scale of the features. To prevent our algorithm to be affected by the magnitude of these features, the algorithm should not be biased towards features with higher magnitude. In this sense, we need to scale all the features, in order to make better models to determine the similarity between data points. Since the features are of incomparable units, it is recommended to standardize them.


In [None]:
from sklearn.preprocessing import StandardScaler

X = df.values

model = StandardScaler()
X_transformed = model.fit_transform(X)

## Modeling

In this section, we will use K Means Clustering Algorithm and Hierarchical Clustering to do clustering. 

### K Means Clustering

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is recommended to use different initializations of centroids. To determine the best number of clusters, we have to do hyperparameter tuning.

#### Hyperparameter Tuning

We will run our model through different number of clusters, to pick the best model.



In [None]:
from sklearn.cluster import KMeans

result = []

for i in range(2,15):
  model = KMeans(n_clusters=i, random_state=4)
  model.fit(X_transformed)
  result.append(model.inertia_)

sns.lineplot(x=range(2,15), y=result, marker="o")
plt.title('Sum of squared distances of samples to their closest cluster center')
plt.xlabel('Number of clusters')
plt.ylabel('SSE');

According to the elbow method, if the plot looks like an arm, then the elbow on the arm is optimal k. It is observed that 4 is our optimal k. We can also use a Python package, kneed, to identify the elbow point programmatically.

In [None]:
!pip install kneed

In [None]:
from kneed import KneeLocator
kl = KneeLocator(range(2,15), result, curve="convex", direction="decreasing")
kl.elbow

Let's build our K means clustering model by using k=4.

In [None]:
model = KMeans(n_clusters=4, random_state=4)
model.fit(X_transformed)

cluster = model.predict(X_transformed)
suicide_kmeans = df.copy()
suicide_kmeans['cluster'] = cluster
suicide_kmeans.head(6)

#### Result

First, we visualize the cluster’s centers picked by k-means algorithm.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap

cmap = ListedColormap(sns.color_palette("rainbow"))
centers = model.cluster_centers_

fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(X_transformed[:,0], X_transformed[:,1], X_transformed[:,2], c=cluster, cmap=cmap)
ax.scatter(centers[:, 0], centers[:, 1], centers[:, 2], c='green', s=50)
ax.set_title('Clusters of Countries (K Means Clustering Model)')
ax.set_xlabel('suicides/100k_pop')
ax.set_ylabel('gdp_per_capita')
ax.set_zlabel('population')
ax.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 0.95), loc=2, title='cluster');

Then, we can also take a look with the members in each cluster.

In [None]:
for i in range(4):
  print(f'Countries belong to cluster {i}:')
  print(list(suicide_kmeans.groupby('cluster').groups[i]))
  print()

In [None]:
suicide_kmeans.rename(index={'Bosnia and Herzegovina': 'Bosnia and Herz',
                            'Czech Republic': 'Czechia',
                            'Dominica': 'Dominican Rep.',
                            'Republic of Korea': 'South Korea',
                            'Russian Federation': 'Russia',
                            'United States': 'United States of America'}, inplace=True)

In [None]:
suicide_map_df = pd.merge(world, suicide_kmeans.reset_index(), on='country')
fig, ax = plt.subplots(figsize = (15, 5))
ax.set_title("Clusters of Countries (K Means Clustering Model)")
suicide_map_df.plot(column='cluster', ax = ax, legend=True, legend_kwds={'label': "cluster"});

We can create a bar chart to show the characteristics of each cluster.

In [None]:
df_kmeans = suicide_kmeans.copy()
df_kmeans['gdp_per_capita'] = df_kmeans['gdp_per_capita'] / 1000
df_kmeans['population'] = df_kmeans['population'] / 1000000
df_kmeans.rename(columns={'gdp_per_capita':'gdp_per_capita (thousand)',
                   'population': 'population (million)'}, inplace=True)
df_kmeans.head(5)

In [None]:
df_kmeans = pd.melt(df_kmeans, id_vars="cluster", var_name="features", value_name="value")
df_kmeans.head(5)

In [None]:
sns.catplot(x='cluster', y='value', hue='features', data=df_kmeans, kind='bar');

We can interpret the clusters as:

| Cluster | Suicide Rate | GDP Per Capita | Population |
| --- | --- | --- | --- |
| 0 | High | Low | Low |
| 1 | Medium/High | Medium | High |
| 2 | Low | Low | Low |
| 3 | Medium | High | Low |

As we can observe from the clustering result, countries with high suicide rates tend to have low or medium GDP per capita, and countries with low suicide rates tend to have low GDP per capita and population. 

### Agglomerative Hierarchical Clustering
Hierarchical clustering is another distance-based clustering method like k-means clustering. It is also used to group the unlabeled data points having similar characteristics together. We will plot the dendrogram of the datapoints to observe the optimal number of clusters.


In [None]:
import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(X_transformed, method="ward"))
plt.title('Dendrogram')
plt.xlabel('Countries')
plt.ylabel('Euclidean distances')
plt.xticks([])
plt.show()

To determine the optimal number of clusters from this diagram, generally we set the threshold that it cuts the longest vertical line. In this case, we set it as 4. The line drawn using the threshold 8 intersects 4 vertical lines, so the number of clusters will be 4.

In [None]:
from sklearn.cluster import AgglomerativeClustering 

model = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster = model.fit_predict(X_transformed)
suicide_ahc = df.copy()
suicide_ahc['cluster'] = cluster
suicide_ahc.head(5)

#### Result


In [None]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap

cmap = ListedColormap(sns.color_palette('rainbow'))

fig = plt.figure(figsize=(8,5))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(X_transformed[:,0], X_transformed[:,1], X_transformed[:,2], c=cluster, cmap=cmap)
ax.set_title('Clusters of Countries (Agglomerative Hierarchical Clustering Model)')
ax.set_xlabel('suicides/100k_pop')
ax.set_ylabel('gdp_per_capita')
ax.set_zlabel('population')
ax.legend(*sc.legend_elements(), bbox_to_anchor=(1.05, 0.95), loc=2, title='cluster');

Now, Take a look with the members in each cluster.

In [None]:
for i in range(4):
  print(f'Countries belong to cluster {i}:')
  print(list(suicide_ahc.groupby('cluster').groups[i]))
  print()

In [None]:
suicide_ahc.rename(index={'Bosnia and Herzegovina': 'Bosnia and Herz',
                            'Czech Republic': 'Czechia',
                            'Dominica': 'Dominican Rep.',
                            'Republic of Korea': 'South Korea',
                            'Russian Federation': 'Russia',
                            'United States': 'United States of America'}, inplace=True)

In [None]:
suicide_map_df = pd.merge(world, suicide_ahc.reset_index(), on='country')
fig, ax = plt.subplots(figsize = (15, 5))
ax.set_title("Clusters of Countries (Agglomerative Hierarchical Clustering)")
suicide_map_df.plot(column='cluster', ax = ax, legend=True, legend_kwds={'label': "cluster"});

In [None]:
df_ahc = suicide_ahc.copy()
df_ahc['gdp_per_capita'] = df_ahc['gdp_per_capita'] / 1000
df_ahc['population'] = df_ahc['population'] / 1000000
df_ahc.rename(columns={'gdp_per_capita':'gdp_per_capita (thousand)',
                   'population': 'population (million)'}, inplace=True)
df_ahc = pd.melt(df_ahc, id_vars="cluster", var_name="features", value_name="value")
sns.catplot(x='cluster', y='value', hue='features', data=df_ahc, kind='bar');

We can interpret the clusters as:

| Cluster | Suicide Rate | GDP Per Capita | Population |
| --- | --- | --- | --- |
| 0 | Low/Medium | Medium | High | 
| 1 | Low | Low | Low |
| 2 | High | Low | Low |
| 3 | Medium | High | Low |

This time, we can see countries with high suicide rates tend to have low GDP per capita and population, while countries with low suicide rates tend to have low or medium GDP per capita.

# Conclusion

There is a decreasing trend in overall suicide rate. Europe has the highest suicide numbers and suicide rates. Among the countries, the Republic of Korea shows the most significant increase in suicide rate over the years. It may be due to environmental, social, and economic reasons. The government should take actions against the suicide problem in the Republic of Korea.

There are various factors for people to commit suicide. Male tend to have a higher risk towards suicide. People aged from 35 to 54 also get a higher risk. It may be due to the midlife crisis. From the correlation analysis, it appears that the GDP does not have many relationships to affect the suicide situation. There may have some other factors that do not included in this dataset, can explain the suicide problem. 

We have also performed clustering by trying to use two different algorithms, which are K Means Clustering and Agglomenrative Hierarchical Clustering, to see how we can separate the observations based on GDP per capita, suicide rate and population among the countries. The clustering result also gives us the insight that there may contains other factors that can cause the suicide problem globally. If a country has a low suicide rate, it can have a high or low suicide rate. GDP per capita may have some relationships with the suicide rate, but their relationship is not significant to help us identify it is the key factor causing the suicide problem.