This kernel is all about exploring our data focusing mostly on the countries with the top suicide rates. This is measured by suicide count for every 100k population.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

Now we double check where are our input is.

In [None]:
!ls '../input/'

In [None]:
sns.set(style="whitegrid")

And finally we read our data. We see that our data is grouped by country, year, sex, age and generation. We also have gdp for year and gpd per capita. Now let's explore more..

In [None]:
suicides_df = pd.read_csv('../input/suicide-rates-overview-1985-to-2016/master.csv')
suicides_df.head()

In [None]:
suicides_df.info()

In [None]:
suicides_df.isnull().sum()

In [None]:
suicides_df.shape

We saw that we have about 19k out of 27k missing data for HDI. Too much missing data is troublesome given that we also don't know how to fill out this missing data. so we just drop this feature. We're also dropping the country-year since it's just a combination of the country + year.

In [None]:
suicides_df.drop(['HDI for year', 'country-year'], axis=1, inplace=True)

Notice that when we called .info() we saw an inconsistency with the column names. gdp_for_year ($) has leading and trailing whitespaces. We clean this up using the lamba and strip() functions.

In [None]:
suicides_df.rename(columns=lambda x: x.strip(), inplace=True)
suicides_df.columns

First let's look at the raw count of the suicide numbers per country. Then plot it as a barplot using the seaborn package.

In [None]:
suic_total = suicides_df[['country','suicides_no']].groupby('country').sum()
suic_total = suic_total.reset_index()
suic_total.sort_values(by=['suicides_no'], ascending=False, inplace=True)
suic_total.head()

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.barplot(y='suicides_no', x='country', data=suic_total.head(10), ax=ax, palette=sns.color_palette('cubehelix'))
plt.title('Countries with highest number of suicides in the last 3 decades (1985 - 2016)')
plt.ylabel('count')

Noticed how the Russian Federation has the highest number of suicide count all over the world? Followed by US, Japan, France, Ukraine, Germany, South Korea, Brazil, Poland, and the UK.

Now let's check the suicide rate whether it will give the same results.

In [None]:
suic_rate = suicides_df[['country','suicides/100k pop']].groupby('country').sum()
suic_rate = suic_rate.reset_index()
suic_rate.sort_values(by=['suicides/100k pop'], ascending=False, inplace=True)
suic_rate.head()

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.barplot(y='suicides/100k pop', x='country', data=suic_rate.head(10), ax=ax, palette=sns.color_palette('cubehelix'))
plt.title('Countries with highest rate (suicides per 100k population) of suicides in the last 3 decades (1985 - 2016)')
plt.ylabel('count')

Interestingly, Russia also has the highest suicide rate all over the world. Japan, Korea, and Ukraine are still at the top 10 as well. Lithuania, Hungary, Kazakhstan, Austria, Finland, and Belgium is at the top considering that we are now considering the suicide count relative to the population.

From here on out, we're more interested with the rate. We'll use it for our next graphs by getting the top 10 countries with the highest suicide rate.

In [None]:
top_countries = suic_rate.head(10)['country']
top_countries.reset_index(drop=True)

Let's now look at the suicide rate by sex. We can see then the distribution between the suicide rate of males and females.

In [None]:
suic_gender = suicides_df.loc[suicides_df['country'].isin(top_countries)]
suic_gender = suic_gender[['country', 'sex', 'suicides/100k pop']].groupby(['country', 'sex']).sum()
suic_gender = suic_gender.reset_index()
suic_gender.sort_values(by=['suicides/100k pop'], ascending=False, inplace=True)
suic_gender.head()

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.barplot(y='suicides/100k pop', x='country', hue='sex', data=suic_gender, ax=ax, palette=sns.color_palette('cubehelix', 2))
plt.title('Countries with highest number of suicides in the last 3 decades (1985 - 2016)')
plt.ylabel('suicide count')

As expected, the suicide rate is higher in males. Several studies have suggested that males are more prone in suicide due to a number of factors like male stereotypes, gender roles, etc.

Next let's check the gender by year..

In [None]:
suic_year = suicides_df.loc[suicides_df['country'].isin(top_countries)]
suic_year = suic_year[['country', 'sex', 'suicides/100k pop', 'year']].groupby(['country', 'sex', 'year']).sum()
suic_year = suic_year.reset_index()
suic_year.sort_values(by=['year', 'suicides/100k pop'], ascending=(True, False), inplace=True)
suic_year.head()

In [None]:
bps = sns.FacetGrid(suic_year, col='country', col_wrap=1, height=3, aspect=3.5, hue='sex', 
                    palette=sns.color_palette('cubehelix', 2))
bps.map(sns.barplot, 'year', 'suicides/100k pop', hue_order='sex')
bps.add_legend()
bps.set_ylabels('suicide rate')
bps.set_xticklabels(suic_year['year'].unique())

axs = bps.axes.flatten()
for i in range(len(suic_year['country'].unique())):
    axs[i].set_title(suic_year['country'].unique()[i])

Again as observed, males are more prone to suicide. Although, Finland from 1985-1997 has either more or equal number of suicides in females than in males. Ukraine almost follows this trend.

Now let's look at by age..

In [None]:
suic_age = suicides_df.loc[suicides_df['country'].isin(top_countries)]
suic_age = suic_age[['country', 'age', 'suicides/100k pop']].groupby(['country', 'age']).sum()
suic_age = suic_age.reset_index()
suic_age['age'] = suic_age['age'].map({'5-14 years': 1, '15-24 years': 2, '25-34 years': 3, 
                        '35-54 years': 4, '55-74 years': 5, '75+ years': 6})
suic_age.sort_values(by=['suicides/100k pop', 'age'], ascending=(False, True), inplace=True)
suic_age.head()

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
colors = sns.color_palette('cubehelix', 6)
sns.barplot(y='suicides/100k pop', x='country', hue='age', data=suic_age, ax=ax, palette=colors)
plt.title('Countries with highest number of suicides in the last 3 decades (1985 - 2016) categorised by age')
plt.ylabel('suicide rate')
plt.legend(labels=['5-14 yo', '15-24 yo', '25-34 yo', '35-54 yo', '55-74 yo', '75+ yo'])

legends = ax.get_legend()
for i in range(len(legends.legendHandles)):
    legends.legendHandles[i].set_color(colors[i])


Almost all of the top 10 countries has 75+ years old of people committing suicide compared to other age groups. Except for Lithuania and Finland though where 35-54 years old age group are higher than the others.

Sadly..we can see that there are about 200 5-14 year olds who commit suicide in Kazakhstan. There are also a number in the same age group in Korea, Austria, Russia, Lithuania, Belgium, Ukraine, Japan, and Finland.

Now let's look at the rate by generation..

In [None]:
suic_gen = suicides_df.loc[suicides_df['country'].isin(top_countries)]
suic_gen = suic_gen[['country', 'generation', 'suicides/100k pop']].groupby(['country', 'generation']).sum()
suic_gen = suic_gen.reset_index()
suic_gen['generation'] = suic_gen['generation'].map({'G.I. Generation': 1, 'Silent': 2, 'Boomers': 3, 
                        'Generation X': 4, 'Millennials': 5, 'Generation Z': 6})
suic_gen.sort_values(by=['suicides/100k pop', 'generation'], ascending=(False, True), inplace=True)
suic_gen.head()

In [None]:
generations = ['G.I. Generation', 'Silent', 'Boomers', 'Generation X', 'Millennials', 'Generation Z']

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
colors = sns.color_palette('cubehelix', 6)
sns.barplot(y='suicides/100k pop', x='country', hue='generation', data=suic_gen, ax=ax, palette=colors)
plt.title('Countries with highest number of suicides in the last 3 decades (1985 - 2016) categorised by generation')
plt.ylabel('suicide rate')

boxes = [item for item in ax.get_children() if isinstance(item, matplotlib.patches.Rectangle)][:-1]
legend_patches = [matplotlib.patches.Patch(color=color, label=label) for
                  color, label in zip([item.get_facecolor() for item in boxes], generations)]
plt.legend(handles=legend_patches)

legends = ax.get_legend()
for i in range(len(legends.legendHandles)):
    legends.legendHandles[i].set_color(colors[i])

The Silent generation seems to be the most prone and has the highest count of suicides across the top countries by suicide rate. Although again, Finland has almost the same count for the Silent generation and Boomer generation.

Now let's look at gdp if it's related to the suicide rate.

In [None]:
gdp = suicides_df[['country', 'gdp_per_capita ($)']].groupby('country').sum()
gdp = gdp.reset_index()
gdp.sort_values(by=['gdp_per_capita ($)'], ascending=False, inplace=True)
gdp.head()

In [None]:
top_countries = top_countries.tolist()

In [None]:
top_gdp = gdp.head(25)
clrs = ['red' if (row['country'] in top_countries) else 'grey' for _, row in top_gdp.iterrows()]

In [None]:
fig, ax = plt.subplots(figsize=(20,7))
bp = sns.barplot(y='gdp_per_capita ($)', x='country', data=top_gdp, ax=ax, palette=clrs)
bp.set_xticklabels(bp.get_xticklabels(), rotation=90)
plt.title('Countries with highest GDP per capita in the last 3 decades (1985 - 2016)')
plt.ylabel('gdp')

Interestingly, you can see that high GDP per capita shouldn't be related with the suicide rate of a country. If we look at the 25 countries with highest GDPs from 1985-2016, we can see that only Japan, Austria, Finland, and Belgium which is part of the top 10 countries with highest suicide rate are part of the group.