# **Historical suicides rates: data visualization**
This dataset contains:
* country
* year
* sex
* age
* suicides_no
* population
* suicides/100k pop
* country-year
* HDI for year
* gdp_for_year
* gdp_per_capita
* generation


My goal is to find basic features of the data and analyse the patterns. Analysing factors that influence suicides is important in order to reduce such cases. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import warnings
from numpy import cov
from scipy.stats import pearsonr
warnings.filterwarnings('ignore')


Let's generally see how data looks...

In [None]:
df=pd.read_csv('../input/suicide-rates-overview-1985-to-2016/master.csv')
print(df.columns)

In [None]:
df.head(20)

In [None]:
df.describe()

In [None]:
df2 = df.rename({'gdp_per_capita ($)': 'gdp_per_capita', ' gdp_for_year ($) ': 'gdp_for_year'}, axis='columns')



The first intention is to see how suicide rate has been changing over the years.

In [None]:
plt.figure(figsize=(15,6))
sns.lineplot(df2.year, df2.suicides_no, ci = None)

It looks like the number of suicides was raisig rapidly approximately until 1993 and then it started to decline.
Another phase of grow up began in 2011-2012 but soon it (hopefully!) stopped and after 2014 there was dramatic fall of the suicide rate. 
But this is data with no respect to population. What about suicides/population ratio?

In [None]:
df_sumByYear = df2.groupby(["year"]).agg({'gdp_per_capita': 'mean', 'suicides_no' : 'mean', 'population' : "mean"}).sort_values(by=['suicides_no'],  ascending=False).reset_index()


df_sumByYear["Suicides/Population"] = df_sumByYear.suicides_no/df_sumByYear.population
plt.figure(figsize=(15,6))

sns.lineplot(df_sumByYear["year"], df_sumByYear["Suicides/Population"], ci = None)

Could it be explained by the significant raise of gdp over the last years?

## Suicides and gdp:

In [None]:
plt.figure(figsize=(15,6))
sns.lineplot(df2.year, df2.gdp_per_capita, ci = None)

Let's see at gdp of first and last 10 countries by suicide rate.

In [None]:
df_sumByCountry = df2.groupby(["country","sex"]).agg({'gdp_per_capita': 'mean', 'suicides_no' : 'mean', 'population' : "mean"}).sort_values(by=['suicides_no'],  ascending=False).reset_index()

df_sumByCountry.head(5)

df_sumByCountry["Suicides/Population"] = df_sumByCountry.suicides_no/df_sumByCountry.population
df_sumByCountry = df_sumByCountry.sort_values(by=['Suicides/Population'],  ascending=False)
df_sumByCountry.tail(30)
df_sumByCountry["Suicides/Population"].mean()

In [None]:
plt.figure(figsize=(15,6))
plt.bar(df_sumByCountry["country"].head(10), df_sumByCountry["Suicides/Population"].head(10),color=(0.2, 0.1, 0.1, 0.6))
plt.title('10 countries with highest Suicides/Population rate')
plt.xlabel('Countries', fontsize=12)
plt.ylabel('Suicide/Population', fontsize=12)

Interesting fact: almost all countries from the first 10 are post-Soviet...

In [None]:
plt.figure(figsize=(15,6))
plt.bar(df_sumByCountry["country"].tail(10), df_sumByCountry["Suicides/Population"].tail(10),color=(0.8, 0.8, 0.8, 0.6))
plt.title('10 countries with lowest Suicides/Population rate')
plt.xlabel('Countries', fontsize=12)
plt.ylabel('Suicide/Population', fontsize=12)

Let's take a look at boxplots for GDP of countries from top and bottom of the list.

In [None]:
plt.figure(figsize=(15,6))
plt.bar(df_sumByCountry["country"].head(10), df_sumByCountry["gdp_per_capita"].head(10),color=(0.2, 0.1, 0.1, 0.6))
plt.title('GDP for countries with highest Suicides/Population rate')

In [None]:
plt.figure(figsize=(15,6))
plt.bar(df_sumByCountry["country"].tail(10), df_sumByCountry["gdp_per_capita"].tail(10),color=(0.8, 0.8, 0.8, 0.6))
plt.title('GDP for countries with lowest Suicides/Population rate')

I'm going to create new column with rating of Suicide/population value to indicate whether it is very high, high, moderate or low.
To do that let's take a look at the Suicide/population boxplot.


In [None]:
ax = sns.boxplot(x=df_sumByCountry["Suicides/Population"])


In [None]:
conditions = [
    (df_sumByCountry["Suicides/Population"] >= 0.0003),
    (df_sumByCountry["Suicides/Population"] < 0.0003) & (df_sumByCountry["Suicides/Population"] >= 0.00015),
    (df_sumByCountry["Suicides/Population"] < 0.00015) & (df_sumByCountry["Suicides/Population"] >= 0.000075),
    (df_sumByCountry["Suicides/Population"] < 0.000075) & (df_sumByCountry["Suicides/Population"] >= 0.000)]
values = ["Very high", "Hign", "Moderate", "Low"]
df_sumByCountry["Rating"] = np.select(conditions, values)
df_sumByCountry.head(20)

In [None]:
plt.figure(figsize=(7,5))
ax = sns.boxplot(x="Rating", y="gdp_per_capita", data=df_sumByCountry)
ax.set(ylim=(0, 80000))

Boxplots show the minimum median of gdp is in the group with low suicide/population. Suprisingly! At the same time group with the most suicides has a close value of median gdp. It looks like there's no connection.

In [None]:
plt.scatter(df2.gdp_per_capita, df2.suicides_no)
corr, _ = pearsonr(df2.gdp_per_capita, df2.suicides_no)
print('Pearsons correlation: %.3f' % corr)

Well, gdp doesn't seem to be an answer.

## **Suicides and gender:**

In [None]:
plt.figure(figsize=(8,6))
men = df[df.sex == "male"]
women = df[df.sex == "female"]
sns.lineplot(men.year, df.suicides_no, ci = None)
sns.lineplot(women.year, df.suicides_no, ci = None)
plt.legend(["male", 'female'])
plt.show()

Overall amount of suicides commited by men is considerably higher that commited by women during all the time. What is more, female suicide rate was almost always stable except period after 2014, while male values fluctuated.

In [None]:
countries = np.array(df_sumByCountry["country"].head(10))
fig = plt.figure(figsize=(15,10))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i in range(0, 10):
    ax = fig.add_subplot(2, 5, i+1)
    sns.barplot(x="sex", y="suicides_no", data=df_sumByCountry[df_sumByCountry.country == countries[i]])
    plt.title(countries[i])



Speaking about 10 countries with highest Suicides/Population rate, women show almost 3 times lower ratio. However, it looks like female suicide ratio is higher in Hungary, Slovenia and Sri Lanka.

## Suicides and age:

In [None]:
suicides_by_ageGroup = df.pivot_table(index='age' , aggfunc='sum')
suicides_by_ageGroup["suicides_no"]

In [None]:
plt.figure(figsize=(12,10))
sns.lineplot("year", df.suicides_no, hue = "age",
             data = df, linewidth = 2.5, style = "age", markers=False
            , dashes=False)
plt.title("Suicide rates for age groups over the time")
plt.show()

Here it's possible to see that people of miidle age are more at risk at commiting a suicide. The elderly show the lowest ratio. Interesting to know that all groups act practically the same way over the time.

In [None]:
suicides_by_ageGroup = df.pivot_table(index='generation' , aggfunc='sum')
suicides_by_ageGroup["suicides_no"]

In [None]:
 df_countries_gen = df2.groupby(["country","generation"]).agg({'gdp_per_capita': 'mean', 'suicides_no' : 'mean', 'population' : "mean"}).sort_values(by=['suicides_no'],  ascending=False).reset_index()
df_countries_gen["Suicides/Population"] = df_countries_gen.suicides_no/df_sumByCountry.population
    
fig = plt.figure(figsize=(30,25))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
generations = df_countries_gen["generation"].unique()
for i in range(10):
    ax = fig.add_subplot(4, 3, i+1)
    sns.barplot(x="generation", y="Suicides/Population", data= df_countries_gen[ df_countries_gen.country == countries[i]], order=generations)
    plt.title(countries[i])



Wow! Here the situation changes completely from county to country. It might be historical and social differents that lead to one or another generation commit more suicides in particulat countries. In Russia it is Generation Z, in Belarus G.I. one, in Ukraine - Millenials, in Latvia it is Silent generation. Quite interesting for relatively close post-Sovies countries. 

In [None]:
fig = plt.figure(figsize=(10,8))
sns.barplot(x="generation", y="Suicides/Population", data= df_countries_gen)


However, it was only the list of first 10 countries. Generally, generation X is the most likely to commit suicides over all countries.

What about gender and age group?

In [None]:
dfByAges = df.groupby(["age", "sex"]).agg({ 'suicides_no' : 'sum', 'population' : "sum"}).sort_values(by=['suicides_no'],  ascending=False).reset_index()

ages = np.array(df["age"].unique())
fig = plt.figure(figsize=(15,10))
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i in range(0, len(ages)):
    ax = fig.add_subplot(2, 3, i + 1)
    sns.barplot(x="sex", y="suicides_no", data=dfByAges[dfByAges.age == ages[i]])
    plt.title(ages[i])

This barcharts show that women commit suicide more often that on average at the very young age or, in contrast, if 75+.

## Results:
- Highest amoubt of suicides was in 2006-2008, world financial situation in the world may be a reason
- High GDP per capita doesn't mean low level of suicides
- Men commit suicides much more often than women 
- Women are more likely to commit suicide when in the 5-14 or 75+ age group.
- Generally, more suicides is commited by Generation Z
