
**SUICIDE RATES OVERVIEW 1985-2016**

*Exploratory Data Analysis*

Data source: Kaggle dataset, created by Kaggle user: Russel Yates
This dataset was compiled from datasets sourced from: the United Nations Development Program, the World Bank, Suicide in the Twenty-First Century (Szamil), and the World Health Organization

# PREPROCESSING DATA

In [None]:

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import ttest_ind, ttest_rel
print(os.listdir("../input"))

In [None]:
df = pd.read_csv("../input/master.csv")
df.head()

In [None]:
df.info()

## CHECK FOR MISSING VALUES IN DATASET ##

In [None]:
c=df.columns
for col in c:
    cos = df[col].isnull().sum()
    print(('column {} has ' + str(cos) + ' missing values').format(col))

## DROP UNNECESSARY INFORMATION##

Drop the 'HDI for year' columns due to many mising values. Moreover, few columns duplicates the information from other columns, for example, information displayed in 'country-year' and  can also be found in columns 'country' and 'year'.

In [None]:
df = df.drop(['HDI for year','country-year'], axis = 1)
df.head()

## NUMERIC AND CATEGORICAL DATA ##
**NUMERIC FEATURES**

Let's take a look at statistics of numeric columns and categorical columns separately.

In [None]:
df_num = df.drop(['country','sex','generation'] ,axis = 1)
df_num.describe()


* All columns has the same amount of records.
* Column 'Year' has interval from 1985 to 2016 year.
* Suicide number has 75% of data under 131, however the remaining part is quantitatively many times bigger than the rest of the dataset, what makes strong right skewness.
* 'suicides/100k pop' allows us to compare the suicide rates between countries with different population in the same way, and it shows, that 75% of records doesn't exceeds 16persons / 100k population.
* From 'gdp_per_capita' we may see, that countries all over the world was taken into dataset, because gdp per person starts from 251 to 126352 dollars.

**CATEGORICAL FEATURES**

In [None]:
df_cats = df.drop(['year','suicides_no','population','suicides/100k pop',\
                   'gdp_per_capita ($)',' gdp_for_year ($) '], axis=1)
df_cats.describe()

* Categorical data also has the same amount of records.
* 101 countries ar participating in dataset and most records are made in Nehterlands.
* Female and Male records are equal, with 13910 records each.
* Ages are grouped by generations and has 6 generations inside with most frequent - 15-24 years.
* Generation is described by 6 unique categories with 6408 records of - Generation X.

## NUMERICAL DATA DISTRIBUTION

In [None]:
df_num = df_num[df_num['suicides_no'] < 2000]
fig, axarr = plt.subplots(2, 2, figsize=(10, 10))
df_num['year'].hist(ax=axarr[0][0])
axarr[0][0].set_title("year", fontsize=18)
df_num['suicides_no'].hist(ax=axarr[0][1], bins = 25)
axarr[0][1].set_title("suicides_no", fontsize=18)
df_num['population'].hist(ax=axarr[1][0])
axarr[1][0].set_title("population", fontsize=18)
df_num['suicides/100k pop'].hist(ax=axarr[1][1], bins = 20)
axarr[1][1].set_title("suicides/100k pop", fontsize=18)

### LOG Normal distribution ###

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(10, 10))
x = np.log10(df['suicides_no'].replace(0, np.nan).dropna()).hist(ax=axarr[0][0], bins =20)
axarr[0][0].set_title("suicides_no", fontsize=18)
x = np.log10(df['population'].replace(0, np.nan).dropna()).hist(ax=axarr[0][1], bins =20)
axarr[0][1].set_title("population", fontsize=18)
x = np.log10(df['suicides/100k pop'].replace(0, np.nan).dropna()).hist(ax=axarr[1][0], bins =20)
axarr[1][0].set_title("suicides/100k pop", fontsize=18)



### Testing for difference across samples by decades

To compare, if suicide rates sample means changes significantly during the time( for instance each 10,20 years), or has different origin, I define function and let it iterate through time periods. T-test with SL 95% is in use.

In [None]:
def ttest_decade(year1,year2):
    male = df[df['year']==year1]["suicides_no"]
    female = df[df['year']==year2]["suicides_no"]

    ttest,pval = ttest_ind(male,female)

    if pval <0.05:
        return print("ttest",ttest.round(3),"\n"\
                     "p-value",pval.round(3),"\nThere -IS- significant difference "\
                     "between sample means of {} and {} years\n".format(year1,year2))
    elif pval == 1:
        return None
    else:
        return print("ttest",ttest.round(3),"\n"\
                     "p-value",pval.round(3),"\nThere is -NO- significant difference "\
                     "between sample means of {} and {} years\n".format(year1,year2))
    
Y1 = [1985,1995]
Y2 = [1995,2005,2015]
import itertools
for year_1, year_2 in itertools.product(Y1, Y2):
    ttest_decade(year_1,year_2)

Here we accept null hypothesis for all iterations and it let us think, that data is homogen and have been collected in one common way for exact period of time.

### Overall suicide rate increase in % from 1985 to 2015 ###

In [None]:
first = df[df['year']==1985]['suicides_no'].mean()
second = df[df['year']==2015]['suicides_no'].mean()
print('Suicides overall rate increased for {:.2f}% during last 30 years'.format(((second / first)-1)*100))

first = df[df['year']==1985]['population'].mean()
second = df[df['year']==2015]['population'].mean()
print('Suicides overall rate increased for {:.2f}% during last 30 years'.format(((second / first)-1)*100))

As we can see, during last 30 years, suicides rates was growing up at same speed as world population with 35,84% and 36,22% respectly.

# DATA VISUALIZATION

## Top 10 countries by suicides rate through out the 30 years


In [None]:
# Drop 2016 due to lack of records on this year
df=df[df['year']!=2016]

plt.figure(figsize = (12,5))
df.most_pop_countries= df.groupby('country')['suicides_no'].sum().sort_values(ascending = False)[:10]
df.most_pop_countries.plot(kind='bar', fontsize=12)
plt.ylabel('suicides number')
plt.title('Top-10 countries by suicides 1985-2015', fontsize = 20)

## Suicide rate by age group 1985-2015

In [None]:
cats = ['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years']
cat_dtype = pd.api.types.CategoricalDtype(categories=cats, ordered=True)
df_cats['age'] = df['age'].astype(cat_dtype)

suicides_rate_per_years = df.groupby(['year','age'])['suicides_no'].sum()
plt.figure(figsize=(12,5))
suicides_rate_per_years.unstack().plot(kind='line',linewidth=5.0, figsize=(15,7), grid=5, fontsize =12)
plt.ylabel('suicides number')
plt.title('Suicide number by age group 1985-2015',fontsize = 20)

Here we see, that the majority of suiciders are 35-74 years old at the moment of their death, and uptrend in suicide amount in all groups, starting from 1987 to 2000, means that people have faced difficult living conditions.
However, the situation started to curl back down after 2000s, what proofs the phrase "More suiciders - less suiciders" (Joke).


## Plotting graph by age groups and generations

Firstly, I set the right order of age,group suicides number by year and age, then visualize.

In [None]:
d = { 'G.I. Generation':['1910-1924'],'Silent':['1925-1945'],'Boomers':['1946-1964'],\
     'Generation X':['1965-1979'],'Millenials':['1980-1994'],'Generation Z':['1995-2012']}
col=['period']
gens = pd.DataFrame(data=d, index=col).T
gens

## Relative weight of each age group 1985-2015

In [None]:
plt.figure(figsize=(12,5))
total = float(len(df_cats) )

ax = sns.countplot(x="generation", data=df_cats)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.1f}%'.format((height/total)*100),
            ha="center") 
plt.title('Relative weight of each age group 1985-2015',fontsize = 20)
plt.show()

This graph displays the overall tendency of each generation to suicide. For the last 30 years, it was spreaded among adult people, with less seniors and children respectly.

## Suicide rate by generations 1985-2015

In [None]:

suicides_rate_per_gen = df.groupby(['year','generation'])['suicides_no'].sum()
suicides_rate_per_gen.unstack().plot(kind='line',linewidth=6.0, figsize=(12,5), fontsize=12)
plt.title("Suicide numbers by generations 1985-2015", fontsize = 20)
plt.ylabel('suicides number')

The graph indicates logical links between generations and timeline:
* G.I. generation, as the oldest generation - stops at 2000s, and it means, that persons older than 76, don't tend to make suicides.
* On other hand, Z generation, the youngest one, has only couple accidents since 2007.
* Boomers and Generation X lines has same form, as Boomers was 35-54 y.o. at 1990s, facing global financial crysis (which could be a strong catalyst), but Generation X has same situation at 2008 global crysis. It leads to assumption, that millenials will have a peak on suicide amount during the next global crysis.

## Suicide rates by gender 1985-2015

In [None]:
pal = ['pink','powderblue']
suicides_rate = df.groupby(['year','sex'])['suicides_no'].sum()
suicides_rate.unstack().plot(kind='area',linewidth=3.0, figsize=(12,5), colors = pal, fontsize=12)
plt.ylabel('suicides number')
plt.title('Suicide rates by gender 1985-2015', fontsize = 20)

It's quite frightening, that mens makes suicides 5 times more often than womans. It means, that male part of the world has more stress factors than females.

## Heatmap of suicide rate per countries per year

In [None]:
adf = df.copy()
adf=df.loc[df['country'].isin(['Russian Federation','Latvia','Estonia','Ukraine','Belarus',\
                               'Lithuania','United States'])]
x=pd.crosstab(adf['country'], df['year'], values = df['suicides/100k pop'], aggfunc = 'sum').round(0)
x = x.apply(lambda row: row.fillna(row.mean()), axis=1)
plt.figure(figsize=(15,8))
sns.heatmap(x)
plt.title('Heatmap of suicide rate per countries per year', fontsize = 25)
plt.ylabel('')


Here I want to point out few things:
* Baltic region countries differs by suicide rates, where Lithuania stands on the first place and that could be the topic for another research.
* Considering Eastern region, Russia has the biggest score there, most probalby due to USSR dissolution, bankruptcy and high crime rate.
* Among Eastern region, Ukraine has the lowest suicide rate. It shows, that Ukrainians ar more endurant and stress-resistant.
* Besides the fact, that USA takes second place by overall suicide number, the suicide rate on 100k population is stable and low. Therefore, we can't treat USA as country, with population's tendency to make suicides. Most probably, it's because there is no missing values for USA, as well as the population is big, comparing to the other countries.  


## Suicide rate and GDP per capita correlation 1985-2015

In [None]:
# Delete 3 biggest suicide countries (Russia, USA, Japan) to ease up the correlation graph interpretation
df_without_top3 = df[(df['country']!='Russian Federation')&(df['country']!='United States')\
        &(df['country']!='Japan')]
#Make a pivot table with summarized suicides numbers each year per each country
df_sui = pd.pivot_table(df_without_top3, values='suicides_no', index='country', columns='year', \
                        aggfunc = 'sum')

#Fill in all the gaps with mean values to receive the same sample sizes for every country
df_sui = df_sui.apply(lambda row: row.fillna(row.mean()), axis=1)
#Create new summarizing column with average values
df_sui['Avg_suicide'] = df_sui.apply(lambda row: row.mean(), axis=1).round(0)
country_mean_suicide = df_sui[['Avg_suicide']]

#Repeat all above with second feature
df_gdp = pd.pivot_table(df_without_top3, values='gdp_per_capita ($)', index='country', columns='year')
df_gdp = df_gdp.apply(lambda row: row.fillna(row.mean()), axis=1)
df_gdp['Avg_gdp'] = df_gdp.apply(lambda row: row.mean(), axis=1).round(0)
country_mean_gdp = df_gdp[['Avg_gdp']]

plt.figure(figsize=(12,5))
plt.scatter(x=country_mean_gdp, y=country_mean_suicide, s = 100)
plt.title('Suicide rate and GDP per capita correlation 1985-2016', fontsize = 20)
plt.xlabel('Average GDP per capita ($)')
plt.ylabel('Average suicides rate per year ')

On this graph we dont see any  global correlation and only excluding outliers like Russia, USA and Japan we can detect some negative correlation. It shows, that the higher is GDP per capita , the less suicides happens, plus, the majority of countries has low suicide rate, in absolute values it is under 2000 suicides per year for one country.


# SUMMARY #
1. Dataset contains adequate, clean and structured information, however, each country has missing values in few years .

2. 101 country and 6 generations participated in the dataset with GDP per capita from 251 to 126352 dolars.

3. Numerical data has strong right skew, what means, that majority of values locates at left side of X axe and should it be grouped more deeply to archieve normal distribution.

4. Countries with the biggest suicide rate are Russia, USA, Japan, France, Ukraine.

5. In their majority, suicider's age is between 35 and 54 years and they represents 'Silent Generation' and 'X Generation'.

6. The suicides rate was increasing during each global crysis, at 90's among 'Silent Generation' and at 08's among 'X Generation'.

7. Male side is 5 times mores likely to commit a suicide comparing to females.

8. Among Baltic states, Lithuania's suicide rates are 2 times bigger than Latvia's or Estonia's rates.

9. USA takes second place on total suicides amount not because of high suicide rate, but due to relatively big population.

10. The higher is GDP per capita, the less suicides happens in country.

11. Suicide rates was growing up at same speed as world population during last 30 years, with 35,84% and 36,22% respectly.