# Suicide analysis

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
print(os.listdir("../input"))


['master.csv']


#### Reading the data

In [2]:
df = pd.read_csv('../input/master.csv')

In [3]:
df.sample(5)

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
10459,Grenada,1992,female,75+ years,0,1675,0.0,Grenada1992,,310160444,3713,G.I. Generation
16497,Mauritius,2011,male,15-24 years,13,101183,12.85,Mauritius2011,0.762,11518393367,9817,Millenials
4205,Brazil,1987,female,15-24 years,291,14145640,2.06,Brazil1987,,294084112393,2394,Generation X
15351,Luxembourg,1988,female,15-24 years,0,25900,0.0,Luxembourg1988,,9750161053,27676,Generation X
7128,Czech Republic,1992,female,25-34 years,52,660200,7.88,Czech Republic1992,,34590052812,3573,Boomers


In [4]:
df.describe()

Unnamed: 0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_per_capita ($)
count,27820.0,27820.0,27820.0,27820.0,8364.0,27820.0
mean,2001.258375,242.574407,1844794.0,12.816097,0.776601,16866.464414
std,8.469055,902.047917,3911779.0,18.961511,0.093367,18887.576472
min,1985.0,0.0,278.0,0.0,0.483,251.0
25%,1995.0,3.0,97498.5,0.92,0.713,3447.0
50%,2002.0,25.0,430150.0,5.99,0.779,9372.0
75%,2008.0,131.0,1486143.0,16.62,0.855,24874.0
max,2016.0,22338.0,43805210.0,224.97,0.944,126352.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 12 columns):
country               27820 non-null object
year                  27820 non-null int64
sex                   27820 non-null object
age                   27820 non-null object
suicides_no           27820 non-null int64
population            27820 non-null int64
suicides/100k pop     27820 non-null float64
country-year          27820 non-null object
HDI for year          8364 non-null float64
 gdp_for_year ($)     27820 non-null object
gdp_per_capita ($)    27820 non-null int64
generation            27820 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 2.5+ MB


In [6]:
df.isnull().sum()

country                   0
year                      0
sex                       0
age                       0
suicides_no               0
population                0
suicides/100k pop         0
country-year              0
HDI for year          19456
 gdp_for_year ($)         0
gdp_per_capita ($)        0
generation                0
dtype: int64

### Understanding the data

The country-year field displays the country name and year of the record. In this way, it is a redundant field and will be discarded. Also due to most data from the 'HDI for year' field, it will be discarded.

In [7]:
df.drop(['country-year', 'HDI for year'], inplace=True, axis = 1)

Let's rename some columns simply to make it easier to access them.

In [8]:
df = df.rename(columns={'gdp_per_capita ($)': 'gdp_per_capita', ' gdp_for_year ($) ':'gdp_for_year'})

In this case, the 'gdp_for_year' field is as a string, so let's convert this to a number.

In [None]:
for i, x in enumerate(df['gdp_for_year']):
    df['gdp_for_year'][i] = x.replace(',', '')
    
df['gdp_for_year'] = df['gdp_for_year'].astype('int64')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


## Data Description

Each data in the data set represents a year, a country, a certain age range, and a gender. For example, in the country Brazil in the year 1985, over 75 years, committed suicide 129 men.

The data set has 10 attributes. These being:

- Country: country of record data;
- Year: year of record data;
- Sex: Sex (male or female);
- Age: Suicide age range, ages divided into six lanes;
- Suicides_no: number of suicides;
- Population: population of this sex, in this age range, in this country and in this year;
- Suicides / 100k pop: Reason between the amount of suicides and the population / 100k;
- GDP_for_year: PIP of the country in the year who issue;
- GDP_per_capita: ratio between the country's PIP and its population;
- Generation: Generation of the suicides in question, being possible 6 different categories.

In [None]:
df['age'].unique()

In [None]:
df['generation'].unique()

## Adding some things

As the HDI was discarded and it is very interesting to assess whether the development of the country has an influence on the suicide rate, I have separated a list of first and second world countries from the data of the site:

http://worldpopulationreview.com

Then I categorized each country in the data set into first, second and third world.

In [None]:
Frist_world = ['United States', 'Germany', 'Japan', 'Turkey', 'United Kingdom', 'France', 'Italy', 'South Korea',
              'Spain', 'Canada', 'Australia', 'Netherlands', 'Belgium', 'Greece', 'Portugal', 
              'Sweden', 'Austria', 'Switzerland', 'Israel', 'Singapore', 'Denmark', 'Finland', 'Norway', 'Ireland',
              'New Zeland', 'Slovenia', 'Estonia', 'Cyprus', 'Luxembourg', 'Iceland']

Second_world = ['Russian Federation', 'Ukraine', 'Poland', 'Uzbekistan', 'Romania', 'Kazakhstan', 'Azerbaijan', 'Czech Republic',
               'Hungary', 'Belarus', 'Tajikistan', 'Serbia', 'Bulgaria', 'Slovakia', 'Croatia', 'Maldova', 'Georgia',
               'Bosnia And Herzegovina', 'Albania', 'Armenia', 'Lithuania', 'Latvia', 'Brazil', 'Chile', 'Argentina',
               'China', 'India', 'Bolivia', 'Romenia']

In [None]:
country_world = []
for i in range(len(df)):
    
    if df['country'][i] in Frist_world:
        country_world.append(1)
    elif df['country'][i] in Second_world:
        country_world.append(2)
    else:
        country_world.append(3)

df['country_world'] = country_world

# Exploratory analysis

I will analyze the impact of some attributes in isolation on the amount of suicides. We start this year.

#### Year

In [None]:
suicides_no_year = []

for y in df['year'].unique():
    suicides_no_year.append(sum(df[df['year'] == y]['suicides_no']))

n_suicides_year = pd.DataFrame(suicides_no_year, columns=['suicides_no_year'])
n_suicides_year['year'] = df['year'].unique()

top_year = n_suicides_year.sort_values('suicides_no_year', ascending=False)['year']
top_suicides = n_suicides_year.sort_values('suicides_no_year', ascending=False)['suicides_no_year']

plt.figure(figsize=(8,5))
plt.xticks(rotation=90)
sns.barplot(x = top_year, y = top_suicides)

#### Age

In [None]:
suicides_no_age = []

for a in df['age'].unique():
    suicides_no_age.append(sum(df[df['age'] == a]['suicides_no']))

plt.xticks(rotation=30)
sns.barplot(x = df['age'].unique(), y = suicides_no_age)

#### Sex

In [None]:
suicides_no_sex = []

for s in df['sex'].unique():
    suicides_no_sex.append(sum(df[df['sex'] == s]['suicides_no']))

sns.barplot(x = df['sex'].unique(), y = suicides_no_sex)

#### Country

In [None]:
suicides_no_pais = []
for c in df['country'].unique():
    suicides_no_pais.append(sum(df[df['country'] == c]['suicides_no']))
    
n_suicides_pais = pd.DataFrame(suicides_no_pais, columns=['suicides_no_pais'])
n_suicides_pais['country'] = df['country'].unique()

quant = 15
top_paises = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['country'][:quant]
top_suicides = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['suicides_no_pais'][:quant]
sns.barplot(x = top_suicides, y = top_paises)

In [None]:
suicides_no_pais = []
for c in df['country'].unique():
    suicides_no_pais.append(sum(df[df['country'] == c]['suicides/100k pop']))
    
n_suicides_pais = pd.DataFrame(suicides_no_pais, columns=['suicides_no_pais'])
n_suicides_pais['country'] = df['country'].unique()

quant = 15
top_paises = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['country'][:quant]
top_suicides = n_suicides_pais.sort_values('suicides_no_pais', ascending=False)['suicides_no_pais'][:quant]
sns.barplot(x = top_suicides, y = top_paises)

#### Generation

In [None]:
suicides_no_gen = []
for g in df['generation'].unique():
    suicides_no_gen.append(sum(df[df['generation'] == g]['suicides_no']))

plt.figure(figsize=(8,5))
sns.barplot(x = df['generation'].unique(), y = suicides_no_gen)

#### Country world

In [None]:
suicides_no_world = []
for w in df['country_world'].unique():
    suicides_no_world.append(sum(df[df['country_world'] == w]['suicides_no']))
    
sns.barplot(x = df['country_world'].unique(), y = suicides_no_world)

In [None]:
suicides_no_world = []
for w in df['country_world'].unique():
    suicides_no_world.append(sum(df[df['country_world'] == w]['suicides/100k pop']))
    
sns.barplot(x = df['country_world'].unique(), y = suicides_no_world)

#### GDP for year

In [None]:
sns.scatterplot(x = 'gdp_for_year', y = 'suicides_no', data = df)

#### GDP per capita

In [None]:
sns.scatterplot(x = 'gdp_per_capita', y = 'suicides_no', data = df)

## Attribute correlation

In [None]:
plt.figure(figsize=(8,7))
sns.heatmap(df.corr(), cmap = 'coolwarm', annot=True)

## Checking the suicidade/100k distribution of some countries

In [None]:
countries = ['Russian Federation', 'Brazil', 'Poland', 'Italy', 'United States', 'Germany', 'Japan', 'Spain', 'France']
df_filtred = df[[df['country'][i] in countries for i in range(len(df))]]

plt.figure(figsize=(12,6))
sns.boxplot(x = 'suicides/100k pop', y = 'country', data = df_filtred)

### General Plot of the World

In [None]:
import plotly.plotly as py
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [None]:
init_notebook_mode(connected=True) 

In [None]:
cod = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')

In [None]:
codes = []
for i in range(len(n_suicides_pais)):
    c = n_suicides_pais['country'][i]
    f = 0
    for j in range(len(cod)):
        if c == cod['COUNTRY'][j]:
            tmp = cod['CODE'][j]
            f = 1
            break
    if f == 0:
        if c == 'Bahamas':
            tmp  = 'BHM'
        elif c == 'Republic of Korea':
            tmp = 'KOR'
        elif c == 'Russian Federation':
            tmp = 'RUS'
        else:
            tmp = 'VC'
    codes.append(tmp)

In [None]:
data = dict(
        type = 'choropleth',
        locations = codes,
        z = n_suicides_pais['suicides_no_pais/100k'],
        text = n_suicides_pais['country'],
        colorbar = {'title' : 'número de suicídios'},
      )

In [None]:
layout = dict(
    title = 'Mapa de calor de suicídios 1985-2016',
    geo = dict(
        showframe = False,
        projection = {'type':'equirectangular'}
    )
)

In [None]:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)

## Brazil Data

As a Brazilian, I have a particular interest in the suicide rate in Brazil. So I'm going to try to analyze the specific indices of this country.

In [None]:
df_brasil = df[df['country'] == 'Brazil']

Country and country_world fields are all the same, then discarded.

In [None]:
df_brasil.drop(['country', 'country_world'], axis = 1, inplace = True)

I'm going to repeat a lot of the graphics already done.

In [None]:
suicides_no_year = []

for y in df_brasil['year'].unique():
    suicides_no_year.append(sum(df_brasil[df_brasil['year'] == y]['suicides_no']))

n_suicides_year = pd.DataFrame(suicides_no_year, columns=['suicides_no_year'])
n_suicides_year['year'] = df_brasil['year'].unique()

top_year = n_suicides_year.sort_values('suicides_no_year', ascending=False)['year']
top_suicides = n_suicides_year.sort_values('suicides_no_year', ascending=False)['suicides_no_year']

plt.figure(figsize=(8,5))
plt.xticks(rotation=90)
sns.barplot(x = top_year, y = top_suicides)

In [None]:
suicides_no_age = []

for a in df['age'].unique():
    suicides_no_age.append(sum(df_brasil[df_brasil['age'] == a]['suicides_no']))

plt.xticks(rotation=30)
sns.barplot(x = df_brasil['age'].unique(), y = suicides_no_age)

In [None]:
suicides_no_sex = []

for s in df['sex'].unique():
    suicides_no_sex.append(sum(df_brasil[df_brasil['sex'] == s]['suicides_no']))

sns.barplot(x = df_brasil['sex'].unique(), y = suicides_no_sex)

In [None]:
suicides_no_gen = []
for g in df['generation'].unique():
    suicides_no_gen.append(sum(df_brasil[df_brasil['generation'] == g]['suicides_no']))

plt.figure(figsize=(8,5))
sns.barplot(x = df_brasil['generation'].unique(), y = suicides_no_gen)

In [None]:
sns.scatterplot(x = 'gdp_for_year', y = 'suicides_no', data = df_brasil)

In [None]:
sns.scatterplot(x = 'gdp_per_capita', y = 'suicides_no', data = df_brasil)

In [None]:
plt.figure(figsize=(8,7))
sns.heatmap(df_brasil.corr(), cmap = 'coolwarm', annot=True)