# Suicide Rates Analysis over 1985 to  2016

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
print(os.listdir("../input"))

In [None]:
data = pd.read_csv('../input/master.csv')
data.head()

We can observe that - 
* Total 12 features
*  **Age** - is grouped into years
*  **country-year** - is combination of country and year feature
* **HDI for year** - has missing data
* **gdp_for_year** - needs to be converted to integer type

In [None]:
data.columns.values

### Renaming column values

In [None]:
data.columns = ['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicidesper100kpop', 'country-year', 'HDI for year',
       'gdp_for_year_dollars', 'gdp_per_capita_dollars', 'generation']
data.columns.values

**gdp_for_year** is numerical feature, but due to comma seperated number it is stored as string

In [None]:
data['gdp_for_year_dollars'] = data['gdp_for_year_dollars'].str.replace(',','').astype(int)

In [None]:
data.info()

Total **27820** entries

**Numerical features**: year, suicides_no, population, suicides/100k pop, HDI for year, gdp_for_year, gdp_per_capita

**Categorical features**: country, sex, age, generation


In [None]:
data.isnull().sum().sort_values(ascending=False)

Only **HDI for year** feature has null values, so we will drop it. Also, we don't need **country-year** feature so we drop it too.

In [None]:
data_n = data.drop(['HDI for year', 'country-year'], axis=1)
data_n.head(3)

### Distribution of numerical feature values

In [None]:
data_n.describe()

### Distribution of categorical features

In [None]:
data_n.describe(include=['O'])

We observe that - 
* Total **101 unique countries** are present in dataset
* Suicide rates of **males** are higher than females
* The age feature  has **6 unique age groups**
* The generation feature has **6 types of generation**
* **Generation X** has higher rates of suicide


## Pivoting data

### 1. Age, Sex, Suicides_no

In [None]:
data_n[['sex','suicides_no']].groupby(['sex']).mean().sort_values(by='suicides_no', ascending=False).plot(kind='bar')

We observe that **males** have very high suicide rate

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x = 'age', y='suicides_no', hue='sex', data=data_n.groupby(["age","sex"]).sum().reset_index()).set_title('Age vs Suicides')
plt.xticks(rotation=90)

We observe here that -
* Suicide rate is **high** in age group **35-54 years**
* Suicide rate is **low** in age group **5-14 years**

### 2. Country, suicides_no

In [None]:
country_suicides = data_n[['country','suicides_no']].groupby(['country']).sum()
country_suicides.plot(kind='bar', figsize=(40,10), fontsize=25)

* Suicide rates are **higher in Russian Federation, United States and Japan**
* Suicide rates are too low in many countries.
* Suicide rates are **moderate in France, Ukraine, Germany, Brazil, Republic of Korea, Poland, Thailand, United Kingdom, Canada, Italy, Mexico, etc.**


### Top 15 countries with most suicides

In [None]:
country_suicides = country_suicides.reset_index().sort_values(by='suicides_no', ascending=False)
top15 = country_suicides[:15]
sns.barplot(x='country', y='suicides_no', data=top15).set_title('countries with most suicides')
plt.xticks(rotation=90)

### Top 15 countries with least suicides

In [None]:
bottom15 = country_suicides[-15:]
sns.barplot(x='country', y='suicides_no', data=bottom15).set_title('countries with least suicides')
plt.xticks(rotation=90)

* The suicide count in countries  ***Dominica*** and ***Saint Kitts and Nevis*** is **zero**.

### Suicides by year distribution

In [None]:
data_n[['year','suicides_no']].groupby(['year']).sum().plot()

We can observe that -
* The suicide rate had grown rapidly from year 1990
* The rate of suicide has drastically reduced in year 2016

### Suicides categorised by generations

In [None]:
grid = sns.countplot(x='generation', data=data_n)
grid = plt.setp(grid.get_xticklabels(), rotation=45)

1. **The Greatest Generation**: Born between **1901 - 1924**, they are those who experienced the Great Depression and World War II in their adulthood.
2. **The Silent Generation**: Born in between **1924 - 1945**, coming of age during the postwar happiness.The Silent Generation children grew up in conditions complicated by war and economic downturn.
3. **Baby Boomers**: Those born in the years **after World War II**. These are the men and women who tuned in, got high, dropped out, dodged the draft, swung in the Sixties and became hippies in the Seventies.  The first tolerant generation. Envision technology and innovation as requiring a learning process.
4. **Generation X**: Born between **1965 and 1980**, they are the “latch-key kids” who grew up street-smart but isolated, often with divorced or career-driven parents.
5. **Millennials**:  Researchers and commentators use birth years ranging from the **early 1980s to the early 2000s.** Known as sophisticated, technology wise, immune to most traditional marketing and sales pitches, they’ve seen it all and been exposed to it all since early childhood. 
6. **Generation Z:** It is the generation born **after 1995** and they have never known a world without computers and cell phones.

In [None]:
gen_year = data_n[['suicides_no','generation','year']].groupby(['generation','year']).sum().reset_index()
plt.figure(figsize=(25,10))
sns.set(font_scale=1.5)
plt.xticks(rotation=90)
sns.barplot(y='suicides_no', x='year', hue='generation', data=gen_year, palette='deep').set_title('Suicides vs generations per year')

* The suicide rate of Generation X -( born between **1965 and 1980**) is increased from year 1995.
* The suicide rate of Silent Generation is high and increased highly from 1985-2010.
* The suicide rate of Boomers was high in years 1991-1994 and was the generation with most suicides from 1991 till 2008.
* The suicide rate of Millenials has increased from year 2011.
* Generation Z has very low rate of suicides.


## Suicide rates of top 15 countries w.r.t sex

In [None]:
top15data = data_n.loc[data_n['country'].isin(top15.country)]
country_suicides_sex = top15data[['country','suicides_no','sex']].groupby(['country','sex']).sum().reset_index().sort_values(by='suicides_no', ascending=False)
plt.figure(figsize=(25,10))
plt.xticks(rotation=90)
sns.barplot(x='country', y='suicides_no', hue='sex', data=country_suicides_sex).set_title('countries suicides rate w.r.t sex')

In [None]:
bottom15data = data_n.loc[data_n['country'].isin(bottom15.country)]
country_suicides_sex = bottom15data[['country','suicides_no','sex']].groupby(['country','sex']).sum().reset_index().sort_values(by='suicides_no', ascending=False)
plt.figure(figsize=(25,10))
plt.xticks(rotation=90)
sns.barplot(x='country', y='suicides_no', hue='sex', data=country_suicides_sex).set_title('countries suicides rate w.r.t sex')

### Female suicide rate w.r.t country

In [None]:
female_data = data_n.loc[data_n['sex']=='female']
female_suicides = female_data[['country','suicides_no','sex']].groupby(['country','sex']).sum().reset_index().sort_values(by='suicides_no', ascending=False)
plt.figure(figsize=(25,10))
plt.xticks(rotation=90)
sns.barplot(x='country', y='suicides_no', data=female_suicides).set_title('females suicide rate w.r.t country')

We can observe that the females suicide rate is **too high** in countries like **Japan, Russian Federation and United States.**

In [None]:
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(data_n.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

It can be observed from the heat map figure -
1. Population and gdp_for_year are correlated
2. Population and suicides_no is also correlated

### Country, Population

In [None]:
plt.figure(figsize=(25,10))
plt.xticks(rotation=90)
sns.barplot(x='country', y='population', hue='sex', data=data_n).set_title('country vs population')

* United States have the highest population followed by Brazil, Russian Federation, Japan and Mexico with more number of females than males

**Let's group year based on decades**

In [None]:
def decade_mapping(data):
    if 1985 <= data <= 1994:
        return "1985-1994"
    elif 1995 <= data <= 2004:
        return "1995-2004"
    else:
        return "2005-2016"
    
data_n.year = data_n.year.apply(decade_mapping)
data_n.head(3)

**Generation, Sex, Year**

In [None]:
grid = sns.FacetGrid(data_n, row='generation', col='year', size = 5, aspect=1.5)
grid.map(sns.barplot, 'sex', 'suicides_no', alpha=.5, ci=None)
grid.add_legend()

We observed that - 
* Suicide rate of Generation Z is too low.
* The Boomers have higher rate of suicides of males between years 1995-2004.
* The suicide rates of females have increased over years in Generation X, Silent, Boomers, Millenials.

***Do upvote if this kernel helped you in any way, it will boost up confidence and, please comment down if I am wrong somewhere. Thank you!***