This is an Exploratory Data Analysis with Python.

The exploration contains three sections, which are focusing on **gender**, **generations** and **age groups**.

The dataset [WHO Suicide Statistics](https://www.kaggle.com/szamil/who-suicide-statistics) is made by @Szamil using the WHO Mortality Database online tool.

**Install and load data**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("../input/suicide-rates-overview-1985-to-2016/master.csv")
df.head()

In [None]:
df.shape

**Data wrangling**

In [None]:
missing_data = df.isnull()

for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print('')

In [None]:
df.drop(['country-year','HDI for year'], axis = 1, inplace = True)
df.head()

In [None]:
# remove the duplicates
df.drop_duplicates()
df.shape

In [None]:
df.dtypes

**Male vs. Female**

In [None]:
# query the sex figures for plotting
df_sex = df[['year', 'suicides_no', 'sex']]

# add new column 'Total'
df_total = df_sex.groupby(['year']).sum('suicides_no')

df_sex = df_sex.groupby(['year', 'sex']).sum('suicides_no')
df_sex.head()

In [None]:
# query the total suicides along with the year
df_total.rename(columns = {'suicides_no':'Total'}, inplace = True)
df_total = df_total.reset_index()

# query the total male suicides along with the year
df_male = df_sex.query("sex == 'male'")
df_male.rename(columns = {'suicides_no':'Male'}, inplace = True)
df_male = df_male.reset_index()
df_male.drop(['sex'], axis = 1, inplace = True)

# query the total male suicides along with the year
df_female = df_sex.query("sex == 'female'")
df_female.rename(columns = {'suicides_no':'Female'}, inplace = True)
df_female = df_female.reset_index()
df_female.drop(['sex'], axis = 1, inplace = True)

# merge the three dataframes for plotting
suicides = df_male.merge(df_female, on = 'year').merge(df_total, on = 'year')
suicides.set_index('year', inplace = True)
suicides = suicides.drop(index = 2016)
suicides.head()

**Create the area plot:**

In [None]:
# area plot to show the total suicides and distribution between men and women
suicides.index = suicides.index.map(int)
suicides.plot(kind = 'area',
              alpha = 0.20,
              stacked = False,
              color = {'coral', 'green', 'purple'},
              figsize = (20, 8))

plt.title('World Suicides Trend from 1985 to 2015')
plt.ylabel('Number of Suicides')
plt.xlabel('Years')

plt.show()

**Analysis:**

   * Compared the number of male suicides to the female suicides, males showed two to three more times of cases than females.
   * The suicides trend grew significantly at the end of the 1980s, followed by a gradual growth from 1990 to 1996 to around 2.5 million cases per year. 
   * There was a drop from the year 2003 to 2005, and there was a downward trend after 2010.

**Generational Highlight**

In [None]:
# query the sex figures for plotting
df_g = df[['year', 'suicides_no', 'generation']]
df_g = df_g.groupby(['year', 'generation']).sum('suicides_no')
df_g = df_g.reset_index()
df_g.head()

In [None]:
df_g['generation'].value_counts()

In [None]:
# G.I. Generation - people born from 1901 to 1927
df_GI = df_g.query("generation == 'G.I. Generation'")
df_GI.rename(columns = {'suicides_no':'G.I. Generation'}, inplace = True)
df_GI.drop(['generation'], axis = 1, inplace = True)
df_GI = df_GI.reset_index(drop=True)

# Silent - people born from 1928 to 1945
df_silent = df_g.query("generation == 'Silent'")
df_silent.rename(columns = {'suicides_no':'Silent'}, inplace = True)
df_silent.drop(['generation'], axis = 1, inplace = True)
df_silent = df_silent.reset_index(drop=True)

# Boomers - people born from 1946 to 1964
df_boomers = df_g.query("generation == 'Boomers'")
df_boomers.rename(columns = {'suicides_no':'Boomers'}, inplace = True)
df_boomers.drop(['generation'], axis = 1, inplace = True)
df_boomers = df_boomers.reset_index(drop=True)

# Generation X - people born from 1960s to 1970s
df_X = df_g.query("generation == 'Generation X'")
df_X.rename(columns = {'suicides_no':'Generation X'}, inplace = True)
df_X.drop(['generation'], axis = 1, inplace = True)
df_X = df_X.reset_index(drop=True)

# Millenials - people born from 1981 to 1996
df_M = df_g.query("generation == 'Millenials'")
df_M.rename(columns = {'suicides_no':'Millenials'}, inplace = True)
df_M.drop(['generation'], axis = 1, inplace = True)
df_M = df_M.reset_index(drop=True)

# Generation Z - people born from 1997 onward
df_Z = df_g.query("generation == 'Generation Z'")
df_Z.rename(columns = {'suicides_no':'Generation Z'}, inplace = True)
df_Z.drop(['generation'], axis = 1, inplace = True)
df_Z = df_Z.reset_index(drop=True)

In [None]:
# merge the dataframes for plotting
generation = df_silent.merge(df_X, on = 'year').merge(df_boomers,how='left', left_on='year', right_on = 'year').merge(df_M,how='left', left_on='year', right_on = 'year').merge(df_GI,how='left', left_on='year', right_on = 'year').merge(df_Z,how='left', left_on='year', right_on = 'year')
generation = generation[['year','G.I. Generation','Silent','Boomers','Generation X','Millenials','Generation Z']]
generation.set_index('year', inplace = True)
generation

In [None]:
# replace the Boomers NaN with mean
avg_boomers = generation['Boomers'].astype('float').mean(axis = 0)
generation['Boomers'].replace(np.nan,avg_boomers,inplace=True )

# replace the rest NaN with 0
generation.replace(np.nan, '0', inplace = True)

# drop the row year 2016 as there's missing data and figures seem unright
generation = generation.drop(index = 2016)

# cast the Years (the index) to type int and reset Years as a new column
generation.index = map(int, generation.index)
generation.index.name = 'Year'
generation.reset_index(inplace=True)

generation.head()

In [None]:
generation.dtypes

In [None]:
generation[['Boomers', 'Millenials','G.I. Generation','Generation Z']] = generation[['Boomers', 'Millenials','G.I. Generation','Generation Z']].astype('int')
generation.dtypes

In [None]:
# normalise the data
norm_GI = (generation['G.I. Generation'] - generation['G.I. Generation'].min()) / (generation['G.I. Generation'].max() - generation['G.I. Generation'].min())
norm_silent = (generation['Silent'] - generation['Silent'].min()) / (generation['Silent'].max() - generation['Silent'].min())
norm_boomers = (generation['Boomers'] - generation['Boomers'].min()) / (generation['Boomers'].max() - generation['Boomers'].min())
norm_X = (generation['Generation X'] - generation['Generation X'].min()) / (generation['Generation X'].max() - generation['Generation X'].min())
norm_M = (generation['Millenials'] - generation['Millenials'].min()) / (generation['Millenials'].max() - generation['Millenials'].min())
norm_Z = (generation['Generation Z'] - generation['Generation Z'].min()) / (generation['Generation Z'].max() - generation['Generation Z'].min())

In [None]:
# G.I. Generation
ax0 = generation.plot(kind='scatter',x='Year',y='G.I. Generation',figsize=(20, 12),alpha=0.5,color='pink',
                      s=norm_GI * 1500 + 10,xlim=(1985, 2015),ylim=(0, 150000))
# Silent
ax1 = generation.plot(kind='scatter',x='Year',y='Silent',figsize=(20, 12),alpha=0.5,color='gray',
                      s=norm_silent * 1500 + 10,xlim=(1985, 2015),ylim=(0, 150000), ax=ax0)
# Boomers
ax2 = generation.plot(kind='scatter',x='Year',y='Boomers',figsize=(20, 12),alpha=0.5,color='green',
                      s=norm_boomers * 1500 + 10,xlim=(1985, 2015),ylim=(0, 150000), ax=ax0)
# Generation X
ax3 = generation.plot(kind='scatter',x='Year',y='Generation X',figsize=(20, 12),alpha=0.5,color='Coral',
                      s=norm_X * 1500 + 10,xlim=(1985, 2015),ylim=(0, 150000), ax=ax0)
# Millenials
ax4 = generation.plot(kind='scatter',x='Year',y='Millenials',figsize=(20, 12),alpha=0.5,color='blue',
                      s=norm_M * 1500 + 10,xlim=(1985, 2015),ylim=(0, 150000), ax=ax0)
# Generation Z
ax5 = generation.plot(kind='scatter',x='Year',y='Generation Z',figsize=(20, 12),alpha=0.5,color='brown',
                      s=norm_Z * 1500 + 10,xlim=(1985, 2015),ylim=(0, 150000), ax=ax0)

ax0.set_ylabel('Number of Suicides')
ax0.set_title('World Suicides Trend by Generations from 1985 to 2015')
ax0.legend(['G.I. Generation (1901-1927)','Silent (1928-1945)','Boomers (1946-1964)','Generation X (1960s-1970s)','Millenials (1981-1996)',
            'Generation Z (1997 Onward)'], loc='upper left', fontsize='x-large')

ax0.text(2007, 118000, '2007-2008 Financial Crisis') 
ax0.text(1992, 105000, '1990s Recession')

In [None]:
generation.set_index('Year', inplace = True)

In [None]:
# create a line plot for analysing the trends
generation.plot(kind = 'line', figsize=(20,8))


plt.title('World Suicides Trend by Generations from 1985 to 2015')
plt.ylabel('Number of Suicides')
plt.xlabel('Years')
plt.legend(['G.I. Generation (1901-1927)','Silent (1928-1945)','Boomers (1946-1964)','Generation X (1960s-1970s)','Millenials (1981-1996)',
            'Generation Z (1997 Onward)'], loc='upper left', fontsize='11')

plt.text(2006, 120000, '2007-2008 Financial Crisis') 
plt.text(1991,116000, '1990s Recession')

plt.show()

**According to the bubble chart and line chart, there were two strong signals in *1990 to 1995* and *2008 to 2011*, which related to two major economic crises: the 1990s Recession and the 2007-08 Financial Crisis.**

**1990~1995**
* Signals: Suicides of Boomers arose significantly then followed by a huge drop in the year 1994. A great decrease in the G.I.Generation in early 1990. A huge increase of suicides showed in Generation X from 1994 to 1995.
* Background: **Early 1990s Recession**
* Analysis:
    *    **Generation X** (25 to 35-year-old)↗ reached the mid 20 and mid 30, the stage of searching for jobs and establishing a career path, however, there was a recession in the world.
    *    **Boomers** (30 to 50-year-old)↗ came to the prime of their life , which some people may start to face the middle-age crisis.
    *     **G.I.Generation** (65-year-old onward)↘ had come to the last chapter of their life , the release of stress may lead to a reduction in the number of suicides.

**2008~2011**
* Signals: There is a significant increase in Generation X and Millennials, whereas a great drop has been seen in the Silents and a mild drop in the Boomers generation.
* Background: **2007-2008 Financial Crisis**
* Analysis: 
    *     **Millennials** (15 to 30-year-old)↗ just started their career, however, they may be layoff because of the crisis and it's relatively difficult to find another job at that time.
    *     **Generation X** (40 to 50-year-old)↗ was in their peak of career, who also was the key man to raise a family, however, some of them may lose their job and savings because of the financial crisis.
    *     **Boomers** (46 to 64-year-old)↘ showed a sharp downward trend after 2010, which the reason could be their characteristic...to be explored.
    *     **Silents** (65 to 82-year-old)↘ were marching into the post-65-year-old life , fewer things to be worried about.


**Age Highlight**

In [None]:
df_age = df[['year','age','suicides_no']]
df_age = df_age.groupby(['year','age']).sum('suicides_no')
df_age.rename(columns = {'suicides_no':'total'}, inplace=True)
df_age = df_age.reset_index()
df_age.head(10)

In [None]:
# 5-14
df_5 = df_age.query("age == '5-14 years'")
df_5.rename(columns = {'total':'5-14 years'}, inplace = True)
df_5.drop(['age'], axis = 1, inplace = True)

# 15-24
df_15 = df_age.query("age == '15-24 years'")
df_15.rename(columns = {'total':'15-24 years'}, inplace = True)
df_15.drop(['age'], axis = 1, inplace = True)

# 25-34
df_25 = df_age.query("age == '25-34 years'")
df_25.rename(columns = {'total':'25-34 years'}, inplace = True)
df_25.drop(['age'], axis = 1, inplace = True)

# 35-54
df_35 = df_age.query("age == '35-54 years'")
df_35.rename(columns = {'total':'35-54 years'}, inplace = True)
df_35.drop(['age'], axis = 1, inplace = True)

# 55-74
df_55 = df_age.query("age == '55-74 years'")
df_55.rename(columns = {'total':'55-74 years'}, inplace = True)
df_55.drop(['age'], axis = 1, inplace = True)

# 75+
df_75 = df_age.query("age == '75+ years'")
df_75.rename(columns = {'total':'75+ years'}, inplace = True)
df_75.drop(['age'], axis = 1, inplace = True)

# merge the dataframes for plotting
age = df_5.merge(df_15, on = 'year').merge(df_25, on = 'year').merge(df_35, on = 'year').merge(df_55, on = 'year').merge(df_75, on = 'year')
age.set_index('year', inplace = True)
age.head()

In [None]:
age.plot(kind = 'area',
         alpha = 0.25,
         stacked = True,
         figsize = (20, 8),
         color = ['coral', 'darkslateblue', 'mediumseagreen','deeppink','yellow','lightskyblue'])

plt.title('World Suicides Trend by Generations from 1985 to 2015')
plt.ylabel('Number of Suicides')
plt.xlabel('Years')

plt.show()

In [None]:
count, bin_edges = np.histogram(age, 15) 

age.plot(kind='hist', 
         figsize=(20,8),
         bins = 15, 
         alpha = 0.5,
         stacked = False,
         color = ['coral', 'darkslateblue', 'mediumseagreen','deeppink','yellow','lightskyblue'],
         xticks = bin_edges)

plt.title('World Suicides Trend by Generations from 1985 to 2015')
plt.ylabel('Number of Years')
plt.xlabel('Number of Suicides')

plt.show()

In [None]:
age.plot(kind = 'line', figsize=(20,8),color = ['coral', 'darkslateblue', 'mediumseagreen','deeppink','yellow','lightskyblue'])


plt.title('World Suicides Trend by Generations from 1985 to 2015')
plt.ylabel('Number of Suicides')
plt.xlabel('Years')
plt.legend(['5-14 years','15-24 years','25-34 years','35-54 years','55-74 years',
            '75+ years'], loc='upper left', fontsize='11')

plt.show()

* Analysis 
    * The plots show that the 35-54 age group has the highest suicide rate whereas the group of 5-14-year-old had the lowest number.
    * The suicides number of people above 75-year-old climbed slowly from the 1990s and surpassed the 15-24 years group in 2014.

**Conclusion**

With the three parts of data exploratory,  the result indicated that the factors of gender and age groups had a significant correlation to suicides. However, while facing an environmental crisis such as a recession, the trends differed when it comes to different generations.