# Dissecting Suicide Rates: A Data Analysis

In this analysis, I have used the [Suicide Rates Overview 1985 to 2016](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjP1_rn2oWAAxXjJUQIHS_rDN8QFnoECBsQAQ&url=https%3A%2F%2Fwww.kaggle.com%2Fdatasets%2Frussellyates88%2Fsuicide-rates-overview-1985-to-2016&usg=AOvVaw266zoceCvZ2DOJUzJ-xnEP&opi=89978449) dataset to perform a step by step analysis and answer the following questions:
<ul>
<li>How many people lost their lives to suicide each year?</li>
<li>Which gender is more likely to commit suicide?</li>
<li>Which age group tends to have the most victims?</li>
<li>How are suicide rates related to the GDP per Capita?</li>
<li>What are the average suicde rates across generations over time?</li>
<li>When was this issue at its peak?</li>
</ul>

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import griddata

### Loading the data

In [None]:
df = pd.read_csv('../data/master.csv')
df.head()

### Data Exploration

In [None]:
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns.')

The dataset has 27820 rows and 12 columns.

In [None]:
print('The column names are: ', df.columns.tolist())

The column names are: country, year, sex, age, suicides_no, population, suicides/100k pop, country-year, HDI for year,  gdp_for_year \(\$\) , gdp_per_capita \(\$\), generation

In [None]:
print('The data types of the columns are:\n', df.dtypes)

<i>**NOTE:**</i> The data type of `gdp_for_year ($)` is 'object.' This is probably because of the commas

In [None]:
print(f'There are {df.duplicated().sum()} duplicate values in the dataset.')

There are 0 duplicate values in the dataset.

In [None]:
# Checking for missing values
df.isnull().sum()

<i>**NOTE:**</i> `HDI for year` is a column with 69.93% missing values.

In [None]:
print('The possible values for age are: ', df['age'].unique())
print('The possible values for sex are: ', df['sex'].unique())
print('The possible values for generation are: ', df['generation'].unique())

The possible values for age are: 15-24 years, 35-54 years, 75+ years, 25-34 years, 55-74 years, 5-14 years<br />
The possible values for sex are: male, female<br />
The possible values for generation are: Generation X, Silent, G.I. Generation, Boomers, Millenials, Generation Z<br />
<br />
<i>NOTE:</i> The ages are NOT categorized (5-14 years comes in the end)

### Data Cleaning

In [None]:
df = df.drop(columns='HDI for year')
# validation
print('Updated columns: ', df.columns.tolist())

Removed the `HDI for year` column beacuse of missing values

In [None]:
df['gdp_for_year ($)'] = df[' gdp_for_year ($) '].str.replace(',', '')
df['gdp_for_year ($)'] = df['gdp_for_year ($)'].astype(float)
# validation
print('The data types of the columns now:\n', df.dtypes)

Removed commas from `gdp_for_year ($)` and changed its values to floats.

In [None]:
# Ordering Age
age_order = ['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years', '75+ years']
df['age'] = pd.Categorical(df['age'], categories=age_order, ordered=True)

Categorized age values.

### Data Visualization

In [None]:
annual_suicides = df.groupby('year')['suicides_no'].sum().reset_index()

sns.set_style("darkgrid")
sns.set_context("notebook")

plt.figure(figsize=(10,6))
line_plot = sns.lineplot(x='year', y='suicides_no', data=annual_suicides, color='blue', linewidth=2.5)

x = annual_suicides['year']
y1 = annual_suicides['suicides_no']
plt.fill_between(x, y1, color="blue", alpha=0.1)

plt.title('Annual Suicides Globally', fontsize=20, fontweight='bold')
plt.xlabel('Year', fontsize=15)
plt.ylabel('Number of Suicides', fontsize=15)
sns.despine()
plt.show()

<i>**NOTE:**</i> The data for 2016 seems to be skewed due to incomplete data from that year. This is because of the cutoff date.

In [None]:
suicides_and_gender = df.groupby('sex')['suicides_no'].sum().reset_index()

sns.set_style("darkgrid")
sns.set_context("notebook")

plt.figure(figsize=(10,6))
bar_plot = sns.barplot(x='sex', y='suicides_no', data=suicides_and_gender, palette=['#1f77b4', '#ff7f0e'])

plt.title('Net Suicides by Gender', fontsize=20, fontweight='bold')
plt.xlabel('Gender', fontsize=15)
plt.ylabel('Number of Suicides', fontsize=15)

for p in bar_plot.patches:
    bar_plot.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 10), 
                   textcoords = 'offset points')

sns.despine()
plt.show()

This graph shows that males are 3.3x more likely to be victims than females.

In [None]:
suicides_and_age = df.groupby('age')['suicides_no'].sum().reset_index()

sns.set_style("darkgrid")
sns.set_context("notebook")
plt.figure(figsize=(10,6))
bar_plot = sns.barplot(x='age', y='suicides_no', data=suicides_and_age, palette=sns.color_palette("husl", 6))
plt.title('Net Suicides by Age', fontsize=20, fontweight='bold')
plt.xlabel('Age Group', fontsize=15)
plt.ylabel('Number of Suicides', fontsize=15)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)

for p in bar_plot.patches:
    bar_plot.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 10), 
                   textcoords = 'offset points')

sns.despine()
plt.show()

Suicide rates among indivuduals spike after 35 years of age.

In [None]:
suicides_and_gdp = df.groupby('country')[['suicides/100k pop', 'gdp_per_capita ($)']].mean().reset_index()

sns.set_style("darkgrid")
sns.set_context("notebook")

plt.figure(figsize=(10,6))
scatter_plot = sns.scatterplot(x='gdp_per_capita ($)', y='suicides/100k pop', data=suicides_and_gdp, color='red', alpha=0.6)

plt.title('Average Suicide Rates vs. GDP per Capita', fontsize=20, fontweight='bold')
plt.xlabel('GDP per Capita', fontsize=15)
plt.ylabel('Suicides per 100k People', fontsize=15)

plt.grid(True, linestyle='-', linewidth=0.5)
sns.despine()

plt.show()

There seems to be no clear trend with lower GDP countries showing a massive variation in average suicide rates.

In [None]:
suicides_and_generations = df.groupby(['year', 'generation'])['suicides/100k pop'].mean().reset_index()

sns.set_style("darkgrid")
sns.set_context("notebook")

plt.figure(figsize=(12,8))
line_plot = sns.lineplot(x='year', y='suicides/100k pop', hue='generation', data=suicides_and_generations, palette=sns.color_palette("husl", 6), linewidth=2.5)

plt.title('Average Suicide Rates Over Time by Generation', fontsize=20, fontweight='bold')
plt.xlabel('Year', fontsize=15)
plt.ylabel('Suicides per 100k People', fontsize=15)

plt.grid(True, linestyle='-', linewidth=0.5, color='white')
plt.legend(loc='upper right', title='Generation', title_fontsize='13', fontsize='12', facecolor='darkgrey')

sns.despine()
plt.show()

The rate of change of suicide rates seem to be similar across generation with the progression of time. This indicates that the tendency to commit suicide is directly linked to the state of the world regardless of generation.

In [None]:
grouped_df = df.groupby(['year', 'age', 'sex'])['suicides_no'].sum().reset_index()

age_mapping = {'5-14 years': 10, '15-24 years': 20, '25-34 years': 30, '35-54 years': 45, '55-74 years': 65, '75+ years': 80}
grouped_df['age_num'] = grouped_df['age'].map(age_mapping)

mean_suicides = grouped_df.groupby(['year', 'age_num'])['suicides_no'].mean().reset_index()

years_grid, age_num_grid = np.mgrid[mean_suicides['year'].min():mean_suicides['year'].max():100j, 
                                    mean_suicides['age_num'].min():mean_suicides['age_num'].max():100j]

suicides_grid = griddata((mean_suicides['year'], mean_suicides['age_num']), mean_suicides['suicides_no'], 
                         (years_grid, age_num_grid), method='cubic')

fig = plt.figure(figsize=(16, 9))
ax = fig.add_subplot(111, projection='3d')

surf = ax.plot_surface(years_grid, age_num_grid, suicides_grid, cmap='cool', edgecolor='k')

ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Age', fontsize=12)
ax.set_zlabel('Suicides', fontsize=12)
ax.set_title('Suicides by Year and Age Group', fontsize=12)

fig.colorbar(surf, shrink=0.5, aspect=10)

plt.show()

This graph shows that suicide rates peaked between 1995 and 2000 between the people of ages 40 and 50.