# Suicide Rate Prediction

## EDA, Decision Tree, Linear Regression and much more! 

Hey guys, I hope you are doing good! So I'm still a beginner in the Data Science field, if you have any recommendations or suggestions please put them in the comments down below! Thanks :D 

Today in our analysis we will look at the "Suicide Rates Overview 1985 to 2016" dataset. This dataset contains 12 features: 
- country: The country of residence of the individual
- year: The year the suicide happend 
- sex: The gender of the individual (male/female)
- age group: The age group of the individual 
- count of suicides: The count number of suicides that happend 
- population: The overall population of the country 
- suicide rate: The number of suicides per 100,000 person
- country-year composite key: A code containing the country name plus the year of the suicide 
- HDI for year: Human Development Index is a statistic composite index of life expectancy, education, and per capita income indicators.
- gdpforyear: The GDP of the country at that year
- gdppercapita: The GDP per capita of the country at that year
- generation: The name of the corresponding generation 

During this analysis, we will first clean our data and make the needed transformation(We will keep transforming our data whenever it's needed during the analysis). Second, we will perform an exploratory data analysis where we will try to extract a few usefull insights from our dataset. Finally we will try to predict the number of suicides given the features we have. 

![](https://uploads-ssl.webflow.com/5a4c78412b69220001d82c7d/5a4c78412b69220001d82dc2_171128_cropped_tsis-lowres.jpg)

## Data Preprocessing:

In [None]:
#Importing needed packages
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set()

- First, we'll start by importing our data. 

In [None]:
#Importing the dataset
raw_data = pd.read_csv('../input/suicide-rates-overview-1985-to-2016/master.csv')

- Now, we'll check for any null values.

In [None]:
#Checking for any null values
raw_data.isnull().sum()

- The HDI for year column has 2/3 of its data as missing values, unfortunatly, we'll have to drop this column. The country-year column isn't that useful also. We'll discard it too. 

In [None]:
#Getting the names of the columns we have
print(raw_data.columns)

In [None]:
#Removing the HDI and country-year columns
no_na_data = raw_data[['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100k pop', ' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation']]
no_na_data.head()

- Let's now check for any outliers or wrong entires through the Pandas describe method that gives us a small summary of our numerical features. This will help us detect any visible anomalies in our data.

In [None]:
#Describing our dataset
no_na_data.describe()

- All the features seem to be in good order, except the suicides rates variables. It is very odd to have 0 suicides in a year! Let's check the values for which we have no suicides in a given year. 

In [None]:
#Checking the entries where suicides_no = 0 
no_na_data[no_na_data['suicides_no']==0]

- The entries that had 0 values didn't represent the whole country in a given year, they actually represented a specific age category. So we are good to go, it's safe to say that our data is ready and we can work with it. 

In [None]:
#Creating a new dataframe 'clean_data' to work with 
clean_data = no_na_data.copy()

## Exploratory Data Analysis:

In [None]:
#Grouping our data by year
gp_year_data = clean_data.groupby('year', as_index=False).mean()

#Plotting the suicides rates by years 
fig, ax = plt.subplots(figsize=(12,4))
sns.lineplot(x='year', y='suicides/100k pop', color=sns.husl_palette(6)[5], data=gp_year_data, ax=ax)
plt.xlabel('Year')
plt.ylabel('Suicides/100k')
plt.title('Evolution of suicide rates\nthroughout the years', size=15)
plt.show()

- From the line plot above we can see that the suicide rates kept oscilating between 11 and 16 suicide per 100,000 person. In 1995 the suicde rates peaked reaching almost 16 suicide per 100,000 individual. It started dropping afterwards to reach a minimum of 11 suicide per 100,000 individual in 2010 and 2014. Let's check if we can determine which countries have the highest suicide rates. 

In [None]:
#Grouping the data by country
gp_cnt_data = clean_data.groupby('country', as_index=False).sum()
top_ten = gp_cnt_data.nlargest(10, 'suicides_no').sort_values('suicides_no', ascending=False)

#Plotting the number of suicides according to the countries 
fig, ax = plt.subplots(figsize=(12,4))
sns.barplot(x='suicides_no', y='country', palette='husl', data=top_ten, ax=ax)
plt.xlabel('Suicides')
plt.ylabel('Country')
plt.title('Suicides according\nto the country', size=15)
ax.ticklabel_format(style='plain', axis='x')
plt.show()

- The countries that have the highest number of suicides from 1985 untill 2016 are: 
    1. Russia 
    2. USA
    3. Japan
    
Let's check if we can determine which generation has the highest suicide rates. ;

In [None]:
#Grouping the data by generations
gp_gen_data = clean_data.groupby('generation', as_index=False).mean()

#Plotting the suicide rates according to the generations 
fig, ax = plt.subplots(figsize=(12,4))
sns.barplot(x='generation', y='suicides/100k pop', palette='husl', data=gp_gen_data, ax=ax, 
            order=['G.I. Generation', 'Silent', 'Boomers', 'Generation X', 'Millenials', 'Generation Z'])
plt.xlabel('Generation')
plt.ylabel('Suicides/100k')
plt.title('Suicide rates according\nto the generation', size=15)
plt.show()

- As we can see, the G.I. Generation or the Greatest Generation (the generation who lived during the WWII) has the highest suicide rate with almost 25 suicides per 100,000 person. This is a very big number compared to younger generations, this might be due to the fact that this generation suffered a lot during the WWII, many of them lost their loved ones and experienced different traumatic events. The suicide rates decrease from a generation to another, where Generation Z has the lowest suicide rates with 1 suicide per 100,000 person. Let's check the suicide rates according to the age categories.

In [None]:
#Grouping the data by age
gp_age_data = clean_data.groupby('age', as_index=False).mean()

#Plotting the suicide rates according to the age categories
fig, ax = plt.subplots(figsize=(12,4))
sns.barplot(x='age', y='suicides/100k pop', palette='husl', data=gp_age_data, ax=ax, 
           order=['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years', '75+ years'])
plt.xlabel('Age categories')
plt.ylabel('Suicides/100k')
plt.title('Suicide rates according\nto the age categories', size=15)
plt.show()

- We can see that as the person gets older it tends to be more suicidal. This could be explained by the fact that important life changes that happen as we get older may cause feelings of uneasiness, stress, and sadness. But this might be due to the fact that old people (75+ years) belong to the G.I. Generation which already has the highest suicide rates. To further explore this we must check the number of people that commited suicide within each age category with respect to their generation. This way we can find out the distribution of ages of suicidal people within each generation. This will help us to identify if suicide is due to the age factor or to the generation. 

In [None]:
#Grouping our data by generation and age 
gp_gen_age_data = clean_data.groupby(['generation', 'age'], as_index=False).mean()

#Making a list containing all the gens 
gens = ['G.I. Generation', 'Silent', 'Boomers', 'Generation X', 'Millenials', 'Generation Z']

#Creating the axis of the plots
plt.figure(figsize=(12,18))
ax1 = plt.subplot2grid((6,1),(0,0))
ax2 = plt.subplot2grid((6,1),(1,0))
ax3 = plt.subplot2grid((6,1),(2,0))
ax4 = plt.subplot2grid((6,1),(3,0))
ax5 = plt.subplot2grid((6,1),(4,0))
ax6 = plt.subplot2grid((6,1),(5,0))

#Making a list containing all the axes
axes = [ax1, ax2, ax3, ax4, ax5, ax6]

#Making a for loop to plot the needed plots 
for gen, ax in zip(gens, axes):
    sns.barplot(x='age', y='suicides/100k pop', palette='husl', 
                data=gp_gen_age_data[gp_gen_age_data['generation'] == gen],
                ax=ax, order=['5-14 years', '15-24 years', '25-34 years', '35-54 years', 
                          '55-74 years', '75+ years'])
    ax.set_xlabel('Age categories')
    ax.set_ylabel('Suicides/100k')
    ax.set_title(gen, size=15)
plt.tight_layout()

- From the plot above we can see that unfortunatly, we don't have every age category for each generation. This is due to the fact the data collection started from around the year 1985, that means that: 

    1. people who belong to the G.I. Generation at the year of 1985, will be already 55+ years, so we won't have any people from this generation who belong to younger age category who commited suicide. 
    2. equivalently, we won't have any 25-34 years (or younger) people who belong to the Silent Generation and commited suicide because this generation in 1985 were at least 35 years old. 
    3. The same goes for younger generations, we might have the young age categories but since the data collection stopped at around 2015 or so, we won't be seeing any boomers older than 75, or millenials older than 35 years old. 
    
Due to these reasons, we can't check if the elevated suicide rates are due to the generation or to the age factor. Let's see if the economy of a country has any effect on suicide rates.

In [None]:
#Grouping the data by country
gp_cnt_data = clean_data.groupby('country', as_index=False).mean()

#Plotting the suicide rates according to the generations 
fig, ax = plt.subplots(figsize=(12,4))
sns.scatterplot(x='gdp_per_capita ($)', y='suicides/100k pop', color=sns.husl_palette(6)[4], data=gp_cnt_data, ax=ax)
plt.xlabel('GDP per capita')
plt.ylabel('Suicides/100k')
plt.title('Suicide rates according\nto the GDP per capita', size=15)
plt.show()

- From the scatterplot above we can't really see a relation between the suicide rates and the GDP per capita, let's present the data in a diffrent way maybe this will help us to detect a pattern. We're going to split the data we have into bins. This way we'll present the different GDP per capita categories and each category will have it's corresponding suicide rate.  

In [None]:
#Making bins and labels for the gdp_per_capita feature
bins = list(range(0, 160000, 20000))
labels = ['0-20,000', '20,000-40,000', '40,000-60,000', '60,000-80,000', '80,000-100,000', '100,000-120,000', '+120,000']
clean_data['gdp_per_capita_bins'] = pd.cut(clean_data['gdp_per_capita ($)'], bins=bins, labels=labels)

#Plotting the suicide rates according to the gbp per capita bins 
fig, ax = plt.subplots(figsize=(12,4))
sns.barplot(x='gdp_per_capita_bins', y='suicides/100k pop', palette='husl', data=clean_data, ax=ax)
plt.xlabel('GDP per capita')
plt.ylabel('Suicides/100k')
plt.title('Suicide rates according\nto the GDP per capita', size=15)
plt.show()

- Even after transformation we can't detect a clear pattern from our data. All the GDP per capita values give a suicide rate between 10 and 13 suicides per 100k person. If we could say something about the data we have, we would say that the GDP doesn't have any effect on the suicide rates.  


## Modelling and predictions: 


- In this part of the notebook, we will try to fit two diffrent models to our data. First we are going to try the linear regression model. Second we will use the decision tree regressor. We will use the SkLearn library for both algorithms and for data preprocessing too. We will first start by selecting the independent features and storing them in a variable named X, and our independent feature in a variable named y. 

In [None]:
#Selecting the dependent and independent features
X = clean_data[['country', 'sex', 'population', 'age', 'gdp_per_capita ($)', 'generation']]
y = clean_data['suicides/100k pop']

- As we know, the linear regression algorithm doesn't work with categorical features. To deal with this problem we'll need to transform our categorical data to dummy variables. For this we will use the Pandas method get_dummies. 

In [None]:
#Transforming the categorical variables to dummy variables
X = pd.get_dummies(X, drop_first=True)

- After dealing with categorical variables, we can move to scaling our data in order to normalise the data within a particular range.Also, scaling helps in speeding up the calculations in an algorithm.

In [None]:
#Importing needed package for scaling
from sklearn.preprocessing import StandardScaler 

In [None]:
#Scaling our data 
sc = StandardScaler()
X[['population', 'gdp_per_capita ($)']] = sc.fit_transform(X[['population', 'gdp_per_capita ($)']])

In [None]:
#Importing needed package for splitting the dataset
from sklearn.model_selection import train_test_split

- Finally, we'll split our data into two sets: training and testing. The sizes will be 80% for the training data and 20% for testing data. 

In [None]:
#Splitting the dataset 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

- Now we can move to creating our models. 

### Linear Regression: 

- We'll start with the linear regression, first we'll import the needed tools from SkLearn. 

In [None]:
#Importing the Linear Regression algorithm 
from sklearn.linear_model import LinearRegression

- Afterwards, we'll need to initialize our model and fit it to the training set.

In [None]:
#Initializing our Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)

- Now we can make predictions, for this we will use the X_test. 

In [None]:
#Predicting the test values
lr_y_pred = lr.predict(X_test)

- We saved the predictions in a variable named lr_y_pred. In order to compare them to the real values, we will plot them in the same figure. The values should be aligned on a 45Â° dergrees line.

In [None]:
#Plotting the results
fig, ax = plt.subplots(figsize=(12,4))
sns.scatterplot(lr_y_pred, y_test, ax=ax, color=sns.husl_palette(10)[0])
sns.lineplot([0, 175], [0, 175], color=sns.husl_palette(10)[5], ax=ax)
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Prediction evaluation (Linear Regression)', size=15)
plt.show()

- Our model doesn't seem to be doing a good job, this might be due to fact that the features we selected aren't good enough, or it might be due the fact that the data we have isn't linear so a similar model won't be any good to estimate the values. Maybe the decision tree will perfom better, let's check it out. 

### Decision Tree Regressor: 

- We will use exactly the same staps we did for the linear regression. First we'll import the tools from SkLearn.

In [None]:
#Importing the Decision Tree algorithm 
from sklearn.tree import DecisionTreeRegressor

- Then we'll initialize the model and fit it to the train data.

In [None]:
#Initializing our Decision Tree
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

- Now, we'll predicted the test set results.

In [None]:
#Predicting the test values
dt_y_pred = dt.predict(X_test)

- Finally we will compare our predictions to the real ones by plotting the on the same figure.

In [None]:
#Plotting the results
fig, ax = plt.subplots(figsize=(12,4))
sns.scatterplot(dt_y_pred, y_test, ax=ax, color=sns.husl_palette(10)[0])
sns.lineplot([0, 175], [0, 175], color=sns.husl_palette(10)[5], ax=ax)
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Prediction evaluation (Decision Tree)', size=15)
plt.show()

- The predictions are visibly much better than the one that the linear regression produced. Still we can't say that model produced good results. The decision tree needs further tuning in order to produce better results. Our data might also need more transformation or we might need more features in order to produce more accurate results. 