This is a narrative notebook for my coursework in INM430 Principles of Data Science module at City, University of London (2020-21 Academic Year). The extra codes and report docs for the Tiny Data Science project handed-in/out on turnitin as of 20 December 2020. Provisional Scores Distinction.

***Abstract***
This study develops the question about rising suicide rates among young women in the Republic of Korea. There is various feature selection has considered in data processing. With a focus on universal relationship between predictors (gender, age) and the ratio of suicidal deaths, socioeconomic attributes are added to identify significant attributes.
In addition to the weak correlation between indicators(population and economic index) and target value, the simplest regression model has emphasized the feature enforcement in expectation. According to the list of coefficient values, the richer dataset is significant to improve the regression model accuracy.
The research would eventually help leaders of countries to prevent acceleration on suicide increasement.

***Keywords—suicides rates, regression prediction model***

**Please reach out to me on Unibuddy if you have questions in City's MSc Data Science programme!**
https://www.city.ac.uk/study/ask-a-student?unibuddy=buddies/students/5e21c2fcd16678055b9fac23

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

## Part1. Data Loading & Manipulation

In [None]:
suicide_data = pd.read_csv('../input/suicide-rates-overview-1985-to-2016/master.csv')
suicide_data.head(3)

***first start with renaming the columns:***

In [None]:
# suicides/100k pop -> suicides_per_100k
# HDI for year -> hdi_for_year
# gdp_for_year ($) -> gdp_for_year
# gdp_per_capita ($) -> gdp_per_capita

cols_rename = {"suicides/100k pop": "suicides_per_100k", "HDI for year": "hdi_for_year",
               " gdp_for_year ($) ": "gdp_for_year", "gdp_per_capita ($)": "gdp_per_capita"}

suicide_data.rename(axis=1, mapper = cols_rename, inplace = True)

***process the data so it's easier to use for analysis***

In [None]:
# country-year just decomposes to year and country so ignore this
# drop country-year
suicide_data.drop(columns = ["country-year"], axis = 1, inplace = True)

# drop generation as that is an indicator of age
suicide_data.drop(columns = ["generation"], axis = 1, inplace = True)

***change the columns to be easier to access***

In [None]:
# let's encode the age bands with integers since these are ordinal
suicide_data.loc[:, "age"].unique()

# enocde age bands
age_encoding = {"5-14 years": 1, "15-24 years": 2, "25-34 years": 3, "35-54 years": 4, "55-74 years": 5, "75+ years": 6}
suicide_data.loc[:, "age"] = suicide_data.loc[:, "age"].map(lambda x: age_encoding[x])

In [None]:
# change gdp_for_year ($) to be numeric
suicide_data.loc[:, "gdp_for_year"] = suicide_data.loc[:, "gdp_for_year"].map(lambda x: float(x.replace(",", "")))

***the overview after processing***

In [None]:
suicide_data.head()

In [None]:
suicide_data.describe()

## Part2. Exploratory Data Analysis(EDA)

### Looking similar names of columns to confirm their values are genuine correlated

In [None]:
# suicides_no is probably correlated to suicides_per_100k 
suicide_data.plot.scatter(y = "suicides_per_100k", x = "suicides_no")

In [None]:
# have a look at the top 10 countries for suicide number
suicide_data.groupby(by = ["country"]).apply(func = lambda x: x.suicides_no.sum()).sort_values(ascending = False).head(10)

In [None]:
# get names of the top 10 countries
top_10 = suicide_data.groupby(by = ["country"]).apply(func = lambda x: x.suicides_per_100k.sum()).sort_values(ascending = False).head(10).index
top_10

In [None]:
top_10[0]

In [None]:
# lets have a look at the relationship between suicide no and suicides rate for the top 10 countries
for country in top_10:
    _ = suicide_data.loc[suicide_data.country == country,["suicides_no", "suicides_per_100k"]].corr().iloc[0,1]
    
    #plotting
    suicide_data[suicide_data.country == country].plot.scatter(x = "suicides_per_100k", y = "suicides_no", title = country + " correlation:" + str(_))

***mostly positive correlations -> focus on predicting suicide rate rather than the suicide no as it is more insensitive to the population size***

In [None]:
#drop suicides_no column
suicide_data.drop(columns = ["suicides_no"], axis = 1, inplace = True)

In [None]:
# repeat the same between gdp per capita and gdp for year
for country in top_10:
    _ = suicide_data.loc[suicide_data.country == country,["gdp_per_capita", "gdp_for_year"]].corr().iloc[0,1]
    
    #plotting
    suicide_data[suicide_data.country == country].plot.scatter(x = "gdp_per_capita", y = "gdp_for_year", title = country + " correlation:" + str(_))

***very strong correlation -> drop gdp_for_year***

In [None]:
suicide_data.drop(columns = ["gdp_for_year"], axis = 1, inplace = True)

***are there columns with high missingness?***

In [None]:
suicide_data.apply(axis=0, func = lambda x: pd.isna(x).sum())

In [None]:
suicide_data[~pd.isna(suicide_data.hdi_for_year)].describe()

***missingness of HDI is across all years, ages and countries, so not biased towards a particular demographic!***

In [None]:
#Hence, best to drop HDI
suicide_data.drop(columns = ["hdi_for_year"], axis = 1, inplace=True)

### Secondly, let's have a look at some univariate relationships between each predictor and suicide rate across the years

In [None]:
# sex
sex_data = suicide_data.groupby(by = ["sex", "year"]).apply(func = lambda x: pd.Series({"total": x.suicides_per_100k.sum()}))
sex_data = sex_data.reset_index()

# plotting
fig, ax = plt.subplots(figsize=(10, 10))
ax = sns.barplot(ax = ax, y="year", x="total", hue = "sex", data=sex_data, orient="h").set_title("total suicide rate by sex and year")

***looks like males have a higher rate overall across the years upward trend following a decreasing trend***

In [None]:
# how many males/ females yer pear?
sex_data = suicide_data.groupby(by = ["sex", "year"]).apply(func = lambda x: pd.Series({"total": x.population.sum()}))
sex_data = sex_data.reset_index()

# plotting
fig, ax = plt.subplots(figsize=(10, 10))
ax = sns.barplot(ax =ax, y="year", x="total", hue = "sex", data=sex_data, orient="h").set_title("sample size by sex and year")

***about the similar levels each year, which suggest suicide rates are dominated by males in general***

In [None]:
# now lets look at age
age_data = suicide_data.groupby(by = ["age", "year"]).apply(func = lambda x: pd.Series({"total": x.suicides_per_100k.sum()}))
age_data = age_data.reset_index()

# plotting
fig, ax = plt.subplots(figsize=(10, 10))
ax = sns.barplot(y="year", x="total", hue = "age", data=age_data, orient="h").set_title("total suicide rate by age and year")

***looks like age has an impact on suicide rate across years with older individuals more likely to commit suicide***

In [None]:
# again check the sample size to see if there is an unequal balance across age bands
age_data = suicide_data.groupby(by = ["age", "year"]).apply(func = lambda x: pd.Series({"total": x.population.sum()}))
age_data = age_data.reset_index()

# plotting
fig, ax = plt.subplots(figsize=(10, 15))
ax = sns.barplot(y="year", x="total", hue = "age", data=age_data, orient="h").set_title("sample size by age and year")

***again, seems that the ages are uniform across bands***

In [None]:
# generally appears to be negative correlation between gdp_per_capita and suicides_per_100k

for year in suicide_data.year.unique():
    fig, ax = plt.subplots(figsize = (10, 5))
    year_data = suicide_data[suicide_data.year == year]
    sns.scatterplot(ax = ax, x = "gdp_per_capita", y = "suicides_per_100k", data = year_data).set_title(year)

### Lastly, have a look year indicator

In [None]:
country_year_data = suicide_data.groupby(by = ["country",  "year"]).apply(func = lambda x: pd.Series({"total": x.suicides_per_100k.sum()})).reset_index()

for country in country_year_data.country.unique():
    fig, ax = plt.subplots(figsize = (10, 5))
    sns.lineplot(ax = ax, x = "year", y = "total",data = country_year_data[country_year_data.country == country]).set_title(country)

***majority of countries have a decreasing relationship with suicide rate, some countries are increasing like South Korea***

In [None]:
# any easy correlated variables
# understandably gdp_per_capita has increased over the years
# also age is correlated with the suicide rate as previously seen
suicide_data.corr()

## Part3. Design a predict model
### Now lets try and predict suicide rate given the predictors (type):
year (ordinal), age (ordinal), population (real), country (nominal)-removed to leave only South Korea, gender (nominal), gdp per capita (real)

In [None]:
# filter to only South Korea
suicide_data = suicide_data[suicide_data.country == "Republic of Korea"]

In [None]:
suicide_data_wo_korea = suicide_data.drop(columns = ["country"], axis = 1)

In [None]:
# columns for one hot encoding (gender: female or male)
y = suicide_data_wo_korea.loc[:, "suicides_per_100k"]
nominal_cols = suicide_data_wo_korea.dtypes[suicide_data_wo_korea.dtypes == object].reset_index().iloc[:, 0]

suicide_data_nominal = suicide_data_wo_korea.loc[:, nominal_cols] #only gender
suicide_data_quant = suicide_data_wo_korea.drop(columns = nominal_cols, axis = 1)

# remove target from suicide_data_quant
suicide_data_quant.drop(columns = ["suicides_per_100k"], axis = 1, inplace = True)

In [None]:
suicide_data_quant

In [None]:
# one hot transformer
enc = OneHotEncoder(sparse=False)
enc.fit(suicide_data_nominal)
enc_out = enc.transform(suicide_data_nominal) #first column is female, second column is male

# column bind onto quant data
X = np.concatenate([suicide_data_quant, enc_out], axis = 1)

In [None]:
enc.categories_

In [None]:
X

In [None]:
y

In [None]:
lm = LinearRegression()
lm.fit(X, y.values)

In [None]:
lm.coef_ #y = a1x1 +a2x2 +a3..
#a1(year) is positive relation in y, a2(age), a3(population) and a4(gdp) is negative
#but very weak ... the last two are for gender (female shows lower suicidal rate than male)

In [None]:
suicide_data_wo_korea

In [None]:
mean_absolute_error(lm.predict(X), y.values)
#the result means that for true 7 on avg prediction 21
#therefore, this is not good model and the reason why the master data requires to combine with other datasets
#which supply more features for training model.

In [None]:
np.median(y.values)

## Part4. Optimise the first model
For the better accurate regression model, we have combined original features with additional features from new dataset[../input/world-happiness] To support mental health attribute, alcohol consumption per capita in [1] has been added. In terms of the information accessibility, we would expect that national internet/mobile network speed by country [2] ought to dominant dependents when it has been interlaced into the analysis frame.


[1] Ritchie, H. and Roser, M., 2018. Alcohol Consumption. [online] Our World in Data. Available at: <https://ourworldindata.org/alcoholconsumption> [Accessed 19 December 2020]. <br>
[2] Speedtest Global Index. 2020. Speedtest Global Index – Monthly Comparisons Of Internet Speeds From Around The World. [online] Available at: <https://www.speedtest.net/global-index> [Accessed 19
December 2020].

## Part5. Findings and Future work
There are more relative feature since new dataset supply meaningful attributes. Especially it is interesting to observe the positive impact of GDP on suicide rates when we expand the single case of South Korea to cover the global countries. (omitted)

For a short period of time, National Statistical Portal (KOrean Statistical Information Service, KOSIS) offers monthly suicidal death statistics since January 2020 by the establishment of a new statistical system for suicide prevention. In the event that monthly suicide statistics has been published KOSIS two months after suicide, it includes the gender and age class. (omitted) Hence, this study would be improved to secure common band of resources as further work.