Continuing to the previous machine learning problem, let's get back to the pre-processed
dataset Suicide Rates Overview 1985 to 2016 file. We would like to have a machine learning
model to predict the suicide rate 'suicides/100k pop'.

In [1]:
import math
import pandas as pd 
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sea
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## 1. [20 pts] Use your previous pre-processed dataset, keep the variables as one-hot encoded and develop a multiple linear regression model. Use your model to predict the target variable for the people with age 20, male, and generation X. What is the MAE error of this prediction? How many regression coefficients are there?

In [2]:
modelData = pd.read_csv('master.csv')
modelData

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers
...,...,...,...,...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54 years,107,3620833,2.96,Uzbekistan2014,0.675,63067077179,2309,Generation X
27816,Uzbekistan,2014,female,75+ years,9,348465,2.58,Uzbekistan2014,0.675,63067077179,2309,Silent
27817,Uzbekistan,2014,male,5-14 years,60,2762158,2.17,Uzbekistan2014,0.675,63067077179,2309,Generation Z
27818,Uzbekistan,2014,female,5-14 years,44,2631600,1.67,Uzbekistan2014,0.675,63067077179,2309,Generation Z


In [3]:
modelData = modelData.rename(columns=lambda x: x.strip())
meanHDI = modelData['HDI for year'].mean()
modelData['HDI'] = modelData['HDI for year'].fillna(value=meanHDI)
modelData['GDPyear'] = modelData['gdp_for_year ($)'].apply(lambda x: float(x.replace(",", "")))
modelData = modelData.drop(columns=['HDI for year', 'gdp_for_year ($)'])
modelData

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,gdp_per_capita ($),generation,HDI,GDPyear
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,796,Generation X,0.776601,2.156625e+09
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,796,Silent,0.776601,2.156625e+09
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,796,Generation X,0.776601,2.156625e+09
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,796,G.I. Generation,0.776601,2.156625e+09
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,796,Boomers,0.776601,2.156625e+09
...,...,...,...,...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54 years,107,3620833,2.96,Uzbekistan2014,2309,Generation X,0.675000,6.306708e+10
27816,Uzbekistan,2014,female,75+ years,9,348465,2.58,Uzbekistan2014,2309,Silent,0.675000,6.306708e+10
27817,Uzbekistan,2014,male,5-14 years,60,2762158,2.17,Uzbekistan2014,2309,Generation Z,0.675000,6.306708e+10
27818,Uzbekistan,2014,female,5-14 years,44,2631600,1.67,Uzbekistan2014,2309,Generation Z,0.675000,6.306708e+10


In [4]:
oneHotdata = modelData.copy()
cols = ['sex', 'age', 'generation']
for col in cols:
    oneHotdata = pd.concat([oneHotdata, pd.get_dummies(oneHotdata[col])], axis=1)
oneHotdata['suicides'] = oneHotdata['suicides/100k pop']
oneHotdata = oneHotdata.drop(columns=cols)
oneHotdata = oneHotdata.drop(columns=['country', 'suicides_no', 'country-year', 'suicides/100k pop'])
oneHotdata

Unnamed: 0,year,population,gdp_per_capita ($),HDI,GDPyear,female,male,15-24 years,25-34 years,35-54 years,5-14 years,55-74 years,75+ years,Boomers,G.I. Generation,Generation X,Generation Z,Millenials,Silent,suicides
0,1987,312900,796,0.776601,2.156625e+09,0,1,1,0,0,0,0,0,0,0,1,0,0,0,6.71
1,1987,308000,796,0.776601,2.156625e+09,0,1,0,0,1,0,0,0,0,0,0,0,0,1,5.19
2,1987,289700,796,0.776601,2.156625e+09,1,0,1,0,0,0,0,0,0,0,1,0,0,0,4.83
3,1987,21800,796,0.776601,2.156625e+09,0,1,0,0,0,0,0,1,0,1,0,0,0,0,4.59
4,1987,274300,796,0.776601,2.156625e+09,0,1,0,1,0,0,0,0,1,0,0,0,0,0,3.28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27815,2014,3620833,2309,0.675000,6.306708e+10,1,0,0,0,1,0,0,0,0,0,1,0,0,0,2.96
27816,2014,348465,2309,0.675000,6.306708e+10,1,0,0,0,0,0,0,1,0,0,0,0,0,1,2.58
27817,2014,2762158,2309,0.675000,6.306708e+10,0,1,0,0,0,1,0,0,0,0,0,1,0,0,2.17
27818,2014,2631600,2309,0.675000,6.306708e+10,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1.67


In [5]:
LR = LinearRegression()
X = oneHotdata.iloc[:,:19]
y = oneHotdata.iloc[:,19]
LR.fit(X,y)

LinearRegression()

In [6]:
target=oneHotdata[oneHotdata['15-24 years'] == 1]
target=target[target['male'] == 1]
target=target[target['Generation X'] == 1]
target

Unnamed: 0,year,population,gdp_per_capita ($),HDI,GDPyear,female,male,15-24 years,25-34 years,35-54 years,5-14 years,55-74 years,75+ years,Boomers,G.I. Generation,Generation X,Generation Z,Millenials,Silent,suicides
0,1987,312900,796,0.776601,2.156625e+09,0,1,1,0,0,0,0,0,0,0,1,0,0,0,6.71
13,1988,319200,769,0.776601,2.126000e+09,0,1,1,0,0,0,0,0,0,0,1,0,0,0,5.33
28,1989,323500,833,0.776601,2.335125e+09,0,1,1,0,0,0,0,0,0,0,1,0,0,0,3.71
37,1992,263700,251,0.776601,7.094526e+08,0,1,1,0,0,0,0,0,0,0,1,0,0,0,3.41
48,1993,243300,437,0.776601,1.228071e+09,0,1,1,0,0,0,0,0,0,0,1,0,0,0,7.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27632,1996,2231200,703,0.776601,1.394889e+10,0,1,1,0,0,0,0,0,0,0,1,0,0,0,11.43
27644,1997,2279700,724,0.776601,1.474460e+10,0,1,1,0,0,0,0,0,0,0,1,0,0,0,12.19
27656,1998,2334400,719,0.776601,1.498897e+10,0,1,1,0,0,0,0,0,0,0,1,0,0,0,11.39
27667,1999,2406752,801,0.776601,1.707847e+10,0,1,1,0,0,0,0,0,0,0,1,0,0,0,13.63


In [7]:
X_test = target.iloc[:,:19]
y_test = target.iloc[:,19]
predictions = LR.predict(X_test)
predictions

array([17.79075973, 17.67437042, 17.5566525 , ..., 16.91845259,
       16.81470025, 13.28253586])

In [8]:
MAE = (len(y_test)**-1) * np.sum(np.abs(predictions-y_test))
print(MAE, len(LR.coef_), np.mean(predictions))

9.575823828519754 19 17.076808619158736


### The MAE is 9.5758 and there are 19 coefficients.

## 2. [30 pts] Now use the original sex, age and generation variables in numerical form and develop a new model. Use your model to predict the target value for the people with age 20, male, and generation X. What is the MAE error of this prediction? How many line coefficients are there? (Note that for this step you have to think of a way of encoding the original nominal age feature and generation feature into numerical features.)

In [9]:
numData = modelData.copy()
numData['suicides'] = numData['suicides/100k pop']
numData = numData.drop(columns=['country-year', 'suicides/100k pop', 'suicides_no'])
numData

Unnamed: 0,country,year,sex,age,population,gdp_per_capita ($),generation,HDI,GDPyear,suicides
0,Albania,1987,male,15-24 years,312900,796,Generation X,0.776601,2.156625e+09,6.71
1,Albania,1987,male,35-54 years,308000,796,Silent,0.776601,2.156625e+09,5.19
2,Albania,1987,female,15-24 years,289700,796,Generation X,0.776601,2.156625e+09,4.83
3,Albania,1987,male,75+ years,21800,796,G.I. Generation,0.776601,2.156625e+09,4.59
4,Albania,1987,male,25-34 years,274300,796,Boomers,0.776601,2.156625e+09,3.28
...,...,...,...,...,...,...,...,...,...,...
27815,Uzbekistan,2014,female,35-54 years,3620833,2309,Generation X,0.675000,6.306708e+10,2.96
27816,Uzbekistan,2014,female,75+ years,348465,2309,Silent,0.675000,6.306708e+10,2.58
27817,Uzbekistan,2014,male,5-14 years,2762158,2309,Generation Z,0.675000,6.306708e+10,2.17
27818,Uzbekistan,2014,female,5-14 years,2631600,2309,Generation Z,0.675000,6.306708e+10,1.67


In [10]:
le = LabelEncoder()
cols = ['country', 'sex', 'age','generation']
for col in cols:
    le.fit(numData[col])
    print(dict(zip(le.classes_, le.transform(le.classes_))))
    numData[col] = le.transform(numData[col])
numData

{'Albania': 0, 'Antigua and Barbuda': 1, 'Argentina': 2, 'Armenia': 3, 'Aruba': 4, 'Australia': 5, 'Austria': 6, 'Azerbaijan': 7, 'Bahamas': 8, 'Bahrain': 9, 'Barbados': 10, 'Belarus': 11, 'Belgium': 12, 'Belize': 13, 'Bosnia and Herzegovina': 14, 'Brazil': 15, 'Bulgaria': 16, 'Cabo Verde': 17, 'Canada': 18, 'Chile': 19, 'Colombia': 20, 'Costa Rica': 21, 'Croatia': 22, 'Cuba': 23, 'Cyprus': 24, 'Czech Republic': 25, 'Denmark': 26, 'Dominica': 27, 'Ecuador': 28, 'El Salvador': 29, 'Estonia': 30, 'Fiji': 31, 'Finland': 32, 'France': 33, 'Georgia': 34, 'Germany': 35, 'Greece': 36, 'Grenada': 37, 'Guatemala': 38, 'Guyana': 39, 'Hungary': 40, 'Iceland': 41, 'Ireland': 42, 'Israel': 43, 'Italy': 44, 'Jamaica': 45, 'Japan': 46, 'Kazakhstan': 47, 'Kiribati': 48, 'Kuwait': 49, 'Kyrgyzstan': 50, 'Latvia': 51, 'Lithuania': 52, 'Luxembourg': 53, 'Macau': 54, 'Maldives': 55, 'Malta': 56, 'Mauritius': 57, 'Mexico': 58, 'Mongolia': 59, 'Montenegro': 60, 'Netherlands': 61, 'New Zealand': 62, 'Nicaragu

Unnamed: 0,country,year,sex,age,population,gdp_per_capita ($),generation,HDI,GDPyear,suicides
0,0,1987,1,0,312900,796,2,0.776601,2.156625e+09,6.71
1,0,1987,1,2,308000,796,5,0.776601,2.156625e+09,5.19
2,0,1987,0,0,289700,796,2,0.776601,2.156625e+09,4.83
3,0,1987,1,5,21800,796,1,0.776601,2.156625e+09,4.59
4,0,1987,1,1,274300,796,0,0.776601,2.156625e+09,3.28
...,...,...,...,...,...,...,...,...,...,...
27815,100,2014,0,2,3620833,2309,2,0.675000,6.306708e+10,2.96
27816,100,2014,0,5,348465,2309,5,0.675000,6.306708e+10,2.58
27817,100,2014,1,3,2762158,2309,3,0.675000,6.306708e+10,2.17
27818,100,2014,0,3,2631600,2309,3,0.675000,6.306708e+10,1.67


In [11]:
X = numData.iloc[:,:9]
y = numData.iloc[:,9]
numLR = LinearRegression()
numLR.fit(X, y)

LinearRegression()

In [12]:
target = numData[numData['age'] ==0]
target = target[target['sex'] == 1]
target = target[target['generation'] == 2]
target

Unnamed: 0,country,year,sex,age,population,gdp_per_capita ($),generation,HDI,GDPyear,suicides
0,0,1987,1,0,312900,796,2,0.776601,2.156625e+09,6.71
13,0,1988,1,0,319200,769,2,0.776601,2.126000e+09,5.33
28,0,1989,1,0,323500,833,2,0.776601,2.335125e+09,3.71
37,0,1992,1,0,263700,251,2,0.776601,7.094526e+08,3.41
48,0,1993,1,0,243300,437,2,0.776601,1.228071e+09,7.40
...,...,...,...,...,...,...,...,...,...,...
27632,100,1996,1,0,2231200,703,2,0.776601,1.394889e+10,11.43
27644,100,1997,1,0,2279700,724,2,0.776601,1.474460e+10,12.19
27656,100,1998,1,0,2334400,719,2,0.776601,1.498897e+10,11.39
27667,100,1999,1,0,2406752,801,2,0.776601,1.707847e+10,13.63


In [13]:
X_test = target.iloc[:,:9]
y_test = target.iloc[:,9]
predictions = numLR.predict(X_test)
predictions

array([14.27019825, 14.2202757 , 14.1688619 , ..., 17.32430022,
       17.27635174, 13.69952954])

In [14]:
MAE = (len(y_test)**-1) * np.sum(np.abs(predictions-y_test))
print(MAE, len(numLR.coef_), np.mean(predictions))

8.957106978430698 9 15.4264996564108


### The MAE is 8.957 and there are 9 coefficients.

## 3. [10 pts] Did you note any change in these two model performances?

There was a slight improvement in MAE, but the value still seems large. I think with more preprocessing this could be reduced. The rate also dropped from 17.08 to 15.43. 

## 4. [10 pts] What is the prediction for age 33, male and generation Alpha (i.e. the generation after generation Z)?

As with a lot of the class, I am a little confused here. What I will do find age 33 males and change their generation to label 6.

In [15]:
target = numData[numData['age'] ==1]
target = target[target['sex'] == 1]
target['generation'] = 6
target

Unnamed: 0,country,year,sex,age,population,gdp_per_capita ($),generation,HDI,GDPyear,suicides
4,0,1987,1,1,274300,796,6,0.776601,2.156625e+09,3.28
20,0,1988,1,1,279900,769,6,0.776601,2.126000e+09,1.79
25,0,1989,1,1,283600,833,6,0.776601,2.335125e+09,6.35
39,0,1992,1,1,245500,251,6,0.776601,7.094526e+08,2.85
51,0,1993,1,1,230100,437,6,0.776601,1.228071e+09,3.91
...,...,...,...,...,...,...,...,...,...,...
27761,100,2010,1,1,2375259,1533,6,0.655000,3.933277e+10,9.89
27773,100,2011,1,1,2457050,1767,6,0.661000,4.591519e+10,11.03
27786,100,2012,1,1,2548472,1964,6,0.668000,5.182157e+10,11.38
27797,100,2013,1,1,2644648,2150,6,0.672000,5.769045e+10,12.40


In [16]:
X_test = target.iloc[:,:9]
y_test = target.iloc[:,9]
predictions = numLR.predict(X_test)
np.mean(predictions)

13.701537532584128

### The prediction is 13.701

## 5. [10 pts] Give one advantage when using regression (as opposed to classification with nominal features) in terms of input data features.

When using regression (as opposed to classification with nominal features), your input data features can be nominal or numerical. From question one and two, we showed two different ways that we can incorporate nominal features, either by one hot encoding them or by converting them into numerical labels. Numerical to nominal may not be ossible for classification as floating data points may be hard to categorize.

## 6. [10 pts] Give one advantage when using regular numerical values rather than one-hot encoding for regression.

One main advantage to numerical label encoding is dimensionality reduction. In problem one, there were 19 coeffecients where as there were only 9 in the second problem. This can reduce training time. Also, the model will treat one hot encoded variables as independant variables of one another when in reality they are dependant on eachother.

## 7. [10 pts] Now that you developed both a classifier and a regression model for the problem in this assignment, which method do you suggest to your machine learning model customer? Classifier or regression? Why?

I would suggest regression to a customer. The value that we are trying to predict is either the number of suicides per 100k population or just the number of suicides that group could have. In classification, my dependant variable was determining whether a group was 'at risk'. This was not a well defined term, and one less suicide could turn a group from at risk to not at risk. With regression, we can try to predict a number rather than a classification which in my mind is more important.