## Purpose:
The goal of this assignment is to practice creating regression models to generate predictions based on specific features of the dataset.   

## Pre-processing

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
from IPython.display import display
import numpy as np
import pandas as pd
import seaborn as sns; sns.set(style="ticks", color_codes=True)

# Load data file and print data head to check proper loading
df = pd.read_csv('Datasets/master.csv')
df.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [2]:
# Renaming the columns to provide simpler category names
df.columns = ['CNTRY', 'YR', 'SEX', 'AGE', 'SUIC_NUM', 
              'POP', 'SUIC_PER100K_POP', 'CNTRY_YR',
              'HDI', 'GDP_FOR_YR', 'GDP_PER_CAP', 'GEN']

In [3]:
# 'HDI' removed due to large quantity of missing values
# Country-related values were removed due to redundancy 
del(df['CNTRY'])
del(df['YR'])
del(df['CNTRY_YR'])
del(df['HDI'])
del(df['GDP_FOR_YR'])
del(df['GDP_PER_CAP'])

In [4]:
# Check to make sure feature removal was performed correctly
df.head()

Unnamed: 0,SEX,AGE,SUIC_NUM,POP,SUIC_PER100K_POP,GEN
0,male,15-24 years,21,312900,6.71,Generation X
1,male,35-54 years,16,308000,5.19,Silent
2,female,15-24 years,14,289700,4.83,Generation X
3,male,75+ years,1,21800,4.59,G.I. Generation
4,male,25-34 years,9,274300,3.28,Boomers


In [5]:
# Check for duplicates, this adds a new column to the dataset
df["is_duplicate"]= df.duplicated()

print(f"Training data size = {len(df)}")
print(f"Duplicates in TRAINING data = {len(df[df['is_duplicate']==True])}")

Training data size = 27820
Duplicates in TRAINING data = 204


In [6]:
# Drop the duplicate rows using index
index_to_drop = df[df['is_duplicate']==True].index
df.drop(index_to_drop, inplace=True)

# Remove the duplicate marker column
df.drop(columns='is_duplicate', inplace=True)
print(f'Training count = {len(df)}')

Training count = 27616


In [7]:
# Do we have NaN in our dataset?
df.isnull().any()

SEX                 False
AGE                 False
SUIC_NUM            False
POP                 False
SUIC_PER100K_POP    False
GEN                 False
dtype: bool

In [8]:
# Check unique levels and see what markers are used
for col in df.columns:
    if df[col].dtype == object:
        print (col, df[col].unique())

SEX ['male' 'female']
AGE ['15-24 years' '35-54 years' '75+ years' '25-34 years' '55-74 years'
 '5-14 years']
GEN ['Generation X' 'Silent' 'G.I. Generation' 'Boomers' 'Millenials'
 'Generation Z']


In [9]:
# One-hot encoder to convert nominal features into numerical values
def encode_onehot(_df, _f):
    _df2 = pd.get_dummies(_df[_f], prefix='', prefix_sep='').groupby(level=0, axis=1).max().add_prefix(_f+' - ')
    _df3 = pd.concat([_df, _df2], axis=1)
    _df3 = _df3.drop([_f], axis=1)
    return _df3

In [10]:
from sklearn import preprocessing 

# One hot encode nominal features
df_o = encode_onehot(df, 'SEX')
df_o = encode_onehot(df_o, 'AGE')
df_o = encode_onehot(df_o, 'GEN')

In [11]:
# Check that one hot encoding was performed correctly
df_o.head()

Unnamed: 0,SUIC_NUM,POP,SUIC_PER100K_POP,SEX - female,SEX - male,AGE - 15-24 years,AGE - 25-34 years,AGE - 35-54 years,AGE - 5-14 years,AGE - 55-74 years,AGE - 75+ years,GEN - Boomers,GEN - G.I. Generation,GEN - Generation X,GEN - Generation Z,GEN - Millenials,GEN - Silent
0,21,312900,6.71,False,True,True,False,False,False,False,False,False,False,True,False,False,False
1,16,308000,5.19,False,True,False,False,True,False,False,False,False,False,False,False,False,True
2,14,289700,4.83,True,False,True,False,False,False,False,False,False,False,True,False,False,False
3,1,21800,4.59,False,True,False,False,False,False,False,True,False,True,False,False,False,False
4,9,274300,3.28,False,True,False,True,False,False,False,False,True,False,False,False,False,False


### Develop a linear regression model to model targets of age 20, male, and generation X. 

In [12]:
# Prepare the input X matrix and target y vector
X_q1 = df_o.loc[:, df_o.columns != 'SUIC_PER100K_POP'].values
y_q1 = df_o.loc[ :, df_o.columns == 'SUIC_PER100K_POP'].values.ravel()

In [13]:
# Obtain the order of feature column names to create a target prediction unit
print(df_o.iloc[0])

SUIC_NUM                     21
POP                      312900
SUIC_PER100K_POP           6.71
SEX - female              False
SEX - male                 True
AGE - 15-24 years          True
AGE - 25-34 years         False
AGE - 35-54 years         False
AGE - 5-14 years          False
AGE - 55-74 years         False
AGE - 75+ years           False
GEN - Boomers             False
GEN - G.I. Generation     False
GEN - Generation X         True
GEN - Generation Z        False
GEN - Millenials          False
GEN - Silent              False
Name: 0, dtype: object


In [14]:
# Calculating the mean values for two categories that will be imputed in the target prediction unit
mean_suic_num = df_o["SUIC_NUM"].mean()
mean_pop = df_o["POP"].mean()

In [15]:
# Generate the prediction unit 
# Age 20, male, Generation X
predict_q1 = [[mean_suic_num,
               mean_pop,
               False,
               True, # Male
               True, # 15-24 years
               False,
               False,
               False,
               False,
               False,
               False,
               False,
               True, # Generation X
               False,
               False,
               False]]

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# Linear Regression training with all values in the dataset
lin_reg_q1 = LinearRegression()
lin_reg_q1.fit(X_q1, y_q1)

In [17]:
# Generate the sub-population of age 20, male, Generation X

X_df = df_o.loc[:, df_o.columns != 'SUIC_PER100K_POP']
y_df = df_o.loc[ :, df_o.columns]
filter_X_df = X_df.loc[(X_df["SEX - male"] == True) &
                     (X_df["GEN - Generation X"] == True) &
                     (X_df["AGE - 15-24 years"] == True)]
filter_y_df = y_df.loc[(y_df["SEX - male"] == True) &
                     (y_df["GEN - Generation X"] == True) &
                     (y_df["AGE - 15-24 years"] == True)]
filter_y_df_final = filter_y_df.loc[ :, filter_y_df.columns == 'SUIC_PER100K_POP']
X_test = filter_X_df.values
y_test = filter_y_df_final.values.ravel()

In [18]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Predict the target sub-population and obtain metrics for prediction
y_pred = lin_reg_q1.predict(X_test)
mae_q1 = mean_absolute_error(y_test, y_pred) 

In [19]:
from sklearn.metrics import r2_score

# Predict the value for the singuler target prediction unit
predict_q1 = lin_reg_q1.predict(predict_q1)
print(f'Mean Absolute Error: {mae_q1:.3f} \nPredicted Value = {predict_q1}')

Mean Absolute Error: 8.512 
Predicted Value = [16.51877515]


#### Regression coefficients 

There are 16 regression coefficients, 14 are one-hot encoded features and two are for numerical values.

In [20]:
# Print the coefficients in the linear regression model
lin_reg_q1.coef_

array([ 7.16064593e-03, -8.31623301e-07, -6.48579116e+00,  6.48579116e+00,
       -2.54806174e+00, -3.34527997e-01,  9.82803695e-01, -9.44131932e+00,
        1.89687352e+00,  9.44423184e+00,  3.95694604e-02,  3.36912811e+00,
       -4.79156915e-01, -1.43060317e+00, -1.50981851e+00,  1.08810269e-02])

### Develop a new model but convert categorical values in numerical values

In [21]:
# Generate a deep copy of the pre-processed data and check that it was properly copied
df_q2 = df.copy(deep=True)
df_q2.head()

Unnamed: 0,SEX,AGE,SUIC_NUM,POP,SUIC_PER100K_POP,GEN
0,male,15-24 years,21,312900,6.71,Generation X
1,male,35-54 years,16,308000,5.19,Silent
2,female,15-24 years,14,289700,4.83,Generation X
3,male,75+ years,1,21800,4.59,G.I. Generation
4,male,25-34 years,9,274300,3.28,Boomers


In [22]:
# Convert the categorical values of the dataset into numerical values
# Age and generation values were generated based on averaging the years in the categories range
df_q2['AGE'].replace(['15-24 years', '35-54 years', '75+ years',
                         '25-34 years', '55-74 years', '5-14 years'],
                       [19.5, 44.5, 87.5, 29.5, 64.5, 9.5], inplace=True)

df_q2['GEN'].replace(['Generation X', 'Silent', 'G.I. Generation', 
                        'Boomers', 'Millenials', 'Generation Z'],
                       [1972.5, 1936.5, 1924.5, 1955, 1988.5, 2004.5], inplace=True)

df_q2['SEX'].replace(['male', 'female'],
                      [0, 1], inplace=True)

In [23]:
# Check that numerical conversions were properly done
df_q2.head()

Unnamed: 0,SEX,AGE,SUIC_NUM,POP,SUIC_PER100K_POP,GEN
0,0,19.5,21,312900,6.71,1972.5
1,0,44.5,16,308000,5.19,1936.5
2,1,19.5,14,289700,4.83,1972.5
3,0,87.5,1,21800,4.59,1924.5
4,0,29.5,9,274300,3.28,1955.0


In [24]:
# Generate the prediction unit 
# Age 20, male, Generation X

predict_q2 = [[0, 20, mean_suic_num, mean_pop, 1972.5]]

In [25]:
# Prepare the input X matrix and target y vector
X_q2 = df_q2.loc[:, df_q2.columns != 'SUIC_PER100K_POP'].values
y_q2 = df_q2.loc[ :, df_q2.columns == 'SUIC_PER100K_POP'].values.ravel()

In [26]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Linear Regression
lin_reg_q2 = LinearRegression()
lin_reg_q2.fit(X_q2, y_q2)

In [27]:
# Generate the sub-population of age 20, male, Generation X

X_df = df_q2.loc[:, df_q2.columns != 'SUIC_PER100K_POP']
y_df = df_q2.loc[ :, df_q2.columns]
filter_X_df = X_df.loc[(X_df["SEX"] == 0) &
                     (X_df["GEN"] == 1972.5) &
                     (X_df["AGE"] == 19.5)]
filter_y_df = y_df.loc[(y_df["SEX"] == 0) &
                     (y_df["GEN"] == 1972.5) &
                     (y_df["AGE"] == 19.5)]
filter_y_df_final = filter_y_df.loc[ :, filter_y_df.columns == 'SUIC_PER100K_POP']
X_test = filter_X_df.values
y_test = filter_y_df_final.values.ravel()

In [28]:
# Predict the target sub-population and obtain metrics for prediction
y_pred = lin_reg_q2.predict(X_test)
mae_q2 = mean_absolute_error(y_test, y_pred) 

In [29]:
from sklearn.metrics import r2_score

# Predict the value for the singuler target prediction unit
predict_q2 = lin_reg_q2.predict(predict_q2)
print(f'Mean Absolute Error: {mae_q2:.3f} \nPredicted Value = {predict_q2}')

Mean Absolute Error: 7.763 
Predicted Value = [14.44021591]


#### Regression coefficients 

There are 5 regression coefficients, one for each of the five features

In [30]:
# Print the coefficients in the linear regression model
lin_reg_q2.coef_

array([-1.29359666e+01,  2.00446071e-01,  7.28783483e-03, -8.58614104e-07,
       -3.77206363e-02])

### What are the differences between the model with categorical values vs. the model with numerical values?

In MAE of one-hot encoded regression was 8.512 compared to the MAE of the numerical version of 7.763, approximately a 9.20% difference. In other words, if we were to perform any future predictions between these two models, we can expect approximately a 9.2% difference between the two values. With regards to calculating suicides per 100,000 people, this can go up to as high as 20 units (maximum value in dataset is 224.97).  With the mean value of 12.9, this can become problematic when predicting values.

### Identify prediction for specified target

In [31]:
# Generate the prediction unit 
# Age 33, male, generation Alpha
predict_onehot_q4 = [[mean_suic_num,
               mean_pop,
               False,
               True, # Male
               False,
               True, # Age 25-34 years
               False,
               False,
               False,
               False,
               False,
               False,
               False,
               False,
               False,
               False]]

In [32]:
predict_onehot_q4 = lin_reg_q1.predict(predict_onehot_q4)
print(f'Predicted Value - One Hot Encoding = {predict_onehot_q4}')

Predicted Value - One Hot Encoding = [19.21146581]


In [33]:
# Generate the prediction unit 
# Age 33, male, generation Alpha
predict_numerical_q4 = [[0, 33, mean_suic_num, mean_pop, 2017.5]]

predict_numerical_q4 = lin_reg_q2.predict(predict_numerical_q4)
print(f'Predicted Value - Numerical = {predict_numerical_q4}')

Predicted Value - Numerical = [15.34858621]


### Discussion

One advantage of using regression is that it provides an equation. This equation has coefficients for the independent variables, and these can be translated to a more easily interpretable model. This is also useful because it allows the user to view the range of values and how the independent variable affects the model. In comparison, classification with nominal features may not necessarily be easily interpreted due to the type of classifier used. For example, a neural network may not be as easy to visualize compared to a linear equation, and distinguishing the effect of independent variables on the classification may be more nuanced than raw numbers. 

One advantage of using regular numerical values is that it reduces the number of features used by the model. As noted in a previous lesson, the curse of dimensionality can occur if there are far too many features with relation to the number of measurements. This can result in overfitting and reduce the ability for the model to find meaningful relationships between variables. Furthermore, this reduction in features also has the advantage of using less memory and less computational power. 

If my customer was looking to predict the binary problem of whether to categorize a person or country as having a high suicide rate vs low suicide rate, I would suggest that the user use a regression model. One reason I would want to use a regression model is because it provides a numerical value that acts an arbitrary threshold that the user can decide. This value can altered to fit the users needs on deciding high vs low rates. A second reason I prefer the regression model is the scalability. Having an equation that can be fitted to various population sizes and groupings provides the user the option of having a wide range of information. A third reason I would prefer regression is that the regression model can be interpreted through its equation, and this can help bridge the gap between the model and the end user. Seeing how the regression model works in its equation may help the end user in applying the model. 