## Use your previous pre-processed dataset, keep the variables as one-hot encoded and develop a multiple linear regression model. Use your model to predict the target variable for the people with age 20, male, and generation X. What is the MAE error of this prediction? How many regression coefficients are there?

In [1]:
%%time

'''
Data exploration was not done to a great extent for this assignment, since this was a dataset 
previously explored and processed from module three.
'''

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from tqdm import tqdm

##get current working directory
cwd = os.getcwd()

##get data path and open as a pandas dataframe
datapath = cwd + '\\data\\master.csv'

##get dataframe
data = pd.read_csv(datapath)

##check contents, dtypes, and description:
print('\nHead:\n\n ', data.head())
print('\n\nDatatypes:\n\n ', data.dtypes)
print('\nData Description: \n\n', data.describe())

# ##get feature names
features = list(data)

##picked a specific year to narrow down the data
data = data[data.year == 2000]
data = data[data.sex == 'male']

##make copy for later analysis
Data = data.copy(deep=True)

# ##one hot encode these features
encode_list = ['sex', 'age', 'generation']
for feature in encode_list:
    data = pd.concat((data, pd.get_dummies(data[feature], drop_first=False)),1)
    data = data.drop(feature, axis = 1)

#drop dependent variable and country
# Note country only made about a 1 person per 100k difference in MAE so decided to drop
y = data.suicides_100k_pop.values
data.drop(['suicides_100k_pop', 'country', 'year'], axis=1, inplace=True)
y = y.reshape(-1, 1)

##sanity check
features_e = np.array(list(data))
features_el = list(data)

from sklearn.preprocessing import StandardScaler
x_scaler = StandardScaler().fit(data.values)
X_scaled = x_scaler.transform(data.values)

##importing train_test_split from sklearn for data split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.25, 
                                                    random_state = 42)

##extract data of interest from X_test and Y_test
##this includes all Male, Generation X, and age samples within the test set which will be used 
##for prediction
X_test_20 = x_scaler.inverse_transform(X_test)
test_df = pd.DataFrame(X_test_20, columns = features_e)
test_df['y_test'] = y_test
test_df = test_df[(test_df['15-24 years'] == 1) & (test_df['Generation X'] == 1) &
                        (test_df['male'] == 1)] 
y_test_20 = test_df['y_test'].values
test_df.drop('y_test', axis=1, inplace=True)
test_df.to_csv('test_df_encoded.csv')
X_test_20 = x_scaler.transform(test_df.values)


Head:

     country  year     sex          age  suicides_no  population  \
0  Albania  1987    male  15-24 years           21      312900   
1  Albania  1987    male  35-54 years           16      308000   
2  Albania  1987  female  15-24 years           14      289700   
3  Albania  1987    male    75+ years            1       21800   
4  Albania  1987    male  25-34 years            9      274300   

   suicides_100k_pop  gdp_per_capita_dollars       generation  
0               6.71                     796     Generation X  
1               5.19                     796           Silent  
2               4.83                     796     Generation X  
3               4.59                     796  G.I. Generation  
4               3.28                     796          Boomers  


Datatypes:

  country                    object
year                        int64
sex                        object
age                        object
suicides_no                 int64
population             



In [2]:
##sanity check and preprocess summary
print('\n\nPreprocess Summary')
print('------------------------')
print('Pre-encoded Features: ', features)
print('Dependant Variable: Suicides_100k_pop' )
print('Number of Encoded features: ', len(features_el))
print('X_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)



Preprocess Summary
------------------------
Pre-encoded Features:  ['country', 'year', 'sex', 'age', 'suicides_no', 'population', 'suicides_100k_pop', 'gdp_per_capita_dollars', 'generation']
Dependant Variable: Suicides_100k_pop
Number of Encoded features:  15
X_test shape:  (129, 15)
y_test shape:  (129, 1)


In [3]:
%%time

##train and predict using linear regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression().fit(X_train, y_train)

##get the R2 score which is the percentage of explained variance of the predictions
y_pred = regressor.predict(X_test)

##evaluate performance using mean-absolute-error
##Note: adapted form module07_regression_notebook.html
def mae(_y: np.array, _y_pred: np.array) -> float:
    '''Calculates mean-absolute-error'''
    return (len(_y)**-1) * np.sum(np.abs(_y_pred - _y))

def mse(_y: np.array, _y_pred: np.array) -> float:
    '''Calculates Mean-squared-error'''
    return (len(_y)**-1) * np.sum((_y_pred-_y)**2)

##calculate metrics
MAE = mae(y_test, y_pred)
MSE = mse(y_test, y_pred)

##Summary: R2-score, intercept, coefficients and MAE
print('\n\nSummary: ')
print('-----------------------')
print('R2 Score: ', regressor.score(X_train, y_train))
print('Number of Coefficients: ', regressor.coef_.size)
print('Intercept: ', regressor.intercept_)
print('Mean-absolue-error (MAE): ', MAE)
print('Mean-squared-error (MSE): ', MSE)

# regressor.coef_.shape
print('\n\nCoef for Each Feature: ')
coeff_df = pd.DataFrame()
coeff_df['features'] = features_e.flatten()
coeff_df['coef'] = regressor.coef_.flatten()
print(coeff_df.head())



Summary: 
-----------------------
R2 Score:  0.31786611617948446
Number of Coefficients:  15
Intercept:  [21.28316457]
Mean-absolue-error (MAE):  15.348952017356728
Mean-squared-error (MSE):  568.4010168761149


Coef for Each Feature: 
                 features          coef
0             suicides_no  1.073593e+01
1              population -6.342996e+00
2  gdp_per_capita_dollars -2.455410e-01
3                    male  1.397944e+13
4             15-24 years  1.525076e+14
Wall time: 37.1 ms


## Now use the original sex, age and generation variables in numerical form and develop a new model. Use your model to predict the target value for the people with age 20, male, and generation X. What is the MAE error of this prediction? How many line coefficients are there? (Note that for this step you have to think of a way of encoding the original nominal age feature and generation feature into numerical features.)

In [4]:
%%time

##starting fresh
datapath = cwd + '\\data\\master.csv'

##get dataframe
data = pd.read_csv(datapath)
data = data[data.year == 2000]
data = data[data.sex == 'male']

def getUnique(col_name: str, dataframe: pd.DataFrame) -> list:
    '''Returns unique values from a list'''
    values = list(dataframe[col_name])
    return list(set(values))

print('\nEncodings')
print('--------------------------------------------------------')
##encode the 'age' feature with the average age for ease of use
ages = getUnique('age', data)
print('\nUnique Ages: ', ages)
age_dict = {'5-14 years': 10, '15-24 years': 20, '25-34 years': 30, '35-54 years': 45,
           '55-74 years': 65, '75+ years': 75}

##encode generations with counting numbers
generations = getUnique('generation', data)
print('\nUnique Generations: ', generations)
generation_dict = {'G.I. Generation': 0, 'Silent': 1, 'Boomers': 2, 'Generation X': 3,
           'Millenials': 4, 'Generation Z': 5}

##encode generations with counting numbers
sex = getUnique('sex', data)
print('\nUnique Sex: ', sex)
sex_dict = {'male': 0, 'female': 1}

##change the data values with the encoded values
df = data.copy(deep=True)
for idx, val in age_dict.items():
    df.loc[df.age == idx, 'age'] = val
    
for idx, val in generation_dict.items():
    df.loc[df.generation == idx, 'generation'] = val
    
for idx, val in sex_dict.items():
    df.loc[df.sex == idx, 'sex'] = val

##drop dependant variable
y = df['suicides_100k_pop'].values
df.drop(['suicides_100k_pop', 'country', 'year'], axis=1, inplace=True)

##sanity check
print('\nFeatures: ', list(df))
print('\n\nDataframe Head \n', df.head())

##scale data
from sklearn.preprocessing import StandardScaler
x_scaler = StandardScaler().fit(df.values)
X_scaled = x_scaler.transform(df.values)

##split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.25, 
                                                    random_state = 42)

##Extract Male, Generation X, and age samples within the test set which will be used 
##for prediction
X_test_20 = x_scaler.inverse_transform(X_test)
test_df = pd.DataFrame(X_test_20, columns = list(df))
test_df['y_test'] = y_test
test_df = test_df[(test_df['age'] == 20) & (test_df['generation'] == 3) &
                        (test_df['sex'] == 0)] 
y_test_20 = test_df['y_test'].values
test_df.drop('y_test', axis=1, inplace=True)
X_test_20 = x_scaler.transform(test_df.values)


Encodings
--------------------------------------------------------

Unique Ages:  ['55-74 years', '35-54 years', '15-24 years', '5-14 years', '25-34 years', '75+ years']

Unique Generations:  ['Millenials', 'Boomers', 'Silent', 'Generation X', 'G.I. Generation']

Unique Sex:  ['male']

Features:  ['sex', 'age', 'suicides_no', 'population', 'gdp_per_capita_dollars', 'generation']


Dataframe Head 
     sex age  suicides_no  population  gdp_per_capita_dollars generation
132   0  30           17      232000                    1299          3
133   0  65           10      177400                    1299          1
135   0  75            1       24900                    1299          0
137   0  20            5      240000                    1299          3
140   0  45            4      374700                    1299          2
Wall time: 75 ms


In [5]:
%%time

##train and predict using linear regression
regressor = LinearRegression().fit(X_train, y_train)

##get the R2 score which is the percentage of explained variance of the predictions
y_pred = regressor.predict(X_test)
y_pred = y_pred.flatten()

MAE = mae(y_test, y_pred)
MSE = mse(y_test, y_pred)

##Summary: R2-score, intercept, coefficients and MAE
print('\n\nSummary: ')
print('-----------------------')
print('R2 Score: ', regressor.score(X_train, y_train))
print('Number of Coefficients: ', regressor.coef_.size)
print('Intercept: ', regressor.intercept_)
print('Mean-absolue-error: ', MAE)
print('Mean-squared-error (MSE): ', MSE)

print('\n\nCoef for Each Feature: ')
coeff_df = pd.DataFrame()
coeff_df['features'] = np.array(list(df)).flatten()
coeff_df['coef'] = regressor.coef_.flatten()
print(coeff_df)



Summary: 
-----------------------
R2 Score:  0.28990148311997077
Number of Coefficients:  6
Intercept:  21.28646184250393
Mean-absolue-error:  16.035183213866983
Mean-squared-error (MSE):  568.4702602623679


Coef for Each Feature: 
                 features       coef
0                     sex   0.000000
1                     age  -1.817471
2             suicides_no  10.919402
3              population  -6.096847
4  gdp_per_capita_dollars  -0.322176
5              generation -11.178976
Wall time: 8 ms


## Any change in these two model performances?

- The MAE of the model trained with one-hot-encoded data is about four percent lower than the model trained on numerical encoded data. The smaller the MAE, the better the performance, so the model trained with numerical encoded data performed better in this case.


- The MSE of the model trained with one-hot-encoded data vs the model trained on numerical encoded data is negligable. The smaller the MSE, the better the performance, so the model trained with numerical encoded data performed better in this case.


- The R2 score of the model trained with one-hot-encoded data is about nine percent higher than the model trained on numerical encoded data. R2 is the goodness of fit of the data. It is used to determine how well the model's predictions approximate to real data. The closer the R2 is to 1.0, the better the model fits to the data. In this case, it looks like the one-hot-encoded data has a better R2 score as well.

- The wall time for the model trained on one-hot-encoded data was 30% longer than for model trained with the numerical encoded data. 

## What is the prediction for age 33, male and generation Alpha (i.e. the generation after generation Z)?

In [6]:
'''
I realize that there are quite a few ways to handle this question. For one thing, there
are more than just age, sex, and generation. The other independent variables include, suicides_no, and
gdp_per_capita_dollars. Another consideration is that generation alpha does not turn 33 until 2043
at the ealiest, so year should not be used as was therfore dropped. For these reasons, 
I created a synthetic dataset based on real data that looks at people in the age range of 
25-34 years old during 2010 to narrow the data down.  I replaced
generation to '6' (numerical encoding for 
generation alpha) and 33 for all of the age columns. 
Every other datapoint stayed the same with regards to suicides_no, population, and
gdp_per_capita_dollars. This data was saved locally as synth_data.csv but can be seen below as test data.
The 89 samples of synthetic test data was then scaled using standard scaler and run through the 
regression model. It seems straight forward to plot a line through points 
of data on a plot, and so thought that this would be an alternative approach, though it seems that 
both methods should be taken with a grain of salt. As can be seen below, the average suicides per 100k 
population is 11.3 suicides.
'''
datapath = cwd + '\\data\\synth_data.csv'
X_test_33 = np.array(pd.read_csv(datapath).values)
print('\nTest Data: \n', X_test_33)

X_scaled = x_scaler.transform(X_test_33)
y_pred = np.absolute(regressor.predict(X_scaled))
y_pred_mean = np.round(np.mean(y_pred), 1)
print('\n\nPrediction Data: \n', y_pred)
print('\nAverage Suicides per 100k population for generation alpha is: ', y_pred_mean)


Test Data: 
 [[       0       33        9   179720     4359        6]
 [       0       33      485  3237957    11273        6]
 [       0       33       10   202936     3460        6]
 [       0       33        0     5568    25974        6]
 [       0       33      311  1606526    54887        6]
 [       0       33       85   542986    49181        6]
 [       0       33        0    28398    30239        6]
 [       0       33        6   235591    22572        6]
 [       0       33        0    18826    17034        6]
 [       0       33      382   723856     6371        6]
 [       0       33      184   696461    47355        6]
 [       0       33        4    24053     4923        6]
 [       0       33     1832 17023785    12161        6]
 [       0       33       57   566462     7066        6]
 [       0       33      409  2366854    49974        6]
 [       0       33      307  1289792    13874        6]
 [       0       33      436  3779392     6836        6]
 [       0       

## List one advantage when using regression (as opposed to classification with nominal features) in terms of input data features.

An advantage of using regression vs classification with nominal features may not be obvious, but I think a specific example will help. For example, lets take classification with decision trees. The decision tree classifier needs to develop a tree containing all of the unique data values for each independent variable. This is usually fast computation for a moderate sized dataset, but if the independent variables are include continous data, this can be a heavy computation, especially if not using a early stopping. For linear regression on the other hand, it does not matter if there is continous data and will not slow down computation. With regards to the benefits of mapping continous input values to continous output values, regression tends to provide more realistic results that can be used to calculate temporal data, currency data, and other continous features. 

## List one advantage when using regular numerical values rather than one-hot encoding for regression.

One obvious advantage of using regular numerical values rather than one-hot-encoding is that one hot encoding creates more data. It might not make a difference for small datasets, but large datasets might be a limiting factor. It is possible to have hundreds or thousands of new columns of data that one-hot-encoding produces. The new columns times the number of samples can be catastrophic in terms of clock performance, compute power, and storage.  

## Now that you developed both a classifier and a regression model for the problem in this assignment, which method do you suggest to your machine learning model customer? Classifier or regression? Why

If the customer wants to predict suicide rates for datapoints that do not exist, then regression is preferable. Datapoints that do not exist would include new age groups, new year, etc.. I think that this would likely be the case for this problem, since we found that the dependent variable is continous and the independent variables are a mix of categorical and continous data. Using classification would require an extra step, such as discretization of the dependent variable to fit the dataset into a discrete datatype. This would add complexity to the data, restrict the model to predicting the discrete values, and most likely be less useful.