# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [44]:
# YOUR CODE HERE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from math import sqrt


## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
# YOUR CODE HERE
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
df = pd.read_csv(WHRDataSet_filename, header = 0)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
# YOUR CODE HERE
df.shape

(1562, 19)

In [5]:
df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


In [6]:
# identify missingness
nan_count = np.sum(df.isnull(), axis = 0)
nan_count

country                                                       0
year                                                          0
Life Ladder                                                   0
Log GDP per capita                                           27
Social support                                               13
Healthy life expectancy at birth                              9
Freedom to make life choices                                 29
Generosity                                                   80
Perceptions of corruption                                    90
Positive affect                                              18
Negative affect                                              12
Confidence in national government                           161
Democratic Quality                                          171
Delivery Quality                                            171
Standard deviation of ladder by country-year                  0
Standard deviation/Mean of ladder by cou

In [7]:
# find the features that correlate the most to my label i want to predict
corrs = df.corr()['Social support']
corrs

year                                                       -0.052845
Life Ladder                                                 0.700299
Log GDP per capita                                          0.658591
Social support                                              1.000000
Healthy life expectancy at birth                            0.586759
Freedom to make life choices                                0.418213
Generosity                                                  0.077543
Perceptions of corruption                                  -0.217857
Positive affect                                             0.459656
Negative affect                                            -0.352552
Confidence in national government                          -0.160353
Democratic Quality                                          0.536387
Delivery Quality                                            0.545010
Standard deviation of ladder by country-year               -0.174091
Standard deviation/Mean of ladder 

In [8]:
corrs_sorted = corrs.sort_values(ascending = False)
corrs_sorted

# conclusion: will use life ladder, ..., freedom to make life choices features

Social support                                              1.000000
Life Ladder                                                 0.700299
Log GDP per capita                                          0.658591
Healthy life expectancy at birth                            0.586759
Delivery Quality                                            0.545010
Democratic Quality                                          0.536387
Positive affect                                             0.459656
Freedom to make life choices                                0.418213
Generosity                                                  0.077543
year                                                       -0.052845
GINI index (World Bank estimate), average 2000-15          -0.128284
GINI index (World Bank estimate)                           -0.148387
Confidence in national government                          -0.160353
Standard deviation of ladder by country-year               -0.174091
Perceptions of corruption         

In [9]:
chosen_features = list(corrs_sorted[1:9].index)
chosen_features

['Life Ladder',
 'Log GDP per capita',
 'Healthy life expectancy at birth',
 'Delivery Quality',
 'Democratic Quality',
 'Positive affect',
 'Freedom to make life choices',
 'Generosity']

*the cells below are filling in the missing values with the average of the column*

In [10]:
condition = nan_count != 0 
nan_detected = condition

In [11]:
is_int_or_float = (df.dtypes == 'int64') | (df.dtypes == 'float64')

In [12]:
to_impute = (nan_detected) & (is_int_or_float)
to_impute

country                                                     False
year                                                        False
Life Ladder                                                 False
Log GDP per capita                                           True
Social support                                               True
Healthy life expectancy at birth                             True
Freedom to make life choices                                 True
Generosity                                                   True
Perceptions of corruption                                    True
Positive affect                                              True
Negative affect                                              True
Confidence in national government                            True
Democratic Quality                                           True
Delivery Quality                                             True
Standard deviation of ladder by country-year                False
Standard d

In [13]:
to_impute_selected = ['Social support','Life Ladder','Log GDP per capita','Healthy life expectancy at birth',
 'Delivery Quality','Democratic Quality','Positive affect','Freedom to make life choices', 'Generosity']

In [14]:
for colname in to_impute_selected:
    df[colname + '_na'] = df[colname].isnull()

In [15]:
df.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,...,"gini of household income reported in Gallup, by wp5-year",Social support_na,Life Ladder_na,Log GDP per capita_na,Healthy life expectancy at birth_na,Delivery Quality_na,Democratic Quality_na,Positive affect_na,Freedom to make life choices_na,Generosity_na
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,...,,False,False,False,False,False,False,False,False,False
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,...,0.441906,False,False,False,False,False,False,False,False,False
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,...,0.327318,False,False,False,False,False,False,False,False,False
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,...,0.336764,False,False,False,False,False,False,False,False,False
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,...,0.34454,False,False,False,False,False,False,False,False,False


In [16]:
for colname in to_impute_selected:
    mean = df[colname].mean()
    df[colname].fillna(value = mean, inplace = True)

In [17]:
for colname in to_impute_selected:
    print("{} missing values count :{}".format(colname, np.sum(df[colname].isnull(), axis = 0)))

Social support missing values count :0
Life Ladder missing values count :0
Log GDP per capita missing values count :0
Healthy life expectancy at birth missing values count :0
Delivery Quality missing values count :0
Democratic Quality missing values count :0
Positive affect missing values count :0
Freedom to make life choices missing values count :0
Generosity missing values count :0


###### Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

###### First Model: Using Linear Regression with the SELECTED FEATURES (based on its POSITIVE correlation to our label)
*note: the missing null values for the columns listed are filled with the average for that corresponding column*

Result: 
Root Mean Squared Error (RMSE): 0.07685762425370715
R-squared (R2): 0.576913701814282

In [18]:
y = df['Social support'] 

# init X without the 'social support' feature in X
all_cols = df.columns.tolist()

cols_to_drop = [col for col in all_cols if col not in to_impute_selected or col == 'Social support']

X = df.drop(columns = cols_to_drop)

X.head()

Unnamed: 0,Life Ladder,Log GDP per capita,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Positive affect,Democratic Quality,Delivery Quality
0,3.72359,7.16869,49.209663,0.718114,0.181819,0.517637,-1.92969,-1.655084
1,4.401778,7.33379,49.624432,0.678896,0.203614,0.583926,-2.044093,-1.635025
2,4.758381,7.386629,50.008961,0.600127,0.13763,0.618265,-1.99181,-1.617176
3,3.831719,7.415019,50.367298,0.495901,0.175329,0.611387,-1.919018,-1.616221
4,3.782938,7.517126,50.709263,0.530935,0.247159,0.710385,-1.842996,-1.404078


In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)

In [20]:
X_train.head()

Unnamed: 0,Life Ladder,Log GDP per capita,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Positive affect,Democratic Quality,Delivery Quality
1326,7.776209,10.935776,72.734001,0.945428,0.125727,0.859107,1.529229,1.879037
1069,6.89414,9.542232,66.400909,0.640219,0.091073,0.819987,0.280292,0.015082
520,5.148242,10.117517,71.780342,0.4383,-0.296735,0.602939,-0.126617,0.004947
643,7.060155,11.066487,71.709785,0.905341,0.206802,0.833389,-0.126617,0.004947
446,7.670627,10.659014,69.745049,0.934179,0.015997,0.772778,1.454463,1.9804


In [21]:
# Create the  LinearRegression model object 
model = LinearRegression()

# Fit the model to the training data 
model.fit(X_train, y_train)

#  Make predictions on the test data 
prediction = model.predict(X_test)

In [22]:
# Calculate RMSE (Root Mean Squared Error)
rmse = mean_squared_error(y_test, prediction, squared=False)

# Calculate R-squared (coefficient of determination)
r2 = r2_score(y_test, prediction)

# Print the results
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2):", r2)

Root Mean Squared Error (RMSE): 0.07685762425370715
R-squared (R2): 0.576913701814282


###### Second Model: Using Linear Regression with the Scaling Feature Selection Tool for chosen features 
*note: Scaling (2.2 Feature Transfomations) was chosen as both the input and output need to be numeric*

Result: no improvements from the previous model 
Root Mean Squared Error (RMSE): 0.07685762425370714
R-squared (R2): 0.5769137018142821


<b>Note:</b> cells below perform scaling feature selection to improve the model score from the first model (chosen features based on correlation score)

In [23]:
scaler = StandardScaler()

In [24]:
X_train_scaled = scaler.fit_transform(X_train)

In [25]:
X_test_scaled = scaler.transform(X_test)

In [26]:
# Create the  LinearRegression model object 
model = LinearRegression()

# Fit the model to the training data 
model.fit(X_train_scaled, y_train)

#  Make predictions on the test data 
prediction = model.predict(X_test_scaled)

In [27]:
# Calculate RMSE (Root Mean Squared Error)
rmse = mean_squared_error(y_test, prediction, squared=False)

# Calculate R-squared (coefficient of determination)
r2 = r2_score(y_test, prediction)

# Print the results
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2):", r2)

Root Mean Squared Error (RMSE): 0.07685762425370714
R-squared (R2): 0.5769137018142821


###### Third Model: Using DT Model Selection by selecting best parameters
*note: Performing Model Selection to Choose a DT (5.2)*

Result: slightly below the linear regression model performance
Root Mean Squared Error (RMSE): 0.07815856148969379
R-squared (R2): 0.5624696698740967


<b>Note:</b> cells below perform dt model selection for best hyperparameters

In [28]:
hyperparams = [n for n in range(1,10)]
hyperparams

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [29]:
accuracy_scores = []

for depth in hyperparams:
    # 1. Create a DecisionTreeClassifier model object
    # YOUR CODE HERE
    model = DecisionTreeRegressor(max_depth = depth, min_samples_leaf = 1)
      
    # 2. Perform a k-fold cross-validation for the decision tree
    # YOUR CODE HERE
    acc_score = cross_val_score(model, X_train, y_train, cv = 5)
    
    # 3. Find the mean of the resulting accuracy scores 
    # YOUR CODE HERE
    acc_mean = np.mean(acc_score)
    
    # 4. Append the mean score to the list accuracy_scores
    # YOUR CODE HERE
    accuracy_scores.append(acc_mean)
    
print('Done\n')

for s in range(len(accuracy_scores)):
    print('Accuracy score for max_depth {0}: {1}'.format(hyperparams[s], accuracy_scores[s]))

Done

Accuracy score for max_depth 1: 0.31157830714142987
Accuracy score for max_depth 2: 0.3866785245139477
Accuracy score for max_depth 3: 0.43906480618809846
Accuracy score for max_depth 4: 0.4491209567885903
Accuracy score for max_depth 5: 0.49122346305037706
Accuracy score for max_depth 6: 0.48162047345675596
Accuracy score for max_depth 7: 0.45846855578327983
Accuracy score for max_depth 8: 0.4440808990205638
Accuracy score for max_depth 9: 0.40368972726579655


In [30]:
best_score_index = accuracy_scores.index(max(accuracy_scores))
best_max_depth = hyperparams[best_score_index]
print(best_max_depth)

5


In [32]:
# 1. Create a DecisionTreeRegressor model object and assign it to the variable 'model'
# YOUR CODE HERE
model = DecisionTreeRegressor(max_depth = best_max_depth, min_samples_leaf = 1)
    
# 2. Fit the model to the training data 
# YOUR CODE HERE
model.fit(X_train, y_train)

prediction = model.predict(X_test)

mean_abs_error = mean_absolute_error(y_test, prediction)
mean_sqr_error = mean_squared_error(y_test, prediction)

print(f"Mean Absolute Error: {mean_abs_error}")
print(f"Mean Squared Error: {mean_sqr_error}")

Mean Absolute Error: 0.05680746569596405
Mean Squared Error: 0.006531073366189047


In [33]:
# Calculate RMSE (Root Mean Squared Error)
rmse = mean_squared_error(y_test, prediction, squared=False)

# Calculate R-squared (coefficient of determination)
r2 = r2_score(y_test, prediction)

# Print the results
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2):", r2)

Root Mean Squared Error (RMSE): 0.08081505655624481
R-squared (R2): 0.5322221952455799


###### Fourth Model: Random Forest Regressor 
*note: Building Random Forests (6.2)* 

Result: n_estimators of 100 had the best R2 score out of the n_estimators of 20, 50, 100. 
Root Mean Squared Error (RMSE): 0.06498641758092867
R-squared (R2): 0.697517507678241

In [45]:
print('Begin Random Forest Implementation...')
# 1. Create the RandomForestClassifier model object below and assign to variable 'rf_20_model'

# YOUR CODE HERE
rf_20_model = RandomForestRegressor(n_estimators = 20, random_state = 1234)

# 2. Fit the model to the training data below

# YOUR CODE HERE
rf_20_model.fit(X_train, y_train)

# 3. Make predictions on the test data using the predict_proba() method and assign the result to a 
# list named 'rf_20_predictions' below

# YOUR CODE HERE
rf_20_predictions = rf_20_model.predict(X_test)


# 50 model
rf_50_model = RandomForestRegressor(n_estimators = 50, random_state = 1234)

rf_50_model.fit(X_train, y_train)

rf_50_predictions = rf_50_model.predict(X_test)

# 100 model
rf_100_model = RandomForestRegressor(n_estimators = 100, random_state = 1234)

rf_100_model.fit(X_train, y_train)

rf_100_predictions = rf_100_model.predict(X_test)


print('End')

Begin Random Forest Implementation...
End


scores of 20, 50, 100 estimators for Random Forest below.

In [46]:
rmse_rf_20_model = sqrt(mean_squared_error(y_test, rf_20_predictions))
r2_rf_20_model = r2_score(y_test, rf_20_predictions)

print(f"Root Mean Squared Error (RMSE): {rmse_rf_20_model}")
print(f"R-squared (R2): {r2_rf_20_model}")

Root Mean Squared Error (RMSE): 0.0654597862768621
R-squared (R2): 0.6930948224721885


In [48]:
rmse_rf_50_model = sqrt(mean_squared_error(y_test, rf_50_predictions))
r2_rf_50_model = r2_score(y_test, rf_50_predictions)

print(f"Root Mean Squared Error (RMSE): {rmse_rf_50_model}")
print(f"R-squared (R2): {r2_rf_50_model}")

Root Mean Squared Error (RMSE): 0.06532887099907979
R-squared (R2): 0.6943211753704577


In [47]:
rmse_rf_100_model = sqrt(mean_squared_error(y_test, rf_100_predictions))
r2_rf_100_model = r2_score(y_test, rf_100_predictions)

print(f"Root Mean Squared Error (RMSE): {rmse_rf_100_model}")
print(f"R-squared (R2): {r2_rf_100_model}")

Root Mean Squared Error (RMSE): 0.06498641758092867
R-squared (R2): 0.697517507678241


Score Analysis: Much improvment in R2 score with the Random Forest Regressor Model 