# Lab 8: Define and Solve an ML Problem of Your Choosing

In [614]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [615]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

df = pd.read_csv(adultDataSet_filename, header = 0)

print(df.shape)
df.head()

(32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. I am using the censusData.csv which is the Census information from 1994.
2. The goal is to predict the number of hours per week that an individual spends at work based on other census details like their financial status, demographics, native country etc. The label would be 'hours-per-week' column.
3. This is a Supervised Learning Regression problem.
4. Currently the features are all columns other than the 'hours-per-week' column. May change after inspecting data.
5. Predicting the hours per week that an individual will invest in their work can help companies make administrative and recruitment decisions. It can also be used to create policies that prevent overworking of employees and provide the necessary resources to families.

Note: I originally had 'education' as the label and then set up a binary classification problem. However, I realized I would have to remove most of the feature columns since occupation, income and many other statistics are based on education and generally come after an individual has completed their education and acquired a job. There would be feature leakage if all the features were kept.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

#### Data Preparation Techniques

In [616]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


In [617]:
np.sum(df.isnull(), axis = 0)

age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64

In [618]:
df.dtypes

age               float64
workclass          object
fnlwgt              int64
education          object
education-num       int64
marital-status     object
occupation         object
relationship       object
race               object
sex_selfID         object
capital-gain        int64
capital-loss        int64
hours-per-week    float64
native-country     object
income_binary      object
dtype: object

There are clearly missing values in the data both in categorical and numerical columns that need to be addressed.
The label column which is the 'hours-per-week' column also has missing values. Examples with missing label values cannot be used for training and testing models. 

1. Addressing missing values:
For numerical features, I will replace the missing values with the mean values of the column. For categorical columns, I will replace the missing values with 'Unknown'. Since the label, which is the 'hours-per-week column has 325 missing values, I will remove these 325 examples from the dataset.
2. I will normalize all the numerical feature columns too to improve model performance.(Not the label)
3. For the categorical features like marital status, race etc. I will use One Hot Encoding to convert them into a numerical form which can be easily be used for ML model training.

In [619]:
categorical_cols = df.columns[df.dtypes == 'object']
for col in categorical_cols:
    print(col, ":", df[col].nunique())
print()
for col in categorical_cols:
    print(col, ":", df[col].unique())

workclass : 8
education : 16
marital-status : 7
occupation : 14
relationship : 6
race : 5
sex_selfID : 2
native-country : 41
income_binary : 2

workclass : ['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' nan
 'Self-emp-inc' 'Without-pay' 'Never-worked']
education : ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' 'Assoc-acdm'
 'Assoc-voc' '7th-8th' 'Doctorate' 'Prof-school' '5th-6th' '10th'
 '1st-4th' 'Preschool' '12th']
marital-status : ['Never-married' 'Married-civ-spouse' 'Divorced' 'Married-spouse-absent'
 'Separated' 'Married-AF-spouse' 'Widowed']
occupation : ['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' nan
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv']
relationship : ['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']
race : ['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Inuit'

4. The 'education', 'native-country', and 'occupation' columns have many possible values. They seem to be either open-ended text features or features with many categories. The values of these column can be combined into a smaller range of categories.

In [620]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32399.0,32561.0,32561.0,32561.0,32561.0,32236.0
mean,38.589216,189778.4,10.080679,615.907773,87.30383,40.450428
std,13.647862,105550.0,2.57272,2420.191974,402.960219,12.353748
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,14084.0,4356.0,99.0


5. The mean and the median values for the numerical features 'age', 'fnlweight' and 'education-num' and the label 'hours-per-week' are not very far apart. However, the 'capital-gain' and 'capital-loss' columns have a huge descrepancy between the mean and the median values specially because there are a lot of 0s. Winsorization would handle some of the outliers in these two columns and some of the slightly high or low values in the other columns.

6. Feature selection: Since there are only 15 columns, it would be a good idea to try with all the features first. I could remove the occupation column since there are many possible values it can take and almost half of its values are missing. I will keep the workclass feature, even if it has a lot of missing values, so that it can capture the occupation type to some extent. Next, I will calculate the correlation of the feature values with the label values (after the one-hot-encoding is done and the data is cleaned) to obtain new feature sets without the features that are not correlated to the label (below 0.2?) and will not add much value to the model.
Note: It turned out that the features were not correlated to the label and the highest correlation value was less than 0.26. So I couldn't remove features with a correlation value less than 0.2.

#### Machine Learning Models Appropriate for Predictive Problem and Data
Since this is a Regression problem, I will start of the Linear Regression model as it is a low-complexity model. I will increase the complexity of the model and use a Decision Tree Model and a K-Nearest Neighbors Model. I will normalize the numerical feature columns for better prediction.

#### Evaluating and Improving Model Performance
Again, since this is a Regression problem with the label values being continuous and numerical, I will use the Root Mean Squared Error (RMSE) value and the Coefficient of Determination (R^2) value to evaluate and compare the model performance. For Linear Regression, I will use the Mean Squared Error loss function to train the model.

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
- Yes, I will modify the education and native-country columns to reduce the range of categories. I will also remove the occupation column since it has a lot of open-ended text values and the workclass column already groups the type of occupation to some extent. With this modified set of features, I will try running the models and observe the change in performance.
- I will first start with all the features in the modified data set to train the models and test performance.
- Next, I will use the correlation values calculated between the features and the label to create feature sets of increasing size (including the most correlated features and removing other irrelevant features) and compare the results of training the models on these sets.
  
Explain different data preparation techniques that you will use to prepare your data for modeling.
- I will use the Pandas fillna() method to impute all the missing values in the numerical columns with the mean of the columns and all the missing values in the categorical features with the value 'Unknown'
- I will use the Pandas dropna() method to remove all the rows that have missing values in the 'hours-per-week' column which is the label. This will make sure that there aren't any examples without label values being used in the training and testing of the models. 
- I will use Winzorization to handle outliers. This can be done using the winsorize function in the Scipy Stats package. The main outliers are in the 'capital-gain' and the 'capital-loss' columns but some of the other columns like 'fnlwgt' also have a high maximum or low minimimum value.
- I will normalize all the numeric features by subtracting by the mean and dividing by the standard deviation. This will ensure that all the columns have a standardized range. Useful for all models but especially Linear Regression.
- I will also group some of the categorical feature values especially those like 'native-country', which appears to be either text-based or has many categories, using Numpy where() function. I will also use the where() function to assign numerical values to the binary features.
- Finally, I will perform One-Hot-Encoding to convert the categorical features to numerical ones using the Pandas get_dummies() function.

What is your model (or models)?
- I want to test a wide variety of models with differing complexities.
- First I will start with a simple linear Regression model which is less complex. Then I will move onto the Decision Tree and the KNN models, modifying their complexities by testing multiple hyperparameters.

Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data.
- I will first split the data into training and testing. For model selection, I will evaluate and compare each model on the testing data to perform out-sample-validation instead of splitting the data into a third validation set too.
- First I will start with training the Linear Regression model and attempt regularization only if the model is overfitting.
- Next I will train a decision tree model with default parameters and evaluate the model.
- Then I will perform gridsearch and 3 fold cross validation to test different hyperparamter combinations and find the one performing best according to the cross-validation testing. I will then find the performance of the best model on the testing data set to compare with other model types.
- Finally, I will train a KNN model with different values for the hyperparameter 'n' and evaluate each one on the test data set. Finding the KNN model with the best out-of-sample score, I will compare the results with the other two types of models.
- I will also try all these models on different features sets using stepwise feature selection as mentioned above and then record the results.
- Finally, I will compare all the results that I will document in the Result section to find the model with the best out-of-sample score without overfitting to the data. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [621]:
import sklearn as skl
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

### Data Preparation

#### Handling Missing Values

In [622]:
# Inspect which columns have missing values
nan_count = np.sum(df.isnull(), axis = 0)
print(nan_count)

# Inspect datatypes
col_types = df.dtypes
col_types

age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64


age               float64
workclass          object
fnlwgt              int64
education          object
education-num       int64
marital-status     object
occupation         object
relationship       object
race               object
sex_selfID         object
capital-gain        int64
capital-loss        int64
hours-per-week    float64
native-country     object
income_binary      object
dtype: object

There are two numeric columns that have missing values : age and hours-per-week<br>
Since age is a feature column, I will replace all the missing values with the mean age value. However, since hours-per-week is the label, I will remove all the rows with a missing value in the hours-per-week column.

In [623]:
# Replacing numeric values in the age column with mean values. 
mean_age = np.mean(df['age'])
df['age'].fillna(value=mean_age, inplace=True)

# Check missing values again
np.sum(df.isnull(), axis = 0)

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64

In [624]:
# Removing all rows with missing values in the hours-per-week column
print("Original number of examples : ", df.shape[0])

df = df.dropna(subset=['hours-per-week'])

# Checking that the code above worked and that there are no more missing values in the age column
# 32561 - 325 = 32236
print(df['hours-per-week'].isnull().unique())
print("Number of rows after removal : ", df.shape[0])
print(np.sum(df.isnull(), axis = 0))

Original number of examples :  32561
[False]
Number of rows after removal :  32236
age                  0
workclass         1812
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1818
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     580
income_binary        0
dtype: int64


The missing values left are now in the workclass, occupation, and native-country columns.

In [625]:
# Handling missing values in 'Native_Country' column
df['native-country'].fillna(value='Unknown', inplace = True)
np.sum(df.isnull(), axis = 0)

age                  0
workclass         1812
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1818
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country       0
income_binary        0
dtype: int64

In [626]:
# Looking at the different occupation categories
print(df['occupation'].unique())
# It is not clear if nan values represent unknown occupation or no occupation.
print(df['workclass'].unique())

# For these two columns, slightly more than 1/2 of the values are missing.
df['occupation'].fillna(value='Unknown or None', inplace=True)
df['workclass'].fillna(value='Unknown', inplace=True)
np.sum(df.isnull(), axis = 0)

['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Tech-support' nan 'Protective-serv'
 'Machine-op-inspct' 'Armed-Forces' 'Priv-house-serv']
['State-gov' 'Self-emp-not-inc' 'Private' 'Federal-gov' 'Local-gov' nan
 'Self-emp-inc' 'Without-pay' 'Never-worked']


age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex_selfID        0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income_binary     0
dtype: int64

#### Handling Outliers

In [627]:
# Handle outliers
# Get all columns that are have numeric type values
condition = (df.dtypes == 'int64') | (df.dtypes == 'float64')
numeric_cols = df.columns[condition]
print("Numeric columns :", numeric_cols)

# Wisorizing the numerical columns
for col in numeric_cols:
    df[col] = stats.mstats.winsorize(df[col], limits=[0.01, 0.01])

Numeric columns : Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')


#### Normalizing Numerical Features

In [628]:
# Normalize all the columns except the label column.
condition = ((df.dtypes == 'int64') | (df.dtypes == 'float64')) & (df.columns != 'hours-per-week')
to_normalize = df.columns[condition]
print(to_normalize)
for col in to_normalize:
    mean = np.mean(df[col])
    print("Mean Value : ", mean)
    std = np.std(df[col])
    print("Standard Deviation : ", std)
    df[col] = (df[col] - mean)/std

Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss'], dtype='object')
Mean Value :  38.53632161962905
Standard Deviation :  13.451769492821699
Mean Value :  188626.11384787195
Standard Deviation :  99780.21161316789
Mean Value :  10.090457873185258
Standard Deviation :  2.5480262552707473
Mean Value :  614.6973259709641
Standard Deviation :  2417.9844286055336
Mean Value :  83.90246928899367
Standard Deviation :  383.15014005304164


#### Grouping values to create Categorical features

In [629]:
# Get all categorical columns
categorical_cols = df.columns[df.dtypes == 'object']
print("Categorical columns :", categorical_cols)

# Check the number of unique values for each column:
print(df[categorical_cols].nunique())

Categorical columns : Index(['workclass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex_selfID', 'native-country',
       'income_binary'],
      dtype='object')
workclass          9
education         16
marital-status     7
occupation        15
relationship       6
race               5
sex_selfID         2
native-country    42
income_binary      2
dtype: int64


In [630]:
# First we will consolidate the native-country column so that the countries are grouped by geographical regions
# This helps reduce the number of categories in the feature. 
print(df['native-country'].unique())
condition = (df['native-country'] == 'United-States') | (df['native-country'] == 'Canada') | ( df['native-country'] == 'Outlying-US(Guam-USVI-etc)')
df['native-country'] = np.where(condition, 'North_America', df['native-country'])

condition = (df['native-country'] == 'Cuba') | (df['native-country'] == 'Jamaica')| (df['native-country'] == 'Puerto-Rico')| (df['native-country'] == 'Trinadad&Tobago') | (df['native-country'] == 'Haiti') | (df['native-country'] == 'Dominican-Republic')
df['native-country'] = np.where(condition, 'Caribbean', df['native-country'])

condition = (df['native-country'] == 'Mexico') | (df['native-country'] == 'South')| (df['native-country'] == 'Honduras')| (df['native-country'] == 'Ecuador') | (df['native-country'] == 'Ecuador')| (df['native-country'] == 'El-Salvador') | (df['native-country'] == 'Guatemala') | (df['native-country'] == 'Peru') | (df['native-country'] == 'Nicaragua') | (df['native-country'] == 'Columbia')
df['native-country'] = np.where(condition, 'Latin_America', df['native-country'])

condition = (df['native-country'] == 'Germany') | (df['native-country'] == 'England')| (df['native-country'] == 'Italy')| (df['native-country'] == 'Portugal') | (df['native-country'] == 'France')| (df['native-country'] == 'Greece') | (df['native-country'] == 'Scotland') | (df['native-country'] == 'Ireland') | (df['native-country'] == 'Holand-Netherlands')
df['native-country'] = np.where(condition, 'Western_Europe', df['native-country'])

condition = (df['native-country'] == 'India')
df['native-country'] = np.where(condition, 'South_Asia', df['native-country'])

condition = (df['native-country'] == 'Philippines') | (df['native-country'] == 'Cambodia')| (df['native-country'] == 'Thailand')| (df['native-country'] == 'Laos') | (df['native-country'] == 'Vietnam')
df['native-country'] = np.where(condition, 'Southeast_Asia', df['native-country'])

condition = (df['native-country'] == 'China') | (df['native-country'] == 'Japan')| (df['native-country'] == 'Taiwan')| (df['native-country'] == 'Hong')
df['native-country'] = np.where(condition, 'Southeast_Asia', df['native-country'])

condition = (df['native-country'] == 'Poland') | (df['native-country'] == 'Hungary')| (df['native-country'] == 'Yugoslavia')
df['native-country'] = np.where(condition, 'Central_and_Eastern_Europe', df['native-country'])

condition = (df['native-country'] == 'Iran')
df['native-country'] = np.where(condition, 'Middle_East', df['native-country'])
df['native-country'].unique()

['United-States' 'Cuba' 'Jamaica' 'India' 'Unknown' 'Mexico' 'South'
 'Puerto-Rico' 'Honduras' 'Canada' 'Germany' 'Iran' 'Philippines'
 'England' 'Italy' 'Poland' 'Columbia' 'Cambodia' 'Thailand' 'Ecuador'
 'Laos' 'Taiwan' 'Haiti' 'Portugal' 'Dominican-Republic' 'El-Salvador'
 'France' 'Guatemala' 'China' 'Japan' 'Yugoslavia' 'Peru'
 'Outlying-US(Guam-USVI-etc)' 'Scotland' 'Trinadad&Tobago' 'Greece'
 'Nicaragua' 'Vietnam' 'Hong' 'Ireland' 'Hungary' 'Holand-Netherlands']


array(['North_America', 'Caribbean', 'South_Asia', 'Unknown',
       'Latin_America', 'Western_Europe', 'Middle_East', 'Southeast_Asia',
       'Central_and_Eastern_Europe'], dtype=object)

In [631]:
# Rename the native-country column to native-region
df.rename({'native-country' : 'native-region'}, inplace = True, axis = 1)
list(df.columns)

['age',
 'workclass',
 'fnlwgt',
 'education',
 'education-num',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex_selfID',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'native-region',
 'income_binary']

In [632]:
# Group education feature categories.
df['education'].unique()
condition = (df['education'] == 'Preschool') | (df['education'] == '1st-4th') | (df['education'] == '5th-6th') | (df['education'] == '10th') | (df['education'] == '7th-8th') | (df['education'] == '9th') | (df['education'] == '11th') | (df['education'] == '12th') | (df['education'] == 'HS-grad') 
df['education'] = np.where(condition, '<=_12th_grade', df['education'])

condition = (df['education'] == 'Bachelors') | (df['education'] == 'Masters') | (df['education'] == 'Some-college') | (df['education'] == 'Assoc-acdm') | (df['education'] == 'Assoc-voc') | (df['education'] == 'Doctorate') | (df['education'] == 'Prof-school')
df['education'] = np.where(condition, '>_12th_grade', df['education'])
df['education'].unique()

array(['>_12th_grade', '<=_12th_grade'], dtype=object)

In [633]:
# Drop occupation category since it is very open ended.
df.drop(columns = 'occupation', inplace = True)
# Note: Earlier when the feature was included, correlation analysis did not show any high correlation between any of the one-hot-encoded occupation features and the label

#### One-Hot-Encoding

In [634]:
categorical_cols = df.columns[df.dtypes == 'object']
print(df[categorical_cols].nunique())
df.head()

workclass         9
education         2
marital-status    7
relationship      6
race              5
sex_selfID        2
native-region     9
income_binary     2
dtype: int64


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-region,income_binary
0,0.03447,State-gov,-1.113549,>_12th_grade,1.141881,Never-married,Not-in-family,White,Non-Female,0.644877,-0.218981,40.0,North_America,<=50K
1,0.852206,Self-emp-not-inc,-1.055471,>_12th_grade,1.141881,Married-civ-spouse,Husband,White,Non-Female,-0.254219,-0.218981,13.0,North_America,<=50K
2,-0.03987,Private,0.270794,<=_12th_grade,-0.427962,Divorced,Not-in-family,White,Non-Female,-0.254219,-0.218981,40.0,North_America,<=50K
3,1.075225,Private,0.461964,<=_12th_grade,-1.212883,Married-civ-spouse,Husband,Black,Non-Female,-0.254219,-0.218981,40.0,North_America,<=50K
4,-0.783267,Private,1.501128,>_12th_grade,1.141881,Married-civ-spouse,Wife,Black,Female,-0.254219,-0.218981,40.0,Caribbean,<=50K


Note that the 'income_binary' and 'sex_selfID' columns are already binary columns. No on-hot-encoding is needed; they just need to be made numerical

In [635]:
# One-hot-encode all categorical columns except 'income_binary' and 'sex_selfID'
to_encode = df.columns[(df.dtypes == 'object') & (df.columns != 'income_binary') & (df.columns != 'sex_selfID') & (df.columns != 'education')]

for col in to_encode:
    # Get dataframe with the one-hot-encoded values of the current column
    df_temp = pd.get_dummies(df[col], prefix= (str(col) + "_"))
    # Join the one-hot-encoded dataframe to the main dataframe
    df = df.join(df_temp)
    # remove the old column
    df.drop(columns = col, inplace=True)
    
print(df.shape)

(32236, 45)


In [636]:
# Converting 'income_binary' and 'sex_selfID' into numerical binary features with values of 0 and 1
# 'sex_selfID' => 1 if 'Female' and 0 if 'Non-Female'
# 'income_binary' => 1 if '>50K' and 0 if '<=50K'
print(df['sex_selfID'].unique())
print(df['income_binary'].unique())

print()
print("self_sexID feature")
print("Before conversion : ")
print(df['sex_selfID'].head())
df['sex_selfID'] = np.where(df['sex_selfID'] == 'Female', 1, 0)
print("After conversion : ")
print(df['sex_selfID'].head())

print()
print("income_binary feature")
print("Before conversion : ")
print(df['income_binary'].head(10))
df['income_binary'] = np.where(df['income_binary'] == '>50K', 1, 0)
print("After conversion : ")
print(df['income_binary'].head(10))

print()
print("education feature")
print("Before conversion : ")
print(df['education'].head(10))
df['education'] = np.where(df['education'] == '>_12th_grade', 1, 0)
print("After conversion : ")
print(df['education'].head(10))

['Non-Female' 'Female']
['<=50K' '>50K']

self_sexID feature
Before conversion : 
0    Non-Female
1    Non-Female
2    Non-Female
3    Non-Female
4        Female
Name: sex_selfID, dtype: object
After conversion : 
0    0
1    0
2    0
3    0
4    1
Name: sex_selfID, dtype: int64

income_binary feature
Before conversion : 
0    <=50K
1    <=50K
2    <=50K
3    <=50K
4    <=50K
5    <=50K
6    <=50K
7     >50K
8     >50K
9     >50K
Name: income_binary, dtype: object
After conversion : 
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1
8    1
9    1
Name: income_binary, dtype: int64

education feature
Before conversion : 
0     >_12th_grade
1     >_12th_grade
2    <=_12th_grade
3    <=_12th_grade
4     >_12th_grade
5     >_12th_grade
6    <=_12th_grade
7    <=_12th_grade
8     >_12th_grade
9     >_12th_grade
Name: education, dtype: object
After conversion : 
0    1
1    1
2    0
3    0
4    1
5    1
6    0
7    0
8    1
9    1
Name: education, dtype: int64


#### Correlation Matrix
Get the correlations of all the features with the label to see which features are suitable for the model and which ones are irrelevant.

In [637]:
corr_matrix = round(df.corr(),5)
corr_matrix

Unnamed: 0,age,fnlwgt,education,education-num,sex_selfID,capital-gain,capital-loss,hours-per-week,income_binary,workclass__Federal-gov,...,race__White,native-region__Caribbean,native-region__Central_and_Eastern_Europe,native-region__Latin_America,native-region__Middle_East,native-region__North_America,native-region__South_Asia,native-region__Southeast_Asia,native-region__Unknown,native-region__Western_Europe
age,1.0,-0.07836,-0.01433,0.04056,-0.08856,0.12521,0.05459,0.0745,0.23689,0.05282,...,0.0341,0.01398,0.0156,-0.0593,0.0022,0.0203,-0.00162,-0.00622,0.00098,0.02356
fnlwgt,-0.07836,1.0,-0.02799,-0.04452,-0.0286,-0.00161,-0.00823,-0.01765,-0.00901,-0.00758,...,-0.05476,0.02855,0.00041,0.1392,-0.00206,-0.08124,-0.01125,-0.01506,0.00517,-0.01466
education,-0.01433,-0.02799,1.0,0.73263,0.02165,0.10897,0.0508,0.07376,0.23655,0.05404,...,0.03881,-0.03278,0.00344,-0.10662,0.0255,0.03953,0.03216,0.03416,0.02197,0.00598
education-num,0.04056,-0.04452,0.73263,1.0,-0.01276,0.16797,0.08071,0.15243,0.33703,0.06081,...,0.05144,-0.05808,0.00328,-0.20982,0.03153,0.10053,0.05045,0.04208,0.02707,0.00588
sex_selfID,-0.08856,-0.0286,0.02165,-0.01276,1.0,-0.07181,-0.04806,-0.23398,-0.21547,0.00022,...,-0.10367,0.03555,-0.00138,-0.02,-0.01213,0.00731,-0.02563,0.00248,-0.01427,0.0072
capital-gain,0.12521,-0.00161,0.10897,0.16797,-0.07181,1.0,-0.05567,0.10477,0.34732,0.0084,...,0.02832,-0.01563,-0.00179,-0.02684,0.01121,0.01622,0.00969,-0.00332,0.00675,0.00305
capital-loss,0.05459,-0.00823,0.0508,0.08071,-0.04806,-0.05567,1.0,0.05575,0.15141,0.01176,...,0.02156,-0.01099,-0.0058,-0.02189,0.00452,0.01068,0.00623,0.00634,0.00965,-0.00315
hours-per-week,0.0745,-0.01765,0.07376,0.15243,-0.23398,0.10477,0.05575,1.0,0.23569,0.01444,...,0.05173,-0.01264,-0.00545,-0.01133,0.0114,0.00182,0.00564,-0.00706,0.01287,0.01438
income_binary,0.23689,-0.00901,0.23655,0.33703,-0.21547,0.34732,0.15141,0.23569,1.0,0.05908,...,0.08525,-0.02954,-0.00027,-0.07387,0.01246,0.03647,0.02028,0.01237,0.00343,0.01705
workclass__Federal-gov,0.05282,-0.00758,0.05404,0.06081,0.00022,0.0084,0.01176,0.01444,0.05908,1.0,...,-0.0507,-0.00375,-0.00557,-0.02358,0.00412,0.01565,-0.00292,0.00544,-0.0,-0.00409


In [638]:
# Get the correlation values of all the features with the label.
corrs = corr_matrix["hours-per-week"]
corrs

age                                          0.07450
fnlwgt                                      -0.01765
education                                    0.07376
education-num                                0.15243
sex_selfID                                  -0.23398
capital-gain                                 0.10477
capital-loss                                 0.05575
hours-per-week                               1.00000
income_binary                                0.23569
workclass__Federal-gov                       0.01444
workclass__Local-gov                         0.01193
workclass__Never-worked                     -0.00941
workclass__Private                          -0.02060
workclass__Self-emp-inc                      0.12935
workclass__Self-emp-not-inc                  0.09217
workclass__State-gov                        -0.02433
workclass__Unknown                          -0.16948
workclass__Without-pay                      -0.01342
marital-status__Divorced                     0

In [639]:
# Sort the correlated values in descending order.
corrs_sorted = corrs.sort_values(ascending = False)
corrs_sorted

hours-per-week                               1.00000
relationship__Husband                        0.25158
income_binary                                0.23569
marital-status__Married-civ-spouse           0.21781
education-num                                0.15243
workclass__Self-emp-inc                      0.12935
capital-gain                                 0.10477
workclass__Self-emp-not-inc                  0.09217
age                                          0.07450
education                                    0.07376
capital-loss                                 0.05575
race__White                                  0.05173
marital-status__Divorced                     0.02664
workclass__Federal-gov                       0.01444
native-region__Western_Europe                0.01438
native-region__Unknown                       0.01287
workclass__Local-gov                         0.01193
native-region__Middle_East                   0.01140
relationship__Not-in-family                  0

In [640]:
# Sort the correlated values in ascending order.
reverse_sorted = corrs.sort_values(ascending = True)
reverse_sorted

relationship__Own-child                     -0.25673
sex_selfID                                  -0.23398
marital-status__Never-married               -0.20330
workclass__Unknown                          -0.16948
marital-status__Widowed                     -0.10783
relationship__Wife                          -0.06560
race__Black                                 -0.05524
relationship__Other-relative                -0.04995
relationship__Unmarried                     -0.03776
workclass__State-gov                        -0.02433
workclass__Private                          -0.02060
fnlwgt                                      -0.01765
marital-status__Separated                   -0.01637
workclass__Without-pay                      -0.01342
native-region__Caribbean                    -0.01264
native-region__Latin_America                -0.01133
workclass__Never-worked                     -0.00941
marital-status__Married-spouse-absent       -0.00779
race__Other                                 -0

### Train and Evaluate Models.

#### Function for Feature Selection

In [641]:
# This function will be used in feature selection to get different feature sets based on the correlation values above.
def get_top_corr_features(top = 5, bottom = 5):
    combined_series = pd.concat([corrs_sorted[1:(top + 1)], reverse_sorted[0:bottom]])
    top_features = list(combined_series.index)
    return top_features
get_top_corr_features()

['relationship__Husband',
 'income_binary',
 'marital-status__Married-civ-spouse',
 'education-num',
 'workclass__Self-emp-inc',
 'relationship__Own-child',
 'sex_selfID',
 'marital-status__Never-married',
 'workclass__Unknown',
 'marital-status__Widowed']

#### Define the Features and Label

In [642]:
print(df.shape)
df.head()

(32236, 45)


Unnamed: 0,age,fnlwgt,education,education-num,sex_selfID,capital-gain,capital-loss,hours-per-week,income_binary,workclass__Federal-gov,...,race__White,native-region__Caribbean,native-region__Central_and_Eastern_Europe,native-region__Latin_America,native-region__Middle_East,native-region__North_America,native-region__South_Asia,native-region__Southeast_Asia,native-region__Unknown,native-region__Western_Europe
0,0.03447,-1.113549,1,1.141881,0,0.644877,-0.218981,40.0,0,0,...,1,0,0,0,0,1,0,0,0,0
1,0.852206,-1.055471,1,1.141881,0,-0.254219,-0.218981,13.0,0,0,...,1,0,0,0,0,1,0,0,0,0
2,-0.03987,0.270794,0,-0.427962,0,-0.254219,-0.218981,40.0,0,0,...,1,0,0,0,0,1,0,0,0,0
3,1.075225,0.461964,0,-1.212883,0,-0.254219,-0.218981,40.0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,-0.783267,1.501128,1,1.141881,1,-0.254219,-0.218981,40.0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [643]:
# X is a matrix holding all feature values
X = df.drop(columns = 'hours-per-week', inplace = False) # Includes all the features in the modified dataframe except the label
# y holds the label column
y = df['hours-per-week']
print(X.shape)
print(y.shape)

(32236, 44)
(32236,)


#### Split Training and Testing Data

In [644]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

In [645]:
print(X_train.shape)
print(X_test.shape)

(22565, 44)
(9671, 44)


#### Function For Evaluating The Models on Test Data

In [646]:
# Creates a function that evalutes the training and test loss and coefficient of determination. 
def evaluate_results(predictions):
    print("Test set results")
    # The mean squared error
    test_rmse = root_mean_squared_error(y_test, predictions[1])
    print("Root mean squared error: ", test_rmse)
    # The coefficient of determination: 1 is perfect prediction
    test_r2 = r2_score(y_test, predictions[1])
    print("Coefficient of determination(r^2): ", test_r2)
    
    print()
    print("Training set results")
    train_rmse = root_mean_squared_error(y_train, predictions[0])
    print("Root mean squared error: ", train_rmse)
    train_r2 = r2_score(y_train, predictions[0])
    print("Coefficient of determination(r^2): ", train_r2)
    return (train_rmse, train_r2, test_rmse, test_r2)

#### Training Linear Regression Model

##### Linear Regressor Function

In [647]:
# Creates a Linear Regression Function that returns trains a Linear Regression Model and returns the 
# predictions on the training and testing datasets. 
def LinearRegressorFunction(X_train, y_train, X_test, y_test):
    # Define model object
    model = LinearRegression()
    # Train the model
    model.fit(X_train, y_train)
    # Test the model
    test_predictions = model.predict(X_test)
    train_predictions = model.predict(X_train)
    return (train_predictions, test_predictions)

##### Using all features in the modified dataframe

In [648]:
# Training on all of the features currently in the dataframe. 
predictions = LinearRegressorFunction(X_train, y_train, X_test, y_test)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.796274076598182
Coefficient of determination(r^2):  0.17467770591202825

Training set results
Root mean squared error:  10.801527884175536
Coefficient of determination(r^2):  0.18555213911871993


##### Testing different feature sets

In [649]:
# Testing different number of features
# Using the step-wise method for feature selection
# This takes the highest positively and negatively correlated features. 
params = [2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
for i in params: 
    print("Number of Features: ", i*2, "\n" )
    selected_features = get_top_corr_features(top = i, bottom = i)
    X = df[selected_features]
    y = df['hours-per-week']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
    predictions = LinearRegressorFunction(X_train, y_train, X_test, y_test)
    train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)
    print("\n" , "\n" )

Number of Features:  4 

Test set results
Root mean squared error:  11.100395775648607
Coefficient of determination(r^2):  0.12752558275026915

Training set results
Root mean squared error:  11.153261923167674
Coefficient of determination(r^2):  0.13164620360304713

 

Number of Features:  8 

Test set results
Root mean squared error:  10.923557856097387
Coefficient of determination(r^2):  0.15510254695497494

Training set results
Root mean squared error:  10.998121363451675
Coefficient of determination(r^2):  0.1556355883878422

 

Number of Features:  10 

Test set results
Root mean squared error:  10.861505393933916
Coefficient of determination(r^2):  0.16467434700984696

Training set results
Root mean squared error:  10.909040313572657
Coefficient of determination(r^2):  0.16925832453349587

 

Number of Features:  12 

Test set results
Root mean squared error:  10.86110085385201
Coefficient of determination(r^2):  0.16473656976862772

Training set results
Root mean squared error: 

In [650]:
# Testing different number of features
# Using the step-wise method for feature selection
# This takes the only the highest positively correlated features. 
params = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
for i in params: 
    print("Number of Features: ", i, "\n" )
    selected_features = get_top_corr_features(top = i, bottom = 0)
    X = df[selected_features]
    y = df['hours-per-week']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
    predictions = LinearRegressorFunction(X_train, y_train, X_test, y_test)
    train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)
    print("\n" , "\n" )

Number of Features:  4 

Test set results
Root mean squared error:  11.321068156156633
Coefficient of determination(r^2):  0.09249174459212339

Training set results
Root mean squared error:  11.381879574869519
Coefficient of determination(r^2):  0.09568262058322852

 

Number of Features:  5 

Test set results
Root mean squared error:  11.301148340913553
Coefficient of determination(r^2):  0.09568251998969646

Training set results
Root mean squared error:  11.329362426934168
Coefficient of determination(r^2):  0.10400859213081282

 

Number of Features:  6 

Test set results
Root mean squared error:  11.301121962199423
Coefficient of determination(r^2):  0.09568674163226376

Training set results
Root mean squared error:  11.326408737994367
Coefficient of determination(r^2):  0.1044757207879613

 

Number of Features:  7 

Test set results
Root mean squared error:  11.272769277962281
Coefficient of determination(r^2):  0.1002186005776664

Training set results
Root mean squared error:  1

The Linear Regression model gave the best results (highest r^2 and lowest error value on the test set) with the feature set of size 26 consisting of the 13 highest positively correlated features and the 13 highest negatively correlated features. 

##### Results

In [651]:
# Best Linear Regressor Model based on the analysis above:
selected_features = get_top_corr_features(top = 13, bottom = 13)
X = df[selected_features]
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
predictions = LinearRegressorFunction(X_train, y_train, X_test, y_test)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.789199451093351
Coefficient of determination(r^2):  0.1757589924853148

Training set results
Root mean squared error:  10.807657440614799
Coefficient of determination(r^2):  0.18462752536966665


The Linear Regression above gives the highest r^2 and lowest error value in the test without overfitting. We know that the model is not overfitting since the training performance is not signficantly better than the testing performance. This performance, however, is not very good overall. Closer the coefficient of determination is to 1 the better is the model's performance. Therefore, a r^2 value of 0.176 and 0.185 is low and the model is underfitting. This is likely because the features themselves do not correlate well with the label and/or because there might not be a strong linear relationship between the feature values and the label values. 

#### Training a Decision Tree

##### Decision Tree Regressor Function

In [652]:
# Creates a Decision Tree Regressor that returns the RMSE and R^2 value for the data. 
def DecisionTreeRegressorFunction(X_train, y_train, X_test, y_test, depth = None, leaf = 1):
    # Define model object
    # max_depth=None, min_samples_leaf=1 Default values
    model = DecisionTreeRegressor(max_depth = depth, min_samples_leaf = leaf)
    # Train the model
    model.fit(X_train, y_train)
    # Test the model
    test_predictions = model.predict(X_test)
    train_predictions = model.predict(X_train)
    return (train_predictions, test_predictions)

##### Using all features in the modified dataframe

In [653]:
# Training on all of the features currently in the dataframe. 
# Default parameters
X = df.drop(columns = 'hours-per-week', inplace = False) # All features in the dataframe except the label
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

predictions = DecisionTreeRegressorFunction(X_train, y_train, X_test, y_test)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)


Test set results
Root mean squared error:  14.727868748674824
Coefficient of determination(r^2):  -0.535873986891904

Training set results
Root mean squared error:  0.7243356178618762
Coefficient of determination(r^2):  0.996337542552922


This model uses all the features currently in the modified dataset. The performance on the training set is pretty good for this model with a high r^2 value and low error value. However, the performance on the testing data is not good at all. This clearly shows that the model is overfitting.

##### Testing on different feature sets

In [654]:
# Testing different number of features
# Using the step-wise method for feature selection
# This takes the highest positively and negatively correlated features. 
params = [2, 4, 5, 6, 7, 8, 9]
for i in params: 
    print("Number of Features: ", i*2, "\n" )
    selected_features = get_top_corr_features(top = i, bottom = i)
    X = df[selected_features]
    y = df['hours-per-week']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
    predictions = DecisionTreeRegressorFunction(X_train, y_train, X_test, y_test)
    train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)
    print("\n" , "\n" )

Number of Features:  4 

Test set results
Root mean squared error:  11.091591279401356
Coefficient of determination(r^2):  0.128909074193856

Training set results
Root mean squared error:  11.144337613716152
Coefficient of determination(r^2):  0.1330352784175819

 

Number of Features:  8 

Test set results
Root mean squared error:  10.834871820204254
Coefficient of determination(r^2):  0.16876594015121504

Training set results
Root mean squared error:  10.63300386421654
Coefficient of determination(r^2):  0.2107677047986426

 

Number of Features:  10 

Test set results
Root mean squared error:  10.886967922828392
Coefficient of determination(r^2):  0.16075326355857555

Training set results
Root mean squared error:  10.488827403682182
Coefficient of determination(r^2):  0.23202553008123317

 

Number of Features:  12 

Test set results
Root mean squared error:  11.032884562555827
Coefficient of determination(r^2):  0.1381058702991731

Training set results
Root mean squared error:  10.

In [655]:
# Testing different number of features
# Using the step-wise method for feature selection
# This takes the only the highest positively correlated features. 
params = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
for i in params: 
    print("Number of Features: ", i, "\n" )
    selected_features = get_top_corr_features(top = i, bottom = 0)
    X = df[selected_features]
    y = df['hours-per-week']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
    predictions = DecisionTreeRegressorFunction(X_train, y_train, X_test, y_test)
    train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)
    print("\n" , "\n" )

Number of Features:  4 

Test set results
Root mean squared error:  11.218996303790698
Coefficient of determination(r^2):  0.1087823387515573

Training set results
Root mean squared error:  11.201268167013529
Coefficient of determination(r^2):  0.12415492107224257

 

Number of Features:  5 

Test set results
Root mean squared error:  11.228217924991155
Coefficient of determination(r^2):  0.10731663734579999

Training set results
Root mean squared error:  11.128995509756807
Coefficient of determination(r^2):  0.13542068843369526

 

Number of Features:  6 

Test set results
Root mean squared error:  11.34417795864769
Coefficient of determination(r^2):  0.08878295284819326

Training set results
Root mean squared error:  10.924641277089208
Coefficient of determination(r^2):  0.1668805465662111

 

Number of Features:  7 

Test set results
Root mean squared error:  11.42878105347622
Coefficient of determination(r^2):  0.07514084467172744

Training set results
Root mean squared error:  10.

Based on the results above, the combination of the 4 most positively correlated features and the 4 most negatively correlated features gives the results with the best test score (high r^2 and low error) without overfitting. It is however, underfitting which is probably because none of the features correlate well with the label. 

In [656]:
# Use top 8 features that gave the best results above. 
selected_features = get_top_corr_features(top = 4, bottom = 4)
X = df[selected_features]
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

##### Testing hyperparameter values using GridSearch

In [657]:
# Using grid search to test different decision tree hyperparameters. 
# Uses 3-fold cross validation to select the best parameters.

# Parameter grid
param_grid = {'max_depth' : [4, 8, 16, 32], 'min_samples_leaf' : [4, 8, 16, 32]}
print('Running Grid Search...')
# Create a DecisionTreeRegressor model
model = DecisionTreeRegressor()
# Running a Grid Search with 3-fold cross-validation
grid = GridSearchCV(model, param_grid, cv = 3, scoring = 'neg_root_mean_squared_error')
# Fit the model
grid_search = grid.fit(X_train, y_train)

print('Done')

Running Grid Search...
Done


In [658]:
# neg_root_mean_squared_error gives a negative value
rmse = -1 * grid_search.best_score_
print("[DT] RMSE for the best model is : {:.2f}".format(rmse) )

[DT] RMSE for the best model is : 10.82


In [659]:
best_params = grid_search.best_params_
best_params

{'max_depth': 8, 'min_samples_leaf': 16}

The best decision tree model, from the versions tested, is the one with 'max_depth' value of 8 and 'min_samples_leaf' value of 16

##### Results

In [660]:
# Results of the best model version on the test data
predictions = DecisionTreeRegressorFunction(X_train, y_train, X_test, y_test, depth = best_params['max_depth'], leaf = best_params['min_samples_leaf'])
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.766914200365393
Coefficient of determination(r^2):  0.17916043991932495

Training set results
Root mean squared error:  10.729040737897659
Coefficient of determination(r^2):  0.19644669189183572


The above decision tree with 'max_depth': 8 and 'min_samples_leaf': 16 and the feature set of size 8 (4 positively correlated and 4 negatively correlated) led to the highest r^2 value and lowest error compared to the other decision trees. 
These results are better than the best linear regression model determined above. 
However, the model is still undefitting overall again because none of the features are highly correlated to the label. None of them have a correlation value (to the label) higher than 0.26 or more negative than -0.26.

#### Training KNN Models

##### KNN Regressor Function 

In [661]:
def KNNRegressorFunction(X_train, y_train, X_test, y_test, n = 5):
    # Define model object
    # max_depth=None, min_samples_leaf=1 Default values
    model = KNeighborsRegressor(n_neighbors = n)
    # Train the model
    model.fit(X_train, y_train)
    # Test the model
    test_predictions = model.predict(X_test)
    train_predictions = model.predict(X_train)
    return (train_predictions, test_predictions)

##### Training on all the features in the modified dataframe

In [662]:
# Training on all of the features currently in the dataframe.
X = df.drop(columns = 'hours-per-week', inplace = False) # All features in the dataframe except the label
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

predictions = LinearRegressorFunction(X_train, y_train, X_test, y_test)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.796274076598182
Coefficient of determination(r^2):  0.17467770591202825

Training set results
Root mean squared error:  10.801527884175536
Coefficient of determination(r^2):  0.18555213911871993


##### Testing on different feature sets

In [663]:
# Testing different number of features
# Using the step-wise method for feature selection
# This takes the highest positively and negatively correlated features. 
params = [2, 4, 5, 6, 7, 8, 9, 10, 11]
for i in params: 
    print("Number of Features: ", i*2, "\n" )
    selected_features = get_top_corr_features(top = i, bottom = i)
    X = df[selected_features]
    y = df['hours-per-week']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
    predictions = KNNRegressorFunction(X_train, y_train, X_test, y_test)
    train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)
    print("\n" , "\n" )

Number of Features:  4 

Test set results
Root mean squared error:  12.094930919388641
Coefficient of determination(r^2):  -0.03581586572744211

Training set results
Root mean squared error:  12.057555525356873
Coefficient of determination(r^2):  -0.014872409425789934

 

Number of Features:  8 

Test set results
Root mean squared error:  11.731717175302794
Coefficient of determination(r^2):  0.02546162867979984

Training set results
Root mean squared error:  11.608128374580273
Coefficient of determination(r^2):  0.05937328431868705

 

Number of Features:  10 

Test set results
Root mean squared error:  11.720511853046547
Coefficient of determination(r^2):  0.02732236250377118

Training set results
Root mean squared error:  11.542187898513872
Coefficient of determination(r^2):  0.07002947324539788

 

Number of Features:  12 

Test set results
Root mean squared error:  11.799041828361235
Coefficient of determination(r^2):  0.014244393265900435

Training set results
Root mean squared e

In [664]:
# Testing different number of features
# Using the step-wise method for feature selection
# This takes the only the highest positively correlated features. 
params = [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
for i in params: 
    print("Number of Features: ", i, "\n" )
    selected_features = get_top_corr_features(top = i, bottom = 0)
    X = df[selected_features]
    y = df['hours-per-week']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
    predictions = KNNRegressorFunction(X_train, y_train, X_test, y_test)
    train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)
    print("\n" , "\n" )

Number of Features:  4 

Test set results
Root mean squared error:  11.995290496944639
Coefficient of determination(r^2):  -0.01881965424827614

Training set results
Root mean squared error:  12.026258146193774
Coefficient of determination(r^2):  -0.009610708797833656

 

Number of Features:  5 

Test set results
Root mean squared error:  12.01710547982868
Coefficient of determination(r^2):  -0.022528733794866085

Training set results
Root mean squared error:  12.059542653965902
Coefficient of determination(r^2):  -0.015206946251811404

 

Number of Features:  6 

Test set results
Root mean squared error:  11.836894838287492
Coefficient of determination(r^2):  0.00790935838176332

Training set results
Root mean squared error:  11.659948321467612
Coefficient of determination(r^2):  0.050956419445783

 

Number of Features:  7 

Test set results
Root mean squared error:  11.961658320467853
Coefficient of determination(r^2):  -0.013114567496750551

Training set results
Root mean squared e

The KNN Model performed best with 18 features including 9 highest positively correlated and 9 highest negatively correlated features. We will use this combination to do empirical testing of hyperparameters to find the best model performance results. 

In [665]:
selected_features = get_top_corr_features(top = 9, bottom = 9)
X = df[selected_features]
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

In [666]:
n_values = [3, 5, 10, 50, 100, 150, 200]
for i in n_values:
    print("Number of neighbours : ", i)
    predictions = KNNRegressorFunction(X_train, y_train, X_test, y_test, n = i)
    train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)
    print("\n\n")

Number of neighbours :  3
Test set results
Root mean squared error:  11.76650683198496
Coefficient of determination(r^2):  0.019673196396374104

Training set results
Root mean squared error:  9.463007149994704
Coefficient of determination(r^2):  0.37489749219295254



Number of neighbours :  5
Test set results
Root mean squared error:  11.141784089810765
Coefficient of determination(r^2):  0.1210073350535007

Training set results
Root mean squared error:  9.565258986640034
Coefficient of determination(r^2):  0.3613155074180807



Number of neighbours :  10
Test set results
Root mean squared error:  10.694993934259697
Coefficient of determination(r^2):  0.1900898165367031

Training set results
Root mean squared error:  9.825537661574588
Coefficient of determination(r^2):  0.3260843311087268



Number of neighbours :  50
Test set results
Root mean squared error:  10.429963275398961
Coefficient of determination(r^2):  0.22973292754561403

Training set results
Root mean squared error:  10.

The best value for n based on the above results is 50 because it has the highest r^2 value and lowest mean squared error value out of all the tested values. It isn't overfitting since the training and testing values are similar. However, the mode is underfitting because neither the training or testing r^2 values are very good. Again, this is because none of the features highly correlate with the label. 

##### Results

In [667]:
# Best model KNN model for this dataset based on the analysis above:
predictions = KNNRegressorFunction(X_train, y_train, X_test, y_test, n = 50)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.429963275398961
Coefficient of determination(r^2):  0.22973292754561403

Training set results
Root mean squared error:  10.224447857714626
Coefficient of determination(r^2):  0.27025246998329366


### Results

#### Linear Regression Model

In [651]:
# Best Linear Regressor Model based on the analysis above:
selected_features = get_top_corr_features(top = 13, bottom = 13)
X = df[selected_features]
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)
predictions = LinearRegressorFunction(X_train, y_train, X_test, y_test)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.789199451093351
Coefficient of determination(r^2):  0.1757589924853148

Training set results
Root mean squared error:  10.807657440614799
Coefficient of determination(r^2):  0.18462752536966665


The Linear Regression model above, with a feature set of size 26 consisting of the 13 highest positively correlated features and the 13 highest negatively correlated features, gives the highest r^2 and lowest error value in the test without overfitting. We know that the model is not overfitting since the training performance is not signficantly better than the testing performance. This performance, however, is not very good overall. Closer the coefficient of determination is to 1 the better is the model's performance. Therefore, a r^2 value of 0.176 and 0.185 is low and the model is underfitting. This is likely because the features themselves do not correlate well with the label and/or because there might not be a strong linear relationship between the feature values and the label values. 

#### Decision Tree Regression Model

In [656]:
# Use top 8 features that gave the best results. 
selected_features = get_top_corr_features(top = 4, bottom = 4)
X = df[selected_features]
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

In [660]:
# Results of the best model version on the test data
predictions = DecisionTreeRegressorFunction(X_train, y_train, X_test, y_test, depth = best_params['max_depth'], leaf = best_params['min_samples_leaf'])
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.766914200365393
Coefficient of determination(r^2):  0.17916043991932495

Training set results
Root mean squared error:  10.729040737897659
Coefficient of determination(r^2):  0.19644669189183572


The above decision tree with 'max_depth': 8 and 'min_samples_leaf': 16 and the feature set of size 8 (4 positively correlated and 4 negatively correlated) led to the highest r^2 value and lowest error compared to the other decision trees. 
These results are better than the best linear regression model determined above. 
However, the model is still undefitting overall again because none of the features are highly correlated to the label. None of them have a correlation value (to the label) higher than 0.26 or more negative than -0.26.

#### KNN Model

In [665]:
selected_features = get_top_corr_features(top = 9, bottom = 9)
X = df[selected_features]
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

In [667]:
# Best model KNN model for this dataset based on the analysis above:
predictions = KNNRegressorFunction(X_train, y_train, X_test, y_test, n = 50)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.429963275398961
Coefficient of determination(r^2):  0.22973292754561403

Training set results
Root mean squared error:  10.224447857714626
Coefficient of determination(r^2):  0.27025246998329366


The KNN Model with an n_neighbors parameter value of 50 and a feature set with 18 features including 9 highest positively correlated and 9 highest negatively correlated features had the best out-of-sample performance scores with the highest r^2 value and lowest mean squared error value out of all the other tested version. It isn't overfitting since the training and testing values are similar. However, the mode is underfitting because neither the training or testing r^2 values are very good. Again, this is because none of the features highly correlate with the label. 

#### Best Model

The best performing model was the KNN model with a n_neighbors parameter value of 50 and a feature set with 18 features including 9 highest positively correlated and 9 highest negatively correlated features.<br>

In [665]:
selected_features = get_top_corr_features(top = 9, bottom = 9)
X = df[selected_features]
y = df['hours-per-week']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123)

In [667]:
# Best model KNN model for this dataset based on the analysis above:
predictions = KNNRegressorFunction(X_train, y_train, X_test, y_test, n = 50)
train_rmse, train_r2, test_rmse, test_r2 = evaluate_results(predictions)

Test set results
Root mean squared error:  10.429963275398961
Coefficient of determination(r^2):  0.22973292754561403

Training set results
Root mean squared error:  10.224447857714626
Coefficient of determination(r^2):  0.27025246998329366


Although, this model performed the best out of all the models trained and tested in the model selection process, the coeeficient of determination(r^2) value is not at all close to 1 and the root mean squared error is also too high. The model is underfitting.

#### How can I improve this?

- To improve the results above, the first thing to do would be to collect more relevant data that can help predict the hours per week of work for an individual more accurately. Having features that better correlate to the label or influence it would be useful.
- With enough data, I could try using ensemble models to better capture the different relationships and reduce the model estimation error caused by bias and variance.
- I could also try splitting the data into a training, testing, and validation set and use the validation set for model selection to prevent fitting to closely to the test data.
- The way I grouped the 'native-country' and 'education' columns would have affected the results too. I could test different number of groups to see if it would improve feature correlation to the label and the model's overall out-of-sample performance.  

### Changes I made

- Originally I had chosen 'education' as my label but then realized that all the other data columns would be logically influenced by the amount of education and individual would have. I would have had to remove a lot of the features to prevent feature leakage.
- I also tested the model without grouping any of the features and also keeping the occupation column as it is. Training the model on all the features as they were lead to overfitting for the deicion tree model. It would also not make sense to leave the columns that way since, in the real world, if the column was truly open-ended text values, there could be many other values assigned to the feature that would not be captured by this model.
- I also tried using the feature importance values of the decision tree to get the top 5 most important features but it didn't give good results. I new I needed more features in the feature set. So, instead of trying to adjust the number of features one-by-one, I decided to try a range of different feature set sizes iteratively, based on the correlation data, to get the best performing models.