# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [2]:
# YOUR CODE HERE

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
filename = os.path.join(os.getcwd(), "data", "adultData.csv")
df = pd.read_csv(filename)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
df.shape
df.dtypes

age               float64
workclass          object
fnlwgt              int64
education          object
education-num       int64
marital-status     object
occupation         object
relationship       object
race               object
sex_selfID         object
capital-gain        int64
capital-loss        int64
hours-per-week    float64
native-country     object
income_binary      object
dtype: object

In [5]:
to_encode = list(df.select_dtypes(include=['object']).columns)
df[to_encode].nunique()
df = df.drop(columns = 'native-country')
to_encode.remove('native-country')
df = df.drop(columns = 'occupation')
to_encode.remove('occupation')
#proxy for education is education num
df = df.drop(columns = 'education')
to_encode.remove('education')
df = df.drop(columns = 'relationship')
to_encode.remove('relationship')

In [6]:
df[to_encode].nunique()

workclass         8
marital-status    7
race              5
sex_selfID        2
income_binary     2
dtype: int64

In [7]:

df['workclass'] = ['Government' if value in ['State-gov', 'Federal-gov', 'Local-gov']
                   else 'Self-Employed' if value in ['Self-emp-not-inc', 'Self-emp-inc']
                   else 'Unemployed' if value in ['Never-worked', 'Without-pay']
                   else 'Unknown'
                   for value in df['workclass']]

print(df['workclass'].unique())

['Government' 'Self-Employed' 'Unknown' 'Unemployed']


In [8]:
df['marital-status'] = ['Married' if value in ['Married-civ-spouse', 'Married-spouse-absent', 'Married-AF-spouse'] else 'Single' for value in df['marital-status']]
print(df['marital-status'].unique())

['Single' 'Married']


In [9]:
print(df['race'].unique())
print(df['sex_selfID'].unique())
print(df['income_binary'].unique())
to_encode.remove('income_binary')

['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Inuit' 'Other']
['Non-Female' 'Female']
['<=50K' '>50K']


In [10]:
df.isnull().sum()

age               162
workclass           0
fnlwgt              0
education-num       0
marital-status      0
race                0
sex_selfID          0
capital-gain        0
capital-loss        0
hours-per-week    325
income_binary       0
dtype: int64

In [11]:
mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)

mean_hours_per_week = df['hours-per-week'].mean()
df['hours-per-week'].fillna(mean_hours_per_week, inplace=True)

In [12]:
df.isnull().sum()
to_encode

['workclass', 'marital-status', 'race', 'sex_selfID']

In [13]:
from sklearn.preprocessing import OneHotEncoder

# Create the encoder:
encoder = OneHotEncoder(handle_unknown = 'error', sparse=False)

# Apply the encoder:
df_enc = pd.DataFrame(encoder.fit_transform(df[to_encode]))

# Reinstate the original column names:
df_enc.columns = encoder.get_feature_names(to_encode)

In [19]:
df_enc.head()

Unnamed: 0,workclass_Government,workclass_Self-Employed,workclass_Unemployed,workclass_Unknown,marital-status_Married,marital-status_Single,race_Amer-Indian-Inuit,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_selfID_Female,sex_selfID_Non-Female
0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [15]:
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [22]:
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income_binary,workclass_Government,workclass_Self-Employed,workclass_Unemployed,workclass_Unknown,marital-status_Married,marital-status_Single,race_Amer-Indian-Inuit,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_selfID_Female,sex_selfID_Non-Female
0,39.0,77516,13,2174,0,40.0,<=50K,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,50.0,83311,13,0,0,13.0,<=50K,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,38.0,215646,9,0,0,40.0,<=50K,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,53.0,234721,7,0,0,40.0,<=50K,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,28.0,338409,13,0,0,40.0,<=50K,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [33]:
df['income_binary_encoded'] = df['income_binary'].apply(lambda x: 0 if x == "<=50K" else 1)
df = df.drop(columns='income_binary')

In [34]:
df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Government,workclass_Self-Employed,workclass_Unemployed,workclass_Unknown,marital-status_Married,marital-status_Single,race_Amer-Indian-Inuit,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_selfID_Female,sex_selfID_Non-Female,income_binary_encoded
0,39.0,77516,13,2174,0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0
1,50.0,83311,13,0,0,13.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0
2,38.0,215646,9,0,0,40.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0
3,53.0,234721,7,0,0,40.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0
4,28.0,338409,13,0,0,40.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0


In [35]:
y = df['income_binary_encoded']
X = df.drop(columns = 'income_binary_encoded')

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 1234)

In [44]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# 1. Create the DecisionTreeRegressor model object using the hyperparameter values above and assign to 
# variable 'dt_model'
dt_model = DecisionTreeRegressor(max_depth=4, min_samples_leaf=10)

# 2. Fit the model to the training data below
dt_model.fit(X_train, y_train)

# 3.  Call predict() to use the fitted model to make predictions on the test data. Save the results to variable
# 'y_dt_pred'
y_dt_pred = dt_model.predict(X_test)

# 4: Compute the RMSE and R2 (on y_test and y_dt_pred) and save the results to dt_rmse and dt_r2
dt_rmse = mean_squared_error(y_test, y_dt_pred, squared = False)
dt_r2 = r2_score(y_test, y_dt_pred)


print('[DT] Root Mean Squared Error: {0}'.format(dt_rmse))
print('[DT] R2: {0}'.format(dt_r2))


[DT] Root Mean Squared Error: 0.3380225411215275
[DT] R2: 0.36525213567581705
