# Lab 8: Define and Solve an ML Problem of Your Choosing

In [262]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [263]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(adultDataSet_filename)

df.head(50)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K
5,37.0,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40.0,United-States,<=50K
6,49.0,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16.0,Jamaica,<=50K
7,52.0,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,45.0,United-States,>50K
8,31.0,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50.0,United-States,>50K
9,42.0,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,5178,0,40.0,United-States,>50K


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

I have chosen the Census dataset for this project. The goal is to predict whether an individual has had twelve years or less of schooling. This is a supervised learning problem, specifically a binary classification task, since the label consists of two categories and is known during training. As of now for features, I plan to use most of the dataset’s attributes except the fnlwgt column, which is a sampling weight and not important to predicting education level. However, as I explore my data I will remove any features that don't seem favorable for my model.
This problem is important because education level is a key indicator that influences income, employment opportunities, and access to social programs. Accurately predicting education levels can help organizations better spread resources, design educational initiatives, and improve social services to support communities.


## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

<b>Step 1</b>: First I will be checking all features to see which ones have many missing then I will deal with the features as I believe best fits the problem. I might remove some features or address missingness using some of the methods learned.

In [264]:
# YOUR CODE HERE
nan_count = np.sum(df.isnull(), axis = 0)
nan_count

age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64

In [265]:
print(df['age'].dtype)
print(df['workclass'].dtype)
print(df['occupation'].dtype)
print(df['hours-per-week'].dtype)
print(df['native-country'].dtype)

float64
object
object
float64
object


In [266]:
# Dropping fnlwght column as I dont believe it will be too beneficial for our problem
df.drop(columns = ['fnlwgt'], inplace=True)

# For out object type columns with many missing values I will replace a NaN with unavaliable
df['workclass'].fillna('unavailable', inplace=True)
df['occupation'].fillna('unavailable', inplace=True)
df['native-country'].fillna('unavailable', inplace=True)

# For our float type columns I will replace NaN values with the median
df['age'].fillna(df['age'].median(), inplace=True)
df['hours-per-week'].fillna(df['hours-per-week'].median(), inplace=True)


In [267]:
nan_count_1 = np.sum(df.isnull(), axis = 0)
nan_count_1

age               0
workclass         0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex_selfID        0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income_binary     0
dtype: int64

In [268]:
corr_matrix = round(df.corr(),5)
corrs = corr_matrix['education-num']
corrs

age               0.03670
education-num     1.00000
capital-gain      0.16709
capital-loss      0.07992
hours-per-week    0.14657
Name: education-num, dtype: float64

- From the correlation between years of education and numerical feature values we can see that most have very weak correlations. Because age and capital loss are so low I will be removing them from the features I will be using.


In [269]:
df.drop(columns = ['age'], inplace=True)
df.drop(columns = ['capital-loss'], inplace=True)
df.drop(columns = ['education'], inplace=True)


- I will be looking at my other columns to see how I can transform my string values into something my machine can use.

In [270]:
df.dtypes

workclass          object
education-num       int64
marital-status     object
occupation         object
relationship       object
race               object
sex_selfID         object
capital-gain        int64
hours-per-week    float64
native-country     object
income_binary      object
dtype: object

In [271]:
to_encode = list(df.select_dtypes(include=['object']).columns)
df[to_encode].nunique()

workclass          9
marital-status     7
occupation        15
relationship       6
race               5
sex_selfID         2
native-country    42
income_binary      2
dtype: int64

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Features I Chose to Keep (and Remove)

After exploring the dataset and dealing with missing values, I made a few decisions to simplify and improve my model:

- I removed **`fnlwgt`** because it's more of a sampling weight and doesn’t add much to the prediction.
- I also dropped **`age`** and **`capital-loss`** since they had very weak correlations with the label I’m predicting.
- The education feature will also be dropped because it basically will give a direct indicator of the answer and you need a certain amount of years to have a degree.

While exploring my data I also began to prepare it by dealing with missing values and dropping features as stated above. To continue this process I also checked for unique values in each column. Since the dataset is large enough, I plan to perform one-hot encoding on my features to get them ready for my model implementation.I will also be preparing my label by binarizing it.

For my model I will be using a *RandomForestClassifier* because it will work well with any type of features and this way there is no necessity to scale the dataset. Additionally, since this is a binary classification problem it a classifier it the best suited approach.

As I perform runs on my model I will tune hyperparameters to hopefully help increase accuracy and performance.To find the best hyperparameters I will be implementing GridSearchCV.


## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [272]:
# YOUR CODE HERE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import  RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [273]:
# YOUR CODE HERE
# First I will binarize my label and hot-encode my features
df['education_binary'] = (df['education-num'] > 10).astype(int)

# drop the original education-num column
df = df.drop(columns=['education-num'])

for colname in to_encode:
    df_encoded = pd.get_dummies(df[colname], prefix=colname +'_')
    df = df.join(df_encoded)

In [274]:
df.columns

Index(['workclass', 'marital-status', 'occupation', 'relationship', 'race',
       'sex_selfID', 'capital-gain', 'hours-per-week', 'native-country',
       'income_binary', 'education_binary', 'workclass__Federal-gov',
       'workclass__Local-gov', 'workclass__Never-worked', 'workclass__Private',
       'workclass__Self-emp-inc', 'workclass__Self-emp-not-inc',
       'workclass__State-gov', 'workclass__Without-pay',
       'workclass__unavailable', 'marital-status__Divorced',
       'marital-status__Married-AF-spouse',
       'marital-status__Married-civ-spouse',
       'marital-status__Married-spouse-absent',
       'marital-status__Never-married', 'marital-status__Separated',
       'marital-status__Widowed', 'occupation__Adm-clerical',
       'occupation__Armed-Forces', 'occupation__Craft-repair',
       'occupation__Exec-managerial', 'occupation__Farming-fishing',
       'occupation__Handlers-cleaners', 'occupation__Machine-op-inspct',
       'occupation__Other-service', 'occupati

In [275]:
df.drop(columns = to_encode ,axis=1, inplace=True)
df.isnull().values.any()

False

- Now I will create my labeled examples and create my training and test sets.

In [276]:
y = df['education_binary']
X = df.drop(columns = 'education_binary', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=123)

# Train, Test and Evaluate our Model: Random Forest

In [277]:
#This is the original implementation of the random forest model.

rf_model = RandomForestClassifier(max_depth = 32,n_estimators = 300)

rf_model.fit(X_train, y_train)

label_predictions = rf_model.predict(X_test)

acc_score = accuracy_score(y_test, label_predictions)

print(acc_score)
print(confusion_matrix(y_test, label_predictions))


0.777356945439656
[[5807  893]
 [1282 1787]]


In [285]:
c_m = confusion_matrix(y_test, label_predictions, labels=[True, False])
pd.DataFrame(
c_m,
columns=['Predicted: Terrible Host', 'Predicted: Great Host'],
index=['Actual: Terrible Host', 'Actual: Great Host']
)

Unnamed: 0,Predicted: Terrible Host,Predicted: Great Host
Actual: Terrible Host,1787,1282
Actual: Great Host,893,5807


- The table above helps us visualize how our model is performing.

## Improving Performance

- Implementing GridSearchCV

In [286]:
print('Running Grid Search...')

md = [8, 16, 50, 100]

msl = [1, 2, 3, 4, 5, 10, 25, 50]
param_grid={'max_depth':md, 'min_samples_leaf':msl}

rf_class = RandomForestClassifier(random_state=123)

rf_grid = GridSearchCV(rf_class, param_grid=param_grid, cv=3,scoring='accuracy', verbose=1)

rf_grid_search = rf_grid.fit(X_train, y_train)

print('Done')

Running Grid Search...
Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  96 out of  96 | elapsed:  2.7min finished


Done


In [287]:
rf_best_params = rf_grid_search.best_params_
rf_best_params

{'max_depth': 50, 'min_samples_leaf': 2}

- Now that we have the best hyperparamers we're going to retrain out model

In [288]:
rf_model_2 = RandomForestClassifier(
    max_depth=rf_best_params['max_depth'],
    min_samples_leaf=rf_best_params['min_samples_leaf'],n_estimators = 300)

rf_model_2.fit(X_train, y_train)

label_predictions_2 = rf_model_2.predict(X_test)

acc_score_2 = accuracy_score(y_test, label_predictions_2)

print(acc_score_2)
print(confusion_matrix(y_test, label_predictions_2))


0.8000818916982291
[[6077  623]
 [1330 1739]]


- By performing a grid search accuracy improved by more than 2%.

In [290]:
c_m_2 = confusion_matrix(y_test, label_predictions_2, labels=[True, False])
pd.DataFrame(
c_m_2,
columns=['Predicted: Terrible Host', 'Predicted: Great Host'],
index=['Actual: Terrible Host', 'Actual: Great Host']
)

Unnamed: 0,Predicted: Terrible Host,Predicted: Great Host
Actual: Terrible Host,1739,1330
Actual: Great Host,623,6077


By doing a GridSearch, we were able to increase the model’s accuracy by 2%, reaching around 80% overall. Along the way, I removed the education feature. During earlier runs I got 100% accuracy and realized this feature was causing leakage. Careful feature selection and hyperparameter tuning ultimately helped us build a more accurate classifier. This improvement is meaningful because it directly supports the goal of predicting education level based on demographic data. In real-world applications, even modest increases in accuracy can make an impact.