# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os

## Part 1: Build the DataFrame

For my project I will be using `censusData.csv`, a file with data from the "census" data set that contains Census information from 1994.

I imported the .csv file into a pandas DataFrame by locating it in a directory and converting it into a DataFrame to be used for data manipuation before I create my machine learning model.

In [2]:
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
df = pd.read_csv(adultDataSet_filename)

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


## Part 2: Define the ML Problem

Using the Census dataset, I will be comparing an individual's income to $50,000 (the label is the `income_binary` column). Given the label is a binary representation of whether the individual's income is less than or equal to the target income or greater than it, the model I will create will solve a supervised learning problem, more specifically a binary classification problem.

In [3]:
df['income_binary']

0        <=50K
1        <=50K
2        <=50K
3        <=50K
4        <=50K
         ...  
32556    <=50K
32557     >50K
32558    <=50K
32559    <=50K
32560     >50K
Name: income_binary, Length: 32561, dtype: object

The features I will initially use are:
- `age` (as a higher age corresponds with more work experience and an increase in an individual's income)
- `education-num` (the revised education column that sequences education levels from lowest to highest)
- `occupation` (some occupations pay better than others - I will convert this column into numerical data similar to the education_num column)
- `race` and `sex_selfID` (to understand trends with racial and sexual bias in the workplace in correlation with an individual's income)
- `hours-per-week` (part-time employees make less than full-time employees as they do not work as consistently as their full-time counterparts)
- `native-country` (migrant workers in the United States have historically taken lower-income jobs and do not earn as much as citizens by birth)

I plan on using Scikit-learn's `SelectKBest` function after I train my model for the first time to confirm which features are actually correspond the most to the examples' labels. I will use a LogisticRegression model to fit data points that are closer to 0 on the graph with the <=50k label and those whose are closer to 1 with the >50k label.

I want to initially focus on two features, which I briefly explained above: race and sex_selfID. There exists a wage gap between male and female employees, as is there between white employees and employees of color. Per <b>[Pew Research Labs](https://www.pewresearch.org/short-reads/2016/07/01/racial-gender-wage-gaps-persist-in-u-s-despite-some-progress/)'</b> 2015 findings, more than two decades after the creation of the Census dataset, Black and Hispanic men made about 70% of a white man's salary, while women of all ethinicities made less than their male coworkers. I want to determine if an individual's demographics, not just their education level, occupation, or hours worked, affects their yearly income. I plan on creating two sets of `X` data: one with all the above features and one without the `race` and `sex_selfID` features.

## Part 3: Understand Your Data

Now that we have defined our machine learning problem and have identified our features and label, we need to inspect the dataset. We've looked at the first few examples of our dataset, so we can now start analyzing the shape of our dataset.

In [4]:
df.shape

(32561, 15)

We have determined our dataset has 32,561 examples. One of the most important steps in data preprocessing is determining if any columns have missing values and how to address them to make a clean dataset our model can be trained on. I will be using the `df.describe()` method to first analyze the numerical values and determine how to clean the dataset using various statistical metrics.

In [5]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32399.0,32561.0,32561.0,32561.0,32561.0,32236.0
mean,38.589216,189778.4,10.080679,615.907773,87.30383,40.450428
std,13.647862,105550.0,2.57272,2420.191974,402.960219,12.353748
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,14084.0,4356.0,99.0


Above is the statistical breakdown of the numeric columns in the Census dataset, in which three of my initial features appear. Before any data manipulation is performed on the DataFrame, we need to make sense of the numerical data my model will be based around:

#### Understanding the `age` feature
The `age` feature is self-explanatory: give the age of a given individual at the time of the census. The mean age is between 38 and 39, an average age for a mid-career individual. The lowest age is 17 years and the highest age is 90, but I would like to remove any outliers in my data set. I will use SciPy's `winsorize` function to address outliers and diminish the `age` column's standard deviation.

Also notice that most columns have a <b>count</b> value of 32,561, the number of examples in the census dataset. The `age` column has a lower number, so it contains missing values. I will create a new column and drop the original `age` column so that my dataset favors the winsorized values, then I'll fill in any missing values with the mean of the winsorized data.

#### Understanding the `education-num` feature
This column takes the original `education` column in the data set and enumerates its contents in chronological order, corresponding with a given individual's highest education level. For example, a 1 represents the individual's highest education level falls between first grade and fourth grade while a 16 represents their highest education level is a doctorate degree. The average individual in the dataset has completed some college, as represented by a mean value just greater than 10, while half of the dataset includes individuals with education levels ranging from a high school diploma (9) to an associate degree (12).

This column does not have any missing values to worry about.

#### Understanding the `hours-per-week` feature
The average individual in the dataset works about 40 hours a week, as shown by the mean of the `hours-per-week` column as well as the difference of 5 hours between the first and third quartiles of the dataset. To address any missing values in this column, I will replace with the <b>mean</b> value, as it is close enough to the `hours-per-week` value for the interquartile range.

## Part 4: Define the Project Plan

I plan on initially using a `LogisticRegression` model, using all default arguments except for a `max_iters` value of 1000, for a basic binary classification: predict if an unlabeled individual will have an income of either less than or equal to 50,000 or greater than 50,000. Using two different sets of features for `X`: one that includes all the features outlined in Part 2, and one that uses the same features but omits `race` and `sex_selfID`, to understand trends in wage gap upon minority workers. I would like to return the probability predictions as well as the accuracy scores of the model on my two `X` datasets.

From here, I will create a parameter grid with various regularization values for `C`, as well as various values for `max_iters`. I will then perform a grid search on another `LogisticRegression` model with no arguments given and the parameter grid to find the best values for `C` and `max_iters` on both datasets; from here I will create another `LogisticRegression` model using the best parameters and train that on both `X` datasets.

My goal is to see how important an individual's race and sex is to their income, and I want to see if a model that omits such categories outperforms or underperforms the full dataset.

## Part 5: Implement the Project Plan

As I just mentioned, I will need to import several functions from the Scikit-learn package. More will be imported later in this notebook, such as the `SelectKBest` and `f_classif` functions.

In [6]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

I will begin preparing the dataset to be used by a machine learning model by converting the object-typed label as a boolean, so that my model can better analyze the label.

In [7]:
df['income_binary'] = (df['income_binary'] == '<=50K').astype(int)
df['income_binary']

0        1
1        1
2        1
3        1
4        1
        ..
32556    1
32557    0
32558    1
32559    1
32560    0
Name: income_binary, Length: 32561, dtype: int64

Now that our label is properly converted into an easier-to-understand datatype, we can start cleaning the rest of the data. We will begin with determining which columns have missing values.

In [8]:
np.sum(df.isnull())

age                162
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
sex_selfID           0
capital-gain         0
capital-loss         0
hours-per-week     325
native-country     583
income_binary        0
dtype: int64

Some of the non-numerical features I wanted to include in my training data have many missing values. As I said earlier, I will remove examples where the `age` feature is null and replace any missing values in the `hours-per-week` column with the mean of the column. I will replace any null value in `native-country` with the phrase "unavailable," but there are so many more missing values in the `occupation` column than the other columns I planned on using with missing values, so I will just omit it in my cleansed data. Also, another column (the `workclass` column) may come in handy later but I do not plan on initially using it, but I will still drop it because of how many missing values there are.

Before we edit any missing columns, we will use SciPy's `winsorize` function to replace outliers in the `age` column. We will create a new column of the DataFrame with the replaced values for `age`.

In [9]:
import scipy.stats as stats
df['age-wins'] = stats.mstats.winsorize(df['age'], limits = [0.01, 0.01])

We will check that the columns `age` and `age-wins` are not identical by finding the unique differences between values in the two columns.

In [10]:
(df['age'] - df['age-wins']).unique()

array([ 0., nan,  1., 12.,  2.,  3., 10.,  4.,  5.,  6.,  7.,  8.,  9.])

Notice how there is an `NaN` element in the array of unique differences. We haven't replaced any missing values yet, but now that the outliers in the `age` column have been replaced with values that are closer to the rest of the data, we can start manipulating null elements.

We will now remove the `workclass`, `occupation`, and `age` columns.

In [11]:
df.drop(columns = ['workclass', 'occupation', 'age'], axis = 1, inplace = True)

We will also drop some columns that contain object values that may not contribute to the machine learning model's predictions. The `fnlwgt`, `education`, `marital-status`, and `relationship` columns will be removed.

In [12]:
df.drop(columns = ['fnlwgt', 'education', 'marital-status', 'relationship'], axis = 1, inplace = True)

We will now replace missing values in the `age-wins`, `hours-per-week`, and `native-country` with their predetermined values. By using NumPy's `mean` function, we will find the average value in our two numerical columns and replace any null example with it.

In [13]:
df['age-wins'].fillna(np.mean(df['age-wins']), inplace = True)
df['hours-per-week'].fillna(np.mean(df['hours-per-week']), inplace = True)
df['native-country'].fillna('unavailable', inplace = True)

Additionally, we will enumerate the `sex_selfID`, `race`, and `native-country` columns. For `sex_selfID`, I will take an approach similar to converting our label into a boolean column. For the other two, I will perform one-hot encoding, create a separate DataFrame for each one-hot encoding, and concatenate the DataFrames.

In [14]:
df['sex_selfID'] = (df['sex_selfID'] == 'Female').astype(int)

df_race = pd.get_dummies(df['race'], prefix = 'race_')
df_native_country = pd.get_dummies(df['native-country'], prefix = 'native_country_')

df = df.join([df_race, df_native_country])
df.drop(columns = ['race', 'native-country'], axis = 1, inplace = True)

After all the changes to our data, let's observe the first few examples of the DataFrame.

In [15]:
df.head(10)

Unnamed: 0,education-num,sex_selfID,capital-gain,capital-loss,hours-per-week,income_binary,age-wins,race__Amer-Indian-Inuit,race__Asian-Pac-Islander,race__Black,...,native_country__Puerto-Rico,native_country__Scotland,native_country__South,native_country__Taiwan,native_country__Thailand,native_country__Trinadad&Tobago,native_country__United-States,native_country__Vietnam,native_country__Yugoslavia,native_country__unavailable
0,13,0,2174,0,40.0,1,39.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,13,0,0,0,13.0,1,50.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,9,0,0,0,40.0,1,38.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,7,0,0,0,40.0,1,53.0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,13,1,0,0,40.0,1,28.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,14,1,0,0,40.0,1,37.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,5,1,0,0,16.0,1,49.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,9,0,0,0,45.0,0,52.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
8,14,1,14084,0,50.0,0,31.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9,13,0,5178,0,40.0,0,42.0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


Our DataFrame has been manipulated to be better understood by a machine learning model.

We will now define our `X` and `y` parameters for our model. Remember that we will be using two separate sets of `X`: one that includes the `race` and `sex_selfID` features and one that does not.

In [16]:
y = df['income_binary']
X_1 = df.drop(columns = 'income_binary', axis = 1)

cols_to_omit = [col for col in list(df.columns) if 'race_' in col]
cols_to_omit.append('income_binary')
cols_to_omit.append('sex_selfID')
X_2 = df.drop(columns = cols_to_omit, axis = 1)

Our dataset has been manipulated to create two sets of features `X` and a label `y`. We will now use Scikit-learn's `train_test_split` function to divide our data into training data and test data. Our test data will be 20% the size of the original dataset.

In [17]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_1, y, test_size = 0.10, random_state = 123)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_2, y, test_size = 0.10, random_state = 123)

Remember that the DataFrame `X_1` contains all the features in the original dataset while `X_2` is identical but omits any columns associated with an individual's race and sex.

We will now create our default `LogisticRegression` models, which will be scaled to perform on larger datasets.

In [18]:
model_default1 = LogisticRegression()
scaler1 = StandardScaler()
pipeline_default1 = make_pipeline(scaler1, model_default1)
pipeline_default1.fit(X_train1, y_train1)

preds_default1 = pipeline_default1.predict(X_test1)
acc_default1 = accuracy_score(y_test1, preds_default1)
print('Accuracy score for default model on dataset 1:', acc_default1)

model_default2 = LogisticRegression()
scaler2 = StandardScaler()
pipeline_default2 = make_pipeline(scaler2, model_default2)
pipeline_default2.fit(X_train2, y_train2)

preds_default2 = pipeline_default2.predict(X_test2)
acc_default2 = accuracy_score(y_test2, preds_default2)
print('Accuracy score for default model on dataset 2:', acc_default2)

Accuracy score for default model on dataset 1: 0.8265274792754068
Accuracy score for default model on dataset 2: 0.8203868590727663


When we create a default `LogisticRegression` model, we get an 82.7% accuracy rate on dataset 1 and an 82% accuracy rate on dataset 2. Already we can see that a full dataset yields slightly more accurate results than one with key features missing, and we notice there is already signs of a wage gap bias among women and employees of color.

We will now create our second set of models, with a maximum number of iterations to 100,000. However, we will create a parameter grid containing various values for `max_iter` and regularization `C`, to be used in a grid search.

In [19]:
param_grid = {
    'logisticregression__max_iter': [100, 200, 500, 1000, 5000, 10000],
    'logisticregression__C': [0.01, 0.1, 1, 10, 100],
    'logisticregression__solver': ['lbfgs', 'liblinear']
}

We will use Scikit-learn's `GridSearchCV` to begin a grid search using a new default model, the parameter grid we just declared, and a cross-validation value of 5. This may take a few minutes to run.

In [20]:
print('Running grid search on dataset 1')

model_grid1 = LogisticRegression()
scaler_grid1 = StandardScaler()
pipeline_grid1 = make_pipeline(scaler_grid1, model_grid1)

grid1 = GridSearchCV(pipeline_grid1, param_grid, cv = 5)
grid_search1 = grid1.fit(X_train1, y_train1)

print('Done with dataset 1')

print('Running grid search on dataset 2')

model_grid2 = LogisticRegression()
scaler_grid2 = StandardScaler()
pipeline_grid2 = make_pipeline(scaler_grid2, model_grid2)

grid2 = GridSearchCV(pipeline_grid2, param_grid, cv = 5)
grid_search2 = grid2.fit(X_train2, y_train2)

print('Done with dataset 2')

Running grid search on dataset 1
Done with dataset 1
Running grid search on dataset 2
Done with dataset 2


We will print the results of each grid search, beginning with dataset 1, the full dataset.

In [28]:
grid_search1.best_params_

{'logisticregression__C': 1,
 'logisticregression__max_iter': 100,
 'logisticregression__solver': 'liblinear'}

Interestingly, the most optimal model fo dataset 1 happens to be the default model. (Note that a default `LogisticRegression` model has a `C` regularization value of 1.0 and a `max_iter` value of 100.) We will now print the best parameters of dataset 2, the dataset with race and sex categories omitted.

In [29]:
grid_search2.best_params_

{'logisticregression__C': 0.01,
 'logisticregression__max_iter': 100,
 'logisticregression__solver': 'liblinear'}

Now that we know what parameters allow the model to optimize its performance, we can create our two best models, even though the best model for dataset 1 happens to be the default model.

In [22]:
model_best1 = LogisticRegression(max_iter = best_maxiter1, C = best_C1)
scaler_best1 = StandardScaler()
pipeline_best1 = make_pipeline(scaler_best1, model_best1)
pipeline_best1.fit(X_train1, y_train1)

preds_best1 = pipeline_best1.predict(X_test1)
acc_best1 = accuracy_score(y_test1, preds_best1)
print('Accuracy score for best model on dataset 1:', acc_best1)

model_best2 = LogisticRegression(max_iter = best_maxiter2, C = best_C2)
scaler_best2 = StandardScaler()
pipeline_best2 = make_pipeline(scaler_best2, model_best2)
pipeline_best2.fit(X_train2, y_train2)

preds_best2 = pipeline_best2.predict(X_test2)
acc_best2 = accuracy_score(y_test2, preds_best2)
print('Accuracy score for best model on dataset 2:', acc_best2)

Accuracy score for best model on dataset 1: 0.8265274792754068
Accuracy score for best model on dataset 2: 0.8216149831132944


The "best" model (the default model) still performs with an 82.7% accuracy rate. Dataset 2's best model, however, does perform better with an 82.2% accuracy rate. We start to see a smaller gap between accuracy rates on the dataset where all the features are included against the dataset where any mention of the race or sex of an individual is omitted.

No matter what model is built to train the data regarding an individual's demographics and their yearly income, whether it be a default model or an optimized model, we can still notice a slight difference in including an individual's race and sex in the data as opposed to not. This shows that a wage gap does exist in the workplace against women and employees of color, as models that include race and sex information do predict better than models which omit such information about an individual. While this dataset is about three decades old, wage gaps still exist due to many workplaces being historically dominated by white men. This model aimed to examine and share how machine learning models are unable to make predictions on an individual's income without knowing about their demographic information (that is, the race and sex they identify with most).