# GSB 545: Advanced Machine Learning for Business Analytics

## Predicting Income
### Primary Goals:
In this lab we'll be using a dataset from kaggle yet again...it's just so fun and rich! We're using the following income dataset where we want to use the other features to predict whether someone is making over $50,000 per year or not.

### Data
https://www.kaggle.com/datasets/lodetomasi1995/income-classification?sort=published

### Assignment Specs
You need to use Naive Bayes and neural networks in your work to answer the question above, but you should explore at least two other models in order to answer the above questions as best you can. You may use multiple neural network models if you like, but I'd encourage you to consider past model types we've discussed.

This dataset has variables of multiple types. So, this should give you an opportunity to explore how neural networks can (or can't) handle data of different types. You may need to one-hot encode the character variables...

Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.

In [42]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import BernoulliNB
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score
from sklearn.metrics import f1_score, confusion_matrix
import warnings
warnings.filterwarnings("ignore")


In [10]:
income_data = pd.read_csv("income_evaluation.csv")
income_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [37]:
income_data["income"].value_counts()

# Imbalanced Data set

income
<=50K    24720
>50K      7841
Name: count, dtype: int64

Because the data is imbalanced we cannot rely on accuracy to give us a good understand on the performance on each model.

In [27]:
X = income_data.drop(["income"], axis=1)
y = income_data["income"]

# Encode the target variable
y = LabelEncoder().fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline for categorical and numerical columns
ct = ColumnTransformer(
  [
    ("dummify", OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop="first"), make_column_selector(dtype_include=object)),
    ("standardize", StandardScaler(), make_column_selector(dtype_include=np.number))
  ],
  remainder="passthrough"
).set_output(transform="pandas")




## Initial Models

### Naive Bayes Classifier

In [52]:
# Naive Bayes pipeline
nb_pipeline = Pipeline(
  [("preprocessing", ct),
   ("naive_bayes", BernoulliNB())]
).set_output(transform="pandas")

nb_pipeline.fit(X_train, y_train)
y_pred = nb_pipeline.predict(X_test)

print("F1 Score of the Naive Bayes model:", f1_score(y_test, y_pred))
print("Confusion matrix of the Naive Bayes model:\n", confusion_matrix(y_test, y_pred))


F1 Score of the Naive Bayes model: 0.6649499284692417
Confusion matrix of the Naive Bayes model:
 [[4180  762]
 [ 409 1162]]
F1 Score of the Naive Bayes model: 0.6649499284692417
Confusion matrix of the Naive Bayes model:
 [[4180  762]
 [ 409 1162]]


In [53]:
# Cross-Val Score
scores = abs(cross_val_score(nb_pipeline, X, y, cv=5, scoring='f1_macro'))
print(scores.mean())

0.7637051391499947
0.7637051391499947


### Neural Network Classifier

In [56]:
# Neural Network pipeline
nn_pipeline = Pipeline(
  [("preprocessing", ct),
   ("neural_network", MLPClassifier())]
).set_output(transform="pandas")

# Fit and evaluate Neural Network model
nn_pipeline.fit(X_train, y_train)
y_pred = nn_pipeline.predict(X_test)

print("F1 Score of the Neural Network model:", f1_score(y_test, y_pred))
print("Confusion matrix of the Neural Network model:\n", confusion_matrix(y_test, y_pred))

F1 Score of the Neural Network model: 0.6478971150503997
Confusion matrix of the Neural Network model:
 [[4568  374]
 [ 639  932]]
F1 Score of the Neural Network model: 0.6478971150503997
Confusion matrix of the Neural Network model:
 [[4568  374]
 [ 639  932]]


In [57]:
# Cross-Val Score
scores = abs(cross_val_score(nn_pipeline, X, y, cv=5, scoring='f1_macro'))
print(scores.mean())

0.7791319151944827
0.7791319151944827


### XGBoosting Model

In [58]:
# XGBoosting pipeline
xgboost_pipeline = Pipeline(
  [("preprocessing", ct),
   ("xgboost", XGBClassifier())]
).set_output(transform="pandas")

xgboost_pipeline.fit(X_train, y_train)
y_pred = xgboost_pipeline.predict(X_test)

# Evaluate the performance of the model
print("F1 Score of the XGBoosting model:", f1_score(y_test, y_pred))
print("Confusion matrix of the XGBoosting model:\n", confusion_matrix(y_test, y_pred))

F1 Score of the XGBoosting model: 0.7193162393162393
Confusion matrix of the XGBoosting model:
 [[4640  302]
 [ 519 1052]]
F1 Score of the XGBoosting model: 0.7193162393162393
Confusion matrix of the XGBoosting model:
 [[4640  302]
 [ 519 1052]]


In [59]:
# Cross-Val Score
scores = abs(cross_val_score(xgboost_pipeline, X, y, cv=5, scoring='f1_macro'))
print(scores.mean())

0.8150121464997288
0.8150121464997288


### RandomForest

In [60]:
# Random forest pipeline
rf_pipeline = Pipeline(
  [("preprocessing", ct),
   ("rf", RandomForestClassifier())]
).set_output(transform="pandas")

rf_pipeline.fit(X_train, y_train)
y_pred = rf_pipeline.predict(X_test)

# Evaluate the performance of the model
print("F1 Score of the XGBoosting model:", f1_score(y_test, y_pred))
print("Confusion matrix of the RandomForest model:\n", confusion_matrix(y_test, y_pred))

F1 Score of the XGBoosting model: 0.6925722145804677
Confusion matrix of the RandomForest model:
 [[4612  330]
 [ 564 1007]]
F1 Score of the XGBoosting model: 0.6925722145804677
Confusion matrix of the RandomForest model:
 [[4612  330]
 [ 564 1007]]


In [61]:
# Cross-Val F1 Score
scores = abs(cross_val_score(rf_pipeline, X, y, cv=5, scoring='f1_macro'))
print(scores.mean())

0.7884985803053365
0.7884985803053365


In [62]:
# Cross-Val Recall Score
scores = abs(cross_val_score(rf_pipeline, X, y, cv=5, scoring='recall'))
print(scores.mean())

0.6208411376022684
0.6208411376022684


The XGBoosting model has best performance with a F1-score of 81.50. This score identifies positive cases while minimizing both false positives and false negatives. However, the recall for the positive class is moderate at approximately 62.08%. 