# Naive-Bayes Analysis of Adult Income

In [1]:
Name = "Zachary Kekoa"

### Description
The goal of this project is to use the Naive-Bayes method and Logistic Regression method to classify whether an individual makes over $\$50,000$ or $\$50,000$ or less (> 50k or <= 50k) based on features such as age, education, occupation, etc. Each row represents a different individual and compare which model was better. Data was collected from https://www.kaggle.com/datasets/wenruliu/adult-income-dataset 

### Import Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OrdinalEncoder # for encoding categorical features from strings to number arrays
from sklearn.naive_bayes import MultinomialNB, CategoricalNB

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

### Import Data

In [2]:
adult = pd.read_csv('adult.csv', sep = ',')
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


### Fitting the Model

In [3]:
# check value counts of our label
adult['income'].value_counts(normalize=True)

<=50K    0.760718
>50K     0.239282
Name: income, dtype: float64

In [4]:
# set seed
np.random.seed(123)

# Randomize the dataset 
data_randomized = adult.sample(frac = 1)

# Calculate index for 80:20 split
trainsize = round(len(data_randomized) * 0.8)

# Split into training and test sets
training_set = data_randomized[:trainsize].reset_index(drop=True)
test_set = data_randomized[trainsize:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(39074, 15)
(9768, 15)


In [5]:
# sanity check
print(type(training_set))
training_set.head(6)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,52,Private,117700,HS-grad,9,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K
1,19,Private,351757,10th,6,Never-married,Other-service,Unmarried,White,Male,0,0,30,El-Salvador,<=50K
2,31,Federal-gov,101345,HS-grad,9,Never-married,Handlers-cleaners,Own-child,White,Female,0,0,40,United-States,<=50K
3,25,Private,324854,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,0,0,40,United-States,<=50K
4,36,Private,245521,7th-8th,4,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,35,Mexico,<=50K
5,50,Private,144084,Bachelors,13,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States,<=50K


In [6]:
# make sure training set proportions are similar to original df
training_set['income'].value_counts(normalize=True)

<=50K    0.760301
>50K     0.239699
Name: income, dtype: float64

In [7]:
# make sure test set proportions are similar to original df
test_set['income'].value_counts(normalize=True)

<=50K    0.762387
>50K     0.237613
Name: income, dtype: float64

### Creating The Model

In [8]:
trainX = training_set.iloc[:,:-1]
trainy = training_set['income']

colnames = trainX.columns

trainX.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,52,Private,117700,HS-grad,9,Divorced,Adm-clerical,Not-in-family,White,Female,0,0,40,United-States
1,19,Private,351757,10th,6,Never-married,Other-service,Unmarried,White,Male,0,0,30,El-Salvador
2,31,Federal-gov,101345,HS-grad,9,Never-married,Handlers-cleaners,Own-child,White,Female,0,0,40,United-States
3,25,Private,324854,Bachelors,13,Never-married,Sales,Not-in-family,White,Female,0,0,40,United-States
4,36,Private,245521,7th-8th,4,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,35,Mexico


In [9]:
testX = test_set.iloc[:,:-1]
testy = test_set['income']

colnames = testX.columns

testX.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,17,Private,181580,11th,7,Never-married,Other-service,Own-child,White,Female,0,0,16,United-States
1,18,Private,205218,11th,7,Never-married,Sales,Own-child,White,Female,0,0,20,United-States
2,27,Local-gov,298510,HS-grad,9,Divorced,Protective-serv,Own-child,White,Male,0,0,40,United-States
3,37,Private,196338,9th,5,Separated,Priv-house-serv,Unmarried,White,Female,0,0,16,Mexico
4,38,Local-gov,218763,Masters,14,Separated,Prof-specialty,Unmarried,White,Female,0,0,40,United-States


### Encoding and Training
I will create and fit an NB model on both the train and the test data to display the bias-variance tradeoff through the accuracy of the models.

#### Train Data

In [10]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

#0-1 encoding train labels
trainBrnli = le.fit_transform(trainy)

array([0, 0, 0, 0, 0])

In [11]:
from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()

trainX = enc.fit_transform(trainX)

trainX = pd.DataFrame(trainX, columns=colnames) 

trainX.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,35.0,4.0,6047.0,11.0,8.0,0.0,1.0,1.0,4.0,0.0,0.0,0.0,39.0,39.0
1,2.0,4.0,22373.0,0.0,5.0,4.0,8.0,4.0,4.0,1.0,0.0,0.0,29.0,8.0
2,14.0,1.0,4452.0,11.0,8.0,4.0,6.0,3.0,4.0,0.0,0.0,0.0,39.0,39.0
3,8.0,4.0,21544.0,9.0,12.0,4.0,12.0,1.0,4.0,0.0,0.0,0.0,39.0,39.0
4,19.0,4.0,18025.0,5.0,3.0,2.0,5.0,0.0,4.0,1.0,0.0,0.0,34.0,26.0


In [12]:
#create model object
model = CategoricalNB()

# fit on train data
model.fit(trainX,trainBrnli)

# predict on train data
yhattrain = model.predict(trainX)

CategoricalNB()

In [14]:
#confusion matrix
confusion_matrix(yhattrain, trainBrnli)

array([[25765,  1323],
       [ 3943,  8043]], dtype=int64)

In [15]:
# accuracy
accuracy_score(yhattrain, trainBrnli)

0.8652300762655474

#### Test Data

In [16]:
# same procedure, but with the test data
testBrnli = le.fit_transform(testy)

testX = enc.fit_transform(testX)
testX = pd.DataFrame(testX, columns=colnames)

model = CategoricalNB()
model.fit(trainX,trainBrnli)
yhattest = model.predict(testX)

In [17]:
confM = confusion_matrix(yhattest, testBrnli)
confM

array([[6649,  891],
       [ 798, 1430]], dtype=int64)

In [18]:
acc = accuracy_score(yhattest, testBrnli)
acc

0.827088452088452

The accuracy of the test model is lower than the accuracy of the train model due to the bias-variance tradeoff phenomenon as the accuracy of the model will naturally decrease across other data samples. Since the accuracy of the test data is 82%, this means that this is a very effective model for predicting income level (> 50k or <= 50k).