In [1]:
from sklearn import naive_bayes
import pandas as pd
import numpy as np
import matplotlib as plt

# Load the data 

Load the adult income data set. There are no column labels. Read the docs for the data set here: https://archive.ics.uci.edu/ml/datasets/Adult, and use the in-built Pandas dataframe options to attach the column labels into the data frame. 

In [2]:
        
names = ["Age", "Workclass", "fnlwgt", "Education", "Education-Num", 
"Martial Status","Occupation", "Relationship", "Race", "Sex",
"Capital Gain", "Capital Loss", "Hours per week", "Country", "Target"]

adult_dat = pd.read_csv('../assets/datasets/adult.csv', skiprows=1, names=names)        
adult_dat.head()

Unnamed: 0,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours per week,Country,Target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,small
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,small
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,small
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,small
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,small


For simplicity, let's drop any missing values.

In [3]:
adult_dat.dropna(inplace=True)

In [4]:
adult_dat["Target"].value_counts() / len(adult_dat)

small    0.751078
large    0.248922
Name: Target, dtype: float64

## Feature Engineering

We're only going to use workclass, education, sex, marital status, occupation, relationship and race.

Create dummies for all these variables.

In [6]:
def create_dummy(column):
    dropper = adult_dat[column].value_counts().index[-1]
    return pd.get_dummies(adult_dat[column]).drop(dropper, axis=1)

In [16]:
Sex = create_dummy('Sex')
Workclass = create_dummy('Workclass') 
Marital = create_dummy('Martial Status')
Occupation = create_dummy('Occupation')
Relationship = create_dummy('Relationship')
Race = create_dummy('Race')
Target = create_dummy('Target')


df = pd.concat([Sex, Workclass, Marital, Occupation, Relationship, Race, Target], axis = 1)

df.head()

Unnamed: 0,Male,Federal-gov,Local-gov,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Divorced,Married-civ-spouse,Married-spouse-absent,...,Husband,Not-in-family,Own-child,Unmarried,Wife,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,White,small
0,1,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,1
1,1,0,0,0,0,1,0,0,1,0,...,1,0,0,0,0,0,0,0,1,1
2,1,0,0,1,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,1,1
3,1,0,0,1,0,0,0,0,1,0,...,1,0,0,0,0,0,0,1,0,1
4,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,1,0,0,1,0,1


# Partition the data

Without using any direct method/libraries that would automatically accomplish this, please partition the data set 70/30. You can use anything from the math, pandas, or numpy library, do not use other libraries (don't use train/test split!)

In [9]:
partition_val = np.random.rand(len(df)) < 0.70
train = df[partition_val]
test = df[~partition_val]

# Define your feature set and define your target 

In [10]:
target_train = train['small']
feature_train = train.drop('small', axis=1)


# Run Naive Bayes Classifier

Instantiate the Naive Bayes predictor from scikit-learn with the training data. 

In [11]:
Cat_Naive_Bayes = naive_bayes.MultinomialNB();
Cat_Naive_Bayes.fit(feature_train, target_train)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# Check Accuracy / Score for Naive Bayes

Define the target and feature set for the test data

In [12]:
target_test = test['small']
feature_test =  test.drop('small', axis = 1)

Score the Naive Bayes classifier on the test data

In [13]:
Cat_Naive_Bayes.score(feature_test, target_test)

0.76860697622750496

# Check Accuracy / Score for a Logistic Classifier 

Define a logistic regression and train it with the feature and target set

In [14]:
import sklearn.linear_model as linear_model

logistic_class = linear_model.LogisticRegression()
logit = logistic_class.fit(feature_train, target_train)

Produce the accuracy score of the logistic regression from the test set

In [15]:
logit.score(feature_test, target_test)

0.81415241057542764

### Does naive bayes perform better of worse the logistic regression?  Why?

Logistic regression does not strictly assume independence between the variables.  And in this case, there's a good chance that variables are somewhat related (for example, sex and occupation).

**Bonus**: what could you do if you wanted to bring back in the numeric features?


You could combine models, since we are assuming independence!

[source from stackoverflow](http://stackoverflow.com/questions/14254203/mixing-categorial-and-continuous-data-in-naive-bayes-classifier-using-scikit-lea)