# Handling Categorical Values

Adult Census Income Data Set: `datasets/adult.csv`

Prediction task is to determine whether a person makes over 50K a year. 
See for more information: # https://www.kaggle.com/uciml/adult-census-income

(<b>Note</b>: the attribute `fnlwgt` pertains to  a weight attribute which is a demographic score assigned to an individual based on information such as state of residence and type of employment. People with similar demographic characteristics
should have similar weights.)

Source: https://cseweb.ucsd.edu/classes/wi17/cse258-a/reports/a120.pdf

1.) Import relevant libraries (don't forget the scikit preprocessing library)

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn import preprocessing

2.) Load `adult.csv` dataset located in the folder `datasets`

In [32]:
filename = '../datasets/adult.csv'
df = pd.read_csv(filename, header=0)

3.) Inspect your data. Notice that some values are `?`. These indicate missing values. 

4.) Extract features and labels.

In [34]:
X = df.iloc[0:10000, :-1]
y = df.iloc[0:10000, -1]

In [35]:
X = X.drop('education', 1)

In [36]:
X['sex'] = X['sex'].astype("category")
X['sex'].cat.categories = [0,1]

In [37]:
X['rel_num'] = X['relationship'].map({'Not-in-family':0, 'Unmarried':0, 'Own-child':0, 'Other-relative':0, 'Husband':1, 'Wife':1})
del X['relationship']

# Get dummy features
nominals = ['workclass', 'marital.status', 'occupation',  'race']
X = pd.get_dummies(X, prefix=nominals, columns=nominals)

# Manually set it
X['native.country'] = np.where(X['native.country'] == 'United-States', 1, 0)

print(X.columns.values)

['age' 'fnlwgt' 'education.num' 'sex' 'capital.gain' 'capital.loss'
 'hours.per.week' 'native.country' 'rel_num' 'workclass_?'
 'workclass_Federal-gov' 'workclass_Local-gov' 'workclass_Never-worked'
 'workclass_Private' 'workclass_Self-emp-inc' 'workclass_Self-emp-not-inc'
 'workclass_State-gov' 'workclass_Without-pay' 'marital.status_Divorced'
 'marital.status_Married-AF-spouse' 'marital.status_Married-civ-spouse'
 'marital.status_Married-spouse-absent' 'marital.status_Never-married'
 'marital.status_Separated' 'marital.status_Widowed' 'occupation_?'
 'occupation_Adm-clerical' 'occupation_Armed-Forces'
 'occupation_Craft-repair' 'occupation_Exec-managerial'
 'occupation_Farming-fishing' 'occupation_Handlers-cleaners'
 'occupation_Machine-op-inspct' 'occupation_Other-service'
 'occupation_Priv-house-serv' 'occupation_Prof-specialty'
 'occupation_Protective-serv' 'occupation_Sales' 'occupation_Tech-support'
 'occupation_Transport-moving' 'race_Amer-Indian-Eskimo'
 'race_Asian-Pac-Island