In [16]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import pandas as pd

## boilerplate that reduces the number of rows pandas outputs
## makes display cleaner
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', None)

## Read in the data 


In [2]:
def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text

In [3]:
train_data = pd.read_csv('census.train', converters={'education' : strip,
                                    'marital-status' : strip,
                                    'occupation' : strip,
                                    'race' : strip,
                                    'sex' : strip,
                                    'hours-per-week' : strip,
                                    'prediction' : strip})

test_data = pd.read_csv('census.test', converters={'education' : strip,
                                    'marital-status' : strip,
                                    'occupation' : strip,
                                    'race' : strip,
                                    'sex' : strip,
                                    'hours-per-week' : strip,
                                    'prediction' : strip})
train_data = train_data.drop(['Unnamed: 0'], axis=1)
test_data = test_data.drop(['Unnamed: 0'], axis=1)
train_data

Unnamed: 0,age,people-represented,education,marital-status,occupation,race,sex,hours-per-week,prediction
0,39,77516,Bachelors,Never-married,Adm-clerical,White,Male,40,<=50K
1,50,83311,Bachelors,Married,Exec-managerial,White,Male,13,<=50K
2,38,215646,HS-grad,Divorced,Handlers-cleaners,White,Male,40,<=50K
3,53,234721,Some HS,Married,Handlers-cleaners,Black,Male,40,<=50K
4,28,338409,Bachelors,Married,Prof-specialty,Black,Female,40,<=50K
...,...,...,...,...,...,...,...,...,...
32556,27,257302,Associate,Married,Tech-support,White,Female,38,<=50K
32557,40,154374,HS-grad,Married,Machine-op-inspct,White,Male,40,>50K
32558,58,151910,HS-grad,Widowed,Adm-clerical,White,Female,40,<=50K
32559,22,201490,HS-grad,Never-married,Adm-clerical,White,Male,20,<=50K


In [4]:
income_boundary = train_data['prediction']
y = np.where(income_boundary == '<=50K',1,0)
train_data = train_data.drop(['prediction'], axis=1)

# We don't need these columns
# to_drop = ['State','Area Code','Phone','Churn?']
# churn_feat_space = churn_df.drop(to_drop,axis=1)

## Understanding the data
Before working with a dataset, I spend a little bit of time trying to figure out the different aspects of the dataset. I'll glimpse through the table and get an idea of what types of variables are being represented, some of the possible values, etc. One of the first things I investigate is the number of unique values in each column

In [5]:
# iteratie through the columns and count the unique values in each column
for column in train_data:
    print(column, len(np.unique(train_data[column])))

age 73
people-represented 21648
education 7
marital-status 4
occupation 15
race 5
sex 2
hours-per-week 94


This gives me a few insights about what techniques may be best for processing the data. For example, we now see that sex clearly has only two categories, therefore we'd definitely only want 1 feature to represent sex.

## Dealing with categorical Data
Before we can even get close to framing this problem in a proper context, we need to deal with the categorical data. 





There are a number of techniques to approach this and each varies based on the type of data. 

The simplest of technique is to enumerate all possible categories and replace the categories with their numerical position. Fortunately, sklearn has the module just for us: [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)


Let's start with the simple case - sex

In [6]:
from sklearn.preprocessing import LabelEncoder as LE

num_encoder = LE()
encoded_train_data = train_data.copy()
#to convert into numbers
print('before encoding', np.unique(train_data['sex']))
encoded_train_data['sex'] = num_encoder.fit_transform(train_data['sex'])
print('after encoding', np.unique(encoded_train_data['sex']))

before encoding ['Female' 'Male']
after encoding [0 1]


Of course we will want to do the same for the other categorical variables. **Write a function that does this for a given list of column names**

In [7]:
def le_categories(df, labels):
    # test to make sure labels are in dataframe
    assert set(labels).issubset(df.columns), "Labels not in column names"
    encoded_df = df.copy()
    ### STUDENT SOL START ###
    for label in labels:
        encoder = LE()
        encoded_df[label] = encoder.fit_transform(encoded_df[label])
    ### STUDENT SOL END ###
    return encoded_df
le_categories(train_data, ['sex', 'education', 'marital-status', 'occupation', 'race', ])

Unnamed: 0,age,people-represented,education,marital-status,occupation,race,sex,hours-per-week
0,39,77516,1,2,1,4,1,40
1,50,83311,1,1,4,4,1,13
2,38,215646,3,0,6,4,1,40
3,53,234721,5,1,6,2,1,40
4,28,338409,1,1,10,2,0,40
...,...,...,...,...,...,...,...,...
32556,27,257302,0,1,13,4,0,38
32557,40,154374,3,1,7,4,1,40
32558,58,151910,3,3,1,4,0,40
32559,22,201490,3,2,1,4,1,20


### There's another way (and it's sometimes better!)
Another technique to encode categorical data is called **One Hot Encoding**. Essentially, you're taking every possible category for a catgorical feature and making it it's own binary feature. 

One hot encoding can improve performance of a classifier by removing the notion of order the enumerated labels produced by labelencoding can include. Despite adding more features and thereby reducing performance, other techniques exist to offset this loss. Now let's try this out with marital status

In [8]:
# reset the data to undo the stuff we did in LE

In [13]:
#OHE EXAMPLE
from sklearn.preprocessing import OneHotEncoder as OHE


marstat_unique = np.unique(train_data['marital-status'])
# sklearn One hot encoder only deals with numbers, so we run LE first 
encoded_train_data = le_categories(train_data, ['marital-status'])
ohe_encoder = OHE()
# enc.transform(['Male'])
ohe_encoder.fit(encoded_train_data['marital-status'].reshape(-1, 1))
# check number of values found
ohe_encoder.n_values_[0]

4

In [14]:
# transform simply takes the columnar Gender data, then returns the OneHotEncoded matrix
marstat_ohe = ohe_encoder.transform(encoded_train_data['marital-status'].reshape(-1, 1)).toarray()

df = pd.DataFrame(marstat_ohe, dtype=np.int, columns=marstat_unique)
df

Unnamed: 0,Divorced,Married,Never-married,Widowed
0,0,0,1,0
1,0,1,0,0
2,1,0,0,0
3,0,1,0,0
4,0,1,0,0
...,...,...,...,...
32556,0,1,0,0
32557,0,1,0,0
32558,0,0,0,1
32559,0,0,1,0


Now let's join the two together 

In [15]:
# encoded_train_data = encoded_train_data.drop(['marital-status'], axis=1)
encoded_train_data = encoded_train_data.join(df)
encoded_train_data

Unnamed: 0,age,people-represented,education,marital-status,occupation,race,sex,hours-per-week,Divorced,Married,Never-married,Widowed
0,39,77516,Bachelors,2,Adm-clerical,White,Male,40,0,0,1,0
1,50,83311,Bachelors,1,Exec-managerial,White,Male,13,0,1,0,0
2,38,215646,HS-grad,0,Handlers-cleaners,White,Male,40,1,0,0,0
3,53,234721,Some HS,1,Handlers-cleaners,Black,Male,40,0,1,0,0
4,28,338409,Bachelors,1,Prof-specialty,Black,Female,40,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,Associate,1,Tech-support,White,Female,38,0,1,0,0
32557,40,154374,HS-grad,1,Machine-op-inspct,White,Male,40,0,1,0,0
32558,58,151910,HS-grad,3,Adm-clerical,White,Female,40,0,0,0,1
32559,22,201490,HS-grad,2,Adm-clerical,White,Male,20,0,0,1,0


**Adopt the above example to work for a dataframe when given a list of column names**

In [None]:

def ohe_catgories(df, labels):
    assert set(labels).issubset(df.columns), "Labels not in column names"
    encoded_train_data = train_data
    ### STUDENT SOL START ###
    
    ### STUDENT SOL END ###
ohe_categories(train_data, ['sex', 'education', 'marital-status', 'occupation', 'race'])

There you go! Now you have two excellent techniques that you can use to convert pesky categorical data into something your classifier will *love*. 

Choosing which method to use is up to you, but here are a few caveats to help make your decision easier:

* Label encoding does bake in a notion of distance. This means that this method is best for categorical data with a notion of order.
* If there is no notion of order, it's probably better to use one hot encoding as this better reflects the proper hamming distance (aka similarity) you'd expect for datapoints with such a label
* One hot encoding comes with the trade off of increased dimensionality, which can quickly increase the amount of time your algorithm takes to train. This is especially prevalent with categorical data that has many categories - such as occupation in this dataset.

## Moving On - Binning
Now that we've explored categorical data, we can do some tricks to improve performance using continuous data as well. 

A popular and useful technique is called binning. The procedure involves breaking up a range of continuous numbers into equally sized chunks, then assigning each datapoint to their respective chunk. This significantly reduces the number of possible values and also can significanlty increase performance in many cases.