Now that we've seen how to create a multi-layer neural network to classify MNIST images, let's see if we can use neural networks for a different problem space and dataset. Like we mentioned in the PPT, we're going to be predicting income levels (<50K or >50K) based on census data.  

In [159]:
import tensorflow as tf
import numpy as np
import pandas as pd # Run 'pip install pandas' in your terminal if you get an error here

# Visualizing Our Data

As a bit of a disclaimer, we did a little bit of preprocessing for this dataset to make it easier for you all. We basically downloaded the data from the [website](https://archive.ics.uci.edu/ml/datasets/Census+Income), converted to CSV files, added column names, and removed some uninterpretable columns (like fnlwgt).  

In [160]:
trainSet = pd.read_csv('Data/train.csv')
testSet = pd.read_csv('Data/test.csv')

In [161]:
trainSet.head()

Unnamed: 0,Age,Work Class,Education,Education Number,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours Per Week,Native Country,Income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [162]:
numTrainExamples = trainSet.shape[0]
numTestExamples = testSet.shape[0]
assert trainSet.shape[1] == testSet.shape[1] # Make sure that both have the same number of features

# Cleaning Our Dataset

For those of you that were here for our Kaggle hack session, you'll know that cleaning our dataset is a large component of any successful machine learning pipeline. Here's a general pipeline for those that couldn't make that workshop. 

1) Determine your problem space. Do you have a classification problem, or a regression problem?

2) Determine what model you want to use (Always good to start off with simple models).

3) Load in and preprocess your dataset. Examine your database to see if there are any NULL or non-numeric values.

4) Split up your dataset into training and testing components.

5) Create your model. This entails defining your function, your placeholders, the loss function, and the optimizer.

6) Train, evaluate, and iterate on your model!

So, the first thing we want to do is decide which columns to drop or preprocess. When you think about the inputs into machine learning models, we want these inputs to be numeric, we don't want them to be in string formats. As you can see in our dataset, we have a lot of categorical labels (basically labels in string formats). Education, marital status, and occupation are some examples. We want to convert these into numbers. One way to do so is by creating a mapping. Let's create one for Sex.

In [163]:
# The space before the male and female is because that's how the values are written in the CSV -_-
# Perfect example of the type of data cleaning you need to do with ML problems
mapping = {
    ' Female': 1,
    ' Male': 0
}
trainSet['Sex'] = trainSet['Sex'].map(mapping)

Now lets see how our dataframe changed.

In [164]:
trainSet.head()

Unnamed: 0,Age,Work Class,Education,Education Number,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours Per Week,Native Country,Income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,<=50K


Let's do the same for Working Class. Since it's not clear how many different types of values there are for that column, let's use the value_counts() function to find out. 

In [165]:
trainSet['Work Class'].value_counts()

 Private             20945
 Self-emp-not-inc     2367
 Local-gov            1931
 ?                    1680
 State-gov            1194
 Self-emp-inc         1040
 Federal-gov           891
 Without-pay            13
 Never-worked            5
Name: Work Class, dtype: int64

Now, lets create our mapping. Right now, we're just hard coding the mapping for readibility sake, but there are Pythonic ways of doing this. 

In [166]:
mapping = {
    ' ?': 0,
    ' Private': 1,
    ' Self-emp-not-inc': 2,
    ' Local-gov': 3,
    ' State-gov': 4,
    ' Self-emp-inc': 5,
    ' Federal-gov': 6,
    ' Without-pay': 7,
    ' Never-worked': 8,
}
trainSet['Work Class'] = trainSet['Work Class'].map(mapping)

Let's make sure the values got updated in our dataframe. 

In [167]:
trainSet.head()

Unnamed: 0,Age,Work Class,Education,Education Number,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours Per Week,Native Country,Income
0,39,4,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States,<=50K
1,50,2,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,<=50K
2,38,1,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,<=50K
3,53,1,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,<=50K
4,28,1,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,<=50K


**TODO: Do the same data preprocessing for Marital Status, Occupation, Relationship, Race, and Native Country**

Just for simplicity sake, we're going to drop Education (because Education Number conveys the same info), Capital Gain, and Capital Loss. But if you have time at the end of the workshop, try preprocessing these columns and see if they help your model out!

# Create Training/Testing Matrices

So, now that we've made our final changes to our dataframe, we want to convert it into a matrix of numbers. We want our Y Matrix to be filled with binary labels indicating whether the person has an income of more than 50K or less than 50K. Our X Matrix should contain all of the features that represent each individual.

In [168]:
# Preprocessing the Income
mapping = {
    ' <=50K': 1,
    ' >50K': 0
}
trainSet['Income'] = trainSet['Income'].map(mapping)

In [169]:
trainSet.head()

Unnamed: 0,Age,Work Class,Education,Education Number,Marital Status,Occupation,Relationship,Race,Sex,Capital Gain,Capital Loss,Hours Per Week,Native Country,Income
0,39,4,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States,1
1,50,2,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,1
2,38,1,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,1
3,53,1,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,1
4,28,1,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,1


In [170]:
# TODO: AS YOU START PREPROCESSING MORE COLUMNS, ADD THEM HERE SO THEY CAN BE PART OF YOUR X MATRIX
X = trainSet[['Age', 'Work Class', 'Education Number', 'Sex', 'Hours Per Week']].as_matrix()
Y = trainSet['Income'].as_matrix()
# Turn Y into one hot vectors instead of single labels like 0 or 1
Y = pd.get_dummies(trainSet['Income']).as_matrix()

# Preprocessing Our Test Set

An important note is that we have to do the same preprocessing for our test set. 

In [171]:
testSet['Sex'] = testSet['Sex'].map({' Female': 1,' Male': 0})
testSet['Work Class'] = testSet['Work Class'].map({
    ' ?': 0,
    ' Private': 1,
    ' Self-emp-not-inc': 2,
    ' Local-gov': 3,
    ' State-gov': 4,
    ' Self-emp-inc': 5,
    ' Federal-gov': 6,
    ' Without-pay': 7,
    ' Never-worked': 8,
})
testSet['Income'] = testSet['Income'].map({' <=50K.': 1,' >50K.': 0})
XTest = testSet[['Age', 'Work Class', 'Education Number', 'Sex', 'Hours Per Week']].as_matrix()
YTest = testSet['Income'].as_matrix()
# Turn Y into one hot vectors instead of single labels like 0 or 1
YTest = pd.get_dummies(testSet['Income']).as_matrix()

# Create Model

Now that we have all of our data loaded in and preprocessed, we can start on creating our model. This component is pretty open ended. You have the freedom to choose whichever model you'd like to create. You can go with logistic regression or use the neural network techniques we've been learning about. If you need inspiration, take a look at our past notebooks. 

- Think about what types of objects you'll need to create. Placeholders, variables, optimizers, etc
- Think back to the different intermediate calculations we needed to make. 

In [172]:
# TODO Create your model here

# Hint: Create placeholders

# Hint: Define hyperparameters

# Hint: Create weight matrices and bias variables

# Hint: Compute intermediate values like h1 and h2

# Hint: Define loss function and optimizer

# The below code can help you see your accuracy during training. If you use the below two 
# lines, make sure that your variable names match up. 
#correct_predictions = tf.equal(tf.argmax(yPred, 1), tf.argmax(y_, 1)) 
#accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))

# Train Model 

Now that you've created your model by defining your computational graph, you're ready to start training the model. Remember that training model basically means that we want to run our optimizer object over different parts of our training dataset. A few other reminders:

- Remember to create a Tensorflow session and initialize all of your variables
- Run your optimizer object at every iteration
- Keep track of how your model is doing every now and again

In [173]:
numIterations = 1000 # Adjust this number as you see fit!
# TODO Create session and initialize variables
for i in range(numIterations):
    ...
    # TODO Run optimizer object over your data
    # TODO Check accuracy every once in a while

# Test Model

By now, you have a trained model and you're ready to test your model! We want to now see how our model does on data that it has never seen before. We want to compute our predictions for the test set. 

- No need to initialize variables or anything. Everything is already trained! We just want to compute our predictions for this new set of data.

In [174]:
# TODO Compute the predictions for the testing set 