### Working with Structured Data

In this notebook, we'll train models on structued data (e.g., a CSV file) using the new [Datasets API](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/programmers_guide/datasets.md) and Estimators. Together, these give you a way to efficiently train neural networks on potentially large amounts of data (e.g., more than could fit into memory). We'll demonstrate a number of different techniques you can use to represent your features, including bucketing, embeddings, and so forth. That's the interesting part of this notebook, and it's worth experimenting with.

### Neat optional bonus

You can explore this dataset with [Facets](https://github.com/pair-code/facets) - try the [online demo](https://pair-code.github.io/facets/) or optionally download and install.

### Before getting started

* This code requires TensorFlow ***version 1.3+*** (which at the time of writing, has not been released). You can install the pre-release, v1.3rc2. See the [README](../README.md) for instructions.

### Tip

Finally, a last tip if you run this code multiple times. The estimators will automatically load and continue training from checkpoints, so if you modify the features and would like to start from scratch, be sure to delete ```./census``` (the saved models directory) first.

## Imports

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections

import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import tensorflow as tf
print('This code requires TensorFlow v1.3+')
print('You have:', tf.__version__)

### Download the dataset

Here, we'll work with the [Adult dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/old.adult.names). Our task is to predict whether a given adult makes more than $50,000 a year, based attributes such as their occupation, and the number of hours they work per week. The code here presented can become a starting point for a problem you care about.

In [None]:
census_train_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
census_train_path = tf.contrib.keras.utils.get_file('census.train', census_train_url)
census_test_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
census_test_path = tf.contrib.keras.utils.get_file('census.test', census_test_url)

### Load the data with Pandas

In [None]:
column_names = [
  'age', 'workclass', 'fnlwgt', 'education', 'education-num',
  'marital-status', 'occupation', 'relationship', 'race', 'gender',
  'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
  'income'
]

census_train = pd.read_csv(census_train_path, index_col=False, names=column_names) 
census_test = pd.read_csv(census_test_path, index_col=False, names=column_names) 

# Convert the label column to true/false
census_train_label = census_train.pop('income') == " >50K" 
census_test_label = census_test.pop('income') == " >50K"

In [None]:
print ("Training examples: %d" % census_train.shape[0])
print ("Test examples: %d" % census_test.shape[0])

In [None]:
# handy method to preview the data
census_train.head(10)

In [None]:
# here's how the label looks after we've converted it
census_train_label[:5]

### Prepare to train a linear model
We'll train a logistic regression model to start, then we'll use a DNN. There are different considerations in how you represent your features for linear and deep models.

## Use the Datasets API to write an Input Function

We'll use the same input functions for both the linear and deep model. Note, we could also have used the [pandas input function](https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/pandas_input_fn), but the Datasets API has a nicer interface and scales better - you should use Datasets moving forward.

There's some additional code below to handle small quirks in the format of the CSV files we're using, see the comments.

In [None]:
# Specify default values for each of the CSV columns.
csv_defaults = collections.OrderedDict([
  ('age',[0]),
  ('workclass',['']),
  ('fnlwgt',[0]),
  ('education',['']),
  ('education-num',[0]),
  ('marital-status',['']),
  ('occupation',['']),
  ('relationship',['']),
  ('race',['']),
  ('sex',['']),
  ('capital-gain',[0]),
  ('capital-loss',[0]),
  ('hours-per-week',[0]),
  ('native-country',['']),
  ('income',['']),
])

# Here's how we'll decode each line
def csv_decoder(line):
    """Converts a CSV row to a dictonary containing each feature."""
    #print(tf.size(line))
    parsed = tf.decode_csv(line, list(csv_defaults.values()))
    return dict(zip(csv_defaults.keys(), parsed))

# The train file has an extra empty line at the end.
# We want to ignore this.
def filter_empty_lines(line):
    """Returns true if the line is empty and False otherwise."""
    return tf.not_equal(tf.size(tf.string_split([line], ',').values), 0)

def train_input_fn(path):
    def input_fn():    
        dataset = (
            tf.contrib.data.TextLineDataset(path)  # create dataset from disk file
                .filter(filter_empty_lines)  # ignore empty lines
                .map(csv_decoder)  # get values on the csv row
                .shuffle(buffer_size=1000)  # shuffle the dataset, careful with the buffer_size
                .repeat()  # repeate the dataset indefinitely
                .batch(32))  # batch the data

        # create iterator
        columns = dataset.make_one_shot_iterator().get_next()
        
        # separate the label and convert it to true/false
        income = tf.equal(columns.pop('income')," >50K") 
        return columns, income
    return input_fn

def test_input_fn(path):
    def input_fn():    
        dataset = (
            tf.contrib.data.TextLineDataset(path)  # create dataset from disk file
                .skip(1) # the test file has a strange first line, we want to ignore this.
                .filter(filter_empty_lines)  # ignore empty lines
                .map(csv_decoder)  # get values on the csv row
                .batch(32))  # batch the data

        # create iterator
        columns = dataset.make_one_shot_iterator().get_next()
        
        # separate the label and convert it to true/false
        income = tf.equal(columns.pop('income')," >50K") 
        return columns, income
    return input_fn

### Specify the features we'll use and how we'd like them represented

Our goal here is to demostrate how to work with different types of features, rather than to aim for an accurate model. Here are five different types we'll use:
* A *numeric_column*. This is just a real-valued attribute.


* A *bucketized_column*. TensorFlow automatically buckets a numeric column for us.


* A *categorical_column_with_vocabulary_list*. This is just a categorical column, where you know the possible values in advance. This is useful when you have a small number of possibilities.


* A *categorical_column_with_hash_bucket*. This is a useful way to represent categorical features when you have a large number of values. Beware of hash collisions.


* A *crossed_column*. Linear models cannot consider interactions between features, so we'll ask TensorFlow to cross features for us.

Using these can be nicer than having to manually preprocess your data in many different ways as you experiment. You can see a more extensive example of them in action [here](https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/learn/wide_n_deep_tutorial.py) (note: this code hasn't yet been updated for v1.3), and you can read more about feature columns [here](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/feature_column) and [here](https://www.tensorflow.org/tutorials/wide_and_deep).

In [None]:
# a numeric feature
hour_per_week = tf.feature_column.numeric_column('hours-per-week')

# a bucketed feature
# you can also specify the bucket values if you prefer
# here, as below
education_num = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('education-num'), 
    list(range(10))
)

age_buckets = tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('age'), 
    boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
)

# a categorical feature with a known list of values
gender = tf.feature_column.categorical_column_with_vocabulary_list('sex', ['male','female'])

# a categorical feature with a possibly large number of values
# beware of hash collisions
native_country = tf.feature_column.categorical_column_with_hash_bucket('native-country', 1000)

# a crossed column
education_num_x_gender = tf.feature_column.crossed_column(
    [education_num, gender],
    hash_bucket_size=int(1e4)
)

# these are the features we'll use for our linear model
feature_columns = [
    hour_per_week,
    education_num,
    age_buckets,
    gender,
    native_country,
    education_num_x_gender
]

In [None]:
train_input = train_input_fn(census_train_path)
test_input = test_input_fn(census_test_path)

In the next two block, I thought I'd add some code you can use to debug your input functions. It won't normally be necessary to create a Session and run them manually.

In [None]:
training_batch = train_input()
with tf.Session() as sess:
    features, label = sess.run(training_batch)
    print(features['education'])
    print(label)

In [None]:
testing_batch = test_input()
with tf.Session() as sess:
    features, label = sess.run(testing_batch)
    print(features['education'])
    print(label)

### Train and Evaluate a Canned Linear Estimator

In [None]:
estimator = tf.estimator.LinearClassifier(feature_columns, model_dir='census/linear',n_classes=2)

In [None]:
estimator.train(train_input, steps=1000)

In [None]:
estimator.evaluate(test_input)

### Add an embedding feature(!)

Instead of using a hashbucket to represent categorical features, why not use a learned embedding. (Cool, right?) We'll also update how the features are represented for our deep model.

In [None]:
feature_columns = [
    tf.feature_column.numeric_column('education-num'),
    tf.feature_column.numeric_column('hours-per-week'),
    tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column('age'), 
        boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
    ),
    tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list('sex',['male','female'])),
    tf.feature_column.embedding_column(  # now using embedding!
        tf.feature_column.categorical_column_with_hash_bucket('native-country', 1000), 10)
]

In [None]:
# restart the input functions
train_input = train_input_fn(census_train_path)
test_input = test_input_fn(census_test_path)

### Train and evaluate a deep model

In [None]:
estimator = tf.estimator.DNNClassifier(hidden_units=[128,128], 
                                       feature_columns=feature_columns, 
                                       n_classes=2, 
                                       model_dir='census/dnn')

In [None]:
estimator.train(train_input, steps=5000)

In [None]:
# Feel free to experiment with features and params!
# If you'd like to learn how to train a joint model,
# check out the wide and deep tutorial above.
estimator.evaluate(test_input)