TODO: 
* Switch datasets (goodbye Census)

##### Copyright 2018 The TensorFlow Authors.

# Classify structured data using feature columns


<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/estimators/linear"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/structured_data/feature_cols_keras.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/feature_cols_keras.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

This tutorial demonstrates how to classify structured data. We will use [tf.keras](https://www.tensorflow.org/guide/keras) to define our model, and [feature columns](https://www.tensorflow.org/guide/feature_columns) to describe how the data should be represented.

## Overview

Using [census data](https://archive.ics.uci.edu/ml/datasets/Census+Income) which contains data a person's age, education, marital status, and occupation (the *features*), we will try to predict whether or not the person earns more than 50,000 dollars a year (the target *label*). 

We will train a neural network that, given an individual's information, outputs a number between 0 and 1. This can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model built on a dataset like this could reinforce societal biases and disparities. Is each  feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).

## Setup

Import TensorFlow, feature columns, and supporting libraries.

In [0]:
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow.python.feature_column import feature_column_v2 as fc

Let's enable eager execution for easier debugging. As of TensorFlow v2.0 (coming in 2019), this will be enabled by default.

In [0]:
tf.enable_eager_execution()

## Download the Census dataset

We will use a version of this dataset that has been lightly preprocessed for consistent formatting, in order to minimize the preprocessing code in this tutorial.

In [0]:
URL = 'https://storage.googleapis.com/applied-dl/uci_census_cleaned.csv'
data = tf.keras.utils.get_file('uci_census_cleaned.csv', URL)

## Use Pandas to load and preprocess the data

[Pandas](https://pandas.pydata.org/) is an open-source Python library with many helpful utilities for loading and working with structured data. We will use Pandas in this tutorial to load and to prepreprocess the cenus dataset before classifying it with TensorFlow.

In [4]:
dataframe = pd.read_csv('~/.keras/datasets/uci_census_cleaned.csv')
dataframe.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


The last column in the above output (income_bracket) is the label we will predict. Notice it is represented as a string. We will use Pandas to convert it to a number (0.0 or 1.0). This datatype will be needed later by our classifier.

In [5]:
dataframe['income_bracket'] = dataframe['income_bracket'].map(lambda x: x == '>50K')
dataframe['income_bracket'] = dataframe['income_bracket'].astype(float)
dataframe.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0.0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0.0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0.0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0.0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0.0


## Split the dataset into train, validation, and test

The dataset we downloaded was a single CSV file. We will split this first into train and test sets, and next into train and validation sets.

In [6]:
train, test = train_test_split(dataframe, test_size=0.1)
train, val = train_test_split(train, test_size=0.1)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

39561 train examples
4396 validation examples
4885 test examples


## Create an input pipeline using tf.data

Next, we will wrap our Pandas dataframes with [tf.data](https://www.tensorflow.org/guide/datasets) datasets. These enable us to use feature columns as a bridge to map from the columns in the dataframe, to features for our model.

In [0]:
# Convert a Pandas dataframe 
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('income_bracket')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.repeat().batch(batch_size)
  return ds

Create our datasets. We'll use a tiny batch size so it's easier to see the output of each feature column, when we demo them below.

In [0]:
batch_size = 5
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

Let's take a look at what the train dataset returns.

In [9]:
for feature_batch, label_batch in train_ds.take(1):
  print('All features:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['age'])
  print('A batch of labels:', label_batch )

All features: ['age', 'workclass', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'gender', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']
A batch of ages: tf.Tensor([42 24 29 37 53], shape=(5,), dtype=int32)
A batch of labels: tf.Tensor([0. 1. 1. 0. 1.], shape=(5,), dtype=float64)


Finally, we'll retrieve a single batch of data from the training dataset, and keep this in memory. We'll use this batch to demo a few different types of feature columns below.

In [0]:
# A batch of data
# We'll use this to show the output of various
# feature columns.
example_batch = list(train_ds.take(1))[0][0]

## Create feature columns
Next, we'll create a few different types of feature columns, and demonstrate what they return when called on an example batch of data, using the helper method below.

In [0]:
# Call a feature column on a batch of data and show the result
def demo(feature_column):
  feature_layer = fc.FeatureLayer([feature_column])
  print(feature_layer(example_batch).numpy())

### Numeric features
First, we'll use a plain numeric feature. TODO: explain

In [12]:
age = fc.numeric_column("age")
demo(age)

[[44.]
 [20.]
 [51.]
 [37.]
 [43.]]


### Bucketized features
TODO: explain

In [13]:
age_buckets = fc.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
demo(age_buckets)

[[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]


### One-hot encodings for categorical features
TODO: explain

In [0]:
education = fc.categorical_column_with_vocabulary_list(
      'education', [
          'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
          'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
          '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

In [15]:
# This creates a one-hot representation
# We will reuse the education feature above later in other contexts
education_one_hot = fc.indicator_column(education)
demo(education_one_hot)

Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


### Embeddings for categorical features
TODO: explain

In [16]:
education_embedding = fc.embedding_column(education, dimension=8)
demo(education_embedding)

[[ 0.17029224  0.38602605  0.46251673 -0.01845278 -0.03972173  0.32146436
   0.34972754  0.14873436]
 [ 0.5718432  -0.3138958  -0.13732137 -0.08816642  0.55092806 -0.14656846
   0.14499006 -0.11601342]
 [ 0.5718432  -0.3138958  -0.13732137 -0.08816642  0.55092806 -0.14656846
   0.14499006 -0.11601342]
 [ 0.5718432  -0.3138958  -0.13732137 -0.08816642  0.55092806 -0.14656846
   0.14499006 -0.11601342]
 [ 0.41000214 -0.30275458  0.4371216  -0.20242569 -0.5092193   0.18223254
  -0.00375817 -0.1666437 ]]


### Hashed feature columns
TODO: explain

In [17]:
occupation = fc.categorical_column_with_hash_bucket(
      'occupation', hash_bucket_size=1000)
demo(fc.indicator_column(occupation))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Crossed feature columns
TODO: explain

In [18]:
crossed_feature = fc.crossed_column([age_buckets, education], hash_bucket_size=1000)
demo(fc.indicator_column(crossed_feature))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Train a model
Now that we've seen how to use a few different types of feature columns, let's use a few to train a model. We've chosen these somewhat arbitrarily, if your aim is to build an accurate model on your dataset, think these through carefully. The most important thing you can do is choose the right features for your domain. TODO.

In [0]:
age = fc.numeric_column('age')
education_num = fc.numeric_column('education_num')
capital_gain = fc.numeric_column('capital_gain')
capital_loss = fc.numeric_column('capital_loss')
hours_per_week = fc.numeric_column('hours_per_week')

In [0]:
education = fc.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = fc.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

relationship = fc.categorical_column_with_vocabulary_list(
    'relationship', [
        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
        'Other-relative'])

workclass = fc.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])

In [0]:
occupation = fc.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

In [0]:
age_buckets = fc.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

In [0]:
education_occuputation = fc.crossed_column(['education', 'occupation'], 
                              hash_bucket_size=1000)

age_education_occuptation = fc.crossed_column([age_buckets, 'education', 'occupation'],
                              hash_bucket_size=1000)

In [0]:
all_columns = [
    age,
    education_num,
    capital_gain,
    capital_loss,
    hours_per_week,
    fc.indicator_column(workclass),
    fc.indicator_column(education),
    fc.indicator_column(marital_status),
    fc.indicator_column(relationship),
    fc.embedding_column(education_occuputation, dimension=8),
    fc.embedding_column(age_education_occuptation, dimension=8),
    fc.embedding_column(occupation, dimension=8),
]

In [0]:
batch_size = 256
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, batch_size=batch_size)
test_ds = df_to_dataset(test, batch_size=batch_size)

### Create a feature layer


In [0]:
feature_layer = fc.FeatureLayer(all_columns)

In [0]:
model = tf.keras.Sequential([
  feature_layer,
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

In [0]:
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss=tf.keras.losses.binary_crossentropy,
              metrics=['accuracy'])

In [29]:
model.fit(train_ds, 
          steps_per_epoch=len(train)//batch_size,
          validation_data=val_ds, 
          validation_steps=len(val)//batch_size,
          epochs=2)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f42218b4c50>

Finally, let's evaluate our model on the test data.

In [30]:
loss, accuracy = model.evaluate(test_ds, steps=len(test) // batch_size)
print("Accuracy", accuracy)

Accuracy 0.7915296052631579


Next steps 
* TODO