##### Copyright 2018 The TensorFlow Authors.



In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Load CSV with tf.data

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/alpha/tutorials/load_data/text"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/text.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/load_data/text.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Setup

In [0]:
!pip install tensorflow==2.0.0-alpha0

In [0]:
from __future__ import absolute_import, division, print_function

import requests

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds


In [0]:
TRAIN_DATA_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
TEST_DATA_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"

train_file_path = tf.keras.utils.get_file("adults.data", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("adults.test", TEST_DATA_URL)

In [0]:
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

## Load data

So we know what we're doing, lets look at the top of the CSV file we're working with.

In [0]:
!head {train_file_path}

As you can see, the columns in the CSV are not labeled. The labels need to be supplied as a list of strings when creating a dataset from an unlabeled CSV file.

If the file you are working with contains the column names in the first line, omit the `column_names` argument from the `make_csv_dataset` function. The constructor will then get the names from the file.


In [0]:
# CSV columns in the input file.
CSV_COLUMNS = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
               'marital_status', 'occupation', 'relationship', 'race', 'gender',
               'capital_gain', 'capital_loss', 'hours_per_week',
               'native_country', 'income_bracket']

This example is going to use a subset of the available columns (we're omitting `fnlwgt` and `native_country`). To specify which columns to use, pass a list of column names in the `select_columns` argument of the constructor.

We can also pass in a set of default values to use for these columns, in case any rows in the original data have empty values.

In [0]:
USED_COLUMNS = ['age', 'workclass', 'education', 'education_num',
                'marital_status', 'occupation', 'relationship', 'race',
                'gender', 'capital_gain', 'capital_loss', 'hours_per_week',
                'income_bracket']

USED_COLUMN_DEFAULTS = [[0], [''], [''], [0], [''], [''], [''], [''], [''], [0],
                        [0], [0], ['']]

We also have to identify what column will serve as the labels for each example, and what those labels are.

In [0]:
LABELS = ['<=50K', '>50K']
LABEL_COLUMN = 'income_bracket'

FEATURE_COLUMNS = list(USED_COLUMNS)
FEATURE_COLUMNS.remove(LABEL_COLUMN)

Now that these constructor argument values are in place,  read the CSV data from the file and create a dataset. The arguments we haven't mentioned are:

-  `batch_size` — the number of (example, label) pairs that will be combined into each element of the dataset 
-  `na_value` — a string to represent NA or NaN values
-  `num_epochs` — an int specifying the number of times this dataset is repeated
-  `ignore_errors` — if true, malformed rows are discarded

(For the full documentation, see `tf.data.experimental.make_csv_dataset`)


In [0]:
def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=64, # Artificially small to make examples easier to show.
      column_names=CSV_COLUMNS,
      label_name=LABEL_COLUMN,
      select_columns=USED_COLUMNS,
      column_defaults=USED_COLUMN_DEFAULTS,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

Each item in the dataset is a batch, represented as a tuple of (*many examples*, *many labels*). The data from the examples is organized in column-based tensors (rather than, for example, row-based tensors). 

It might help to see this yourself.

In [0]:
examples, labels = next(iter(raw_train_data)) # Just the first batch of 64.
print("EXAMPLES: \n", examples, "\n")
print("LABELS: \n", labels)

## Data preprocessing

### Categorical data

Some of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

In the CSV, these options are represented as text. This text needs to be converted to integers before the model can be trained. To facilitate that, we need to create a list of categorical columns, along with a list of the options available in each column.

In [0]:
CATEGORIES = {
    'education': ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school',
                  'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th',
                  'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th',
                  'Preschool'],
    'marital_status': ['Married-civ-spouse', 'Divorced', 'Never-married',
                       'Separated', 'Widowed', 'Married-spouse-absent',
                       'Married-AF-spouse'],
    'relationship': ['Wife', 'Own-child', 'Husband', 'Not-in-family',
                     'Other-relative', 'Unmarried'],
    'workclass': ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov',
                  'Local-gov', 'State-gov', 'Without-pay', 'Never-worked'],
    'occupation': ['Tech-support', 'Craft-repair', 'Other-service', 'Sales',
                   'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners',
                   'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing',
                   'Transport-moving', 'Priv-house-serv', 'Protective-serv',
                   'Armed-Forces'],
    'gender': ['Male', 'Female'],
    'race': ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other',
             'Black'],
}

Write a function that takes a tensor of categorical values, matches it to a list of value names, and then performs a one-hot encoding.

In [0]:
def process_categorical_data(data, categories):
  """Returns a one-hot encoded tensor representing categorical values."""
  
  # Remove leading ' '.
  data = tf.strings.regex_replace(data, '^ ', '')
  # Remove trailing '.'.
  data = tf.strings.regex_replace(data, r'\.$', '')
  
  # ONE HOT ENCODE
  # Reshape data from 1d (a list) to a 2d (a list of one-element lists)
  data = tf.reshape(data, [-1, 1])
  # For each element, create a new list of boolean values the length of categories,
  # where the truth value is element == category label
  data = tf.equal(categories, data)
  # Cast booleans to floats.
  data = tf.cast(data, tf.float32)
  
  # The entire encoding can fit on one line:
  # data = tf.cast(tf.equal(categories, tf.reshape(data, [-1, 1])), tf.float32)
  return data

To help you visualize this, we'll take a single category-column tensor from the first batch, preprocess it, and show the before and after state.

In [0]:
workclass_tensor = examples['workclass']
workclass_tensor

In [0]:
workclass_categories = CATEGORIES['workclass']
workclass_categories

In [0]:
processed_workclass = process_categorical_data(workclass_tensor, workclass_categories)
processed_workclass

Notice the relationship between the lengths of the two inputs and the shape of the output.

In [0]:
print("Size of batch: ", len(workclass_tensor.numpy()))
print("Number of category labels: ", len(workclass_categories))
print("Shape of one-hot encoded tensor: ", processed_workclass.shape)

### Continuous data

Continuous data needs to be normalized, so that the values fall between 0 and 1. To do that, write a function that multiplies each value by 1 over twice the mean of the column values.

The function should also reshape the data into a two dimensional tensor.


In [0]:
def process_continuous_data(data, mean):
  # Normalize data
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

To do this calculation, you need the column means. You would obviously need to compute these in real life, but for this example we'll just provide them.

In [0]:
MEANS = {
    'age': 38.64358543876172,
    'education_num': 10.078088530363212,
    'capital_gain': 1079.0676262233324,
    'capital_loss': 87.50231358257237,
    'hours_per_week': 40.422382375824085,
}

Again, to see what this function is actually doing, we'll take a single tensor of continuous data and show it before and after processing.

In [0]:
age_tensor = examples['age']
age_tensor

In [0]:
process_continuous_data(age_tensor, MEANS['age'])

### Preprocess the data

Now assemble these preprocessing tasks into a single function that can be mapped to each batch in the dataset. 



In [0]:
def preprocess(features, labels):
  
  # Process categorial features.
  for feature in CATEGORIES.keys():
    features[feature] = process_categorical_data(features[feature],
                                                 CATEGORIES[feature])

  # Process continuous features.
  for feature in MEANS.keys():
    features[feature] = process_continuous_data(features[feature],
                                                MEANS[feature])

  # Process the labels. (Labels are also categorical.)
  labels = process_categorical_data(labels, LABELS)
  
  # Assemble features into a single tensor.
  features = tf.concat([features[column] for column in FEATURE_COLUMNS], 1)
  
  return features, labels

train_data = raw_train_data.map(preprocess)
test_data = raw_test_data.map(preprocess)

And let's see what a single example looks like.

In [0]:
examples, labels = next(iter(train_data))

examples, labels

The examples and labels are both two dimensional arrays of 64 items each (the batch size). Each item represents a single row in the original CSV file.

## Build the model

This example uses the [Keras Functional API](https://www.tensorflow.org/alpha/guide/keras/functional) wrapped in a `get_model` constructor to build up a simple model. 

In [0]:
def get_model(input_dim, labels_dim, hidden_units=[100], learning_rate=0.01):
  """Create a Keras model with layers.

  Args:
    input_dim: (int) The shape of an item in a batch. 
    labels_dim: (int) The shape of a label.
    hidden_units: [int] the layer sizes of the DNN (input layer first)
    learning_rate: (float) the learning rate for the optimizer.

  Returns:
    A Keras model.
  """

  inputs = tf.keras.Input(shape=(input_dim,))
  x = inputs

  for units in hidden_units:
    x = tf.keras.layers.Dense(units, activation=tf.keras.backend.relu)(x)
  outputs = tf.keras.layers.Dense(labels_dim, activation='softmax')(x)

  model = tf.keras.Model(inputs, outputs)
  model.compile(
      loss='categorical_crossentropy',
      optimizer=tf.keras.optimizers.RMSprop(learning_rate),
      metrics=['accuracy'])
  return model

The `get_model` constructor needs to know the input and output shapes of your data (not including the batch size).

In [0]:
input_shape, output_shape = train_data.output_shapes

input_dimension = input_shape.dims[1] # [0] is the batch size
output_dimension = output_shape.dims[1] 

## Train, evaluate, and predict

Now the model can be instantiated and trained.

In [0]:
model = get_model(input_dimension, output_dimension)


model.fit(train_data, epochs=20)

Once the model is trained, we can check its accuract on the `test_data` set.

In [0]:
model.evaluate(test_data)

In production, you want to actually get the output. Use `tf.keras.Model.predict` to infer labels on a batch or a dataset of batches.

In [0]:
predictions = model.predict(test_data)

print("Predictions:\n", predictions)
print("\nShape:\n", predictions.shape)
