##### Copyright 2019 The TensorFlow Authors.

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Census using tf.Keras + TF.data.Dataset in TF 2.0

<table>![Keras+Tensorflow](https://avatars0.githubusercontent.com/u/15658638?s=200&v=4 =85x)
</table>

## Overview

This tutorial shows how to train a neural network using the Keras
functional API locally and how to serve predictions
from that model.

Keras is a high-level API to build and train deep learning models.
[tf.keras](https://www.tensorflow.org/guide/keras) is TensorFlow’s
implementation of this API.

### Dataset

Using census data which contains data for a person's age, education, marital status, and occupation (the features), we will try to predict whether or not the person earns more than 50,000 USD a year (the target label). We will then train a neural network model that, given an individual's information (features), outputs a number between 0 and 1—this can be interpreted as the probability that the individual has an annual income of over 50,000 USD.

**Key Point:** As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is each feature relevant to the problem you want to solve or will it introduce bias? For more information, read about ML fairness.

## Setup

You must do several things before you can train a model:

* Set up your development environment. (Skip this step if you're using
Colaboratory.)

If using Google Cloud:

* Create a Google Cloud Platform (GCP) project with Billing and the necessary
  APIs enabled.
* Authenticate your GCP account in this notebook.
* Create a Google Cloud Storage bucket to store your training package and your
  trained model.


### Set up your development environment

If you are using Colaboratory, skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You
need the following:

* Google Cloud SDK
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

3. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3.

4. Activate that environment. Run `pip install jupyter` in a shell to install
   Jupyter.

5. Run `jupyter notebook` in a shell to launch Jupyter.

6. Open this notebook in the Jupyter Notebook Dashboard.

### Set up your GCP project

If you are using Colaboratory, skip this step

Follow the first three steps of [these setup
instructions](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#set-up-your-gcp-project)
to setup a GCP Project, enable billing, and enable
Compute Engine APIs. Enter the id of your project in the cell below.


In [0]:
PROJECT_ID = "dpe-cloud-mle" #@param {type:"string"}

### Authenticate your GCP account

**If you are not running this notebook in Colaboratory**, follow step four of
the [setup
instructions](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#set-up-your-gcp-project)
to create a service account key and save it to your machine. Enter the path to
your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` constant in the
cell below.

**If you _are_ using Colaboratory**, run the cell below and follow the
instructions when prompted to authenticate your account via OAuth.

In [0]:
import sys

# If you are running this notebook in Colaboratory, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Google Cloud Storage bucket and lets us submit training jobs and prediction
# requests.

if 'google.colab' in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()

# If you are running this notebook locally, please follow these instructions
# to create a service account key: [TODO(alecglassford): Once we convert this to markdown, link to *this* tutorial on c.g.c, instead of the other guide]
# https://cloud.google.com/ml-engine/docs/tensorflow/python-guide#set-up-your-gcp-project
# Then, replace the string below with the path to your service account key
# and run this cell to authenticate your GCP account.
else:
  GOOGLE_APPLICATION_CREDENTIALS='/path/to/your/service-account-key.json' #@param {type:"string"}
  %env GOOGLE_APPLICATION_CREDENTIALS {GOOGLE_APPLICATION_CREDENTIALS}


Run the following cell to make sure the Cloud SDK uses the right project for
all the commands in this notebook.

Note: Jupyter interpolates Python variables in curly braces into shell commands.

In [0]:
! gcloud config set project {PROJECT_ID}

### Create a Google Cloud Storage bucket

In this tutorial, TensorFlow trained model save the results from your job in a Google Cloud Storage bucket.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets. You may also change the `REGION` variable. Make sure to
[choose a region where Cloud ML Engine services are
available](https://cloud.google.com/ml-engine/docs/tensorflow/regions), because
you must run your Cloud ML Engine jobs in your Cloud Storage bucket's region.

In [0]:
MODEL_NAME = "census_model" #@param {type:"string"}
BUCKET_NAME = "tony-dev" #@param {type:"string"}
REGION = "us-central1" #@param ["us-central1", "us-east1", "europe-west1"]


Run the following cell to create your Cloud Storage bucket. If it already exists, slip this step.

In [0]:
! gsutil mb -l {REGION} gs://{BUCKET_NAME}

## Install TensorFlow 2.0 preview

In [0]:
! pip install -U tf-nightly-2.0-preview

## Part 1. Training a Keras model

In this section we will build a Keras model from scratch.
We will perform the following steps:
- Data download
- Data preparation
- Data standarization
- Model creation
- Model training
- Model evaluation
- Model serving

We will create the model and export it to serve requests in ML Engine.

### Setup your GCP project

Verify, you have follow the initial steps for setting up your project at the beginning of this notebook.
Update your parameters accordingly.

### Import libraries
Import TensorFlow and supporting modules:

In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import numpy as np
import pandas as pd
import tensorflow as tf

import os
import time
import math
import matplotlib.pyplot as plt

from six.moves import urllib

# Software versions
print(__import__('sys').version)
print(tf.__version__)
print(tf.keras.__version__)

### Define Constants

In [0]:
# Storage directory
DATA_DIR = '/tmp/census_data/'

# Download options.
DATA_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult'
TRAINING_FILE = 'adult.data'
EVAL_FILE = 'adult.test'
TRAINING_URL = '%s/%s' % (DATA_URL, TRAINING_FILE)
EVAL_URL = '%s/%s' % (DATA_URL, EVAL_FILE)

# These are the features in the dataset.
_CSV_COLUMNS = [
    'age', 'workclass', 'fnlwgt', 'education', 'education_num',
    'marital_status', 'occupation', 'relationship', 'race', 'gender',
    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
    'income_bracket'
]

_CATEGORICAL_TYPES = {
  'workclass': pd.api.types.CategoricalDtype(categories=[
    'Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc',
    'Self-emp-not-inc', 'State-gov', 'Without-pay'
  ]),
  'marital_status': pd.api.types.CategoricalDtype(categories=[
    'Divorced', 'Married-AF-spouse', 'Married-civ-spouse',
    'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed'
  ]),
  'occupation': pd.api.types.CategoricalDtype([
    'Adm-clerical', 'Armed-Forces', 'Craft-repair', 'Exec-managerial',
    'Farming-fishing', 'Handlers-cleaners', 'Machine-op-inspct',
    'Other-service', 'Priv-house-serv', 'Prof-specialty', 'Protective-serv',
    'Sales', 'Tech-support', 'Transport-moving'
  ]),
  'relationship': pd.api.types.CategoricalDtype(categories=[
    'Husband', 'Not-in-family', 'Other-relative', 'Own-child', 'Unmarried',
    'Wife'
  ]),
  'race': pd.api.types.CategoricalDtype(categories=[
    'Amer-Indian-Eskimo', 'Asian-Pac-Islander', 'Black', 'Other', 'White'
  ]),
  'native_country': pd.api.types.CategoricalDtype(categories=[
    'Cambodia', 'Canada', 'China', 'Columbia', 'Cuba', 'Dominican-Republic',
    'Ecuador', 'El-Salvador', 'England', 'France', 'Germany', 'Greece',
    'Guatemala', 'Haiti', 'Holand-Netherlands', 'Honduras', 'Hong', 'Hungary',
    'India', 'Iran', 'Ireland', 'Italy', 'Jamaica', 'Japan', 'Laos', 'Mexico',
    'Nicaragua', 'Outlying-US(Guam-USVI-etc)', 'Peru', 'Philippines', 'Poland',
    'Portugal', 'Puerto-Rico', 'Scotland', 'South', 'Taiwan', 'Thailand', 
    'Trinadad&Tobago', 'United-States', 'Vietnam', 'Yugoslavia'
  ]),
  'income_bracket': pd.api.types.CategoricalDtype(categories=[
    '<=50K', '>50K'
  ])
}

# This is the label (target) we want to predict.
_LABEL_COLUMN = 'income_bracket'

_CSV_COLUMN_DEFAULTS = [[0], [''], [0], [''], [0], [''], [''], [''], [''], [''],
                        [0], [0], [0], [''], ['']]

# Use one CPU for this example
NUM_CPUS = 1

# This the training batch size
BATCH_SIZE = 40
# This is the number of epochs (passes over the full training data)
EPOCHS = 40
# Define learning rate.
LEARNING_RATE = 0.001

_NUM_EXAMPLES = {
    'train': 32561,
    'validation': 16281,
}

In [0]:
# Clean up directory each run
! rm -rf {DATA_DIR}

### Helper function to download and clean files


In [0]:
def _download_and_clean_file(filename, url):
  """ Downloads data from url, and makes changes to match the CSV format.
      Removes excessive whitespace
  """
  temp_file, _ = urllib.request.urlretrieve(url)
  with tf.io.gfile.GFile(temp_file, 'r') as temp_file_object:
    with tf.io.gfile.GFile(filename, 'w') as file_object:
      for line in temp_file_object:
        line = line.strip()
        line = line.replace(', ', ',')
        if not line or ',' not in line:
          continue
        if line[-1] == '.':
          line = line[:-1]
        line += '\n'
        file_object.write(line)
  tf.io.gfile.remove(temp_file)

### Function to download Training and Evaluation files
 

In [0]:
def download(data_dir):
  """Download census data if it is not already present."""
  tf.io.gfile.makedirs(data_dir)

  training_file_path = os.path.join(data_dir, TRAINING_FILE)
  if not tf.io.gfile.exists(training_file_path):
    _download_and_clean_file(training_file_path, TRAINING_URL)

  eval_file_path = os.path.join(data_dir, EVAL_FILE)
  if not tf.io.gfile.exists(eval_file_path):
    _download_and_clean_file(eval_file_path, EVAL_URL)
  print('Download is completed!')

### Download Census Dataset



In [0]:
# Download Census dataset: Training and test csv files.
download(DATA_DIR)

In [0]:
# Verify data is downloaded successfully. You will see 2 files: adult.data and adult.test
% ls -l {DATA_DIR}

In [0]:
# Define the full path for training and test files.
train_file = os.path.join(DATA_DIR, TRAINING_FILE)
test_file = os.path.join(DATA_DIR, EVAL_FILE)

### Load files into a Pandas Dataset

In [0]:
# This census data uses the value '?' for fields (column) that are missing data. 
# We use na_values to find ? and set it to NaN values.
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

train = pd.read_csv(train_file, names=_CSV_COLUMNS, na_values="?")
test = pd.read_csv(test_file, names=_CSV_COLUMNS, na_values="?")

# Here's what the data looks like before we preprocess the data.
train.head()

#### Drop Unused Features and Features that are Biased

In [0]:
# Dataset information: https://archive.ics.uci.edu/ml/datasets/census+income

"""
These are columns we will not use as features for training. There are many
reasons not to use certain attributes of data for training. Perhaps their
values are noisy or inconsistent, or perhaps they encode bias that we do not
want our model to learn. For a deep dive into the features of this Census
dataset and the challenges they pose, see the Introduction to ML Fairness
notebook: https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/intro_to_fairness.ipynb"""

UNUSED_COLUMNS = ['fnlwgt', 'education', 'gender']

### Process Numerical and Categorical columns

The Census datasets contains both numbers and strings
we need to convert string data into numbers to be able to train the model.

In [0]:
def preprocess(dataframe):  
  """Dataframe contains both numeric and categorical features, convert 
  categorical features to numeric.

  Args:
    dataframe: A `Pandas.Dataframe` to process.
  """

  dataframe = dataframe.drop(columns=UNUSED_COLUMNS)
  
  # Convert integer valued (numeric) columns to floating point
  numeric_columns = dataframe.select_dtypes(['int64']).columns
  dataframe[numeric_columns] = dataframe[numeric_columns].astype('float32')

  # Convert categorical columns to numeric
  cat_columns = dataframe.select_dtypes(['object']).columns
  # Keep categorical columns always using same values based on dict.
  dataframe[cat_columns] = dataframe[cat_columns].apply(lambda x: x.astype(_CATEGORICAL_TYPES[x.name]))
  dataframe[cat_columns] = dataframe[cat_columns].apply(lambda x: x.cat.codes)
  return dataframe

In [0]:
train = preprocess(train)
test = preprocess(test)

In [0]:
# Here's how the data has changed after we preprocessed it.
# Note how columns like workclass, education, marital_status, occupation, 
# relationship, race, gender, native_country and income_bracket have been 
# changed to categorical conversion.
train.head()

### Split Features and Labels

In [0]:
# Split train and test data with labels.
# The pop() method will extract (copy) and remove the label column from the dataframe
_train_x, train_y = train, train.pop(_LABEL_COLUMN)
_test_x, test_y = test, test.pop(_LABEL_COLUMN)

In [0]:
# Reshape Label for Dataset. 
train_y = np.asarray(train_y).astype('float32').reshape((-1, 1))
test_y = np.asarray(test_y).astype('float32').reshape((-1, 1))

### Normalize data for Model Convergence

In [0]:
def standardization(dataset):
  """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
  dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
  # Normalize numeric columns.
  for column, dtype in dtypes:
      if dtype == 'float32':
          dataset[column] -= dataset[column].mean()
          dataset[column] /= dataset[column].std()
  return dataset

In [0]:
# Join all dataframes to standarize data, then split.
all_data = pd.concat([_train_x, _test_x], keys=[0, 1])
standardized_data = standardization(all_data)
train_x, test_x = standardized_data.xs(0), standardized_data.xs(1)

In [0]:
# Verify dataset features
# Note how only the numeric fields (not categorical) have been standardized
train_x.head()

#### DataFrame Length

List length of training and testing data. ((32561, 32561, 16281, 16281))

In [0]:
len(train_x), len(train_y), len(test_x), len(test_y)

### Dataset input function

Keras supports tf.data.Dataset, we will be using this new functionality to process the dataset.

https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [0]:
def input_fn(features, labels, shuffle, num_epochs, batch_size):
  """Generate an input function for the Estimator."""
  
  if labels is None:
    inputs = features
  else:
    inputs = (features, labels)    
  dataset = tf.data.Dataset.from_tensor_slices(inputs)
  
  if shuffle:
    dataset = dataset.shuffle(num_epochs)

  # We call repeat after shuffling, rather than before, to prevent separate
  # epochs from blending together.
  dataset = dataset.repeat()
  dataset = dataset.batch(batch_size)
  dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
  return dataset

### Create a Keras Model

We'll create our neural network using the Keras Sequential API. Keras is a high-level API to build and train deep learning models and is user friendly, modular and easy to extend. **tf.keras** is TensorFlow's implementation of this API and it supports such things as eager execution, **tf.data** pipelines and Estimators.

Architecture wise, we'll build a logistic regressions using a deep neural network (DNN) with several hidden layers, where:

- The input layer will have 100 units using the ReLU activation function.
- The hidden layer will have 75 units using the ReLU activation function.
- The hidden layer will have 50 units using the ReLU activation function.
- The hidden layer will have 25 units using the ReLU activation function.
- The output layer will have 1 units and use sigmoid function.
- We will use the binary crossentropy loss function, and the RMSprop optimizer.

In [0]:
# *This may change once Keras supports Feature Columns.
# https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/fast-and-lean-data-science/08_Taxifare_Keras_FeatureColumns_solution.ipynb

def create_keras_model(input_dim, learning_rate):
  """Created Keras Model for Binary Classification."""
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Dense(100, activation=tf.nn.relu, input_shape=(input_dim,)))
  model.add(tf.keras.layers.Dense(70, activation=tf.nn.relu))
  model.add(tf.keras.layers.Dense(50, activation=tf.nn.relu))
  model.add(tf.keras.layers.Dense(25, activation=tf.nn.relu))
  # The single output node and Sigmoid activation makes this a Logistic Regression.
  model.add(tf.keras.layers.Dense(1, activation=tf.nn.sigmoid))

  # Custom Optimizer: 
  # https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer
  optimizer = tf.keras.optimizers.RMSprop(lr=learning_rate, 
                                          rho=0.9, 
                                          epsilon=1e-08, 
                                          decay=learning_rate/10)

  # Compile Keras model
  model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])
  return model

In [0]:
# Input dimensions
input_dim = train_x.shape[1]
print('Total features: {}'.format(input_dim))

In [0]:
# Create the Keras Model

keras_model = create_keras_model(input_dim=input_dim, learning_rate=LEARNING_RATE)

In [0]:
# Take a detailed look inside the model
keras_model.summary()

### Train and Evaluate

After adding all the features to the model, let's train the model. Training a model is just a single command using the tf.data.Dataset API.

In [0]:
# Pass a numpy array by passing DataFrame.values
training_dataset = input_fn(features=train_x.values, 
                    labels=train_y, 
                    shuffle=True, 
                    num_epochs=40, 
                    batch_size=BATCH_SIZE)

# Pass a numpy array by passing DataFrame.values
validation_dataset = input_fn(features=test_x.values, 
                    labels=test_y, 
                    shuffle=False, 
                    num_epochs=1, 
                    batch_size=BATCH_SIZE)                

In [0]:
# Setup Learning Rate decay.
lr_decay = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 0.0001 + 0.02 * math.pow(0.5, 1+epoch), verbose=True)

**tf.data.Dataset** using Keras

In this example we will use a tf.data.Dataset to train our Keras model.
There are new parameters required:

steps_per_epoch = Total of training samples / Training batch size
validation_steps = Total of validation samples / Evaluation batch size

This means how many batches per epoch you will yield.
This is configured in order to guarantee:

- You train your entire training set
- You validate your entire validation set


In [0]:
history = keras_model.fit(training_dataset, 
                          validation_data=validation_dataset, 
                          steps_per_epoch=int(_NUM_EXAMPLES['train']/BATCH_SIZE), 
                          validation_steps=int(_NUM_EXAMPLES['validation']/BATCH_SIZE), 
                          epochs=EPOCHS, 
                          callbacks=[lr_decay],
                          verbose=2)

**Reference:** Traditional Keras training mode.

```keras_model.fit(x=train_x, y=train_y, epochs=40, validation_data=(test_x, test_y), verbose=1, callbacks=[lr_decay])```

### Visualize Model history

In [0]:
# Visualize History for Accuracy.
plt.title('Keras Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epochs')
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.legend(['training', 'testing'], loc='upper left')
plt.show()

In [0]:
# Visualize History for Loss.
plt.title('Keras Model loss')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['training', 'testing'], loc='upper left')
plt.show()

### Export model for Serving

In [0]:
# Exporting model to GCS. 
export_path = tf.keras.experimental.export(keras_model, os.path.join('gs://', BUCKET_NAME, 'keras_export'))
export_path = export_path.decode('utf-8')
print("Model exported to: ", export_path)

#### Generate online predictions

In [0]:
# We Download sample data to verify predictions.
! rm -rf test.*
! wget https://raw.githubusercontent.com/GoogleCloudPlatform/cloudml-samples/master/census/test.csv

# Extract the file into a Pandas dataframe to process it for Predictions.
predictions_df = pd.read_csv('test.csv', names=_CSV_COLUMNS)
# Display data
predictions_df.head()

In [0]:
# Preprocess data as Serving function is expecting numeric data.
predict = preprocess(predictions_df)

In [0]:
# Split features and label. We will pass features only to Serving model.
_predict_x, predict_y = predict, predict.pop(_LABEL_COLUMN)

In [0]:
# Concat training and test data to perform standarization.
all_data = pd.concat([_train_x, _predict_x], keys=[0, 1])
# Standarize predictions using training data + prediction.
standardized_data = standardization(all_data)
train_x, predict_x = standardized_data.xs(0), standardized_data.xs(1)

In [0]:
predict_x.head()

### Local Predictions

In [0]:
# Predict using Keras
predict_x.to_csv('predictions.csv', header=False)
! more predictions.csv

predictions = keras_model.predict_classes(predict_x, verbose=1)
print(['<=50K' if x==0 else '>=50K' for x in predictions])

Copyright 2019 The TensorFlow authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


## Questions? Feedback?
Feel free to send us an email (cloudml-feedback@google.com) if you run into any issues or have any questions/feedback!