<h1> Create TensorFlow model </h1>

This notebook illustrates:
<ol>
<li> Creating a model using the high-level Estimator API 
</ol>

In [18]:
import os
output = os.popen("gcloud config get-value project").readlines()
project_name = output[0][:-1]

# change these to try this notebook out
PROJECT = project_name
BUCKET = project_name
BUCKET = BUCKET.replace("qwiklabs-gcp-", "inna-bckt-")
REGION = 'eu-west3'

print(PROJECT)
print(BUCKET)

qwiklabs-gcp-66dff79c51c6ef7e
inna-bckt-66dff79c51c6ef7e


In [19]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [20]:
%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

In [21]:
import shutil
import numpy as np
import tensorflow as tf

## enable eager execution (mostly to see mistakes right away):
## TURN OFF FOR BETTER PERFORMANCE. 
## also doesn't work for estimator API.
# tf.enable_eager_execution()

### Helper Code Snippets

In [22]:
## [[todo]]
## see also `lab-snippets.py`
%pwd
%ls *.csv
%cat train.csv | head -n 2

eval.csv  train.csv
6.0009827716399995,False,14,Single(1),40.0,4740473290291881219
7.3744626639,False,17,Single(1),42.0,4740473290291881219
cat: write error: Broken pipe


<h2> Create TensorFlow model using TensorFlow's Estimator API </h2>
<p>
First, write an input_fn to read the data.
<p>

## Lab Task 1
Verify that the headers match your CSV output

In [23]:
import shutil
import numpy as np
import tensorflow as tf

In [24]:
# Determine CSV, label, and key columns
CSV_COLUMNS = 'weight_pounds,is_male,mother_age,plurality,gestation_weeks,key'.split(',')
LABEL_COLUMN = 'weight_pounds'
KEY_COLUMN = 'key'

# Set default values for each CSV column
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], ['nokey']]
TRAIN_STEPS = 1000

## Lab Task 2

Fill out the details of the input function below

In [26]:
## stuff to try out some of the code below:
line_of_text = '5.43659938092,True,12,Single(1),39.0,1451354159195218418'
parsed_line = tf.decode_csv(line_of_text, record_defaults = DEFAULTS, field_delim = ',')
print(parsed_line)
print()

features = dict(zip(CSV_COLUMNS, parsed_line))
label = features.pop(LABEL_COLUMN)
print(features)
print(label)
print()

def change_indent():
  """
  ```javascript
var cell = Jupyter.notebook.get_selected_cell();
var config = cell.config;
var patch = {
      CodeCell:{
        cm_config:{indentUnit:2}
      }
    }
config.update(patch)
  ```
  ou can enter the previous snippet in your browser’s 
  JavaScript console once. Then reload the notebook page 
  in your browser. Now, the preferred indent unit should 
  be equal to two spaces. The custom setting persists and 
  you do not need to reissue the patch on new notebooks.
  """
  return 1;
  

[<tf.Tensor 'DecodeCSV:0' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:1' shape=() dtype=string>, <tf.Tensor 'DecodeCSV:2' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:3' shape=() dtype=string>, <tf.Tensor 'DecodeCSV:4' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:5' shape=() dtype=string>]

{'gestation_weeks': <tf.Tensor 'DecodeCSV:4' shape=() dtype=float32>, 'is_male': <tf.Tensor 'DecodeCSV:1' shape=() dtype=string>, 'key': <tf.Tensor 'DecodeCSV:5' shape=() dtype=string>, 'mother_age': <tf.Tensor 'DecodeCSV:2' shape=() dtype=float32>, 'plurality': <tf.Tensor 'DecodeCSV:3' shape=() dtype=string>}
Tensor("DecodeCSV:0", shape=(), dtype=float32)



In [27]:
# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(filename_pattern, mode, batch_size = 512):
  def _input_fn():
    tf.logging.info('Defining input function.')
    def decode_csv(line_of_text):
      #tf.logging.info('Parsing line {}'.format(line_of_text))
      # TODO #1: Use tf.decode_csv to parse the provided line
      parsed_line = tf.decode_csv(line_of_text, record_defaults = DEFAULTS, field_delim = ',')
      # TODO #2: Make a Python dict.  The keys are the column names, the values are from the parsed data
      features = dict(zip(CSV_COLUMNS, parsed_line))
      # TODO #3: Return a tuple of features, label where features is a Python dict and label a float
      label = features.pop(LABEL_COLUMN)
      return features, label

    # TODO #4: Use tf.gfile.Glob to create list of files that match pattern
    file_list = tf.gfile.Glob(filename_pattern)
    # Create dataset from file list
    dataset = (tf.data.TextLineDataset(file_list)  # Read text file
             .map(decode_csv))  # Transform each elem by applying decode_csv fn
    # TODO #5: In training mode, shuffle the dataset and repeat indefinitely
    #                (Look at the API for tf.data.dataset shuffle)
    #          The mode input variable will be tf.estimator.ModeKeys.TRAIN if in training mode
    #          Tell the dataset to provide data in batches of batch_size 
    if (mode == tf.estimator.ModeKeys.TRAIN):
      num_epochs = None  ## train indefinitely
      dataset = dataset.shuffle(buffer_size = 10 * batch_size)
    else:
      num_epochs = 1
    
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    # This will now return batches of features, label
    # * features = dictionary of {<featurename> : <tensor>}
    # * label = <tensor> of one or more labels
    return dataset
  return _input_fn

## Lab Task 3

Use the TensorFlow feature column API to define appropriate feature columns for your raw features that come from the CSV.

<b> Bonus: </b> Separate your columns into wide columns (categorical, discrete, etc.) and deep columns (numeric, embedding, etc.)

In [38]:
## define feature columns: wide-deep-model:

def get_wide_deep_features():
  tf.logging.info('Getting feature columns.')
  ## define column types:
  is_male = tf.feature_column.categorical_column_with_vocabulary_list(
    'is_male',
    ['True', 'False', 'Unknown'])
  mother_age = tf.feature_column.numeric_column('mother_age')
  plurality = tf.feature_column.categorical_column_with_vocabulary_list(
    'plurality',
    ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 
     'Quintuplets(5), Multiple(2+)'])
  gestation_weeks = tf.feature_column.numeric_column('gestation_weeks')
  
  ## feature transformations:
  
  ## bucketize mother_age:
  age_buckets = tf.feature_column.bucketized_column(
    mother_age,   ## input: tf.feature_column variable
    boundaries = np.arange(15, 45, 1).tolist())
  gestation_buckets = tf.feature_column.bucketized_column(
    gestation_weeks,
    boundaries = np.arange(17, 47, 1).tolist())
  
  ## define list of "wide columns", i.e., categorical variables
  ## (linear relationships):
  wide = [is_male,
         plurality,
         age_buckets,
         gestation_buckets]
  
  ## feature cross all the wide columns [[?]]
  ## and embed into a lower dimension:
  crossed = tf.feature_column.crossed_column(wide, hash_bucket_size = 20000)
  embed = tf.feature_column.embedding_column(crossed, 3)
  
  ## define list of continuous columns (plus embeded columns);
  ## they have a "deep", complex relationship with the output:
  deep = [mother_age,
         gestation_weeks,
         embed]
  
  return wide, deep

## define feature columns: DNN-model: 
def get_features_dnn():
  tf.logging.info('Getting feature columns (DNN).')
  ## define column types:
  is_male = tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list(
    'is_male', ['True', 'False', 'Unknown']))
  mother_age = tf.feature_column.numeric_column('mother_age')
  plurality = tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list(
    'plurality',
    ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 
     'Quintuplets(5), Multiple(2+)']))
  gestation_weeks = tf.feature_column.numeric_column('gestation_weeks')
  ## don't do any bucketing here, just return feature list:
  features = [\
             is_male,
             mother_age,
             plurality,
             gestation_weeks]
  return features

## Lab Task 4

To predict with the TensorFlow model, we also need a serving input function (we'll use this in a later lab). We will want all the inputs from our user.

Verify and change the column names and types here as appropriate. These should match your CSV_COLUMNS

In [39]:
# Create serving input function to be able to serve predictions later using provided inputs
def serving_input_fn():
  tf.logging.info('Serving input (function).')
  feature_placeholders = {
    'is_male': tf.placeholder(tf.string, [None]),
    'mother_age': tf.placeholder(tf.float32, [None]),
    'plurality': tf.placeholder(tf.string, [None]),
    'gestation_weeks': tf.placeholder(tf.float32, [None])
  }
  features = {
    key: tf.expand_dims(tensor, -1)
    for key, tensor in feature_placeholders.items()
  }
  input_receiver = tf.estimator.export.ServingInputReceiver(
    features, feature_placeholders)
  return input_receiver

## Lab Task 5

Complete the TODOs in this code:

In [40]:
# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
  tf.logging.info('Entering train and evaluate function.')
  ## create a run configuration:
  EVAL_INTERVAL = 300
  run_config = tf.estimator.RunConfig(
    save_checkpoints_secs = EVAL_INTERVAL,
    keep_checkpoint_max = 3)

  ## TODO #1: Create your model estimator:

  ## Linear Classifier:
  estimator = tf.estimator.LinearRegressor(
    model_dir = output_dir,
    feature_columns = get_features_dnn(),
    config = run_config
  )
    
  ## DNN Classifier:
  estimator = tf.estimator.DNNRegressor(
    model_dir = output_dir,
    feature_columns = get_features_dnn(),
    hidden_units = [64, 32],
    config = run_config
  )

  ## Wide-and-Deep-Model:
  #wide, deep = get_wide_deep_features()
  #estimator = tf.estimator.DNNLinearCombinedRegressor(
  #  model_dir = output_dir,
  #  linear_feature_columns = wide,
  #  dnn_feature_columns = deep,
  #  dnn_hidden_units = [64, 32],
  #  config = run_config
  #)
    
  ## Training specification:
  train_spec = tf.estimator.TrainSpec(
                       # TODO #2: Call read_dataset passing in the training CSV file and the appropriate mode
                       input_fn = read_dataset(    ## pass the function, not call it
                           filename_pattern = 'train.csv',
                           mode = tf.estimator.ModeKeys.TRAIN),  
                       max_steps = TRAIN_STEPS)
  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
  eval_spec = tf.estimator.EvalSpec(
                       # TODO #3: Call read_dataset passing in the evaluation CSV file and the appropriate mode
                       input_fn = read_dataset(
                           filename_pattern = 'eval.csv',
                           mode = tf.estimator.ModeKeys.EVAL),
                       steps = None,
                       start_delay_secs = 60, # start evaluating after N seconds
                       throttle_secs = EVAL_INTERVAL,  # evaluate every N seconds
                       exporters = exporter)
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

Finally, train!

In [41]:
# Run the model
shutil.rmtree('babyweight_trained', ignore_errors = True) # start fresh each time
train_and_evaluate('babyweight_trained')

INFO:tensorflow:Entering train and evaluate function.
INFO:tensorflow:Getting feature columns (DNN).
INFO:tensorflow:Using config: {'_task_id': 0, '_session_config': None, '_global_id_in_cluster': 0, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_is_chief': True, '_master': '', '_save_checkpoints_secs': 300, '_save_summary_steps': 100, '_train_distribute': None, '_task_type': 'worker', '_num_ps_replicas': 0, '_evaluation_master': '', '_log_step_count_steps': 100, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa7c5109860>, '_model_dir': 'babyweight_trained', '_keep_checkpoint_max': 3, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_worker_replicas': 1}
INFO:tensorflow:Getting feature columns (DNN).
INFO:tensorflow:Using config: {'_task_id': 0, '_session_config': None, '_global_id_in_cluster': 0, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_is_chief': True, '_master': '', '_save_checkpoints_secs': 300, '_save

When I ran it, the final lines of the output (above) were:
<pre>
INFO:tensorflow:Saving dict for global step 1000: average_loss = 1.2693067, global_step = 1000, loss = 635.9226
INFO:tensorflow:Restoring parameters from babyweight_trained/model.ckpt-1000
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: babyweight_trained/export/exporter/temp-1517899936/saved_model.pb
</pre>
The exporter directory contains the final model and the final RMSE (the average_loss) is 1.2693067

<h2> Monitor and experiment with training </h2>

In [16]:
from google.datalab.ml import TensorBoard
TensorBoard().start('./babyweight_trained')

4494

In [17]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print('Stopped TensorBoard with pid {}'.format(pid))

Stopped TensorBoard with pid 4494


Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License