<h1> 2c. Loading large datasets progressively with the tf.data.Dataset </h1>

In this notebook, we continue reading the same small dataset, but refactor our ML pipeline in two small, but significant, ways:
<ol>
<li> Refactor the input to read data from disk progressively.
<li> Refactor the feature creation so that it is not one-to-one with inputs.
</ol>
<br/>
The Pandas function in the previous notebook first read the whole data into memory -- on a large dataset, this won't be an option.

## Challenge Exercise

Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Unlike in the challenge exercise for b_estimator.ipynb, assume that your measurements of r, h and V are all rounded off to the nearest 0.1. Simulate the necessary training dataset. This time, you will need a lot more data to get a good predictor.

Now modify the "noise" so that instead of just rounding off the value, there is up to a 10% error (uniformly distributed) in the measurement followed by rounding off.

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np

  from ._conv import register_converters as _register_converters


In [2]:
# Generate data
def make_some_noise(array, error=0.1):
  return np.random.uniform(array*(1-error), array*(1+error))

def initialise_data(n_rows, value_range=(5,20), random_seed=696, error=0.1):
  """ Generate random dataset for model training """
  # Seed
  np.random.seed(random_seed)
  
  # Generate r and h
  frame = pd.DataFrame({
    'r': np.random.uniform(value_range[0], value_range[1], n_rows),
    'h': np.random.uniform(value_range[0], value_range[1], n_rows),
  })
  
  # Compute v
  frame['v'] = np.pi*frame['r']**2 * frame['h']
  
  # Add noise to measurements
  frame[['r', 'h']] = frame[['r', 'h']].apply(make_some_noise, error=error)
  
  # Round value to the nearest 0.1
  frame = frame.apply(np.round, decimals=1)
  
  return frame

In [3]:
# Test
initialise_data(5)

Unnamed: 0,h,r,v
0,14.6,16.3,12056.6
1,20.2,12.4,10810.9
2,9.5,6.4,1404.1
3,13.4,6.3,1771.5
4,11.4,17.7,9628.8


In [18]:
# Generate datasets
train_file = './cyl-train.csv'
val_file = './cyl-val.csv'

initialise_data(50000).to_csv(train_file, header=False, index=False)
initialise_data(5000, random_seed=969).to_csv(val_file, header=False, index=False)

In [19]:
class tfUtils:
  def __init__(self, train_file, val_file, num_epochs=50, feature_cols=['h','r'], label='v', batch_size=64, queue_capacity=1000, defaults=[[1.1], [1.1], [1.1]]):
    self.num_epochs = num_epochs
    self.label = label
    self.batch_size = batch_size
    self.queue_capacity = queue_capacity
    self.features = feature_cols
    self.defaults = defaults
    self.train_file = train_file
    self.val_file = val_file
    
    self.columns = feature_cols + list(label)
    
  def decode_csv(self, row):
    columns = tf.decode_csv(row, record_defaults=self.defaults)
    
    # Separate features and label
    features = dict(zip(self.columns, columns))    
    label = features.pop(self.label)
    
    return features, label

  def read_dataset(self, filename, mode, batch_size=1024, header=False):
    # Create list of file names that match "glob" pattern (i.e. data_file_*.csv)
    filenames = tf.data.Dataset.list_files(filename, shuffle=False)
    
    # Skip the first row if there's header
    if header:
      filenames = filenames.skip(1)
    
    # Read lines from text files
    textlines = filenames.flat_map(tf.data.TextLineDataset)
    
    # Parse text lines as comma-separated values (CSV)
    dataset = textlines.map(self.decode_csv)

    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None
        dataset = dataset.shuffle(buffer_size=10 * batch_size, seed=2)
    else:
        num_epochs = 1 # end-of-input after this

    dataset = dataset.repeat(num_epochs).batch(batch_size)

    return dataset
  
  def get_feature_cols(self):
    return [tf.feature_column.numeric_column(k) for k in self.features]

  def get_train_input_fn(self):
    return self.read_dataset(self.train_file, mode=tf.estimator.ModeKeys.TRAIN)

  def get_eval_input_fn(self):
    return self.read_dataset(self.val_file, mode=tf.estimator.ModeKeys.EVAL)
  
  def compute_rmse(self, model, file):
    metrics = model.evaluate(input_fn=self.get_eval_input_fn)
    print('RMSE = ', np.sqrt(metrics['average_loss']))

In [20]:
# Initialise
tf_utils = tfUtils(train_file, val_file)

In [34]:
# Set logging settings
tf.logging.set_verbosity(tf.logging.INFO)

# Initialise DNN Regressor
model = tf.estimator.DNNRegressor(
          hidden_units=[3,2,1],
          feature_columns=tf_utils.get_feature_cols()
          )

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_service': None, '_keep_checkpoint_every_n_hours': 10000, '_task_type': 'worker', '_save_checkpoints_steps': None, '_tf_random_seed': None, '_task_id': 0, '_master': '', '_train_distribute': None, '_evaluation_master': '', '_global_id_in_cluster': 0, '_session_config': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f83984fff98>, '_is_chief': True, '_model_dir': '/tmp/tmpuw1l16ym', '_num_worker_replicas': 1, '_log_step_count_steps': 100, '_save_summary_steps': 100, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_save_checkpoints_secs': 600}


In [35]:
# Train model
model.train(input_fn=tf_utils.get_train_input_fn, steps=2000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpuw1l16ym/model.ckpt.
INFO:tensorflow:step = 1, loss = 76150220000.0
INFO:tensorflow:global_step/sec: 35.3381
INFO:tensorflow:step = 101, loss = 72862010000.0 (2.836 sec)
INFO:tensorflow:global_step/sec: 35.0956
INFO:tensorflow:step = 201, loss = 80288840000.0 (2.849 sec)
INFO:tensorflow:global_step/sec: 35.0102
INFO:tensorflow:step = 301, loss = 78698410000.0 (2.858 sec)
INFO:tensorflow:global_step/sec: 35.5196
INFO:tensorflow:step = 401, loss = 70392570000.0 (2.814 sec)
INFO:tensorflow:global_step/sec: 35.4902
INFO:tensorflow:step = 501, loss = 76875860000.0 (2.818 sec)
INFO:tensorflow:global_step/sec: 34.6313
INFO:tensorflow:step = 601, loss = 75832730000.0 (2.887 sec)
INFO:tensorflow:global_step/s

<tensorflow.python.estimator.canned.dnn.DNNRegressor at 0x7f83984ff080>

In [36]:
tf_utils.compute_rmse(model, val_file)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-02-13-16:39:32
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpuw1l16ym/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-02-13-16:39:33
INFO:tensorflow:Saving dict for global step 2000: average_loss = 71455350.0, global_step = 2000, loss = 71455360000.0
RMSE =  8453.127
