![Machine Learning Live](images/title-slide.png "Title")

In [None]:
import datalab.bigquery as bq
import seaborn as sns
import pandas as pd
import numpy as np
import os
import shutil
import tensorflow as tf
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

# predict an output based on known inputs

* rate/demand forecasts
* probability of things like fraud or conversions
* package delivery times
* medical, eg: mortality rates for risk groups
* insurance risk
* operations/monitoring

## traditional rules

~~~~
                   rules
                     ||
                     \/
             ----------------
 input ====> |  algorithm   |  =====> output
             ----------------
~~~~

## With Machine Learning

~~~~
                    data
                     ||
                     \/
             ----------------
 input ====> |   ML model   |  =====> output
             ----------------
~~~~

# Why Taxi Fares?

The city of [New York published a dataset](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml) with taxi > 150M rides per year since 2009. This makes it a great dataset for training because:

* one can easily relate to the data (we have all used taxis before)
* simple enough, one table with a handful of columns
* large enough to do relevant machine learning
* there are lots of blogs, notebooks etc from others about all aspects of this data

# NYC Taxi fares

Sounds easy enough to do with rules:

* initial charge \$2.50
* \$0.40 per 1/5 mile
* \$0.40 per 1 minute stopped/slow traffic
* \$1.00 Weekday Surcharge 4pm-8pm
* \$0.50 Night Surcharge 8pm-6am

But to calculate this before a trip you need to already know the route time and distance. We will use this route from LGA to midtown Manhattan as an example and you can see that there are different route options with different estimated times and mileages. These also vary highly with the amount of traffic over the course of a day.

Problem - fare increases.

![sample route LGA to midtown](images/sample_route.png "Sample Route")

# A quick look at the data

The dataset that we will use is a public data and conveniently available at several locations, including a <a href="https://bigquery.cloud.google.com/table/nyc-tlc:yellow.trips">BigQuery public dataset</a>. 

Let's write a SQL query to poke around.

In [None]:
%sql --module afewrecords2
SELECT
  pickup_datetime,
  pickup_longitude, pickup_latitude, 
  dropoff_longitude, dropoff_latitude,
  passenger_count,
  trip_distance,
  tolls_amount,
  fare_amount,
  total_amount
FROM
  [nyc-tlc:yellow.trips]
WHERE
  ABS(HASH(pickup_datetime)) % $EVERY_N == 1

In [None]:
trips = bq.Query(afewrecords2, EVERY_N=100000).to_dataframe()
trips[:10]

# What to do with this dataset?

* reporting
* analytics and exploration
* predictive analytics

But first we should do some data wrangling to improve the quality of the dataset we work with.

Let's increase the number of records so that we can do some neat graphs. There is no guarantee about the order in which records are returned, and so no guarantee about which records get returned if we simply increase the LIMIT. To properly sample the dataset, let's use the HASH of the pickup time and return 1 in 100,000 records -- because there are 1 billion records in the data, we should get back approximately 10,000 records if we do this.

<h3> Exploring data </h3>

Let's explore this dataset and clean it up as necessary. We'll use the Python Seaborn package to visualize graphs and Pandas to do the slicing and filtering.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
ax = sns.regplot(x="trip_distance", y="fare_amount", fit_reg=False, ci=None, truncate=True, data=trips)

Hmm ... do you see something wrong with the data that needs addressing?

It appears that we have a lot of invalid data that is being coded as zero distance and some fare amounts that are definitely illegitimate. Let's remove them from our analysis. We can do this by modifying the BigQuery query to keep only trips longer than zero miles and fare amounts that are at least the minimum cab fare ($2.50).

Note the extra WHERE clauses.

In [None]:
%sql --module afewrecords3
SELECT
  pickup_datetime,
  pickup_longitude, pickup_latitude, 
  dropoff_longitude, dropoff_latitude,
  passenger_count,
  trip_distance,
  tolls_amount,
  fare_amount,
  total_amount
FROM
  [nyc-tlc:yellow.trips]
WHERE
  (ABS(HASH(pickup_datetime)) % $EVERY_N == 1 AND
  trip_distance > 0 AND fare_amount >= 2.5 AND fare_amount <120)

In [None]:
trips = bq.Query(afewrecords3, EVERY_N=100000).to_dataframe()
fig, ax = plt.subplots()
fig.set_size_inches(15, 10)
ax = sns.regplot(x="trip_distance", y="fare_amount", fit_reg=False, ci=None, truncate=True, data=trips)

What's up with the streaks at <span>$</span>45 and \$50?  Those are fixed-amount rides from JFK into anywhere in Manhattan, i.e. to be expected. Let's list the data to make sure the values look reasonable.

Let's examine whether the toll amount is captured in the total amount.

In [None]:
tollrides = trips[trips['tolls_amount'] > 0]
tollrides[tollrides['pickup_datetime'] == '2012-09-05 15:45:00']

Looking a few samples above, it should be clear that the total amount reflects fare amount, toll and tip somewhat arbitrarily -- this is because when customers pay cash, the tip is not known.  So, we'll use the sum of fare_amount + tolls_amount as what needs to be predicted.  Tips are discretionary and do not have to be included in our fare estimation tool.

Let's also look at the distribution of values within the columns.

In [None]:
trips.describe()

<h3> Quality control and other preprocessing </h3>

We need to some clean-up of the data:
<ol>
<li>New York city longitudes are around -74 and latitudes are around 41.</li>
<li>We shouldn't have zero passengers.</li>
<li>Clean up the total_amount column to reflect only fare_amount and tolls_amount, and then remove those two columns.</li>
<li>Before the ride starts, we'll know the pickup and dropoff locations, but not the trip distance (that depends on the route taken), so remove it from the ML dataset</li>
<li>Discard the timestamp</li>
</ol>

We could do preprocessing in SQL, similar to how we removed the zero-distance rides, but just to show you another option, let's do this in Python.  In production, we'll have to carry out the same preprocessing on the real-time input data. 

This sort of preprocessing of input data is quite common in ML, especially if the quality-control is dynamic.

In [None]:
def preprocess(trips_in):
  trips = trips_in.copy(deep=True)
  trips.fare_amount = trips.fare_amount + trips.tolls_amount
  del trips['tolls_amount']
  del trips['total_amount']
  del trips['trip_distance']
  del trips['pickup_datetime']
  qc = np.all([\
             trips['pickup_longitude'] > -78, \
             trips['pickup_longitude'] < -70, \
             trips['dropoff_longitude'] > -78, \
             trips['dropoff_longitude'] < -70, \
             trips['pickup_latitude'] > 37, \
             trips['pickup_latitude'] < 45, \
             trips['dropoff_latitude'] > 37, \
             trips['dropoff_latitude'] < 45, \
             trips['passenger_count'] > 0,
            ], axis=0)
  return trips[qc]

trips = bq.Query(afewrecords3, EVERY_N=10000).to_dataframe()

tripsqc = preprocess(trips)
tripsqc.describe()

# scoring of accuracy

To compare the quality of different models we have to come up with a way to express the overall accuracy of a model. To do that we will run predictions against the same test training set and compare our predictions agains the known correct values. A simple way to aggregate all individual deviations or errors is the Root Mean Squared Error (RMSE). The calculation is

![rmse_formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/eeb88fa0f90448e9d1a67cd7a70164f674aeb300 "RMSE formula")

The benefit of RMSE is that it is easy to calculate and the result is expressed in the same unit as the predicted label. Throughout the training phase(s) the model will "fit" against the training set. We can calculate the RMSE of a model against the training set but it will usually be biased to be too optimistic as the model might be suffering from overfitting, that is it might be good at "remembering" certain outcomes in the test data or handle it's specific quirks but then underperform against real life data.
That's why it is necessary to evaluate the performance of a model against a validation set. More sophisticated ML models and neural networks have many parameters that can be tuned either by an engineer or the training process itself. In those cases, each iteration of a model training would check the results against the validation set, adjust a parameter, then retrain and see if the new parameter value improved or regressed the model's performance.

There are many options for choosing loss functions and they can have a significant impact on the performance of trained ML models. 

<h3> Create ML datasets </h3>

Let's split the QCed data randomly into training and validation sets.

In [None]:
shuffled = tripsqc.sample(frac=1)
trainsize = int(len(shuffled['fare_amount']) * 0.80)
validsize = int(len(shuffled['fare_amount']) * 0.20)


df_train = shuffled.iloc[:trainsize, :]
df_valid = shuffled.iloc[trainsize:(trainsize+validsize), :]

In [None]:
df_train.describe()

In [None]:
df_valid.describe()

Let's write out the two dataframes to appropriately named csv files. We can use these csv files for local training (recall that these files represent only 1/100,000 of the full dataset) until we get to point of using Dataflow and Cloud ML.

The training set will be used to fit the model in the training phase.
Validation is for quick evaluation of the trained model, especially useful in comparing different training parameters.
It would also be best practice to have a Test dataset as the final step and used to compare different models against each other. 

In [None]:
def to_csv(df, filename):
  outdf = df.copy(deep=False)
  outdf.loc[:, 'key'] = np.arange(0, len(outdf)) # rownumber as key
  # reorder columns so that target is first column
  cols = outdf.columns.tolist()
  cols.remove('fare_amount')
  cols.insert(0, 'fare_amount')
  print (cols)  # new order of columns
  outdf = outdf[cols]
  outdf.to_csv(filename, header=False, index_label=False, index=False)

! mkdir -p data
to_csv(df_train, 'data/taxi-train.csv')
to_csv(df_valid, 'data/taxi-valid.csv')

In [None]:
!head -10 data/taxi-train.csv

We have 2 .csv files corresponding to training and validation.  The ratio of file-sizes correspond to our split of the data. 

Looks good! We now have our ML datasets and are ready to train ML models, validate them and evaluate them.

# rule-based Benchmark

Before we start building actual ML models, it is a good idea to come up with a very simple model and use that as a benchmark.

The simplest model is going to be to derive the trip distance and calculate the mean rate per km (or mile). Then to predict the rate we multiply that average cost per km with the trip distance.
In the absence of a map routing engine, we will have to use the direct line between pickup and dropoff which is obviously flawed but simple and fast to calculate.

In [None]:
def distance_between(lat1, lon1, lat2, lon2):
  # haversine formula to compute distance "as the crow flies".  Taxis can't fly of course.
  dist = np.degrees(np.arccos(np.minimum(1,np.sin(np.radians(lat1)) * np.sin(np.radians(lat2)) + np.cos(np.radians(lat1)) * np.cos(np.radians(lat2)) * np.cos(np.radians(lon2 - lon1))))) * 60 * 1.515 * 1.609344
  return dist

def estimate_distance(df):
  return distance_between(df['pickuplat'], df['pickuplon'], df['dropofflat'], df['dropofflon'])

def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual-predicted)**2))

def print_rmse(df, rate, name):
  print ("{1} RMSE = {0}".format(compute_rmse(df['fare_amount'], rate*estimate_distance(df)), name))

FEATURES = ['pickuplon','pickuplat','dropofflon','dropofflat','passengers']
TARGET = 'fare_amount'
columns = list([TARGET])
columns.extend(FEATURES) # in CSV, target is the first column, after the features
columns.append('key')
df_train = pd.read_csv('data/taxi-train.csv', header=None, names=columns)
df_valid = pd.read_csv('data/taxi-valid.csv', header=None, names=columns)
rate = df_train['fare_amount'].mean() / estimate_distance(df_train).mean()
print ("Rate = ${0}/km".format(rate))
print_rmse(df_train, rate, 'Train')
#print_rmse(df_valid, rate, 'Valid')  

The simple distance-based rule gives us a RMSE of about <b>$8</b>.  We have to beat this, of course, but you will find that simple rules of thumb like this can be surprisingly difficult to beat.

Let's be ambitious, though, and make our goal to build ML models that have a RMSE of less than $6 on the test set.

# Machine Learning using Tensorflow

Tensorflow is a popular open source machine learning library. Beyond basic algorithms it supports distributed training on neural networks. We will start with a linear regression model using tf.estimator and evaluate its performance. Tensorflow can run both locally or on a remote compute cluster. We will start with a small dataset (<10k records) so we can do it all in-memory on the VM that this notebook is running on. We will also just pass the data in as-is. 

## A quick word about linear regression

The model is a simple linear equation:

y = c + a1*x1 + a2*x2 + ...

And during fitting/trainig we are trying out different weights to minimize overall loss. In the simplest form all inputs to the equation must be numerical for this to work. 

In [None]:
# In CSV, label is the first column, after the features, followed by the key
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[0]

df_train = pd.read_csv('./data/taxi-train.csv', header = None, names = CSV_COLUMNS)
df_valid = pd.read_csv('./data/taxi-valid.csv', header = None, names = CSV_COLUMNS)

## Input function to read from Pandas Dataframe into tf

In [None]:
def make_input_fn(df, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = df[LABEL],
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000,
    num_threads = 1
  )

### Create feature columns for estimator

In [None]:
def make_feature_cols():
  input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]
  return input_columns

<h3> Linear Regression with tf.Estimator framework </h3>

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)

OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model = tf.estimator.LinearRegressor(
      feature_columns = make_feature_cols(), model_dir = OUTDIR)

model.train(input_fn = make_input_fn(df_train, num_epochs = 10))

Evaluate on the validation data. Let's see if the number of iterations can help to improve the training. Try out different values and calculate the RMSE for the validation set.

In [None]:
def print_rmse(model, name, df):
  metrics = model.evaluate(input_fn = make_input_fn(df, 1))
  print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss'])))
    
print_rmse(model, 'validation', df_valid)

This is nowhere near our benchmark (RMSE of $8 or so on this data), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction.

# Deep Neural Network regression

http://playground.tensorflow.org

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2],
      feature_columns = make_feature_cols(), model_dir = OUTDIR)
model.train(input_fn = make_input_fn(df_train, num_epochs = 20));

print_rmse(model, 'valid', df_valid)

We are not beating our benchmark with either model ... what's up?  Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well.  That's what the rest of this session is about!

RMSE on benchmark dataset is > $10 (results will vary because of random seeds).

This is not only way more than our goal of 6.00, but it doesn't even beat our distance-based rule's RMSE of 8.

Fear not -- you have learned how to write a TensorFlow model, but not to do all the things that you will have to do to your ML model performant. We will do this in the next chapters. In this chapter though, we will get our TensorFlow model ready for these improvements.

# Improving ML

## better data: Feature engineering

### Day of week

* date is available for training and inference
* exact date (YYYY-MM-DD) is not that useful as we would have different values between training and inference
* but we can extract the day of the week and there is likely a correlation between traffic on different dates
* It's also important to note that this should not be a numerical feature. day=4 is not any better than day=2
* We can code the dow as a categorical feature which will turn one input into 7 distinct features

### Hour of Day

* the pickup time likely also has an impact on traffic and fares (we know that there are time-based surcharges)
* but we also know that there is likely not much difference between a pickup at 12:30 vs 12:42
* so let's settle on coding the hour of the day as another category
* furthermore we can create a feature-cross between hour-of-day and day-of-week

### distance

* we saw from our initial benchmark that the euclidian distance is a feature with a strong correlation
* that distance can easily be calculated from the input long/lat values

### pickup and dropoff

* our dataset has very (too?) accurate location data
* but we know that if pickup and dropoff are within a few city blocks, the fare should be about the same
* an extra challenge we do not address: bridges and tolls
* what we can easily do is to create a grid of buckets over the city's coordinates
* we can make that a tunable parameter, starting with a 10x10 grid
* additionally let's add a feature-cross of pickup and dropoff so that we can group similar trips together

## hyperparameter tuning

* we already saw that the layout of neurons plays a great role in finding the best DNN
* likewise a different number of buckets for the location grid can have an impact
* we are going to search for the optimal values

## more data: scale to cloud training

* DNNs generally create better predictions the larger the training dataset becomes
* we will perform a few dry-runs locally
* and then increase the amount of training data and fit on a cluster of machines in the cloud

In [None]:
PROJECT = 'rostlab-181304'    # CHANGE THIS
BUCKET = 'rostlab-181304-ml' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.
REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions.

In [None]:
# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8' 

In [None]:
%%bash
## ensure gcloud is up to date
gcloud components update

gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## ensure we predict locally with our current Python environment
gcloud config set ml_engine/local_python `which python`

## Develop model with new inputs

Download the first shard of a preprocessed data to enable local development. We could create these files ourselves but they are also available on a public bucket.

In [None]:
%%bash
if [ -d sample ]; then
  rm -rf sample
fi
mkdir sample

gsutil cp "gs://$BUCKET/taxifare/taxi_preproc/train.csv-00000-of*" sample/train.csv
gsutil cp "gs://$BUCKET/taxifare/taxi_preproc/valid.csv-00000-of*" sample/valid.csv
wc -l sample/*

We have two new inputs in the INPUT_COLUMNS, three engineered features, and the estimator involves bucketization and feature crosses.

In [None]:
%%bash
grep -A 20 "INPUT_COLUMNS =" taxifare/trainer/model.py

In [None]:
%%bash
grep -A 50 "build_estimator" taxifare/trainer/model.py

In [None]:
%%bash
grep -A 15 "add_engineered(" taxifare/trainer/model.py

In [None]:
%%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
python -m trainer.task \
  --train_data_paths=${PWD}/sample/train.csv \
  --eval_data_paths=${PWD}/sample/valid.csv  \
  --output_dir=${PWD}/taxi_trained \
  --train_steps=10 \
  --job-dir=/tmp

In [None]:
print('RMSE = {}'.format(np.sqrt(74)))

## 5. Train on cloud

This will take <b> 5-10 minutes </b> even though the prompt immediately returns after the job is submitted. Monitor job progress on the [Cloud Console, in the ML Engine](https://console.cloud.google.com/mlengine) section and wait for the training job to complete.

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/taxi_trained
JOBNAME=ml_live_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
  --region=$REGION \
  --module-name=trainer.task \
  --package-path=${PWD}/taxifare/trainer \
  --job-dir=$OUTDIR \
  --staging-bucket=gs://$BUCKET \
  --scale-tier=STANDARD_1 \
  --runtime-version=$TFVERSION \
  -- \
  --train_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/train.csv-00000-of-*" \
  --eval_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/valid.csv-00000-of-*"  \
  --train_steps=5000 \
  --output_dir=$OUTDIR

The RMSE is now 8.33249, an improvement over the 9.3 that we were getting ... of course, we won't know until we train/validate on a larger dataset. Still, this is promising. But before we do that, let's do hyper-parameter tuning.

<b>Use the Cloud Console link to monitor the job and do NOT proceed until the job is done.</b>

# Hyper-parameter tune

Training a DNN supports a number of parameters such as the number of hidden layers, training batch size etc. These have a significant impact on a model's performance. Hypertuning is the process of iteratively trying out several options to come up with the optimal values.

## Command-line parameters to task.py

Note the command-line parameters to task.py.  These are the things that could be hypertuned if we wanted.

In [None]:
!grep -A 2 add_argument taxifare/trainer/task.py

## Evaluation metric

We add a special evaluation metric. It could be any objective function we want.

In [None]:
!grep -A 5 get_eval_metrics taxifare/trainer/model.py

## Add trial id to not overwrite existing results

In [None]:
!grep -A 5 "trial" taxifare/trainer/task.py

## Create hyper-parameter configuration

The file specifies the search region in parameter space.  Cloud MLE carries out a smart search algorithm within these constraints (i.e. it does not try out every single value).

In [None]:
%writefile hyperparam.yaml
trainingInput:
  scaleTier: STANDARD_1
  hyperparameters:
    goal: MINIMIZE
    maxTrials: 30
    maxParallelTrials: 3
    hyperparameterMetricTag: rmse
    params:
    - parameterName: train_batch_size
      type: INTEGER
      minValue: 64
      maxValue: 512
      scaleType: UNIT_LOG_SCALE
    - parameterName: nbuckets
      type: INTEGER
      minValue: 10
      maxValue: 20
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: hidden_units
      type: CATEGORICAL
      categoricalValues: ["128 32", "256 128 16", "64 64 64 8"]       

## Run the HP training job
we just add the new HP config to the existing job

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/taxi_hypertrain
JOBNAME=ml_live_hp_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ai-platform jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=STANDARD_1 \
   --runtime-version=$TFVERSION \
   --config=hyperparam.yaml \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/train.csv-00000-of-*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/valid.csv-00000-of-*"  \
   --output_dir=$OUTDIR \
   --train_steps=2500

Based on this we come up with these values for our training parameters:

* train_batch_size: 512
* nbuckets: 16
* hidden_units: "64 64 64 8"


# Run with optimal hyperparameters

In [None]:
%%bash

OUTDIR=gs://${BUCKET}/taxifare/feateng
JOBNAME=mllive_$(date -u +%y%m%d_%H%M%S)
TIER=STANDARD_1 
echo $OUTDIR $REGION $JOBNAME
#gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=$TIER \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/train.csv-00001-of-*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/valid.csv-00000-of-*"  \
   --output_dir=$OUTDIR \
   --train_steps=8000 \
   --train_batch_size=512 --nbuckets=16 --hidden_units="64 64 64 8"

This yields an RMSE of about \$5. Beating our first benchmark. But let's try this on a larger dataset and see how far we can push it.

# Run on large dataset with 2M records

In [None]:
%%bash

WARNING -- this uses significant resources and is optional. Remove this line to run the block.

OUTDIR=gs://${BUCKET}/taxifare/feateng2m
JOBNAME=mllivexl_$(date -u +%y%m%d_%H%M%S)
TIER=STANDARD_1 
echo $OUTDIR $REGION $JOBNAME
# only remove the outdir if you don't want to resume a previous run
#gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=$TIER \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/taxi_preproc/valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=1600000 \
   --train_batch_size=512 --nbuckets=16 --hidden_units="64 64 64 8"

### Start Tensorboard

In [None]:
from google.datalab.ml import TensorBoard
OUTDIR='gs://{0}/taxifare/feateng2m'.format(BUCKET)
print(OUTDIR)
TensorBoard().start(OUTDIR)

### Stop Tensorboard

In [None]:
pids_df = TensorBoard.list()
if not pids_df.empty:
    for pid in pids_df['pid']:
        TensorBoard().stop(pid)
        print('Stopped TensorBoard with pid {}'.format(pid))

The RMSE after training on the 2-million-row dataset is below \$4.  This graph shows the improvements so far ...

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame({'Lab' : pd.Series(['1a', '2-3', '4a', '4b', '4c']),
              'Method' : pd.Series(['Heuristic Benchmark', 'tf.learn', '+Feature Eng.', '+ Hyperparam', '+ 2m rows']),
              'RMSE': pd.Series([8.026, 9.4, 8.3, 5.0, 3.03]) })

ax = sns.barplot(data = df, x = 'Method', y = 'RMSE')
ax.set_ylabel('RMSE (dollars)')
ax.set_xlabel('Labs/Methods')
plt.plot(np.linspace(-20, 120, 1000), [5] * 1000, 'b');

## Use model for prediction

Finally, let's use the trained model to predict the taxi fare to get me from my hotel to the Javits Center for Oracle Code NYC. First, we write the trip data to a file, then we run the prediction locally using the exported model.

In [None]:
%%writefile /tmp/test.json
{"dayofweek": "Tue", "hourofday": 10, "pickuplon": -73.976055, "pickuplat": 40.763492, "dropofflon": -74.001891, "dropofflat": 40.757328, "passengers": 2}

In [None]:
%%bash
model_dir=$(gsutil ls gs://${BUCKET}/taxifare/feateng2m/export/exporter | tail -1)
gcloud ml-engine local predict \
  --model-dir=${model_dir} \
  --json-instances=/tmp/test.json

# References

This notebook is based on labs from an excellent [coursera class](https://www.coursera.org/learn/serverless-machine-learning-gcp) that is a great starting point to learn about machine learning with tensorflow in the cloud.

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License