# Scaling up ML using Cloud ML Engine

<li>In this notebook, we take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run in Cloud MLE.<\li>
<li>This notebook illustrates *how* to package up a TensorFlow model to run it within Google Cloud ML. </li>
<li>This will give speed (you can choose number of CPUs to run on) as opposed to running a datalab on a single CPU </li>
<li>Actual estimator code has been shifted to a single file model.py and functions defined in it are called usig task.py which the Cloud ML Engine runs <\li>

## Environment variables for project and bucket

Note that:
<ol>
<li> Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> nyc-taxi-fare-project </li>
<li> Cloud training often involves saving and restoring model files. Create bucket from the GCP console (because it will dynamically check whether the bucket name you want is available). </li>
</ol>



In [1]:
import os
PROJECT = 'nyc-taxi-fare-project' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'nyc_taxi_fare_cloud_run' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-east1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

In [2]:
# For Python Code
# Model Info
MODEL_NAME = 'taxifare'
# Model Version
MODEL_VERSION = 'v1'
# Training Directory name
TRAINING_DIR = 'taxi_trained'

In [3]:
# For Bash Code
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['MODEL_NAME'] = MODEL_NAME
os.environ['MODEL_VERSION'] = MODEL_VERSION
os.environ['TRAINING_DIR'] = TRAINING_DIR 
os.environ['TFVERSION'] = '1.8'  # Tensorflow version

In [4]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


### Create the bucket to store model and training data for deploying to Google Cloud Machine Learning Engine Component

In [None]:
#%%bash
# The bucket needs to exist for the gsutil commands in next cell to work
#gsutil mb -p ${PROJECT} gs://${BUCKET}

### Enable the Cloud Machine Learning Engine API

The next command works with Cloud Machine Learning Engine API.  In order for the command to work, you must enable the API using the Cloud Console UI.

Allow the Cloud ML Engine service account to read/write to the bucket containing training data.

In [5]:
%%bash
# This command will fail if the Cloud Machine Learning Engine API is not enabled using the link above.
echo "Getting the service account email associated with the Cloud Machine Learning Engine API"

AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print response['serviceAccount']")  # If this command fails, the Cloud Machine Learning Engine API has not been enabled above.

echo "Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET   
gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET   # error message (if bucket is empty) can be ignored.  
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET      

Getting the service account email associated with the Cloud Machine Learning Engine API
Authorizing the Cloud ML Service account service-884408627146@cloud-ml.google.com.iam.gserviceaccount.com to access files in nyc_taxi_fare_cloud_run


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   237    0   237    0     0    244      0 --:--:-- --:--:-- --:--:--   244
No changes to gs://nyc_taxi_fare_cloud_run/
No changes to gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi-test.csv
No changes to gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi-train.csv
No changes to gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi-valid.csv
Updated ACL on gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/
Updated ACL on gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/checkpoint
Updated ACL on gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/eval/events.out.tfevents.1535997996.cmle-training-17516292832578914890
Updated ACL o

## Packaging up the code

Take your code and put into a standard Python package structure.  <a href="taxifare/trainer/model.py">model.py</a> and <a href="taxifare/trainer/task.py">task.py</a> containing the Tensorflow code from earlier (explore the <a href="taxifare/trainer/">directory structure</a>).

In [6]:
%%bash
find ${MODEL_NAME}

taxifare
taxifare/trainer
taxifare/trainer/model.py
taxifare/trainer/task.py
taxifare/trainer/model.pyc
taxifare/trainer/__init__.pyc
taxifare/trainer/__init__.py
taxifare/.ipynb_checkpoints


In [None]:
%%bash
cat ${MODEL_NAME}/trainer/model.py

In [None]:
%%bash
pwd

## Find absolute paths to your data

Note the absolute paths below. /content is mapped in Datalab to where the home icon takes you

In [7]:
%%bash
echo "Working Directory: ${PWD}"
echo "Head of taxi-train.csv"
head -1 $PWD/taxi-train.csv
echo "Head of taxi-valid.csv"
head -1 $PWD/taxi-valid.csv

Working Directory: /content/datalab/NYC_cloud
Head of taxi-train.csv
8.5,Fri,0,-73.989012,40.763585,-74.003615,40.740253,1,notneeded
Head of taxi-valid.csv
2.5,Fri,0,-73.991437,40.717318,-73.993938,40.660867,1,notneeded


## Running the Python module from the command-line

#### Clean model training dir/output dir

In [13]:
%%bash
# This is so that the trained model is started fresh each time. However, this needs to be done before 
# tensorboard is started

rm -rf $PWD/${TRAINING_DIR}

#### Monitor using Tensorboard

In [None]:
from google.datalab.ml import TensorBoard
TensorBoard().start('taxi_trained')

Need to ensure .pyc files are deleted when switching between python2 and python3?

In [17]:
%%bash
ls -lrt /content/datalab/NYC_cloud/taxi-train.csv

-rw-r--r-- 1 root root 254427262 Sep  3 16:37 /content/datalab/NYC_cloud/taxi-train.csv


In [None]:
%%bash
# Setup python so it sees the task module which controls the model.py
export PYTHONPATH=${PYTHONPATH}:${PWD}/${MODEL_NAME}
# Currently set for python 2.  To run with python 3 
#    1.  Replace 'python' with 'python3' in the following command
#    2.  Edit trainer/task.py to reflect proper module import method 
python -m trainer.task --train_data_paths=${PWD}/taxi-train.csv --eval_data_paths=${PWD}/taxi-valid.csv  --output_dir=${PWD}/${TRAINING_DIR} --train_steps=1000 --job-dir=./tmp

In [None]:
%%bash
ls $PWD/${TRAINING_DIR}/export/exporter/

In [None]:
%%writefile ./test.json
{"pickuplon": -73.885262,"pickuplat": 40.773008,"dropofflon": -73.987232,"dropofflat": 40.732403,"passengers": 2}

In [None]:
%%bash
# This model dir is the model exported after training and is used for prediction
#
# For python2 this is suffient.  Need to determine method for python 3.
# Does not work for python 3.  TODO:     --pythonVersion=3.5 \
#
model_dir=$(ls ${PWD}/${TRAINING_DIR}/export/exporter | tail -1)
# predict using the trained model
gcloud ml-engine local predict \
    --model-dir=${PWD}/${TRAINING_DIR}/export/exporter/${model_dir} \
    --json-instances=./test.json

#### Stop Tensorboard
The training directory will be deleted.  Stop the exising tensorboard before removing the directory its using.

In [None]:
pids_df = TensorBoard.list()
if not pids_df.empty:
    for pid in pids_df['pid']:
        TensorBoard().stop(pid)
        print 'Stopped TensorBoard with pid {}'.format(pid)

## Submit training job using gcloud

First copy the training data to the cloud.  Then, launch a training job.

After you submit the job, go to the cloud console (http://console.cloud.google.com) and select <b>Machine Learning | Jobs</b> to monitor progress.  


In [20]:
%%bash
# Clear Cloud Storage bucket and copy the CSV files to Cloud Storage bucket
# Run once if data is copied, don't run again
echo $BUCKET
gsutil -m rm -rf gs://${BUCKET}/${MODEL_NAME}/smallinput/
gsutil -m cp ${PWD}/*.csv gs://${BUCKET}/${MODEL_NAME}/smallinput/

nyc_taxi_fare_cloud_run


CommandException: 1 files/objects could not be removed.
Copying file:///content/datalab/NYC_cloud/taxi-train.csv [Content-Type=text/csv]...
Copying file:///content/datalab/NYC_cloud/taxi-valid.csv [Content-Type=text/csv]...
Copying file:///content/datalab/NYC_cloud/taxi-test.csv [Content-Type=text/csv]...
/ [0/3 files][    0.0 B/346.8 MiB]   0% Done                                    / [0/3 files][    0.0 B/346.8 MiB]   0% Done                                    / [0/3 files][    0.0 B/346.8 MiB]   0% Done                                    ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objec

In [8]:
%%bash
OUTDIR=gs://${BUCKET}/${MODEL_NAME}/smallinput/${TRAINING_DIR}
JOBNAME=${MODEL_NAME}_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
# Clear the Cloud Storage Bucket used for the training job
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/${MODEL_NAME}/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${BUCKET}/${MODEL_NAME}/smallinput/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/${MODEL_NAME}/smallinput/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=350000

gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained us-east1 taxifare_180904_012512
jobId: taxifare_180904_012512
state: QUEUED


Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/#1535998864194457...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/export/#1535998000123150...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/checkpoint#1535998866836066...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/eval/#1535997995887338...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/export/exporter/#1535998000420752...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/eval/events.out.tfevents.1535997996.cmle-training-17516292832578914890#1535998978894209...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/export/exporter/1535997998/#1535998007956245...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/export/exporter/1535997998/saved_model.pb#1535998008357731...
Removing gs://nyc_taxi_fare_cloud_run/taxifare/smallinput/taxi_trained/export/exporter/1

### Progress can be monitored using TensorBoard, to check the Loss, RMSE on validation data, etc..

<li>TensorBoard uploaded on GitHub as pdf files</li>
<li>Valdation RMSE here is again around  4.75, since we have only rerun the data in Google Cloud Engine</li>
<li>This was an exercise to leverage power of Google Cloud ML engine to get results faster </li>