# Introducing AI Platform Training Service
**Learning Objectives:**
  - Learn how to make code compatible with AI Platform Training Service
  - Train your model using cloud infrastructure via AI Platform Training Service
  - Deploy your model behind a production grade REST API using AI Platform Training Service

## Introduction

In this notebook we'll make the jump from training and predicting locally, to do doing both in the cloud. We'll take advantage of Google Cloud's [AI Platform Training Service](https://cloud.google.com/ai-platform/). 

AI Platform Training Service is a managed service that allows the training and deployment of ML models without having to provision or maintain servers. The infrastructure is handled seamlessly by the managed service for us.

In [None]:
# Uncomment and run if you need to update your Google SDK
# !sudo apt-get update && sudo apt-get --only-upgrade install google-cloud-sdk

## Make code compatible with AI Platform Training Service
In order to make our code compatible with AI Platform Training Service we need to make the following changes:

1. Upload data to Google Cloud Storage 
2. Move code into a Python package
3. Modify code to read data from and write checkpoint files to GCS 

### Upload data to Google Cloud Storage (GCS)

Cloud services don't have access to our local files, so we need to upload them to a location the Cloud servers can read from. In this case we'll use GCS.

Specify your project name and bucket name in the cell below.

In [9]:
PROJECT = "qwiklabs-gcp-ml-49b827b781ab"  # Replace with your PROJECT
BUCKET = "qwiklabs-gcp-ml-49b827b781ab"  # Replace with your BUCKET
REGION = "us-central1"            # Choose an available region for AI Platform Training Service
TFVERSION = "1.14"                # TF version for AI Platform Training Service to use

Jupyter allows the subsitution of python variables into bash commands when using the `!<cmd>` format.
It is also possible using the `%%bash` magic but requires an [additional parameter](https://stackoverflow.com/questions/19579546/can-i-access-python-variables-within-a-bash-or-script-ipython-notebook-c). 

In [10]:
!gcloud config set project {PROJECT}
!gsutil mb -l {REGION} gs://{BUCKET}
!gsutil -m cp *.csv gs://{BUCKET}/taxifare_20191123_1/smallinput/

Updated property [core/project].
Creating gs://qwiklabs-gcp-ml-49b827b781ab/...
ServiceException: 409 Bucket qwiklabs-gcp-ml-49b827b781ab already exists.
CommandException: No URLs matched: *.csv
CommandException: 1 file/object could not be transferred.


### Move code into a python package

When you execute a AI Platform Training Service training job, the service zips up your code and ships it to the Cloud so it can be run on Cloud infrastructure. In order to do this AI Platform Training Service requires your code to be a Python package.

A Python package is simply a collection of one or more `.py` files along with an `__init__.py` file to identify the containing directory as a package. The `__init__.py` sometimes contains initialization code but for our purposes an empty file suffices.

#### Create Package Directory and \_\_init\_\_.py

The bash command `touch` creates an empty file in the specified location.

In [1]:
%%bash
mkdir taxifaremodel
touch taxifaremodel/__init__.py

#### Paste existing code into model.py

A Python package requires our code to be in a .py file, as opposed to notebook cells. So we simply copy and paste our existing code for the previous notebook into a single file.

The %%writefile magic writes the contents of its cell to disk with the specified name.

#### **Exercise 1**

In the cell below, write the content of the `model.py` to the file `taxifaremodel/model.py`. This will allow us to package the model we 
developed in the previous labs so that we can deploy it to AI Platform Training Service. You'll also need to reuse the input functions and the `EvalSpec`, `TrainSpec`, `RunConfig`, etc. that we implemented in the previous labs.

Complete all the TODOs in the cell below by copy/pasting the code we developed in the previous labs. This will write all the necessary components we developed in our notebook to a single `model.py` file. 

Once we have the code running well locally, we will execute the next cells to train and deploy your packaged model to AI Platform Training Service.

In [13]:
%%writefile taxifaremodel/model.py
import tensorflow as tf
import shutil

CSV_COLUMN_NAMES = ["fare_amount","dayofweek","hourofday","pickuplon","pickuplat","dropofflon","dropofflat"]
CSV_DEFAULTS = [[0.0],[1],[0],[-74.0], [40.0], [-74.0], [40.7]]
FEATURE_NAMES = CSV_COLUMN_NAMES[1:] # all but first column


def parse_row(row):
    fields = tf.decode_csv(records = row, record_defaults = CSV_DEFAULTS)
    features = dict(zip(CSV_COLUMN_NAMES, fields))
    label = features.pop("fare_amount")
    return features, label

def read_dataset(csv_path):
    dataset = tf.data.TextLineDataset(filenames = csv_path).skip(count = 1) # skip header
    dataset = dataset.map(map_func = parse_row)
    return dataset

def train_input_fn(csv_path, batch_size = 128):
    dataset = read_dataset(csv_path)
    dataset = dataset.shuffle(buffer_size = 1000).repeat(count = None).batch(batch_size = batch_size)
    return dataset

def eval_input_fn(csv_path, batch_size = 128):
    dataset = read_dataset(csv_path)
    dataset = dataset.batch(batch_size = batch_size)
    return dataset
  
def serving_input_receiver_fn():
    receiver_tensors = {
        'dayofweek' : tf.placeholder(shape=[None], dtype=tf.int32),
        'hourofday' : tf.placeholder(shape=[None], dtype=tf.int32),
        'pickuplon' : tf.placeholder(shape=[None], dtype=tf.float32),
        'pickuplat' : tf.placeholder(shape=[None], dtype=tf.float32),
        'dropofflon': tf.placeholder(shape=[None], dtype=tf.float32),
        'dropofflat': tf.placeholder(shape=[None], dtype=tf.float32),
        }
    features = receiver_tensors
    return tf.estimator.export.ServingInputReceiver(features = features, receiver_tensors = receiver_tensors)
    
def my_rmse(labels, predictions):  
    pred_values = tf.squeeze(input=predictions["predictions"])
    return {
        "rmse": tf.metrics.root_mean_squared_error(labels=labels,predictions=pred_values)
    }

def create_model(model_dir, train_steps):
    feature_cols = [tf.feature_column.numeric_column(key = k) for k in FEATURE_NAMES]
    
    myopt = tf.train.AdamOptimizer(learning_rate=0.01)
    config = tf.estimator.RunConfig(tf_random_seed = 1,
                                    save_summary_steps=10,
                                    save_checkpoints_steps = max(10, train_steps // 10),
                                    model_dir = model_dir)
    model = tf.estimator.DNNRegressor(model_dir=model_dir,
                                      hidden_units=[10, 10],
                                      feature_columns=feature_cols,
                                      activation_fn=tf.nn.relu,
                                      optimizer = myopt,
                                      config = config)    
    return model

def train_and_evaluate(params):
    OUTDIR = params["output_dir"]
    TRAIN_DATA_PATH = params["train_data_path"]
    EVAL_DATA_PATH = params["eval_data_path"]
    TRAIN_STEPS = params["train_steps"]

    model = create_model(OUTDIR, TRAIN_STEPS)
    model = tf.contrib.estimator.add_metrics(estimator = model, metric_fn = my_rmse)  

    train_spec = tf.estimator.TrainSpec(
                     input_fn = lambda: train_input_fn(TRAIN_DATA_PATH),
                     max_steps = 500)
    
    exporter = tf.estimator.FinalExporter(
               name='exporter',
               serving_input_receiver_fn = serving_input_receiver_fn)

    eval_spec = tf.estimator.EvalSpec(
                    input_fn = lambda: eval_input_fn(EVAL_DATA_PATH),
                    steps=None,
                    exporters=exporter,
                    start_delay_secs=1,
                    throttle_secs=1)

    tf.logging.set_verbosity(tf.logging.INFO) 
    shutil.rmtree(path = OUTDIR, ignore_errors = True)

    tf.estimator.train_and_evaluate(estimator = model, train_spec = train_spec, eval_spec = eval_spec)


Overwriting taxifaremodel/model.py


### Modify code to read data from and write checkpoint files to GCS 

If you look closely above, you'll notice two changes to the code

1. The input function now supports reading a list of files matching a file name pattern instead of just a single CSV
  - This is useful because large datasets tend to exist in shards.
2. The train and evaluate portion is wrapped in a function that takes a parameter dictionary as an argument.
  - This is useful because the output directory, data paths and number of train steps will be different depending on whether we're training locally or in the cloud. Parametrizing allows us to use the same code for both.

We specify these parameters at run time via the command line. Which means we need to add code to parse command line parameters and invoke `train_and_evaluate()` with those params. This is the job of the `task.py` file. 

Exposing parameters to the command line also allows us to use AI Platform Training Service's automatic hyperparameter tuning feature which we'll cover in a future lesson.

#### **Exercise 2**

Add two additional command line parameter parsers to the list we've started below. You should add code to parse command line parameters for the `output_dir` and the `job-dir`. Look at the examples below to make sure you have the correct format, including a `help` description and `required` specification.

In [14]:
%%writefile taxifaremodel/task.py
import argparse
import json
import os

from . import model

if __name__ == "__main__":
    
    parser = argparse.ArgumentParser()
    
    parser.add_argument(
        "--train_data_path",
        help = "GCS or local path to training data",
        required = True
    )
    parser.add_argument(
        "--train_steps",
        help = "Steps to run the training job for (default: 1000)",
        type = int,
        default = 1000
    )
    parser.add_argument(
        "--eval_data_path",
        help = "GCS or local path to evaluation data",
        required = True
    )   
    parser.add_argument(
        "--output_dir",
        help = "GCS location to write checkpoints and export models",
        required = True
    )
    parser.add_argument(
        "--job-dir",
        help = "This is not used by our model, but it is required by gcloud",
    )
    args = parser.parse_args().__dict__

    model.train_and_evaluate(args)

Overwriting taxifaremodel/task.py


## Train using AI Platform Training Service (local)

AI Platform Training Service comes with a local test tool ([`gcloud ai-platform local train`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/local/train)) to ensure we've packaged our code directly. It's best to first run that for a few steps before trying a Cloud job. 

The arguments before `-- \` are for AI Platform Training Service
- package-path: speficies the location of the Python package
- module-name: specifies which `.py` file should be run within the package. `task.py` is our entry point so we specify that

The arguments after `-- \` are sent to our `task.py`.

In [15]:
%%time
!gcloud ai-platform local train \
    --package-path=taxifaremodel \
    --module-name=taxifaremodel.task \
    -- \
    --train_data_path=taxi-train.csv \
    --eval_data_path=taxi-valid.csv  \
    --train_steps=1 \
    --output_dir=taxi_trained 

2019-11-23 08:40:50.397564: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 10 or save_checkpoints_secs None.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) con

## Train using AI Platform Training Service (Cloud)

To submit to the Cloud we use [`gcloud ai-platform jobs submit training [jobname]`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training) and simply specify some additional parameters for AI Platform Training Service:
- jobname: A unique identifier for the Cloud job. We usually append system time to ensure uniqueness
- job-dir: A GCS location to upload the Python package to
- runtime-version: Version of TF to use. Defaults to 1.0 if not specified
- python-version: Version of Python to use. Defaults to 2.7 if not specified
- region: Cloud region to train in. See [here](https://cloud.google.com/ml-engine/docs/tensorflow/regions) for supported AI Platform Training Service regions

Below the `-- \` note how we've changed our `task.py` args to be GCS locations

In [16]:
OUTDIR = "gs://{}/taxifare_20191123_1/trained_small".format(BUCKET)

In [17]:
!gsutil -m rm -rf {OUTDIR} # start fresh each time
!gcloud ai-platform jobs submit training taxifare_$(date -u +%y%m%d_%H%M%S) \
    --package-path=taxifaremodel \
    --module-name=taxifaremodel.task \
    --job-dir=gs://{BUCKET}/taxifare_20191123_1 \
    --python-version=3.5 \
    --runtime-version={TFVERSION} \
    --region={REGION} \
    -- \
    --train_data_path=gs://{BUCKET}/taxifare_20191123_1/smallinput/taxi-train.csv \
    --eval_data_path=gs://{BUCKET}/taxifare_20191123_1/smallinput/taxi-valid.csv  \
    --train_steps=1000 \
    --output_dir={OUTDIR}

Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/#1574497471346494...
Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/checkpoint#1574497479116868...
Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/events.out.tfevents.1574497471.cmle-training-10121353322291343068#1574497472170743...
Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/graph.pbtxt#1574497473707675...
Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/model.ckpt-0.data-00000-of-00002#1574497477292434...
Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/model.ckpt-0.meta#1574497479940992...
Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/model.ckpt-0.index#1574497477837261...
Removing gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/model.ckpt-0.data-00001-of-00002#1574497476905370...
/ [8/8 objects] 100% Done      

You can track your job and view logs using [cloud console](https://console.cloud.google.com/mlengine/jobs). It will take 5-10 minutes to complete. **Wait until the job finishes before moving on.**

## Deploy model

Now let's take our exported SavedModel and deploy it behind a REST API. To do so we'll use AI Platform Training Service's managed TF Serving feature which auto-scales based on load.

In [18]:
!gsutil ls gs://{BUCKET}/taxifare_20191123_1/trained_small/export/exporter

gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/export/exporter/
gs://qwiklabs-gcp-ml-49b827b781ab/taxifare_20191123_1/trained_small/export/exporter/1574498631/


AI Platform Training Service uses a model versioning system. First you create a model folder, and within the folder you create versions of the model. 

Note: You will see an error below if the model folder already exists, it is safe to ignore

In [19]:
VERSION='v1'
!gcloud ai-platform models create taxifare_20191123_1 --regions us-central1
!gcloud ai-platform versions delete {VERSION} --model taxifare_20191123_1 --quiet
!gcloud ai-platform versions create {VERSION} --model taxifare_20191123_1 \
    --origin $(gsutil ls gs://{BUCKET}/taxifare_20191123_1/trained_small/export/exporter | tail -1) \
    --python-version=3.5 \
    --runtime-version {TFVERSION}

Created ml engine model [projects/qwiklabs-gcp-ml-49b827b781ab/models/taxifare_20191123_1].
[1;31mERROR:[0m (gcloud.ai-platform.versions.delete) NOT_FOUND: Field: name Error: The specified model version was not found.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: The specified model version was not found.
    field: name
Creating version (this might take a few minutes)......done.                    


## Online prediction

Now that we have deployed our model behind a production grade REST API, we can invoke it remotely. 

We could invoke it directly calling the REST API with an HTTP POST request [reference docs](https://cloud.google.com/ml-engine/reference/rest/v1/projects/predict), however AI Platform Training Service provides an easy way to invoke it via command line.

### Invoke prediction REST API via command line
First we write our prediction requests to file in json format

In [20]:
%%writefile ./test.json
{"dayofweek": 1, "hourofday": 0, "pickuplon": -73.885262, "pickuplat": 40.773008, "dropofflon": -73.987232, "dropofflat": 40.732403}

Writing ./test.json


Then we use [`gcloud ai-platform predict`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/predict) and specify the model name and location of the json file. Since we don't explicitly specify `--version`, the default model version will be used. 

Since we only have one version it is already the default, but if we had multiple model versions we can designate the default using [`gcloud ai-platform versions set-default`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/versions/set-default) or using [cloud console](https://pantheon.corp.google.com/mlengine/models)

In [23]:
!gcloud ai-platform predict --model=taxifare --json-instances=./test.json

PREDICTIONS
[7.731191635131836]


In [24]:
!gcloud ai-platform predict --model=taxifare_20191123_1 --json-instances=./test.json

PREDICTIONS
[8.367416381835938]


### Invoke prediction REST API via python

#### **Exercise 3**

In the cell below, use the Google Python client library to query the model you just deployed on AI Platform Training Service. Find the estimated taxi fare for a ride with the following properties
- ride occurs on Monday
- at 8:00 am
- pick up at (40.773, -73.885)
- drop off at (40.732, -73.987)

Have a look at this post and examples on ["Using the Python Client Library"](https://cloud.google.com/ml-engine/docs/tensorflow/python-client-library) and ["Getting Online Predictions"](https://cloud.google.com/ml-engine/docs/tensorflow/online-predict) from Google Cloud.

In [30]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

VERSION='v1'

credentials = GoogleCredentials.get_application_default()
api = discovery.build("ml", "v1", credentials = credentials,
            discoveryServiceUrl = "https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json")



request_data = {"instances":
    [
        {
         "dayofweek": 1, 
         "hourofday": 0, 
         "pickuplon": -73.885262, 
         "pickuplat": 40.773008, 
         "dropofflon": -73.987232, 
         "dropofflat": 40.732403
        }
    ]
}

#parent = "projects/{}/models/taxifare_20191123_1".format(PROJECT) # use default version
parent = "projects/{}/models/taxifare_20191123_1/versions/{}".format(PROJECT,VERSION) # specify a specific version

response = api.projects().predict(body = request_data, name = parent).execute()
print("response = {0}".format(response))

response = {'predictions': [{'predictions': [8.367416381835938]}]}


## Challenge exercise

Modify your solution to the challenge exercise in e_traineval.ipynb appropriately. Make sure that you implement training and deployment. Increase the size of your dataset by 10x since you are running on the cloud. Does your accuracy improve?

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License