# Training a model in the cloud

This notebook contains commands and instructions to train models on Google's Cloud ML.

Table of contents:

- [1 Project setup](#1-Project-setup)
- [2 Training a model](#2-Training-a-model)
- [3 Publishing a model](#3-Publishing-a-model)
- [A References](#A-References)

# 1 Project setup

## 1.1 Creating a project

For ["Tensorflow Basics" Workshop (AMLD)](https://www.appliedmldays.org/workshop_sessions/tensorflow-basics.1) participants:

1. You should have a properly configured project registered to your account.
2. Navigate to https://console.cloud.google.com/cloud-resource-manager and select the project `amld-tf-?` by clicking on it (this select the project when following other links further down).

For everybody else:

1. Follow https://console.cloud.google.com/cloud-resource-manager and click on "+ CREATE PROJECT" button to create a new project and follow instructions.
2. [Enable Billing](https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project) for the new project. If you register a payment method for the first time, you will get a free credit of 300 USD. Note that you might get a small amount (~$2) transacted from your registered credit card, but this transaction will immediately be reversed.
3. [Enable Cloud ML API](https://console.cloud.google.com/apis/library) for the new project: search for "ML", click on API card and then on "enable" button.


## 1.2 Setting up variables

Open https://console.cloud.google.com/cloudshell - All the following commands
(in `codefont`) need to be entered in the cloudshell. Note that you need to
redefine variables if you open a new tab.

Set an environment variable to match your project ID (the "project=" parameter in the URL from above step):

`PROJECT_ID=<YOUR PROJECT ID>`

Define a couple more variables and set default project:

`
MODELS_BUCKET="gs://${PROJECT_ID}-models"
LOCATION=europe-west1
gcloud config set project ${PROJECT_ID}
`

Don't forget to enter these commands in every Cloudshell instance (if you are using multiple tabs)!

## 1.3 Initializing the project

Create the storage bucket where our models will be stored (view them with https://console.cloud.google.com/storage/browser):

`gsutil mb -l ${LOCATION} ${MODELS_BUCKET}`

Download this repository and navigate to `cloud` directory:

`git clone https://github.com/tensorflow/workshops.git && cd workshops/extras/amld/cloud`


# 2 Training a model

To train a new model you need to issue one long command that references
a python module (in this case `./quickdraw_rnn/task.py`) containing a
`tf.contrib.learn.Experiment`
([source](https://github.com/tensorflow/workshops/tree/master/extras/amld/cloud/quickdraw_rnn)).
The Cloud ML infrastructure will then take care of running parameter servers,
masters and workers for distributed training of the model.

## 2.1 Start a training job

The following commands starts a job on Cloud ML and sets a couple of
[configuration parameters](https://cloud.google.com/ml-engine/docs/training-overview#job_configuration_parameters):

In [1]:

print r'''

DATASET='zoo_img'
MODEL='quickdraw_cnn'
INFO='2k'
JOB_NAME="${MODEL}_${DATASET%_*}_${INFO}_$(date +%Y%m%d_%H%M%S)"
JOB_DIR="${MODELS_BUCKET}/${JOB_NAME}"

DATA="gs://amld-tf-data/${DATASET}"
gcloud ml-engine jobs submit training "${JOB_NAME}" \
    --package-path "${MODEL}" \
    --module-name "${MODEL}.task" \
    --staging-bucket "${MODELS_BUCKET}" \
    --job-dir "${JOB_DIR}" \
    --runtime-version 1.4\
    --region ${LOCATION} \
    --config config/config.yaml \
    -- \
    --data_dir "$DATA" \
    --output_dir "$JOB_DIR" \
    --train_steps 2000

'''



DATASET='zoo_img'
MODEL='quickdraw_cnn'
INFO='2k'
JOB_NAME="${MODEL}_${DATASET%_*}_${INFO}_$(date +%Y%m%d_%H%M%S)"
JOB_DIR="${MODELS_BUCKET}/${JOB_NAME}"

DATA="gs://amld-tf-data/${DATASET}"
gcloud ml-engine jobs submit training "${JOB_NAME}" \
    --package-path "${MODEL}" \
    --module-name "${MODEL}.task" \
    --staging-bucket "${MODELS_BUCKET}" \
    --job-dir "${JOB_DIR}" \
    --runtime-version 1.4\
    --region ${LOCATION} \
    --config config/config.yaml \
    -- \
    --data_dir "$DATA" \
    --output_dir "$JOB_DIR" \
    --train_steps 2000




## 2.2 Monitor training jobs

You can see your current training jobs with

`gcloud ml-engine jobs list`

or by visiting https://console.cloud.google.com/mlengine/jobs –
It will usually take a couple of minutes to setup the jobs on Cloud
ML before they appear on the web UI.

You can also visualize the training/eval stats of your models using
Tensorboard (if Tensorboard doesn't update your training stats, you
might have to restart the program...):

`tensorboard --port 8080 --logdir "${MODELS_BUCKET}"`

You can then open Tensorboard in your browser by clicking on the
top right browser icon in the header of the cloud shell and select
"Preview on port 8080".

# 3 Publishing a model

The trained model will also be "exported" into the `export/Servo`
directory. This exported model (in a subdirectory named after the
seconds since epoch when the model was exported) will contain all
necessary information, including graph, variables, and "signature"
that defines input and output tensor shapes.

"Publishing" a model basically means copying one of these exported
models for prediction and giving it a label.

You can do this either via the
[web interface (ML Engine/Models)](https://console.cloud.google.com/mlengine/models)
or by issuing the following Cloudshell command:

In [2]:
print r'''

MODEL_NAME="${MODEL}_${DATASET%_*}"
VERSION_NAME=v1

gcloud ml-engine models create --regions ${LOCATION} ${MODEL_NAME}
ORIGIN=$(gsutil ls "${JOB_DIR}"/export/Servo | tail -1)
gcloud ml-engine versions create \
    --origin ${ORIGIN} \
    --model ${MODEL_NAME} \
    ${VERSION_NAME}
gcloud ml-engine versions set-default --model ${MODEL_NAME} ${VERSION_NAME}

'''



MODEL_NAME="${MODEL}_${DATASET%_*}"
VERSION_NAME=v1

gcloud ml-engine models create --regions ${LOCATION} ${MODEL_NAME}
ORIGIN=$(gsutil ls "${JOB_DIR}"/export/Servo | tail -1)
gcloud ml-engine versions create \
    --origin ${ORIGIN} \
    --model ${MODEL_NAME} \
    ${VERSION_NAME}
gcloud ml-engine versions set-default --model ${MODEL_NAME} ${VERSION_NAME}




You can then do online predictions – see https://cloud.google.com/ml-engine/docs/online-predict.

# A References

- https://cloud.google.com/ml-engine/docs/distributed-tensorflow-mnist-cloud-datalab – Describes same approach used in this notebook.
- https://cloud.google.com/solutions/running-distributed-tensorflow-on-compute-engine – Describes how to run distributed Tensorflow in a virtual machine.
- https://www.tensorflow.org/deploy/distributed – Learn more about how Tensorflow distributes training on multiple machines.