# Classify Images via [Transfer Learning](https://en.wikipedia.org/wiki/Transfer_learning) 

<font color="red"><b>This is NOT an Official Google Product and is only for education!!!</b></font>
<br><br>
[Google Cloud Vision API](https://cloud.google.com/vision/) is a popular service that allows users to classify images into categories, appropriate for multiple common use cases across several industries. For those users whose category requirements map to the pre-built, pre-trained machine-learning model reflected in the API, this approach is ideal. However, other users have more specialized requirements — for example, to identify specific products and soft goods in mobile-phone photos, or to detect nuanced differences between particular animal species in wildlife photography. For them, it can be more efficient to train and serve a new image model using [Google Cloud Machine Learning](https://cloud.google.com/products/machine-learning/) (Cloud ML), the managed service for building and running machine-learning models at scale using the open source [TensorFlow](https://www.tensorflow.org/) deep-learning framework.

The goal of this lab is to build a simple tensorflow model via [transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) in Cloud ML to identify the flower type:  daisy, dandelion, roses, sunflowers and tulips using a small set of labeled flower images. This dataset has been selected for ease of explanation only; We've successfully used the same implementation for several proprietary datasets covering cases like interior-design classification (e.g., carpet vs. hardwood floor) and animated-character classification. This code can easily be adapted to run on different datasets.

### Set Up Environment
Lets get our environment set up

In [None]:
import os
import time
PROJECT = os.popen("gcloud config list --format 'value(core.project)' 2>/dev/null").read().rstrip('\n')
AUTH_NAME = os.popen("gcloud auth list --filter=status:ACTIVE --format='value(account)'").read().split('@')[0].rstrip('\n')
REGION = 'us-central1'
TIME = str(int(time.time()))
GCS_BUCKET = 'gs://ml-workshop-transfer-learning-' + TIME + '-' + AUTH_NAME
GCS_PATH = GCS_BUCKET + "/" + TIME
JOBNAME = 'ml_workshop_' + TIME

# Set Env Variables
os.environ['GCS_BUCKET'] = GCS_BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['GCS_PATH'] = GCS_PATH
os.environ['JOBNAME'] = JOBNAME

### Prepare data
gs://cloud-ml-data is a public storage bucket on [Google Cloud Storage](https://cloud.google.com/storage/) that hosts our training and test data for transfer learning excercise. We will keep the actual images of flowers in gs://cloud-ml-data and instead copy only the labelled csv files for training and evaluation. Codeblock below creates a bucket and copies the labelled csv dataset from  gs://cloud-ml-data to our bucket

In [None]:
!gsutil mb $GCS_BUCKET
!gsutil cp -r gs://cloud-ml-data/img/flower_photos/train_set.csv $GCS_BUCKET/img/flower_photos/
!gsutil cp -r gs://cloud-ml-data/img/flower_photos/eval_set.csv $GCS_BUCKET/img/flower_photos/
!gsutil cp -r gs://cloud-ml-data/img/flower_photos/dict.txt $GCS_BUCKET/img/flower_photos/

### Understand our Data
Both train_set.csv and eval_set.csv have labelled dataset in the following format. We will use this labelled dataset to teach our model things like what a sunflower, rose, tulip, daisy and dandelion look like
<pre>
gs://cloud-ml-data/img/flower_photos/dandelion/17388674711_6dca8a2e8b_n.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/sunflowers/9555824387_32b151e9b0_m.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/daisy/14523675369_97c31d0b5b.jpg,daisy
gs://cloud-ml-data/img/flower_photos/roses/512578026_f6e6f2ad26.jpg,roses
gs://cloud-ml-data/img/flower_photos/tulips/497305666_b5d4348826_n.jpg,tulips...
</pre>
We also need a text file containing all the labels (dict.txt), which is used to sequentially map labels to internally used IDs. In this case, daisy would become ID 0 and tulips would become 4

In [None]:
!gsutil cat -r 0-85  $GCS_BUCKET/img/flower_photos/eval_set.csv
!gsutil cat $GCS_BUCKET/img/flower_photos/dict.txt

### Training Data vs. Test Data
We have randomly split data into two files, train_set.csv and eval_set.csv, with 90% data for training and 10% for eval, respectively. Read more on training vs. test [here](https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets) 

In [None]:
!gsutil cat $GCS_BUCKET/img/flower_photos/train_set.csv | wc -l
!gsutil cat $GCS_BUCKET/img/flower_photos/eval_set.csv | wc -l

### Preprocess Data
We start with a set of labeled images in a Google Cloud Storage bucket and preprocess them to extract the image features from the bottleneck layer (typically the penultimate layer) of the Inception network. Although processing images in this manner can be reasonably expensive, each image can be processed independently and in parallel, making this task a great candidate for [Cloud Dataflow](https://cloud.google.com/dataflow/).

We process each image to produce its feature representation (also known as an embedding) in the form of a k-dimensional vector of floats (in our case 2,048 dimensions). The preprocessing includes converting the image format, resizing images, and running the converted image through a pre-trained model to get the embeddings. Final output will be written in directory specified by --output_path.

To measure the benefit of parallelizing preprocessing on Google Cloud, we ran the above preprocessing on 1 million sample images from the Open Image Dataset. We found that while it takes several days to preprocess 1 million images locally, it takes less than 2 hours on the cloud when we use 100 workers with four cores each!

### Machine Learning Pipeline 
<p align='left'>We are setting up following pipeline </p>

<img src='./images_for_markdown/pipeline.png' width=1000 align=left></img>

<br>
### How to run Preprocessing Job 
(uri, label_ids, embedding) -> (tensorflow.Example) (Many tensorflow.Example make 1 TfRecord)

  Output proto contains 'label', 'image_uri' and 'embedding'.
  The 'embedding' is calculated by feeding image into input layer of image
  neural network and reading output of the bottleneck layer of the network.

  Below gives you an example of how you would run this at scale for all the flower images we have. 
  <pre>
  !python trainer/preprocess.py \
  --input_dict $GCS_BUCKET/img/flower_photos/dict.txt \
  --input_path $GCS_BUCKET/img/flower_photos/eval_set.csv \
  --output_path $GCS_PATH/preproc/eval \
  --num_workers 10 \
  --cloud
  </pre>
  
  <pre>
  !python trainer/preprocess.py \
  --input_dict $GCS_BUCKET/img/flower_photos/dict.txt \
  --input_path $GCS_BUCKET/img/flower_photos/train_set.csv \
  --output_path $GCS_PATH/preproc/train \
  --num_workers 10 \
  --cloud
  </pre>
  
  But today we wont run it to save cost and time. Instead we will run a much smaller job just to get a feel for it. Go ahead & run the following job & check the job status [here](https://console.cloud.google.com/dataflow?)

In [None]:
!gsutil cp dataflow_slim.csv $GCS_PATH/slim_job/input/dataflow_slim.csv

In [None]:
!python trainer/preprocess_fast.py \
  --input_dict $GCS_BUCKET/img/flower_photos/dict.txt \
  --input_path $GCS_PATH/slim_job/input/dataflow_slim.csv \
  --output_path $GCS_PATH/slim_job/output \
  --cloud

### Preprocessed Images: Simple copy tfrecords below (Avoid 50 min. Processing time)

In [None]:
!gsutil -m cp gs://lytx-experiment/1512512700/preproc/* $GCS_PATH/preproc/

### Training
Once we've preprocessed data, we can then train a simple classifier. The network will comprise a single fully-connected layer with RELU activations and with one output for each label in the dictionary to replace the original output layer. Final output is computed using the softmax function
<br><br>
<img src='./images_for_markdown/incept_v3.png' align='center' width=600>

### Launch a Training Job
Below we will launch a remore distributed training job on Cloud Machine Learning Engine (Cloud ML). To learn more on how to run the job please visit this [link](https://cloud.google.com/ml-engine/docs/training-overview) Please check your job status [here](https://console.cloud.google.com/mlengine/jobs)

In [None]:
%%bash
gcloud ml-engine jobs submit training $JOBNAME \
  --stream-logs \
  --module-name=trainer.task \
  --package-path=./trainer \
  --staging-bucket=$GCS_BUCKET \
  --region=us-central1 \
  --runtime-version=1.0 \
  --scale-tier=BASIC \
  -- \
  --output_path=$GCS_PATH/$JOBNAME/output \
  --eval_data_paths=$GCS_PATH/preproc/eval* \
  --train_data_paths=$GCS_PATH/preproc/train*

## TensorBoard - View our Training Progress

In [None]:
TENSORBOARD_PATH = GCS_PATH + "/" + JOBNAME + "/output"
print (TENSORBOARD_PATH)
from google.datalab.ml import TensorBoard
TensorBoard().start(TENSORBOARD_PATH)

### Run an Inference Call

In [None]:
%%bash
MODEL_NAME="ml_workshop"
MODEL_VERSION="v2"

#ADD YOUR MODEL PATH Below. EXPLORE CLOUD STORAGE BUCKET TO SEE WHERE THE MODEL IS CREATED. HINT $GCS_PATH/$JOBNAME/output/model 
MODEL_LOCATION='gs://ml-workshop-transfer-learning-1522282044-155051428461-compute/1522282044/ml_workshop_1522282044/output/model' 
#ADD YOUR MODEL PATH ABOVE. EXPLORE CLOUD STORAGE BUCKET FOR THE SAME 

gcloud ml-engine models create ${MODEL_NAME} --regions us-central1
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} 

In [None]:
%%bash
python -c 'import base64, sys, json; img = base64.b64encode(open(sys.argv[1], "rb").read()); print json.dumps({"key":"0", "image_bytes": {"b64": img}})' ./test_images/daisy.jpg &> request.json
python -c 'import base64, sys, json; img = base64.b64encode(open(sys.argv[1], "rb").read()); print json.dumps({"key":"1", "image_bytes": {"b64": img}})' ./test_images/rose.jpg &>> request.json

In [None]:
%%bash
gcloud ml-engine predict --model $MODEL_NAME --json-instances request.json

In [None]:
!gsutil cat $GCS_BUCKET/img/flower_photos/dict.txt