### About this notebook
Besides CloudML's built-in preprocessing features, sometimes you want to preprocess your data with your own pipeline. This sample includes a custom preprocessing pipeline, which depends on DataFlow and a pretrained image model to preprocess JPEG images. It extracts features and turn them into a format that is accepted by CloudML service.

Since Dataflow currently has a limitation that dependency python files are not copied to workers, we have to invoke the python file with a bash command. You can view the code by running the following.

In [None]:
%load preprocess.py

This image identification sample uses custom preprocessing. It calls DataFlow pipeline to extract features out of JPEG images, into data format that is accepted by Cloud ML.

There a a few libraries that preprocessor depends.

In [None]:
%%bash
apt-get install -y libjpeg-dev python-imaging
pip install Pillow

### Copy Source Data
Create Storage bucket to hold training data.

In [25]:
# Some code to determine a unique bucket name for the purposes of the sample
from gcp.context import Context

CLOUD_PROJECT = Context.default().project_id
ml_bucket_name = CLOUD_PROJECT + '-mldata'
ml_bucket_path = 'gs://' + ml_bucket_name

INPUT_DIR = ml_bucket_path + '/sampledata/ml/image/input/'
OUTPUT_DIR = ml_bucket_path + '/sampledata/ml/image/output/'

In [26]:
%%storage create --bucket $ml_bucket_path

Copy image source data to your project.

In [27]:
%%bash -s "$INPUT_DIR"
gsutil -m -q cp -r gs://cloud-datalab/sampledata/ml/image/* $1

### Preprocessing

While it is running, you can go to Developer Console and watch the DataFlow job progress.
Please ignore an Error output "This account is not whitelisted to run Python-based pipelines...". This warning message shows even if the job is completed successfully.

In [30]:
%%bash -s "$INPUT_DIR" "$OUTPUT_DIR" "$CLOUD_PROJECT"
python preprocess.py \
  --input_data_location $1\
  --output $2 \
  --job_name cloud-ml-sample-image-classification \
  --project $3 \
  --staging_location "$2/staging" \
  --temp_location "$2/temp" \
  --runner BlockingDataflowPipelineRunner \
  --num_workers 10

ERROR:root:
*************************************************************
This account is not whitelisted to run Python-based pipelines using the Google Cloud Dataflow service. Make sure that your project is whitelisted before submitting your job. 
Please see documentation for getting more information on getting your project whitelisted.
*************************************************************



Now let's browse the extracted features.

In [24]:
%%bash -s "$OUTPUT_DIR"
gsutil list -r "$1data"

gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/:
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00000-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00001-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00002-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00003-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00004-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00005-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00006-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00007-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-00008-of-00010
gs://cloud-ml-users-mldata/sampledata/ml/image/output/data/data.test.json-0000