<h1> Feature Engineering </h1>

In this notebook, you will learn how to incorporate feature engineering into your pipeline.
<ul>
<li> Working with feature columns </li>
<li> Adding feature crosses in TensorFlow </li>
<li> Using a wide-and-deep model </li>
</ul>

In [None]:
import tensorflow as tf
import apache_beam as beam
import shutil
print(tf.__version__)

<h2>Environment variables for project and bucket </h2>

<li> Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> cloud-training-demos </li>
<li> Cloud training often involves saving and restoring model files. Therefore, we should <b>create a single-region bucket</b>. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available) </li>
</ol>
<b>Change the cell below</b> to reflect your Project ID and bucket name.


In [None]:
import os
PROJECT = 'cpb100-151023'    # CHANGE THIS
BUCKET = 'drfib-usc1' # REPLACE WITH YOUR BUCKET NAME. Use a regional bucket in the region you selected.
REGION = 'us-central1' # Choose an available region for Cloud MLE from https://cloud.google.com/ml-engine/docs/regions.

In [None]:
# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8' 

## ensure we're using python2 env
os.environ['CLOUDSDK_PYTHON'] = 'python2'

<h2>Train locally</h2>

In [None]:
%%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:${PWD}/taxifare
python -m trainer.task \
  --train_data_paths=${PWD}/taxifare/preproc/20k/train.csv \
  --eval_data_paths=${PWD}/taxifare/preproc/20k/valid.csv  \
  --output_dir=${PWD}/taxi_trained \
  --train_steps=10 \
  --job-dir=/tmp

<h2>Test model with gcloud ml-engine</h2>

In [None]:
%%bash

OUTDIR=gs://${BUCKET}/taxifare/feateng20k_local
JOBNAME=feateng20k_local
echo $OUTDIR $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   -- \
   --train_data_paths="${PWD}/taxifare/preproc/20k/train*" \
   --eval_data_paths="${PWD}/taxifare/preproc/20k/valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=377 \
   --train_batch_size=128 --nbuckets=21 --hidden_units="144 89 55"

## Train on cloud

This will take <b>2 hr</b> using 5m rows in PREMIUM scale tier.


In [None]:
%%bash
TS=$(date -u +%y%m%d_%H%M%S)
OUTDIR=gs://${BUCKET}/taxifare/feateng5m_$TS
JOBNAME=feateng5m_$TS
TIER=PREMIUM_1 
echo $OUTDIR $REGION $JOBNAME
gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=$TIER \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/preproc/5m/train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/preproc/5m/valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=3524578 \
   --train_batch_size=128 --nbuckets=21 --hidden_units="144 89 55"

### Start Tensorboard

In [None]:
from google.datalab.ml import TensorBoard
OUTDIR='gs://{0}/taxifare/feateng2m'.format(BUCKET)
print(OUTDIR)
TensorBoard().start(OUTDIR)

### Stop Tensorboard

In [None]:
pids_df = TensorBoard.list()
if not pids_df.empty:
    for pid in pids_df['pid']:
        TensorBoard().stop(pid)
        print('Stopped TensorBoard with pid {}'.format(pid))

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License