# ML with Structured Data using Google Cloud

This tutorial is adapted from [this tutorial](https://docs.google.com/presentation/d/e/2PACX-1vR-d6ztE9pkRS1L0pKInaaGMsBf7d_bMETr3Mx0uFYng2Y22zexg0ZaPRWbmmc497EMBeRgg8xvLLfI/pub?start=false&loop=false&delayms=3000&slide=id.g3444070087_0_0) created by **Lak Lakshmanan** for end-to-end ML with TensorFlow on GCP, which includes the original [codelabs](https://codelabs.developers.google.com/codelabs/end-to-end-ml/#0). It extends on original one by covering: Facets, BQML, TFT, and TFMA.

This notebook illustrates:

1. Exploring a BigQuery dataset using Datalab & Facets
2. Linear Regression with BQML
3. Creating datasets for Machine Learning using Dataflow & tf.Transform
4. Creating a model using the high-level Estimator API 
5. Evaluate model quality using TensorFlow Model Analysis
5. Training & Tuning using Cloud ML Engine
5. Deploying the model
6. Predicting with the model

### Housekeeping 

In [None]:
!pip list | grep 'tensor'

In [None]:
import os

BUCKET = 'agravat-demo'
PROJECT = 'agravat-demo'
REGION = 'us-central1'

os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

gcs_data_dir = 'gs://{0}/data/babyweight/'.format(BUCKET)
gcs_model_dir = 'gs://{0}/models/babyweight/'.format(BUCKET)

local_data_dir = 'data/babyweight'
local_models_dir= 'models/babyweight'

In [None]:
%%bash

gsutil -m rm -rf gs://$BUCKET/models/babyweight/*
gsutil -m rm -rf gs://$BUCKET/data/babyweight/big_data/*

## 1. Exploring data in BigQuery

The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that.

In [None]:
%%bq query --name data

SELECT
  CAST(mother_race AS string) race_index,
  AVG(weight_pounds) avg_weight,
  COUNT(weight_pounds) instance_Count
FROM
  `publicdata.samples.natality`
WHERE 
    year > 2000
AND weight_pounds > 0
AND mother_age > 0
AND plurality > 0
AND gestation_weeks > 0
AND month > 0
AND mother_race is not null
GROUP BY
  mother_race
ORDER BY
  avg_weight DESC

### Visualise with Datalab commands 
http://googledatalab.github.io/pydatalab/google.datalab%20Commands.html

In [None]:
%chart columns --data data --fields race_index,avg_weight
title: Mother Race Index vs Average Baby Weight
height: 400
width: 900
hAxis:
  title: Race Index
vAxis:
  title: Average Weight

### Fetch data from BigQuery as a pandas dataframe

In [None]:
data_size = 10000

In [None]:
%sql --module query 

SELECT
  ROUND(weight_pounds,1) AS weight_pounds,
  is_male,
  mother_age,
  mother_race,
  plurality,
  gestation_weeks,
  mother_married,
  cigarette_use,
  alcohol_use
FROM
  `publicdata.samples.natality`
WHERE 
        year > 2000
    AND weight_pounds > 0
    AND mother_age > 0
    AND plurality > 0
    AND gestation_weeks > 0
    AND month > 0
    AND mother_race IS NOT NULL
LIMIT $DATA_SIZE

In [None]:
import datalab.bigquery as bq
import sys
data = bq.Query(query, DATA_SIZE = data_size).to_dataframe(dialect='standard')
print('Row count:{}'.format(len(data)))
data.head(5)

In [None]:
data.describe()

### Explore & Visualise

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
plt.close('all')
#plt.figure(figsize=(45, 25))
plt.figure(figsize=(30, 15))

# Baby Weight Distribution
plt.subplot(2,3,1)
plt.title("Baby Weight Histogram")
plt.hist(data.weight_pounds, bins=150)
#plt.axis([0, 50, 0, 3500])
plt.xlabel("Baby Weight Ranges")
plt.ylabel("Frequency")
# ---------------------------

# Mother Age vs Baby Weight
plt.subplot(2,3,2)
plt.title("Mother Age vs Baby Weight")
plt.scatter(data.mother_age,data.weight_pounds)
plt.xlabel("Mother Age")
plt.ylabel("Baby Weight")
# ---------------------------

# Gestation Weeks vs Baby Weight
plt.subplot(2,3,3)
fit = np.polyfit(data.gestation_weeks,data.weight_pounds, deg=1)
plt.plot(data.gestation_weeks, fit[0] * data.gestation_weeks + fit[1], color='red')
plt.scatter(data.gestation_weeks, data.weight_pounds)
plt.xlabel("Gestation Weeks")
plt.ylabel("Baby Weight")

#---------------------------

# Is Male vs Baby Weight Boxplot
plt.subplot(2,3,4)
plt.title("Is Male vs Baby Weight")

is_male_values = list(data.is_male.value_counts().index.values)
is_male_data = []
for i in is_male_values:
    is_male_data = is_male_data + [data.weight_pounds[data.is_male == i].values]

plt.boxplot(is_male_data)
plt.axis([0, 3, 4, 11])
plt.xlabel("Is Male")
plt.ylabel("Baby Weight")
# ---------------------------

# Mother Race vs Baby Weight Boxplot
plt.subplot(2,3,5)
plt.title("Mother Race vs Baby Weight")

race_values = list(data.mother_race.value_counts().index.values)
race_data = []
for i in race_values:
    race_data = race_data + [data.weight_pounds[data.mother_race == i].values]

plt.boxplot(race_data)
plt.axis([0, 16, 4, 11])
plt.xlabel("Mother Race")
plt.ylabel("Baby Weight")

# # ---------------------------

plt.subplot(2,3,6)
plt.title("Alcohol & Cigarette Use")

alch_use_values = list(data.alcohol_use.value_counts().index.values)
cig_use_values = list(data.cigarette_use.value_counts().index.values)

use_data = []
labels = []

for i in alch_use_values:
    for j in cig_use_values:
        labels = labels + ['alch-use:{} & cig-use:{}'.format(i,j)]
        condition = (data.alcohol_use == i) & (data.cigarette_use == j)
        values = data.weight_pounds[condition].values
        if (len(values) > 0):
            use_data = use_data + [len(values)]

plt.pie(use_data)
plt.legend(labels, loc="lower center")

plt.show()

## Visualise Dataset using Facets - Big Picture
visit: https://research.google.com/bigpicture/

* Use Stacked with categorical features to test the distribution  If used with numerical features will bucketise them.
* Use Scatter with numerical values to test correlations.
* Use Facets to slice and dice (vertically and horizontally).
* Use colour with the target feature.

In [None]:
from IPython.core.display import display, HTML

jsonstr = data.to_json(orient='records')

#HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-dist/facets-jupyter.html">
HTML_TEMPLATE = """<link rel="import" href="/nbextensions/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=jsonstr)
#display(HTML(html))

file = open("babyweight-facest.html","w") 
file.write(html) 
file.close() 

In [None]:
%%HTML
<iframe width="100%" height="600" src="babyweight-facest.html"></iframe>