# Week 2 - End-to-End Machine Learning with Tensorflow on GCP

URL : https://www.coursera.org/learn/end-to-end-ml-tensorflow-gcp/home/week/2

# 1. Creating a dataset

## Building an ML model involves:
1. Create the dataset
2. Build the model
3. Operationalize the model

## What makes a feature "good"?

1. Be related to the objective
2. Be known at prediction-time
3. Be numeric with meaningful magnitude
4. Have enough examples
5. Bring human insight to problem

## The simplest option is to sample rows randomly
- Each data point is a birth record from the natality dataset
- Random sampling eliminates potential biases due to order of the training examples but ...

## Also ... what about triplets?
- 3 rows with essentially the same data!
- How can we make this data unique?
- How can we solve this?

## Solution: Split a dataset into training/validation using hashing and modulo operators
## Developing the ML model software on the entire dataset can be expensive; you want to develop on a smaller sample
- Develop your Tensorflow code on a small subset of data, then scale it out to the cloud

## Solution: Sampling the split so that we have a small dataset to develop our code on
- RAND() => random

amenable 1. 말을 잘 듣는; …을 잘 받아들이는   2. (특정한 방식으로) 처리할 수 있는

# Module Quiz
1. True or False - In ML, you could train using all your data and decide not to hold out a test set and still get a good model
 > <font color='red'>True</font>, False
2. What are the benefits of using the hashing and modulo operators for creating ML datasets ?
> 1.<font color='red'>It allows you to create datasets in a repeatable manner.</font>
> 2. It is more computationally efficient than using the rand() function.
> 3. It provides the best performing split for training and evaluation.
> 4. None of the above.

hold out 1. (특히 어려운 상황에서) 지속되다   2. (어려운 상황에서) 저항하다

Modlue Quiz

1. Numeric
2. 3 above


# 2. Hands-on Lab 2

## Lab 2: Create a sample dataset

### What you learn In this lab, you will learn how to:

- Sample a BigQuery dataset to create datasets for ML
- Preprocess data using Pandas

deploy 1. (군대·무기를) 배치하다   2. 효율적으로 사용하다

```python

# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

# Create SQL query using natality data after the year 2000
from google.cloud import bigquery
query = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks,
  ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

df = bigquery.Client().query("SELECT hashmonth, COUNT(weight_pounds) AS num_babies FROM (" + query + ") GROUP BY hashmonth").to_dataframe()
df.head()

df.shape

trainQuery = "SELECT * FROM ("+query+") WHERE MOD(hashmonth, 4) < 3 AND RAND() < 0.0005"
evalQuery = "SELECT * FROM ("+query+") WHERE MOD(hashmonth, 4) = 3 AND RAND() < 0.0005"
traindf = bigquery.Client().query(trainQuery).to_dataframe()
evaldf = bigquery.Client().query(evalQuery).to_dataframe()
print(len(traindf), len(evaldf))

import pandas as pd
def preprocess(df):
  df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]
  
  twins_etc = dict(zip([1,2,3,4,5],['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
  df['plurality'].replace(twins_etc, inplace=True)
  
  nous = df.copy(deep=True)
  nous.loc[nous['plurality'] != 'Single(1)', 'plurality'] = 'Multiple(2+)'
  nous['is_male'] = 'Unknown'
  
  return pd.concat([df, nous])
  
traindf = preprocess(traindf)
evaldf = preprocess(evaldf)
traindf.head()

traindf.to_csv('train.csv', index=False, header=False)
evaldf.to_csv('eval.csv', index=False, header=False)

%%bash
wc -l *.csv
head *.csv
tail *.csv
```

## Lab 2: demo and review



# 3. Build the model

### Tensorflow is an open-source high-performance library for numerical computation that uses directed graph
- Nodes represent mathematical operations
- Edges represent arrays of data
### A tensor is an N-dimensional array of data
### Tensorflow toolkit hierachy
### Working with Estimator API
- Set up machine learning model
 1. Regression or classification?
 2. What is the label?
 3. What are the features?
- Carry out ML steps
 1. Train the model
 2. Evaluate the model
 3. Predict with the model

Square footage => My model => Price
### Structure of an Estimator API ML model
### Encoding categorical data to supply to a DNN
- 1a. If you know the complete vocabulary beforehand:
```python
tf.feature_column.categorical_column_with_vocabulary_list('zipcode',vocabulary_list = ['83452','72345','87654','98723','23451'])
```
- 1b. If your data is already indexed; i.e., has integers in [0-N):

```python
tf.feature_column.categorical_column_with_identity('stateId',num_buckets=50)
```

- 2. To pass in a categorical column into a DNN, one option is to one-hot encode it

```python
tf.feature_column.indicator_column(my_categorical_column)
```
### To read CSV files, create a TextLineDataset giving it a function to decode the CSV into features, labels

- dataset = tf.data.TextLineDataset(filename).map(decode_csv function)

### Shuffling is important for distributed training
```python
dataset = dataset.shuffle(buffer_size = 10*batch_size)
dataset = dataset.repeat(num_epochs).batch(batch_size)
dataset.make_one_shot_iterator().get_next()
```

### Estimator API comes with a method that handles distributed training and evaluation
```python
estimator = tf.estimator.LinearRegressor(model_dir=output_dir,
                                         feature_columns=feature_cols)
tf.estimator.train_and_evaluate(estimator,
                                train_spec,
                                eval_spec)
```
1. Distribute the graph
2. Share variables
3. Evaluate occasionally
4. Handle machine failures
5. Create checkpoint files
6. Recover from failures
7. Save summaries for TensorBoard

### TrainSpec consists of the things that used to be passed into the train() method
```python
train_spec = tf.estimator.TrainSpec(input_fn = read_dataset('gs://.../train*',
                                                            mode = tf.contrib.learn.ModeKeys.TRAIN),
                                    max_steps=num_train_steps)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    
```

### EvalSpec controls the evaluation and the checkpointing of the model because they happen at the same time
```python
exporter=...
eval_spec = tf.estimator.EvalSpec(input_fn=read_dataset('gs://.../valid*,
                                                        mode=tf.contrib.learn.ModeKeys.EVAL),
                                  steps=None,
                                  start_delay_secs=60,
                                  throttle_secs=600,
                                  exporters=exporter)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
```

# 4. Hands-on-Lab 3

```python
import shutil
import numpy as np
import tensorflow as tf

# Determine CSV, label, and key columns
CSV_COLUMNS = 'weight_pounds,is_male,mother_age,plurality,gestation_weeks,key'.split(',')
LABEL_COLUMN = 'weight_pounds'
KEY_COLUMN = 'key'

# Set default values for each CSV column
DEFAULTS = [[0.0], ['null'], [0.0], ['null'], [0.0], ['nokey']]
TRAIN_STEPS = 1000

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(filename_pattern, mode, batch_size = 512):
  def _input_fn():
    def decode_csv(line_of_text):
      # TODO #1: Use tf.decode_csv to parse the provided line
      columns = tf.decode_csv(line_of_text, record_defaults=DEFAULTS)      
      # TODO #2: Make a Python dict.  The keys are the column names, the values are from the parsed data
      features = dict(zip(CSV_COLUMNS, columns))      
      # TODO #3: Return a tuple of features, label where features is a Python dict and label a float
      label = features.pop(LABEL_COLUMN)      
      return features, label
    
    # TODO #4: Use tf.gfile.Glob to create list of files that match pattern
    file_list = tf.gfile.Glob(filename_pattern)
    print(file_list)
    # Create dataset from file list
    dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                 .map(decode_csv))  # Transform each elem by applying decode_csv fn
    
    # TODO #5: In training mode, shuffle the dataset and repeat indefinitely
    #                (Look at the API for tf.data.dataset shuffle)
    #          The mode input variable will be tf.estimator.ModeKeys.TRAIN if in training mode
    #          Tell the dataset to provide data in batches of batch_size 
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None
        dataset = dataset.shuffle(buffer_size = 10*batch_size)
    else:
        num_epochs = 1
    
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    # This will now return batches of features, label
    return dataset
  return _input_fn

# Define feature columns
def get_categorical(name, value):
    return tf.feature_column.indicator_column(tf.feature_column.categorical_column_with_vocabulary_list(name, value))

def get_cols():
    return [\
            get_categorical('is_male', ['True','False','Unknown']),
            tf.feature_column.numeric_column('mother_age'),
            get_categorical('plurality',['Single(1)', 'Twins(2)', 'Triplets(3)',
                       'Quadruplets(4)', 'Quintuplets(5)','Multiple(2+)']),            
            tf.feature_column.numeric_column('gestation_weeks')            
           ]

# Create serving input function to be able to serve predictions later using provided inputs
def serving_input_fn():
    feature_placeholders = {
        'is_male': tf.placeholder(tf.string, [None]),
        'mother_age': tf.placeholder(tf.float32, [None]),
        'plurality': tf.placeholder(tf.string, [None]),
        'gestation_weeks': tf.placeholder(tf.float32, [None])
    }
    features = {
        key: tf.expand_dims(tensor, -1)
        for key, tensor in feature_placeholders.items()
    }
    return tf.estimator.export.ServingInputReceiver(features, feature_placeholders)

# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
  EVAL_INTERVAL = 300
  run_config = tf.estimator.RunConfig(save_checkpoints_secs = EVAL_INTERVAL,
                                      keep_checkpoint_max = 3)
  # TODO #1: Create your estimator
  estimator = tf.estimator.DNNRegressor(
                       model_dir = output_dir,
                       feature_columns = get_cols(),
                       hidden_units = [64, 32],
                       config = run_config)
  train_spec = tf.estimator.TrainSpec(
                       # TODO #2: Call read_dataset passing in the training CSV file and the appropriate mode
                       input_fn = read_dataset('train.csv', mode = tf.estimator.ModeKeys.TRAIN),
                       max_steps = TRAIN_STEPS)
  exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
  eval_spec = tf.estimator.EvalSpec(
                       # TODO #3: Call read_dataset passing in the evaluation CSV file and the appropriate mode
                       input_fn = read_dataset('eval.csv', mode = tf.estimator.ModeKeys.EVAL),
                       steps = None,
                       start_delay_secs = 60, # start evaluating after N seconds
                       throttle_secs = EVAL_INTERVAL,  # evaluate every N seconds
                       exporters = exporter)
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    
# Run the model
shutil.rmtree('babyweight_trained', ignore_errors = True) # start fresh each time
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
train_and_evaluate('babyweight_trained')
```