### Gcloud Intro (Datalab, Ml-Engine, ML-API)
by: Ian Myjer

## Setup Stuff (very simplified version of what Zach covered)

#### download data file to local
https://www.usaspending.gov/#/download_center/award_data_archive

#### create gcloud storage bucket (performed locally) 
`gsutil mb -c regional -l us-east1 gs://ian-sandbox`

#### move files to bucket (performed locally)
`gsutil cp *all_Contracts* gs://ian-sandbox`

#### get data into BigQuery
1. Pin GCP project to BigQuery
2. Create dataset
3. Create table within dataset and let BigQuery auto guess the field types

### Or do it it the cool/more scalable way (https://github.com/RZachLamberty/usaspending)
1. Download data somewhere
2. Upload to GCP cloud storage
3. Use Dataprep to parse one file and get the schema (or come up with schema on your own...)
4. Use Dataflow to parse the remaining files and put them into BigQuery


## Datalab

#### what is it?
Jupyter notebook hosted on GCP (for free!)    
More technical: it's a docker image of jupyter and some other python stuff running on a GCP compute engine that Google spins up for you (https://github.com/googledatalab/datalab/tree/master/containers/datalab)

##### How to use it? 
from the glcoud shell, create a datalab instance:
`datalab create gcp-example --project ian-sandbox-221014 --zone us-east1-b`

##### How to manage it?
`datalab list`   
`datalab stop gcp-example`    
`datalab connect gcp-example` (Can be used if network connection is lost or to restart/connect to datalab if it has been stopped)    
`datalab delete gcp-example`    

#### Why use datalab? 
1. Free/easy development environment management 
2. If using BigQuery, BigTable, etc, Datalab can easily connect
3. Integrated with ML-Engine for easy scaling and access to Tensorflow hardware

#### Why not use datalab?
1. Kind of annoying to spin up/spin down
2. Less flexible than maintaining your own development environment (e.g., harder to use typical developer tools like `git`)
3. No obvious way to use `R`


## ML Engine

#### What is it?
1. Auto-scaling model training engine
2. Model deployment, management, and verison control framework

#### How to use it?
Use the GCP console to look at ML jobs and managed models
Use the terminal or a Datalab instance to kick off training jobs or deploy models

#### Why use it?
1. Automatically scale the workers required for training models on large datasets
2. Access to hardware with GPU/TPUs (https://cloud.google.com/ml-engine/docs/tensorflow/using-tpus)
3. Easy model versioning
4. Model deployment
5. Model monitoring?

#### Why not use it?
1. Deploying the model as a REST API to users outside of GCP seems like it would take extra effort (but probably not as much as manual deployment using flask app or similar)

#### FAQs
1. What version of pyton/tensorflow is used once the model is deployed?     
    1a. Notes: https://cloud.google.com/ml-engine/docs/tensorflow/versioning    
    1a. Documentation: https://cloud.google.com/ml-engine/docs/tensorflow/runtime-version-list
2. Is it possible to train/deploy on Cloud ML using anything besides Tensorflow?     
    2a. Yes, later versions (Version > 1.0) of the cloud ML runtime support Tensorflow, Sk-learn, and XGBoost (https://cloud.google.com/ml-engine/docs/scikit/quickstart)    
3. Basic steps for deploying a model    
    3a. Train a model somewhere and do all that magic stuff to make it good    
    3b. Save model to a file (file type depends on framework used)    
    3c. Use `gcloud` command line tool to deploy model from file and version it    
4. How does scoring a model work if input data needs cleaning??    
    4a. Just clean the data before it's scored, bro    
    4b. If the model is deployed using Tensorflow, a `serving_input_fn` can be added to the model object to handle some maniulation of scoring inputs (https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/cloudmle/taxifare/trainer/model.py#L92)

# ML Apis

#### what is it?
1. Pre-trained models created by google for common things (Vision, Text, Audio)

#### why is it?
1. Because the future is now
2. Because Google has more data than you
3. Because you were never that good at data science anyway
4. Because f*** it man, I'm tired of training
5. None of the above, I will provide my own transportation to and from the gala

#### why not use it? 
1. If you have sensitive data it's not clear that using these services would be kosher

