# Using custom containers with AI Platform Training

**Learning Objectives:**
1. Learn how to create a train and a validation split with Big Query
1. Learn how to wrap a machine learning model into a Docker container and train in on CAIP
1. Learn how to use the hyperparameter tunning engine on GCP to find the best hyperparameters
1. Learn how to deploy a trained machine learning model GCP as a rest API and query it

In this lab, you develop, package as a docker image, and run on **AI Platform Training** a training application that trains a multi-class classification model that predicts the type of forest cover from cartographic data. The [dataset](../../../datasets/covertype/README.md) used in the lab is based on **Covertype Data Set** from UCI Machine Learning Repository.

The training code uses `scikit-learn` for data pre-processing and modeling. The code has been instrumented using the `hypertune` package so it can be used with **AI Platform** hyperparameter tuning.


In [1]:
import json
import os
import numpy as np
import pandas as pd
import pickle
import uuid
import time
import tempfile

from googleapiclient import discovery
from googleapiclient import errors

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

## Configure environment settings

Set location paths, connections strings, and other environment settings. Make sure to update   `REGION`, and `ARTIFACT_STORE`  with the settings reflecting your lab environment. 

- `REGION` - the compute region for AI Platform Training and Prediction
- `ARTIFACT_STORE` - the GCS bucket created during installation of AI Platform Pipelines. The bucket name starts with the `hostedkfp-default-` prefix.

In [8]:
!gsutil ls

gs://artifacts.sk-kfp.appspot.com/
gs://automl-ctype/
gs://sk-kfp-kubeflowpipelines-default/
gs://sk-kfp-kubeflowpipelines-eu/
gs://sk-kfp_cloudbuild/


In [9]:
REGION = 'europe-west1'
#ARTIFACT_STORE = 'gs://sk-kfp-kubeflowpipelines-default'
ARTIFACT_STORE = 'gs://automl-ctype/'

PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]
DATA_ROOT='{}/data'.format(ARTIFACT_STORE)
JOB_DIR_ROOT='{}/jobs'.format(ARTIFACT_STORE)
ALL_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'all', 'dataset.csv')
TRAINING_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'training', 'dataset.csv')
VALIDATION_FILE_PATH='{}/{}/{}'.format(DATA_ROOT, 'validation', 'dataset.csv')

## Explore the Covertype dataset 

In [4]:
%%bigquery
SELECT *
FROM `covertype_dataset.covertype`

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
0,2085,256,18,150,27,738,176,248,208,914,Cache,C2702,5
1,2125,256,20,30,12,871,169,248,215,300,Cache,C2702,2
2,2146,256,34,150,62,1253,122,237,239,511,Cache,C2702,2
3,2186,256,38,210,102,1294,109,232,244,552,Cache,C2702,2
4,2831,256,25,277,183,1706,153,246,225,1485,Commanche,C2705,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,3136,254,12,319,60,5734,193,248,193,2467,Rawah,C7746,1
99996,3242,254,12,636,148,3551,193,248,193,2010,Commanche,C7757,0
99997,2071,255,12,234,63,342,192,247,193,247,Cache,C2706,2
99998,3248,255,12,730,113,725,192,247,193,2724,Commanche,C7756,1


## Copy table

Use BigQuery to import tableand save them to GCS storage


In [13]:
%%bash

export PROJECT_ID=$(gcloud config get-value core/project)

DATASET_LOCATION=US
DATASET_ID=covertype_dataset_us
TABLE_ID=covertype
DATA_SOURCE=gs://workshop-datasets/covertype/small/dataset.csv
SCHEMA=Elevation:INTEGER,\
Aspect:INTEGER,\
Slope:INTEGER,\
Horizontal_Distance_To_Hydrology:INTEGER,\
Vertical_Distance_To_Hydrology:INTEGER,\
Horizontal_Distance_To_Roadways:INTEGER,\
Hillshade_9am:INTEGER,\
Hillshade_Noon:INTEGER,\
Hillshade_3pm:INTEGER,\
Horizontal_Distance_To_Fire_Points:INTEGER,\
Wilderness_Area:STRING,\
Soil_Type:STRING,\
Cover_Type:INTEGER

bq --location=$DATASET_LOCATION --project_id=$PROJECT_ID mk --dataset $DATASET_ID

bq --project_id=$PROJECT_ID --dataset_id=$DATASET_ID load \
--source_format=CSV \
--skip_leading_rows=1 \
--replace \
$TABLE_ID \
$DATA_SOURCE \
$SCHEMA

Dataset 'sk-kfp:covertype_dataset_us' successfully created.


Waiting on bqjob_r435c92e72e755fdb_00000172cda68c91_1 ... (3s) Current status: DONE   


In [10]:
!bq query \
-n 0 \
--destination_table covertype_dataset.covertype_table \
--replace \
--use_legacy_sql=false \
'SELECT * \
FROM `covertype_dataset.covertype`' 

Waiting on bqjob_r53b04737c6ca90d6_00000172cda02a00_1 ... (1s) Current status: DONE   


In [11]:
!bq extract \
--destination_format CSV \
covertype_dataset.covertype_table \
$ALL_FILE_PATH

Waiting on bqjob_r3b02f449059f2c9b_00000172cda04774_1 ... (0s) Current status: DONE   
BigQuery error in extract operation: Error processing job 'sk-
kfp:bqjob_r3b02f449059f2c9b_00000172cda04774_1': Cannot read and write in
different locations: source: EU, destination: us-central1


In [15]:
%%bigquery

with scores as (
select b.Cover_Type , 
(select pct.tables
from unnest(b.predicted_Cover_Type) as pct
order by pct.tables.score desc
limit 1
) as tables
FROM `sk-kfp.export_evaluated_examples_ctype_20200619064107_2020_06_19T22_18_02_997Z.evaluated_examples`  as b
),

# accuracy as (
# select 100 - (countif(Cover_Type != tables.value) / count(*)) *100 as accuracy_percentage
# from scores),

grp_accuracy as (
select Cover_Type, 
count(*) as count,
round(100 - (100* countif(Cover_Type != tables.value) / count(*)),2) as accuracy_percentage
from scores
group by Cover_Type
)

select * from grp_accuracy;

# select Cover_Type, count(*) as count
# FROM `covertype_dataset_us.covertype` as a
# group by Cover_Type;




Unnamed: 0,Cover_Type,count,accuracy_percentage
0,1,4905,95.56
1,2,625,94.56
2,3,50,78.0
3,5,299,83.61
4,0,3535,92.14
5,6,352,90.91
6,4,155,75.48
