# Train and deploy TF model

> train model with Covertype dataset

**TODO:**
* update [this example](https://github.com/GoogleCloudPlatform/mlops-on-gcp/blob/master/skew_detection/01_covertype_training_serving.ipynb) to use latest Vertex AI Model Monitoring service

## Load env config
* use the prefix from 00-env-setup

In [1]:
# naming convention for all cloud resources
VERSION        = "v1"              # TODO
PREFIX         = f'ra-vmm-{VERSION}'   # TODO

print(f"PREFIX = {PREFIX}")

PREFIX = ra-vmm-v1


**run the next cell to populate env vars**

In [2]:
# staging GCS
GCP_PROJECTS             = !gcloud config get-value project
PROJECT_ID               = GCP_PROJECTS[0]

# GCS bucket and paths
BUCKET_NAME              = f'{PREFIX}-{PROJECT_ID}-bucket'
BUCKET_URI               = f'gs://{BUCKET_NAME}'

config = !gsutil cat {BUCKET_URI}/config/notebook_env.py
print(config.n)
exec(config.n)


PROJECT_ID               = "hybrid-vertex"
PROJECT_NUM              = "934903580331"
LOCATION                 = "us-central1"

REGION                   = "us-central1"
BQ_LOCATION              = "US"

VERTEX_SA                = "934903580331-compute@developer.gserviceaccount.com"

PREFIX                   = "ra-vmm-v1"
VERSION                  = "v1"

BUCKET_NAME              = "ra-vmm-v1-hybrid-vertex-bucket"
BUCKET_URI               = "gs://ra-vmm-v1-hybrid-vertex-bucket"
DATA_GCS_PREFIX          = "data"
DATA_PATH                = "gs://ra-vmm-v1-hybrid-vertex-bucket/data"

REPOSITORY               = "mm-ctp-ra-vmm-v1"
TRAIN_IMAGE_NAME         = "train-ctp-v1"
REMOTE_IMAGE_NAME        = "us-central1-docker.pkg.dev/hybrid-vertex/mm-ctp-ra-vmm-v1/train-ctp-v1"



## Imports

In [3]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [4]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import pandas as pd
from google.cloud import bigquery

print("TF version: {}".format(tf.__version__))
print("bigquery version: {}".format(bigquery.__version__))

TF version: 2.13.0
bigquery version: 3.11.4


## Define constants

You can change the default values for the following constants:

In [6]:
LOCAL_WORKSPACE = './workspace'
LOCAL_DATA_DIR = os.path.join(LOCAL_WORKSPACE, 'data')

MODEL_NAME = 'covertype_classifier'
VERSION_NAME = VERSION 
TRAINING_DIR = os.path.join(LOCAL_WORKSPACE, 'training')
MODEL_DIR = os.path.join(TRAINING_DIR, 'exported_model')

print(f"LOCAL_DATA_DIR = {LOCAL_DATA_DIR}")
print(f"TRAINING_DIR   = {TRAINING_DIR}")
print(f"MODEL_DIR      = {MODEL_DIR}")

LOCAL_DATA_DIR = ./workspace/data
TRAINING_DIR   = ./workspace/training
MODEL_DIR      = ./workspace/training/exported_model


## Create a local workspace

In [7]:
if tf.io.gfile.exists(LOCAL_WORKSPACE):
    print("Removing previous workspace artifacts...")
    tf.io.gfile.rmtree(LOCAL_WORKSPACE)

print("Creating a new workspace...")
tf.io.gfile.makedirs(LOCAL_WORKSPACE)
tf.io.gfile.makedirs(LOCAL_DATA_DIR)

print("Workspace created.")

Creating a new workspace...
Workspace created.


# 1. Preparing the dataset and defining the metadata
The data in this tutorial is based on the [covertype](https://archive.ics.uci.edu/ml/datasets/covertype) dataset from UCI Machine Learning Repository. The notebook uses a version of the dataset that has been preprocessed, split, and uploaded to a public Cloud Storage bucket at the following location:

`gs://workshop-datasets/covertype`

For more information, see [Cover Type Dataset](https://github.com/GoogleCloudPlatform/mlops-on-gcp/tree/master/datasets/covertype)

The task in this tutorial is to predict forest cover type from cartographic variables only. The aim is to build and deploy a minimal model to showcase the AI Platform Prediction request-response logging capabilities. Such logs let you perform further analysis for detecting data skews.

## Download the data

In [8]:
LOCAL_TRAIN_DATA = os.path.join(LOCAL_DATA_DIR, 'train.csv') 
LOCAL_EVAL_DATA = os.path.join(LOCAL_DATA_DIR, 'eval.csv')

print(f"LOCAL_TRAIN_DATA = {LOCAL_TRAIN_DATA}")
print(f"LOCAL_EVAL_DATA  = {LOCAL_EVAL_DATA}")

LOCAL_TRAIN_DATA = ./workspace/data/train.csv
LOCAL_EVAL_DATA  = ./workspace/data/eval.csv


In [9]:
!gsutil cp gs://workshop-datasets/covertype/data_validation/training/dataset.csv {LOCAL_TRAIN_DATA}
!gsutil cp gs://workshop-datasets/covertype/data_validation/evaluation/dataset.csv {LOCAL_EVAL_DATA}
!wc -l {LOCAL_TRAIN_DATA}

AccessDeniedException: 403 934903580331-compute@developer.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).
AccessDeniedException: 403 934903580331-compute@developer.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).
wc: ./workspace/data/train.csv: No such file or directory


View a sample of the downloaded data:

In [None]:
pd.read_csv(LOCAL_TRAIN_DATA).head().T

## Define the metadata
The following code shows the metadata of the dataset, which is used to create the data input function, the feature columns, and the serving function.

In [None]:
HEADER = ['Elevation', 'Aspect', 'Slope','Horizontal_Distance_To_Hydrology',
          'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
          'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
          'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area', 'Soil_Type',
          'Cover_Type']

TARGET_FEATURE_NAME = 'Cover_Type'

TARGET_FEATURE_LABELS = ['0', '1', '2', '3', '4', '5', '6']

NUMERIC_FEATURE_NAMES = ['Aspect', 'Elevation', 'Hillshade_3pm', 
                         'Hillshade_9am', 'Hillshade_Noon', 
                         'Horizontal_Distance_To_Fire_Points',
                         'Horizontal_Distance_To_Hydrology',
                         'Horizontal_Distance_To_Roadways','Slope',
                         'Vertical_Distance_To_Hydrology']

CATEGORICAL_FEATURES_WITH_VOCABULARY = {
    'Soil_Type': ['2702', '2703', '2704', '2705', '2706', '2717', '3501', '3502', 
                  '4201', '4703', '4704', '4744', '4758', '5101', '6101', '6102', 
                  '6731', '7101', '7102', '7103', '7201', '7202', '7700', '7701', 
                  '7702', '7709', '7710', '7745', '7746', '7755', '7756', '7757', 
                  '7790', '8703', '8707', '8708', '8771', '8772', '8776'], 
    'Wilderness_Area': ['Cache', 'Commanche', 'Neota', 'Rawah']
}

FEATURE_NAMES = list(CATEGORICAL_FEATURES_WITH_VOCABULARY.keys()) + NUMERIC_FEATURE_NAMES

HEADER_DEFAULTS = [[0] if feature_name in NUMERIC_FEATURE_NAMES + [TARGET_FEATURE_NAME] else ['NA'] 
                   for feature_name in HEADER]

NUM_CLASSES = len(TARGET_FEATURE_LABELS)

print(f"FEATURE_NAMES   = {FEATURE_NAMES}")
print(f"HEADER_DEFAULTS = {HEADER_DEFAULTS}")
print(f"NUM_CLASSES     = {NUM_CLASSES}")

In [None]:
config = f'''
USER_AGE_LOOKUP       = {USER_AGE_LOOKUP}
USER_AGE_DIM          = {USER_AGE_DIM}

USER_OCC_LOOKUP       = {USER_OCC_LOOKUP}
USER_OCC_DIM          = {USER_OCC_DIM}

MOVIE_GEN_LOOKUP      = {MOVIE_GEN_LOOKUP}
MOVIE_GEN_DIM         = {MOVIE_GEN_DIM}

MOVIELENS_NUM_MOVIES  = {MOVIELENS_NUM_MOVIES}
MOVIELENS_NUM_USERS   = {MOVIELENS_NUM_USERS}
'''
# TODO - cleanup
with open(f'{REPO_DOCKER_PATH_PREFIX}/{RL_SUB_DIR}/data_config.py', 'w') as f:
    f.write(config)