# Prepare Scripts for Model Deployment on the Google Cloud
**Author:** Robert Smith  
**Date:** 06-21-2020

This final notebook covers the last (and most challenging) stage of the CRISP-DM lifecycle, model deployment. In order to deploy a model using the AI Platform, a few files will be created:  

> * **preprocess.py:** A python module that contains a data pre-processing class to transform the raw data into a form suitable for a scikit-learn pipeline.  
> * **gs_log_model.pkl:** A pickled scikit-learn machine learning pipeline, using the `preprocess` module created above to prepare and model the data.  
> * **preprocessor.pkl** A pickled `HeartDiseaseTransformer` preprocessor class from the preprocess.py.  
> * **predictor.py:** A python module that uses the preprocessor and scikit-learn machine learning pipeline as inputs to transform and predict on new observations.  
>*  **heart_disease_classification-0.1.tar.gz:** A source distribution containing the `preprocess.py` and `predictor.py` scripts.  

After all of these files are created, we'll access the command line here in Jupyter by prefixing our code with a bang (!) operator to push the source distribution and the pickle files to the Google Cloud Platform. 

As a final step - we'll use the command line and Google's Python API client to pass the deployed model a couple patient observations that we'd like to estimate heart disease probability for.

**NOTE:** You'll need to have your GCP account, billing, service account, APIs, and project set-up, and the Google SDK installed and initialized prior to deploying your model.

## Create Preprocessor Module
First, we'll create data pre-processor class that takes care of a couple data pre-processing steps outside of scikit-learn, including the translation of a few features from numeric into categorical (which we'll subsequently encode using one-hot encoding in a pipeline) and transform the numeric target feature into a binary indicator. 

In [1]:
%%writefile preprocess.py 

import numpy as np
import pandas as pd

class HeartDiseaseTransformer():
    def __init__(self):
        self._cp_dict = {1: "typical angina",
                         2: "atypical angina",
                         3: "non-anginal pain", 
                         4: "asymptomatic"}
        
        self._restecg_dict = {0: "normal", 
                              1: "wave abnormality", 
                              2: "ventricular hypertrophy"}
        
        self._thal_dict = {3 : "normal",
                           6 : "fixed defect",
                           7 : "reversable defect"}
    
    def preprocess_X(self, data):
        data["cp"].replace(self._cp_dict, inplace = True)
        data["restecg"].replace(self._restecg_dict, inplace = True)
        data["thal"].replace(self._thal_dict, inplace = True)

        return data
    
    def preprocess_y(self, data):
        data = (data > 0).astype(int)
        
        return data

Writing preprocess.py


## Train & Pickle Model & Preprocessor
Using the pre-processing module just created above, this section re-runs the modeling approach in the previous notebook and pickles the resulting model and the instantiated pre-processing class. 

In [2]:
# Regular EDA and plotting libraries
import numpy as np
import pandas as pd

# Data Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression

# Other functions needed from Scikit-Learn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Import class just written
from preprocess import HeartDiseaseTransformer

# Export final model
import pickle

# Import raw data
col_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
             "thalach", "exang", "oldpeak", "slope","ca", "thal", "target"]

df = pd.read_csv("../data/processed.cleveland.data", names = col_names, na_values = "?")

# Split data into X and y
X = df.drop("target", axis = 1)
y = df["target"]

# Use pre-processor
hdt = HeartDiseaseTransformer()
X = hdt.preprocess_X(X)
y = hdt.preprocess_y(y)

# Split into train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2,
                                                    random_state = 123)

# Create Pipeline & Train Model
# Let's split up our features into three different groups that will undergo separate transformations:
cat_vars = ["cp","restecg","thal"]
num_vars = ["age", "trestbps", "chol", "thalach", "oldpeak", "slope","ca"]
bin_vars = ["sex", "fbs", "exang"]


cat_transformer = Pipeline(steps = [("impute", SimpleImputer(strategy = "most_frequent")),
                                    ("ohe", OneHotEncoder())])
                                   

num_transformer = Pipeline(steps = [("impute", SimpleImputer(strategy = "most_frequent")),
                                    ("scaler", PowerTransformer())]) # Here we are using a power-transform instead.

bin_transformer = Pipeline(steps = [("impute", SimpleImputer(strategy = "most_frequent"))])

preprocessor_pt = ColumnTransformer(transformers = [('cat', cat_transformer, cat_vars),
                                                  ('num', num_transformer, num_vars),
                                                  ('bin', bin_transformer, bin_vars)],
                                  remainder = "drop")

log_model_pipeline = Pipeline(steps = [
    ("preprocessing", preprocessor_pt),
    ("model", LogisticRegression())])

# Create a hyperparameter grid for Logistic Regression
np.random.seed(42)
param_grid = {"model__penalty": ["l2", "l1"],
                "model__C": np.logspace(-4, 4, 30), 
                "model__solver" : ["liblinear"]}


# Fit grid hyperparameter search model
gs_log_model = GridSearchCV(log_model_pipeline, param_grid, cv = 10, iid = False)
gs_log_model.fit(X_train, y_train)

# Save Model
with open('../model/log_model_v1.pkl', 'wb') as model_file:
    pickle.dump(gs_log_model, model_file)
    
# Save pre-processor
with open ('../model/preprocessor.pkl', 'wb') as preprocessor_file:
    pickle.dump(hdt, preprocessor_file)


## Write Predictor Module
The AI Platform also expects a predictor module for deploying a custom prediction routine. This module contains a class that instantiates the pre-processor and scikit-learn machine learning pipeline and uses these together to predict the probability of heart disease on new, unseen observations.

In [3]:
%%writefile predictor.py 

import os
import pickle
import numpy as np
import pandas as pd

class HeartDiseasePredictor(object):
    def __init__(self, model, preprocessor):
        self._model = model
        self._preprocessor = preprocessor

    def predict(self, instances, **kwargs):
        instances = pd.DataFrame(instances, 
             columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
             "thalach", "exang", "oldpeak", "slope","ca", "thal"])
        preprocessed_inputs = self._preprocessor.preprocess_X(instances)
        outputs = self._model.predict_proba(preprocessed_inputs)
        return [str(np.round(p[1],2)*100)+"%" for p in outputs]

    @classmethod
    def from_path(cls, model_dir):
        model_path = os.path.join(model_dir, 'log_model_v1.pkl')
        with open(model_path, 'rb') as f:
            model = pickle.load(f)

        preprocessor_path = os.path.join(model_dir, 'preprocessor.pkl')
        with open(preprocessor_path, 'rb') as f:
            preprocessor = pickle.load(f)
        
        return cls(model, preprocessor)

Writing predictor.py


Test out the new module by passing it a couple of observations:

In [4]:
from predictor import HeartDiseasePredictor
instances = [
    {
      "age": 63.0,
      "sex": 1.0,
      "cp": 1.0,
      "trestbps": 145.0,
      "chol": 233.0,
      "fbs": 1.0,
      "restecg": 2.0,
      "thalach": 150.0,
      "exang": 0.0,
      "oldpeak": 2.3,
      "slope": 3.0,
      "ca": 0.0,
      "thal": 6.0
    },
        {
      "age": 63.0,
      "sex": 0.0,
      "cp": 1.0,
      "trestbps": 145.0,
      "chol": 233.0,
      "fbs": 1.0,
      "restecg": 2.0,
      "thalach": 150.0,
      "exang": 0.0,
      "oldpeak": 2.3,
      "slope": 3.0,
      "ca": 0.0,
      "thal": 6.0
    }
  ]
hdp = HeartDiseasePredictor.from_path("../model/")
hdp.predict(instances)

['31.0%', '15.0%']

Great! The module is able to return a string of the heart disease probability. The second observation has the same clinical measurements as the first obervation except for the sex. The resulting probabilities indicate that a women has a 16% lower risk of heart disease compared to a man with the same clinical measurements.

The last thing that needs to be done before deployment is to package the `preprocess.py` and `predictor.py` into a source distribution.

In [5]:
%%writefile setup.py 

from setuptools import setup

setup(
    name='heart_disease_classification',
    version='0.1',
    scripts=['predictor.py', 'preprocess.py'])

Writing setup.py


Create source distribution

In [6]:
! python setup.py sdist --formats=gztar

running sdist
running egg_info
creating heart_disease_classification.egg-info
writing heart_disease_classification.egg-info/PKG-INFO
writing dependency_links to heart_disease_classification.egg-info/dependency_links.txt
writing top-level names to heart_disease_classification.egg-info/top_level.txt
writing manifest file 'heart_disease_classification.egg-info/SOURCES.txt'
reading manifest file 'heart_disease_classification.egg-info/SOURCES.txt'
writing manifest file 'heart_disease_classification.egg-info/SOURCES.txt'

running check


creating heart_disease_classification-0.1
creating heart_disease_classification-0.1/heart_disease_classification.egg-info
copying files to heart_disease_classification-0.1...
copying predictor.py -> heart_disease_classification-0.1
copying preprocess.py -> heart_disease_classification-0.1
copying setup.py -> heart_disease_classification-0.1
copying heart_disease_classification.egg-info/PKG-INFO -> heart_disease_classification-0.1/heart_dis

## Deploy Custom Prediction Routine to the Cloud
Now we are ready to deploy the model pipeline, including the data pre-processor, to the cloud! We'll first make sure our GCP environment has the right service account credentials. You'll want to download the key file (in json format) for a GCP service account that will have access to your model and provide that as a file path:  
**Example:**  

`%env GOOGLE_APPLICATION_CREDENTIALS <path/to/credientials.json>`

In [7]:
%env GOOGLE_APPLICATION_CREDENTIALS /Users/RobertSmith/data-science-projects/gcp-key/heart_disease_classification_key.json

env: GOOGLE_APPLICATION_CREDENTIALS=/Users/RobertSmith/data-science-projects/gcp-key/heart_disease_classification_key.json


Next, we'll want to set the GCP parameters needed to deploy the model.

In [8]:
PROJECT_ID = "heart-disease-classification" # Name of project in GCP
REGION = "us-central1" # Where we will create our storage bucket and deploy our model
BUCKET_NAME = "heart-disease-classification" # Name of our storage bucket where will will store our code
MODEL_NAME = 'HeartDiseasePredictor' # Name of the model to be deployed
VERSION_NAME = 'v1' # Version of the MODEL_NAME

In [9]:
# Set our environment to the right project...
! gcloud config set project $PROJECT_ID

Updated property [core/project].


In [10]:
# Create Bucket
! gsutil mb -l $REGION gs://$BUCKET_NAME

Creating gs://heart-disease-classification/...


In [11]:
# Copy our pickled model and data pre-processor to the newly created storage bucket in the model folder
! gsutil cp ../model/log_model_v1.pkl ../model/preprocessor.pkl gs://heart-disease-classification/model/

Copying file://../model/log_model_v1.pkl [Content-Type=application/octet-stream]...
Copying file://../model/preprocessor.pkl [Content-Type=application/octet-stream]...
- [2 files][ 27.6 KiB/ 27.6 KiB]                                                
Operation completed over 2 objects/27.6 KiB.                                     


In [12]:
# Copy our source distribution to the storage bucket
! gsutil cp ./dist/heart_disease_classification-0.1.tar.gz gs://heart-disease-classification/

Copying file://./dist/heart_disease_classification-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  1.4 KiB/  1.4 KiB]                                                
Operation completed over 1 objects/1.4 KiB.                                      


In [13]:
# Create a model in the region chosen above
! gcloud ai-platform models create $MODEL_NAME \
  --regions $REGION

Created ml engine model [projects/heart-disease-classification/models/HeartDiseasePredictor].


In [14]:
# Create a model version with the same configuration as the local data science environment. This is why it is
# sooo important to make sure that the same package versions installed locally are available on the AI Platform

# --quiet automatically installs the beta component if it isn't already installed 
! gcloud --quiet beta ai-platform versions create $VERSION_NAME \
  --model $MODEL_NAME \
  --runtime-version 1.15 \
  --python-version 3.7 \
  --origin gs://$BUCKET_NAME/model/ \
  --package-uris gs://$BUCKET_NAME/heart_disease_classification-0.1.tar.gz \
  --prediction-class predictor.HeartDiseasePredictor

Creating version (this might take a few minutes)......done.                    


## Make Model Predictions
We can use the model in two different ways:
* Via the command line
* Google's Python API Client

### Command Line Predictions

In [15]:
# Path to two clinical observations
INPUT_DATA_FILE = "../data/observations_for_prediction.json"

In [16]:
! gcloud ai-platform predict --model $MODEL_NAME  \
                   --version $VERSION_NAME \
                   --json-request $INPUT_DATA_FILE \
                   --format text

predictions[0]: 31.0%
predictions[1]: 15.0%


### Google API Client

In [17]:
import googleapiclient.discovery

instances = [
    {
      "age" : 63.0,
      "sex" : 1.0,
      "cp" : 1.0,
      "trestbps" : 145.0,
      "chol" : 233.0,
      "fbs" : 1.0,
      "restecg" : 2.0,
      "thalach" : 150.0,
      "exang" : 0.0,
      "oldpeak" : 2.3,
      "slope" : 3.0,
      "ca" : 0.0,
      "thal" : 6.0
    },
    {
      "age" : 63.0,
      "sex" : 0.0,
      "cp" : 1.0,
      "trestbps" : 145.0,
      "chol" : 233.0,
      "fbs" : 1.0,
      "restecg" : 2.0,
      "thalach" : 150.0,
      "exang" : 0.0,
      "oldpeak" : 2.3,
      "slope" : 3.0,
      "ca" : 0.0,
      "thal" : 6.0
    }
  ]

service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, VERSION_NAME)

response = service.projects().predict(
    name=name,
    body={'instances': instances}
).execute()

if 'error' in response:
    raise RuntimeError(response['error'])
else:
  print(response['predictions'])

['31.0%', '15.0%']


## Tear-down AI Platform Resources
To avoid incurring ongoing charges, the following commands will delete the model (and version) and associated versions. 

In [18]:
# Delete version resource
#! gcloud ai-platform versions delete $VERSION_NAME --quiet --model $MODEL_NAME 

# Delete model resource
#! gcloud ai-platform models delete $MODEL_NAME --quiet

# Delete Cloud Storage objects that were created
#! gsutil -m rm -r gs://$BUCKET_NAME