<a href="https://colab.research.google.com/github/superwise-ai/quickstart/blob/main/getting_started/vertex.ipynb#offline=true&sandboxMode=true" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 Getting started with Superwise.ai on GCP Vertex AI

In this notebook, we will demonstrate how to integrate a Vertex AI based development workflow with Superwise.ai

**Part I** of this notebook walks you through building a classical model for predicting the Titanic passenger survival, using Sci-kit learn on Vertex AI. 

It is based on [GCP tutorial for building custom models on Vertex AI](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/custom/custom-tabular-bq-managed-dataset.ipynb).

**Part II** of this notebook will walk you through how to setup Superwise.ai to start tracking your model, by registering and providing a baseline for the model's behavior.

**Part III** will demonstrate how to send new predictions from your model to Superwise.ai, simulating a post-deployment scenario.

At this point, you should be able to start seeing insights from Superwise.ai in the web portal.

## 📌 Prerequisites

1. A Superwise.ai account that enables you to login and view insights + Superwise SDK installed
2. A set of API keys for sending data to Superwise.ai 
3. Permissions to create models, training jobs and inference endpoints inside Vertex.ai
4. Grant Superwise.ai permissions to your GCS bucket #soon to be removed

Note: this notebook works best when run from within a Vertex AI notebook instance

In [None]:
%pip install -U superwise

## 🏗️ Part I - building a Vertex AI Model to predict the survival chances of the Titanic passengers

This is a classical SVM model, over a publicly available dataset.

This guide is based on the best practices from [Vertex AI's example for building a Scikit-Learn model.](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/custom/custom-tabular-bq-managed-dataset.ipynb)

### 🔧 Setup

Install the latest version of Vertex AI SDK for Python.

In [None]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [None]:
! pip install {USER_FLAG} --upgrade google-cloud-aiplatform

Install the latest version of *google-cloud-storage* library as well.

In [None]:
! pip install {USER_FLAG} -U google-cloud-storage

Install the latest version of *google-cloud-bigquery* library as well.

### Restart the kernel

Once you've installed everything, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

# Automatically restart kernel after installs
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Set your project ID

**If you don't know your project ID**, you might be able to get your project ID using `gcloud`.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
PROJECT_ID = shell_output[0]
print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

### 🛂 Authenticate your Google Cloud account

**If you are using Google Cloud Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you submit a training job using the Cloud SDK, you upload a Python package
containing your training code to a Cloud Storage bucket. Vertex AI runs
the code from this package. In this tutorial, Vertex AI also saves the
trained model that results from your job in the same bucket. Using this model artifact, you can then
create Vertex AI model and endpoint resources in order to serve
online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are
available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with Vertex AI.

In [None]:
BUCKET_NAME = f"{PROJECT_ID}-superwise-vertex-demo-bucket"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "-aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

### ➕ Import Vertex SDK for Python

Import the Vertex SDK for Python into your Python environment and initialize it.

In [None]:
import os
import sys

from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

## Set up variables

Next, set up some variables used throughout the tutorial.

### 🔧 Set pre-built containers

Vertex AI provides pre-built containers to run training and prediction.

For the latest list, see [Pre-built containers for training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) and [Pre-built containers for prediction](https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers)

In [None]:
TRAIN_VERSION = "scikit-learn-cpu.0-23"
DEPLOY_VERSION = "sklearn-cpu.0-23"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

print("Training:", TRAIN_IMAGE)
print("Deployment:", DEPLOY_IMAGE)

Training: us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest
Deployment: us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-23:latest


### Set machine types

Next, set the machine types to use for training and prediction.

- Set the variables `TRAIN_COMPUTE` and `DEPLOY_COMPUTE` to configure your compute resources for training and prediction.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: The following is not supported for training:*

 - `standard`: 2 vCPUs
 - `highcpu`: 2, 4 and 8 vCPUs

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

Learn [which machine types are available for training](https://cloud.google.com/vertex-ai/docs/training/configure-compute) and [which machine types are available for prediction](https://cloud.google.com/vertex-ai/docs/predictions/configure-compute)

In [None]:
MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

MACHINE_TYPE = "n1-standard"

VCPU = "2"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

Train machine type n1-standard-4
Deploy machine type n1-standard-2


### Prepare the data

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
BINARY_FEATURES = [
    'sex']

# List all column names for numeric features
NUMERIC_FEATURES = [
    'age',
    'fare']

# List all column names for categorical features
CATEGORICAL_FEATURES = [
    'pclass',
    'embarked',
    'home_dest',
    'parch',
    'sibsp']

LABEL = ['survived']

ALL_COLUMNS = BINARY_FEATURES+NUMERIC_FEATURES+CATEGORICAL_FEATURES+LABEL

In [None]:
# download the dataset
df = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
df = df.rename(columns={"home.dest" : "home_dest"})
df = df[ALL_COLUMNS]

In [None]:
df

Unnamed: 0,sex,age,fare,pclass,embarked,home_dest,parch,sibsp,survived
0,female,29,211.3375,1,S,"St Louis, MO",0,0,1
1,male,0.9167,151.55,1,S,"Montreal, PQ / Chesterville, ON",2,1,1
2,female,2,151.55,1,S,"Montreal, PQ / Chesterville, ON",2,1,0
3,male,30,151.55,1,S,"Montreal, PQ / Chesterville, ON",2,1,0
4,female,25,151.55,1,S,"Montreal, PQ / Chesterville, ON",2,1,0
...,...,...,...,...,...,...,...,...,...
1304,female,14.5,14.4542,3,C,?,0,1,0
1305,female,?,14.4542,3,C,?,0,1,0
1306,male,26.5,7.225,3,C,?,0,0,0
1307,male,27,7.225,3,C,?,0,0,0


In [None]:
def clean_missing_numerics(df: pd.DataFrame, numeric_columns):
    '''
    removes invalid values in the numeric columns

            Parameters:
                    df (pandas.DataFrame): The Pandas Dataframe to alter
                    numeric_columns (List[str]): List of column names that are numberic from the DataFrame
            Returns:
                    pandas.DataFrame: a dataframe with the numeric columns fixed
    '''

    for n in numeric_columns:
        df[n] = pd.to_numeric(df[n], errors='coerce')

    df = df.fillna(df.mean())

    return df

In [None]:
df = clean_missing_numerics(df, NUMERIC_FEATURES)
# add a record_id column, using the dataframe's natural index. This is needed for training so that later we can send the ID as part of the prediction payload
df = df.reset_index().rename(columns = {'index': 'record_id'})


### Train/Test split and store as CSV files in the bucket

In [None]:

X = df.drop(columns="survived")
y = df["survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

train = X_train.copy()
train["survived"] = y_train

test = X_test.copy()
test["survived"] = y_test


In [None]:
train.to_csv(f"gs://{BUCKET_NAME}/data/titanic_train.csv")
test.to_csv(f"gs://{BUCKET_NAME}/data/titanic_test.csv")

## Prepare the training code package

For this tutorial, we will wrap our training script in a package.
This package can be run locally or installed inside the training container when running on the Vertex AI training machine.

In [None]:
! mkdir -p titanic/trainer

In [None]:
%%writefile titanic/setup.py

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
    'gcsfs==0.7.1',
    'dask[dataframe]==2021.2.0',
    'google-cloud-bigquery-storage==1.0.0',
    'six==1.15.0'
]

setup(
    name='trainer',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(), # Automatically find packages within this directory or below.
    include_package_data=True, # if packages include any data files, those will be packed together.
    description='Classification training titanic survivors prediction model'
)

Overwriting titanic/setup.py


In [None]:
! touch titanic/trainer/__init__.py

In [None]:
%%writefile titanic/trainer/task.py

from google.cloud import bigquery, bigquery_storage, storage
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.metrics import classification_report, f1_score
from typing import Union, List
import os, logging, json, pickle, argparse
import dask.dataframe as dd
import pandas as pd
import numpy as np

# feature selection.  The FEATURE list defines what features are needed from the training data.
# as well as the types of those features. We will perform different feature engineering depending on the type

# List all column names for binary features: 0,1 or True,False or Male,Female etc
BINARY_FEATURES = [
    'sex']

# List all column names for numeric features
NUMERIC_FEATURES = [
    'age',
    'fare']

# List all column names for categorical features
CATEGORICAL_FEATURES = [
    'pclass',
    'embarked',
    'home_dest',
    'parch',
    'sibsp']

# ID column - needed to support predict() over numpy arrays 
ID = ['record_id']

ALL_COLUMNS = ID + BINARY_FEATURES+NUMERIC_FEATURES+CATEGORICAL_FEATURES 

# define the column name for label
LABEL = 'survived'


# Define the index position of each feature. This is needed for processing a
# numpy array (instead of pandas) which has no column names.
BINARY_FEATURES_IDX = list(range(1,len(BINARY_FEATURES)+1))
NUMERIC_FEATURES_IDX = list(range(len(BINARY_FEATURES)+1, len(BINARY_FEATURES)+len(NUMERIC_FEATURES)+1))
CATEGORICAL_FEATURES_IDX = list(range(len(BINARY_FEATURES+NUMERIC_FEATURES)+1, len(ALL_COLUMNS)))


def load_data_from_gcs(data_gcs_path: str) -> pd.DataFrame:
    '''
    Loads data from Google Cloud Storage (GCS) to a dataframe

            Parameters:
                    data_gcs_path (str): gs path for the location of the data. Wildcards are also supported. i.e gs://example_bucket/data/training-*.csv

            Returns:
                    pandas.DataFrame: a dataframe with the data from GCP loaded
    '''

    # using dask that supports wildcards to read multiple files. Then with dd.read_csv().compute we create a pandas dataframe
    # Additionally I have noticed that some values for TotalCharges are missing and this creates confusion regarding TotalCharges the data types.
    # to overcome this we manually define TotalCharges as object.
    # We will later fix this upnormality
    logging.info("reading gs data: {}".format(data_gcs_path))
    return dd.read_csv(data_gcs_path, dtype={'TotalCharges': 'object'}).compute()


def load_data_from_bq(bq_uri: str) -> pd.DataFrame:
    '''
    Loads data from BigQuery table (BQ) to a dataframe

            Parameters:
                    bq_uri (str): bq table uri. i.e: example_project.example_dataset.example_table
            Returns:
                    pandas.DataFrame: a dataframe with the data from GCP loaded
    '''
    if not bq_uri.startswith('bq://'):
        raise Exception("uri is not a BQ uri. It should be bq://project_id.dataset.table")
    logging.info("reading bq data: {}".format(bq_uri))
    project,dataset,table =  bq_uri.split(".")
    bqclient = bigquery.Client(project=project[5:])
    bqstorageclient = bigquery_storage.BigQueryReadClient()
    query_string = """
    SELECT * from {ds}.{tbl}
    """.format(ds=dataset, tbl=table)

    return (
        bqclient.query(query_string)
            .result()
            .to_dataframe(bqstorage_client=bqstorageclient)
    )

def clean_missing_numerics(df: pd.DataFrame, numeric_columns):
    '''
    removes invalid values in the numeric columns

            Parameters:
                    df (pandas.DataFrame): The Pandas Dataframe to alter
                    numeric_columns (List[str]): List of column names that are numberic from the DataFrame
            Returns:
                    pandas.DataFrame: a dataframe with the numeric columns fixed
    '''

    for n in numeric_columns:
        df[n] = pd.to_numeric(df[n], errors='coerce')

    df = df.fillna(df.mean())

    return df

def data_selection(df: pd.DataFrame, selected_columns: List[str], label_column: str) -> (pd.DataFrame, pd.Series):
    '''
    From a dataframe it creates a new dataframe with only selected columns and returns it.
    Additionally it splits the label column into a pandas Series.

            Parameters:
                    df (pandas.DataFrame): The Pandas Dataframe to drop columns and extract label
                    selected_columns (List[str]): List of strings with the selected columns. i,e ['col_1', 'col_2', ..., 'col_n' ]
                    label_column (str): The name of the label column

            Returns:
                    tuple(pandas.DataFrame, pandas.Series): Tuble with the new pandas DataFrame containing only selected columns and lablel pandas Series
    '''
    # We create a series with the prediciton label
    labels = df[label_column]

    data = df.loc[:, selected_columns]


    return data, labels

def pipeline_builder(params_svm: dict, bin_ftr_idx: List[int], num_ftr_idx: List[int], cat_ftr_idx: List[int]) -> Pipeline:
    '''
    Builds a sklearn pipeline with preprocessing and model configuration.
    Preprocessing steps are:
        * OrdinalEncoder - used for binary features
        * StandardScaler - used for numerical features
        * OneHotEncoder - used for categorical features
    Model used is SVC

            Parameters:
                    params_svm (dict): List of parameters for the sklearn.svm.SVC classifier
                    bin_ftr_idx (List[str]): List of ints that mark the column indexes with binary columns. i.e [0, 2, ... , X ]
                    num_ftr_idx (List[str]): List of ints that mark the column indexes with numerica columns. i.e [6, 3, ... , X ]
                    cat_ftr_idx (List[str]): List of ints that mark the column indexes with categorical columns. i.e [5, 10, ... , X ]
                    label_column (str): The name of the label column

            Returns:
                     Pipeline: sklearn.pipelines.Pipeline with preprocessing and model training
    '''

    # Definining a preprocessing step for our pipeline.
    # it specifies how the features are going to be transformed
    preprocessor = ColumnTransformer(
        transformers=[
            ('bin', OrdinalEncoder(), bin_ftr_idx),
            ('num', StandardScaler(), num_ftr_idx),
            ('cat', OneHotEncoder(handle_unknown='ignore'),  cat_ftr_idx)], remainder='drop', n_jobs=-1)


    # We now create a full pipeline, for preprocessing and training.
    # for training we selected a linear SVM classifier

    clf = SVC()
    clf.set_params(**params_svm)

    return Pipeline(steps=[ ('preprocessor', preprocessor),
                            ('classifier', clf)])

def train_pipeline(clf: Pipeline, X: Union[pd.DataFrame, np.ndarray], y: Union[pd.DataFrame, np.ndarray]) -> float:
    '''
    Trains a sklearn pipeline by fiting training data an labels and returns the accuracy f1 score

            Parameters:
                    clf (sklearn.pipelines.Pipeline): the Pipeline object to fit the data
                    X: (pd.DataFrame OR np.ndarray): Training vectors of shape n_samples x n_features, where n_samples is the number of samples and n_features is the number of features.
                    y: (pd.DataFrame OR np.ndarray): Labels of shape n_samples. Order should mathc Training Vectors X

            Returns:
                    score (float): Average F1 score from all cross validations
    '''
    # run cross validation to get training score. we can use this score to optimise training
    score = cross_val_score(clf, X, y, cv=10, n_jobs=-1).mean()

    # Now we fit all our data to the classifier.
    clf.fit(X, y)

    return score

def process_gcs_uri(uri: str) -> (str, str, str, str):
    '''
    Receives a Google Cloud Storage (GCS) uri and breaks it down to the scheme, bucket, path and file

            Parameters:
                    uri (str): GCS uri

            Returns:
                    scheme (str): uri scheme
                    bucket (str): uri bucket
                    path (str): uri path
                    file (str): uri file
    '''
    url_arr = uri.split("/")
    if "." not in url_arr[-1]:
        file = ""
    else:
        file = url_arr.pop()
    scheme = url_arr[0]
    bucket = url_arr[2]
    path = "/".join(url_arr[3:])
    path = path[:-1] if path.endswith("/") else path

    return scheme, bucket, path, file

def pipeline_export_gcs(fitted_pipeline: Pipeline, model_dir: str) -> str:
    '''
    Exports trained pipeline to GCS

            Parameters:
                    fitted_pipeline (sklearn.pipelines.Pipeline): the Pipeline object with data already fitted (trained pipeline object)
                    model_dir (str): GCS path to store the trained pipeline. i.e gs://example_bucket/training-job
            Returns:
                    export_path (str): Model GCS location
    '''
    scheme, bucket, path, file = process_gcs_uri(model_dir)
    if scheme != "gs:":
        raise ValueError("URI scheme must be gs")

    # Upload the model to GCS
    b = storage.Client().bucket(bucket)
    export_path = os.path.join(path, 'model.pkl')
    blob = b.blob(export_path)

    blob.upload_from_string(pickle.dumps(fitted_pipeline))
    return scheme + "//" + os.path.join(bucket, export_path)


def prepare_report(cv_score: float, model_params: dict, classification_report: str, columns: List[str], example_data: np.ndarray) -> str:
    '''
    Prepares a training report in Text

            Parameters:
                    cv_score (float): score of the training job during cross validation of training data
                    model_params (dict): dictonary containing the parameters the model was trained with
                    classification_report (str): Model classification report with test data
                    columns (List[str]): List of columns that where used in training.
                    example_data (np.array): Sample of data (2-3 rows are enough). This is used to include what the prediciton payload should look like for the model
            Returns:
                    report (str): Full report in text
    '''

    buffer_example_data = '['
    for r in example_data:
        buffer_example_data+='['
        for c in r:
            if(isinstance(c,str)):
                buffer_example_data+="'"+c+"', "
            else:
                buffer_example_data+=str(c)+", "
        buffer_example_data= buffer_example_data[:-2]+"], \n"
    buffer_example_data= buffer_example_data[:-3]+"]"

    report = """
Training Job Report    
    
Cross Validation Score: {cv_score}

Training Model Parameters: {model_params}
    
Test Data Classification Report:
{classification_report}

Example of data array for prediciton:

Order of columns:
{columns}

Example for clf.predict()
{predict_example}


Example of GCP API request body:
{{
    "instances": {json_example}
}}

""".format(
        cv_score=cv_score,
        model_params=json.dumps(model_params),
        classification_report=classification_report,
        columns = columns,
        predict_example = buffer_example_data,
        json_example = json.dumps(example_data.tolist()))

    return report


def report_export_gcs(report: str, report_dir: str) -> None:
    '''
    Exports training job report to GCS

            Parameters:
                    report (str): Full report in text to sent to GCS
                    report_dir (str): GCS path to store the report model. i.e gs://example_bucket/training-job
            Returns:
                    export_path (str): Report GCS location
    '''
    scheme, bucket, path, file = process_gcs_uri(report_dir)
    if scheme != "gs:":
        raise ValueError("URI scheme must be gs")

    # Upload the model to GCS
    b = storage.Client().bucket(bucket)

    export_path = os.path.join(path, 'report.txt')
    blob = b.blob(export_path)

    blob.upload_from_string(report)

    return scheme + "//" + os.path.join(bucket, export_path)



# Define all the command line arguments your model can accept for training
if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    # Input Arguments

    parser.add_argument(
        '--model_param_kernel',
        help = 'SVC model parameter- kernel',
        choices=['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'],
        type = str,
        default = 'linear'
    )

    parser.add_argument(
        '--model_param_degree',
        help = 'SVC model parameter- Degree. Only applies for poly kernel',
        type = int,
        default = 3
    )

    parser.add_argument(
        '--model_param_C',
        help = 'SVC model parameter- C (regularization)',
        type = float,
        default = 1.0
    )

    parser.add_argument(
        '--model_param_probability',
        help = 'Whether to enable probability estimates',
        type = bool,
        default = True
    )


    ''' 
    Vertex AI automatically populates a set of environment varialbes in the container that executes 
    your training job. those variables include:
        * AIP_MODEL_DIR - Directory selected as model dir
        * AIP_DATA_FORMAT - Type of dataset selected for training (can be csv or bigquery)
    
    Vertex AI will automatically split selected dataset into training,validation and testing
    and 3 more environment variables will reflect the locaiton of the data:
        * AIP_TRAINING_DATA_URI - URI of Training data
        * AIP_VALIDATION_DATA_URI - URI of Validation data
        * AIP_TEST_DATA_URI - URI of Test data
        
    Notice that those environment varialbes are default. If the user provides a value using CLI argument,
    the environment variable will be ignored. If the user does not provide anything as CLI  argument
    the program will try and use the environemnt variables if those exist. otherwise will leave empty.
    '''
    parser.add_argument(
        '--model_dir',
        help = 'Directory to output model and artifacts',
        type = str,
        default = os.environ['AIP_MODEL_DIR'] if 'AIP_MODEL_DIR' in os.environ else ""
    )
    parser.add_argument(
        '--data_format',
        choices=['csv', 'bigquery'],
        help = 'format of data uri csv for gs:// paths and bigquery for project.dataset.table formats',
        type = str,
        default =  os.environ['AIP_DATA_FORMAT'] if 'AIP_DATA_FORMAT' in os.environ else "csv"
    )
    parser.add_argument(
        '--training_data_uri',
        help = 'location of training data in either gs:// uri or bigquery uri',
        type = str,
        default =  os.environ['AIP_TRAINING_DATA_URI'] if 'AIP_TRAINING_DATA_URI' in os.environ else ""
    )
    parser.add_argument(
        '--validation_data_uri',
        help = 'location of validation data in either gs:// uri or bigquery uri',
        type = str,
        default =  os.environ['AIP_VALIDATION_DATA_URI'] if 'AIP_VALIDATION_DATA_URI' in os.environ else ""
    )
    parser.add_argument(
        '--test_data_uri',
        help = 'location of test data in either gs:// uri or bigquery uri',
        type = str,
        default =  os.environ['AIP_TEST_DATA_URI'] if 'AIP_TEST_DATA_URI' in os.environ else ""
    )

    parser.add_argument("-v", "--verbose", help="increase output verbosity",
                        action="store_true")



    args = parser.parse_args()
    arguments = args.__dict__


    if args.verbose:
        logging.basicConfig(level=logging.INFO)


    logging.info('Model artifacts will be exported here: {}'.format(arguments['model_dir']))
    logging.info('Data format: {}'.format(arguments["data_format"]))
    logging.info('Training data uri: {}'.format(arguments['training_data_uri']) )
    logging.info('Validation data uri: {}'.format(arguments['validation_data_uri']))
    logging.info('Test data uri: {}'.format(arguments['test_data_uri']))


    '''
    We have 2 different ways to load our data to pandas. One is from cloud storage by loading csv files and
    the other is by connecting to BigQuery. Vertex AI supports both and 
    here we created a code that depelnding on the dataset provided, we will select the appropriated loading method.
    '''
    logging.info('Loading {} data'.format(arguments["data_format"]))
    if(arguments['data_format']=='csv'):
        df_train = load_data_from_gcs(arguments['training_data_uri'])
        df_test = load_data_from_gcs(arguments['test_data_uri'])
        df_valid = load_data_from_gcs(arguments['validation_data_uri'])
    elif(arguments['data_format']=='bigquery'):
        print(arguments['training_data_uri'])
        df_train = load_data_from_bq(arguments['training_data_uri'])
        df_test = load_data_from_bq(arguments['test_data_uri'])
        df_valid = load_data_from_bq(arguments['validation_data_uri'])
    else:
        raise ValueError("Invalid data type ")

    #as we will be using cross validation, we will have just a training set and a single test set.
    # we ill merge the test and validation to achieve an 80%-20% split
    df_test = pd.concat([df_test,df_valid])

    logging.info('Defining model parameters')
    model_params = dict()
    model_params['kernel'] = arguments['model_param_kernel']
    model_params['degree'] = arguments['model_param_degree']
    model_params['C'] = arguments['model_param_C']
    model_params['probability'] = arguments['model_param_probability']

    df_train = clean_missing_numerics(df_train, NUMERIC_FEATURES)
    df_test = clean_missing_numerics(df_test, NUMERIC_FEATURES)


    logging.info('Running feature selection')
    X_train, y_train = data_selection(df_train, ALL_COLUMNS, LABEL)
    X_test, y_test = data_selection(df_test, ALL_COLUMNS, LABEL)

    logging.info('Training pipelines in CV')
    clf = pipeline_builder(model_params, BINARY_FEATURES_IDX, NUMERIC_FEATURES_IDX, CATEGORICAL_FEATURES_IDX)

    cv_score = train_pipeline(clf, X_train, y_train)



    logging.info('Export trained pipeline and report')
    pipeline_export_gcs(clf, arguments['model_dir'])

    y_pred = clf.predict(X_test)


    test_score = f1_score(y_test, y_pred, average='weighted')


    logging.info('f1score: '+ str(test_score))

    report = prepare_report(cv_score,
                            model_params,
                            classification_report(y_test,y_pred),
                            ALL_COLUMNS,
                            X_test.to_numpy()[0:2])

    report_export_gcs(report, arguments['model_dir'])


    logging.info('Training job completed. Exiting...')
    

Overwriting titanic/trainer/task.py


### Install the training package locally

In [None]:
! cd titanic && python setup.py install

### 🏃 Train the model locally



In [None]:
CMDARGS = [f"--model_param_kernel=linear", \
           f"--data_format=csv", \
           f"--training_data_uri=gs://{BUCKET_NAME}/data/titanic_train.csv", \
           f"--test_data_uri=gs://{BUCKET_NAME}/data/titanic_test.csv", \
           f"--validation_data_uri=gs://{BUCKET_NAME}/data/titanic_test.csv"]

In [None]:
#add a specific path to write the model file to the args
CMDARGS_LOCAL = " ".join(CMDARGS + [f"--model_dir=gs://{BUCKET_NAME}/titanic/trial"])
            

In [None]:
%run titanic/trainer/task.py $CMDARGS_LOCAL

In [None]:
# create a package and upload it to the cloud bucket
! cd titanic && python setup.py sdist

In [None]:
PACKAGE_URI = f"gs://{BUCKET_NAME}/training/trainer-0.1.tar.gz"

In [None]:
! gsutil cp titanic/dist/trainer-0.1.tar.gz $PACKAGE_URI

Copying file://titanic/dist/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  6.3 KiB/  6.3 KiB]                                                
Operation completed over 1 objects/6.3 KiB.                                      


## 🏃 Train and deploy the model on Vertex AI

Define your custom `TrainingPipeline` on Vertex AI.

Use the `CustomTrainingJob` class to define the `TrainingPipeline`. The class takes the following parameters:

- `display_name`: The user-defined name of this training pipeline.
- `script_path`: The local path to the training script.
- `container_uri`: The URI of the training container image.
- `requirements`: The list of Python package dependencies of the script.
- `model_serving_container_image_uri`: The URI of a container that can serve predictions for your model — either a pre-built container or a custom container.

Use the `run` function to start training. The function takes the following parameters:

- `args`: The command line arguments to be passed to the Python script.
- `replica_count`: The number of worker replicas.
- `model_display_name`: The display name of the `Model` if the script produces a managed `Model`.
- `machine_type`: The type of machine to use for training.
- `accelerator_type`: The hardware accelerator type.
- `accelerator_count`: The number of accelerators to attach to a worker replica.

The `run` function creates a training pipeline that trains and creates a `Model` object. After the training pipeline completes, the `run` function returns the `Model` object.

In [None]:
# Create a custom package-based training job
JOB_NAME = 'superwise_vertex_demo_job'
MODEL_DISPLAY_NAME = "superwise_vertex_titanic"


job = aiplatform.CustomPythonPackageTrainingJob(display_name=JOB_NAME, 
                                                python_package_gcs_uri=PACKAGE_URI, 
                                                python_module_name='trainer.task', 
                                                container_uri=TRAIN_IMAGE, 
                                                model_serving_container_image_uri=DEPLOY_IMAGE, 
                                                )


from datetime import datetime

model = job.run(
    model_display_name=MODEL_DISPLAY_NAME,
    args=CMDARGS,
    replica_count=1,
    machine_type=TRAIN_COMPUTE,
    accelerator_count=0)

### 🚀 Deploy the model

Before you use your model to make predictions, you must deploy it to an `Endpoint`. You can do this by calling the `deploy` function on the `Model` resource. This will do two things:

1. Create an `Endpoint` resource for deploying the `Model` resource to.
2. Deploy the `Model` resource to the `Endpoint` resource.


The function takes the following parameters:

- `deployed_model_display_name`: A human readable name for the deployed model.
- `traffic_split`: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs.
   - If only one model, then specify `{ "0": 100 }`, where "0" refers to this model being uploaded and 100 means 100% of the traffic.
   - If there are existing models on the endpoint, for which the traffic will be split, then use `model_id` to specify `{ "0": percent, model_id: percent, ... }`, where `model_id` is the ID of an existing `DeployedModel` on the endpoint. The percentages must add up to 100.
- `machine_type`: The type of machine to use for training.
- `accelerator_type`: The hardware accelerator type.
- `accelerator_count`: The number of accelerators to attach to a worker replica.
- `starting_replica_count`: The number of compute instances to initially provision.
- `max_replica_count`: The maximum number of compute instances to scale to. In this tutorial, only one instance is provisioned.

### Traffic split

The `traffic_split` parameter is specified as a Python dictionary. You can deploy more than one instance of your model to an endpoint, and then set the percentage of traffic that goes to each instance.

You can use a traffic split to introduce a new model gradually into production. For example, if you had one existing model in production with 100% of the traffic, you could deploy a new model to the same endpoint, direct 10% of traffic to it, and reduce the original model's traffic to 90%. This allows you to monitor the new model's performance while minimizing the distruption to the majority of users.

### Compute instance scaling

You can specify a single instance (or node) to serve your online prediction requests. This tutorial uses a single node, so the variables `MIN_NODES` and `MAX_NODES` are both set to `1`.

If you want to use multiple nodes to serve your online prediction requests, set `MAX_NODES` to the maximum number of nodes you want to use. Vertex AI autoscales the number of nodes used to serve your predictions, up to the maximum number you set. Refer to the [pricing page](https://cloud.google.com/vertex-ai/pricing#prediction-prices) to understand the costs of autoscaling with multiple nodes.

### Endpoint

The method will block until the model is deployed and eventually return an `Endpoint` object. If this is the first time a model is deployed to the endpoint, it may take a few additional minutes to complete provisioning of resources.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
DEPLOYED_NAME = MODEL_DISPLAY_NAME + "-" + TIMESTAMP
TRAFFIC_SPLIT = {"0": 100}

MIN_NODES = 1
MAX_NODES = 1


endpoint = model.deploy(
    deployed_model_display_name=DEPLOYED_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type=DEPLOY_COMPUTE,
    accelerator_count=0,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
)

## Make an online prediction request

Send an online prediction request to your deployed model.

### Send the prediction request

Now that you have test data, you can use it to send a prediction request. Use the `Endpoint` object's `predict` function, which takes the following parameters:

- `instances`: A list of instances for prediction. Each instance is an array of values. 

**Note**: The first column for each instance needs to be the record_id. We are sending this to the prediction API in order to associate it with the prediction outputs on the server side.

The `predict` function returns a list, where each element in the list corresponds to the an instance in the request. 

In [None]:
instances = train.to_numpy().tolist()
predictions = endpoint.predict(instances=instances)

In [None]:
y_predicted = np.asarray(predictions.predictions, dtype= np.int)
y_predicted

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """Entry point for launching an IPython kernel.


array([0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,

In [None]:
correct = sum(y_predicted == np.array(y_train))
accuracy = len(y_predicted)
print(
    f"Correct predictions = {correct}, Total predictions = {accuracy}, Accuracy = {correct/accuracy}"
)

Correct predictions = 782, Total predictions = 916, Accuracy = 0.8537117903930131


## 📈 Part II - Setup Superwise.ai to track your model


### 🔧 Setup
1. Install the Superwise Python package from pip
2. Set environment variables with the API keys
3. Create a Superwise client

In [None]:
# # Login to Superwise.ai portal, and click on your account icon. 
# # Click "personal tokens" -> "generate tokens" and past the values below
%env SUPERWISE_CLIENT_ID=REPLACE_WITH_YOUR_CLIENTID
%env SUPERWISE_SECRET=REPLACE_WITH_YOUR_SECRET

In [None]:
from superwise import Superwise
from superwise.models.model import Model
from superwise.models.version import Version
from superwise.models.data_entity import DataEntity
from superwise.resources.superwise_enums import FeatureType, DataEntityRole
from superwise.controller.infer import infer_dtype

sw = Superwise()

### Create a Superwise *Model*

A *Model* represents a domain problem.
In our case, the model is to predict the survival chances of the Titanic passengers.

Over time, we may develop and deploy different ML models that attempt to address this model.

In Superwise.ai terminology, each specific ML model we wish to track is called a *Version*. 
There may be multiple *Versions* belonging to a *Model* being tracked at any point in time (e.g. new models in shadow mode, or A/B tests)

In [None]:
# Create the Model entity
titanic_model =Model(
    name="Superwise-vertex-titanic-model",
    description="Predicting Titanic passengers' survival probability"
)

my_model = sw.model.create(titanic_model)
print(my_model.id)

53


### Create a *Baseline* for our deployed model

We've just deployed a model to Sagemaker, and wish to start tracking it.
In order to perform the analysis of the model's performance over time, we need to set up a Baseline for the model's behavior.

It's a common practice to use the training or test data (both features and predictions)as the baseline, as they represent 
The state which we consider stable and validated.

Later, when the model performs predictions in production, we can compare the data and prediction behavior to the baseline, and detect drift.

The baseline data includes:

1. Features
2. Labels
3. Model predictions
4. Timestamp of inference
5. Id for each record (later used to correlate predictions with labels)

In [None]:
# add the prediction value, a timestamp and the label to the training features

baseline_data = X_train.assign(
    prediction=predictions.predictions,
    ts=pd.Timestamp.now(),
    survived=np.array(y_train)
)

In [None]:
baseline_data

Unnamed: 0,record_id,sex,age,fare,pclass,embarked,home_dest,parch,sibsp,prediction,ts,survived
0,1214,male,29.881135,8.6625,3,S,?,0,0,0.0,2021-12-22 18:49:50.620685,0
1,677,male,26.000000,7.8958,3,S,"Bulgaria Chicago, IL",0,0,0.0,2021-12-22 18:49:50.620685,0
2,534,female,19.000000,26.0000,2,S,"Worcester, England",0,0,1.0,2021-12-22 18:49:50.620685,1
3,1174,female,29.881135,69.5500,3,S,?,2,8,0.0,2021-12-22 18:49:50.620685,0
4,864,female,28.000000,7.7750,3,S,?,0,0,1.0,2021-12-22 18:49:50.620685,0
...,...,...,...,...,...,...,...,...,...,...,...,...
911,1095,female,29.881135,7.6292,3,Q,?,0,0,1.0,2021-12-22 18:49:50.620685,0
912,1130,female,18.000000,7.7750,3,S,?,0,0,1.0,2021-12-22 18:49:50.620685,0
913,1294,male,28.500000,16.1000,3,S,?,0,0,0.0,2021-12-22 18:49:50.620685,0
914,860,female,26.000000,7.9250,3,S,?,0,0,1.0,2021-12-22 18:49:50.620685,1


### Create a *Schema* object that describes the format and sematics of our Baseline data

The Schema object helps Superwise.ai interpret our data, for example - undertand which column prepresents predictions and which represents the labels.

You can let Superwise to infer the data type, or you can verify and edit them manually

In [None]:
entities_dtypes = infer_dtype(df=baseline_data)
entities_dtypes

{'record_id': 'Numeric',
 'sex': 'Categorical',
 'age': 'Numeric',
 'fare': 'Numeric',
 'pclass': 'Numeric',
 'embarked': 'Categorical',
 'home_dest': 'Categorical',
 'parch': 'Numeric',
 'sibsp': 'Numeric',
 'prediction': 'Boolean',
 'ts': 'Timestamp',
 'survived': 'Boolean'}

In [None]:
entities_dtypes['pclass'] = 'Categorical'
entities_dtypes['parch'] = 'Categorical'
entities_dtypes['sibsp'] = 'Categorical'

In [None]:
entities_collection = sw.data_entity.summarise(
    data=baseline_data,
    entities_dtypes=entities_dtypes,
    specific_roles = {
      'record_id': DataEntityRole.ID,
      'ts': DataEntityRole.TIMESTAMP,
      'prediction': DataEntityRole.PREDICTION_VALUE,
      'survived': DataEntityRole.LABEL
    }
)

Here are the schema main properties (roles, types, feature importance and descriptive statistics):

In [None]:
ls = list()
for entity in entities_collection:
    ls.append(entity.get_properties())
    
pd.DataFrame(ls)[['name', 'type', 'role', 'feature_importance', 'summary']]

Unnamed: 0,name,type,role,feature_importance,summary
0,record_id,Numeric,id,0.0,"{'statistics': {'missing_values': 0.0, 'outlie..."
1,sex,Categorical,feature,74.11,"{'statistics': {'missing_values': 0.0, 'new_va..."
2,age,Numeric,feature,5.14,"{'statistics': {'missing_values': 0.0, 'outlie..."
3,fare,Numeric,feature,6.09,"{'statistics': {'missing_values': 0.0, 'outlie..."
4,pclass,Categorical,feature,3.01,"{'statistics': {'missing_values': 0.0, 'new_va..."
5,embarked,Categorical,feature,1.9,"{'statistics': {'missing_values': 0.0, 'new_va..."
6,home_dest,Categorical,feature,0.0,{'statistics': {'missing_values': 0.0}}
7,parch,Categorical,feature,3.12,"{'statistics': {'missing_values': 0.0, 'new_va..."
8,sibsp,Categorical,feature,6.64,"{'statistics': {'missing_values': 0.0, 'new_va..."
9,prediction,Boolean,prediction value,0.0,"{'statistics': {'missing_values': 0.0, 'new_va..."


### Create a *Version* object

As explained above, a *Version* represents a concrete ML model we are tracking.

A *Version* solves a *Model*

A *Version* has a *Baseline*

In [None]:
titanic_version = Version(
    model_id=my_model.id,
    name="1.0",
    data_entities=entities_collection,
)

my_version = sw.version.create(titanic_version)

In [None]:
sw.version.activate(my_version.id)

<Response [204]>

## 🩺 Part III - monitoring ongoing predictions

Now that we have a *Version* of the model setup with a *Baseline*, we can start sending ongoing model predictions to Superwise to monitor the model's performance in a production settings.

For this demo, we will treat the Test split of the data as our "ongoing predictions".


In [None]:
predictions = endpoint.predict(instances=X_test.to_numpy().tolist())

In [None]:
pred = [x for x in predictions.predictions]
ongoing_predictions = X_test.copy()
ongoing_predictions['prediction']=pred
ongoing_predictions['ts']=pd.Timestamp.now()
ongoing_predictions

Unnamed: 0,record_id,sex,age,fare,pclass,embarked,home_dest,parch,sibsp,prediction,ts
0,1148,male,35.000000,7.1250,3,S,?,0,0,0.0,2021-12-22 19:03:22.415855
1,1049,male,20.000000,15.7417,3,C,?,1,1,0.0,2021-12-22 19:03:22.415855
2,982,male,29.881135,7.8958,3,S,?,0,0,0.0,2021-12-22 19:03:22.415855
3,808,male,29.881135,8.0500,3,S,"Bridgwater, Somerset, England",0,0,0.0,2021-12-22 19:03:22.415855
4,1195,male,29.881135,7.7500,3,Q,?,0,0,0.0,2021-12-22 19:03:22.415855
...,...,...,...,...,...,...,...,...,...,...,...
388,325,male,30.000000,13.0000,2,S,"Bryn Mawr, PA, USA",0,0,0.0,2021-12-22 19:03:22.415855
389,919,male,18.500000,7.2292,3,C,?,0,0,0.0,2021-12-22 19:03:22.415855
390,532,male,41.000000,13.0000,2,S,?,0,0,0.0,2021-12-22 19:03:22.415855
391,1159,female,29.881135,8.0500,3,S,?,0,0,1.0,2021-12-22 19:03:22.415855


In [None]:
transaction_id = sw.transaction.log_records(
    model_id=my_model.id,
    version_id=my_version.name,
    records=ongoing_predictions.to_dict(orient='records')
)
print(transaction_id)

{'transaction_id': '0a50ac7a-635a-11ec-99d4-5acaded3d43d'}


In [None]:
transaction_id = sw.transaction.get(transaction_id=transaction_id['transaction_id'])
transaction_id.get_properties()['status']

'Passed'

### Optional - report ongoing lables to Superwise.ai

In some cases, our system is able to gather "ground truth" labels for it's predictions.
Often, this happens later on, after the prediciton was already given.

By sending these labels to Superwise.ai, we add another important layer of data to our monitoring solution.

For the purpose of this demo, we can use the test set's labels as the ground truth, simulating a label we collected in production.


In [None]:
# Note: we provide the column names we declared in the Schema object, 
# so that Superwise.ai will be able to interpret the data

ground_truth = pd.DataFrame(data=test, columns=['record_id', 'survived'])
ground_truth

Unnamed: 0,record_id,survived
1148,1148,0
1049,1049,1
982,982,0
808,808,0
1195,1195,0
...,...,...
325,325,0
919,919,0
532,532,0
1159,1159,1


In [None]:
transaction_id = sw.transaction.log_records(
    model_id=my_model.id,
    records=ground_truth.to_dict(orient='records')
)
print(transaction_id)

{'transaction_id': '651bee30-635a-11ec-ad4d-426f0aedd514'}


In [None]:
transaction_id = sw.transaction.get(transaction_id=transaction_id['transaction_id'])
transaction_id.get_properties()['status']

'Passed'

## Undeploy the model

To undeploy your `Model` resource from the serving `Endpoint` resource, use the endpoint's `undeploy` method with the following parameter:

- `deployed_model_id`: The model deployment identifier returned by the endpoint service when the `Model` resource was deployed. You can retrieve the deployed models using the endpoint's `deployed_models` property.

Since this is the only deployed model on the `Endpoint` resource, you can omit `traffic_split`.

In [None]:
deployed_model_id = endpoint.list_models()[0].id
endpoint.undeploy(deployed_model_id=deployed_model_id)

## 🗑️ Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Training Job
- Model
- Endpoint
- Cloud Storage Bucket

In [None]:
delete_training_job = True
delete_model = True
delete_endpoint = True

# Warning: Setting this to true will delete everything in your bucket
delete_bucket = False

# Delete the training job
job.delete()

# Delete the model
model.delete()

# Delete the endpoint
endpoint.delete()