# Model Packaging Example

## Before Everything

### Install `snowflake-ml-python` locally

Before `snowflake-ml-python` is publicly available, you have to install from wheel file. Once it is ready, you could install them like other packages in PIP or conda.

In [None]:
%pip install snowflake_ml_python-0.3.2-py3-none-any.whl

Notice: It is suggested to use pure-pip environment or empty conda environment when you try this. If you insist to install snowML in a conda environment with packages, it is suggested that you should install all requirements and install `snowflake-ml-python` with `--no-deps` flag.

If you are about to go over the **Use with customize model** part in this notebook, you will need tensorflow and transformers, which could be installed by following command.

In [None]:
%pip install snowflake_ml_python-0.3.2-py3-none-any.whl[tensorflow] transformers==4.24.0

### Setup Notebook

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Scale cell width with the browser window to accommodate .show() commands for wider tables.
from IPython.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))

### Start Snowpark Session

To avoid exposing credentials in Github, we use a small utility `SnowflakeLoginOptions`. It allows you to score your default credentials in `~/.snowsql/config` in the following format:
```
[connections]
accountname = <string>   # Account identifier to connect to Snowflake.
username = <string>      # User name in the account. Optional.
password = <string>      # User password. Optional.
dbname = <string>        # Default database. Optional.
schemaname = <string>    # Default schema. Optional.
warehousename = <string> # Default warehouse. Optional.
#rolename = <string>      # Default role. Optional.
#authenticator = <string> # Authenticator: 'snowflake', 'externalbrowser', etc
```
Please follow [this](https://docs.snowflake.com/en/user-guide/snowsql-start.html#configuring-default-connection-settings) for more details.

In [None]:
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session

session = Session.builder.configs(SnowflakeLoginOptions()).create()

### Let `snowflake-ml-python` available for your models to be deployed

Unfortunately, since `snowflake-ml-python` does not exist in Anaconda channel yet, we have to import them manually to use it when the model get deployed to Snowflake. To avoid upload them again and again, we could set up a temporary stage and upload the wheel file there.

In [None]:
SNOW_ML_WHEEL_LOCAL_PATH = "~/snowml/bazel-bin/snowflake/ml/snowflake_ml_python-0.3.3-py3-none-any.whl"

In [None]:
import os
from typing import Optional

def upload_snowml_to_tmp_stage(session: Session, wheel_path: str, stage_name: Optional[str] = None) -> str:
    """Upload model module of snowml to tmp stage.

    Args:
        session: Snowpark session.
        wheel_path: Path to the local SnowML wheel file.

    Returns:
        The stage path to uploaded snowml.zip file.
    """
    if stage_name is None:
        stage_name = session.get_session_stage()
    _ = session.file.put(wheel_path, stage_name, auto_compress=False, overwrite=True)
    whl_filename = os.path.basename(wheel_path)
    return f"{stage_name}/{whl_filename}"

In [None]:
SNOW_ML_WHEEL_STAGE_PATH = upload_snowml_to_tmp_stage(session, SNOW_ML_WHEEL_LOCAL_PATH)

### Open/Create Model Registry

A model registry needs to be created before it can be used. The creation will create a new database in the current account so the active role needs to have permissions to create a database. After the first creation, the model registry can be opened without the need to create it again.

In [None]:
REGISTRY_DATABASE_NAME = "TEMP"
REGISTRY_SCHEMA_NAME = "WZHAO"

In [None]:
from snowflake.ml.registry import model_registry
model_registry.create_model_registry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)
registry = model_registry.ModelRegistry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)

## Use with scikit-learn model

### Train A Small Scikit-learn Model

The cell below trains a small model for demonstration purposes. The nature of the model does not matter, it is purely used to demonstrate the usage of the Model Packaging and Registry.

In [None]:
from sklearn import svm
from sklearn.datasets import load_digits

digits = load_digits()
target_digit = 6
num_training_examples = 10
svc_gamma = 0.001
svc_C = 10.0

clf = svm.SVC(gamma=svc_gamma, C=svc_C, probability=True)


def one_vs_all(dataset, digit):
    return [x == digit for x in dataset]


# Train a classifier using num_training_examples and use the last 100 examples for test.
train_features = digits.data[:num_training_examples]
train_labels = one_vs_all(digits.target[:num_training_examples], target_digit)
clf.fit(train_features, train_labels)

test_features = digits.data[-100:]
test_labels = one_vs_all(digits.target[-100:], target_digit)
prediction = clf.predict(test_features)

In [None]:
print(prediction[:10])

SVC has multiple method, for example, `predict_proba`.

In [None]:
prediction_proba = clf.predict_proba(test_features)
print(prediction_proba[:10])

### Register Model

The call to `log_model` executes a few steps:
1. The given model object is serialized and uploaded to a stage.
1. An entry in the Model Registry is created for the model, referencing the model stage location.
1. Additional metadata is updated for the model as provided in the call.

For the serialization to work, the model object needs to be serializable in python.

Aso, you have to provide a sample input data so that we could infer the model signature for you, or you can specify the model signature manually.

In [None]:
SVC_MODEL_NAME="SIMPLE_SVC_MODEL"
SVC_MODEL_VERSION="2"

In [None]:
# A name and model tags can be added to the model at registration time.
model_id = registry.log_model(
    model_name=SVC_MODEL_NAME,
    model_version=SVC_MODEL_VERSION,
    model=clf,
    tags={"stage": "testing", "classifier_type": "svm.SVC", "svc_gamma": svc_gamma, "svc_C": svc_C},
    sample_input_data=test_features[:10],
)

# The object API can be used to reference a model after creation.
model = model_registry.ModelReference(registry=registry, model_name=SVC_MODEL_NAME, model_version=SVC_MODEL_VERSION)
print("Registered new model:", model_id)

### Load Model

We can also restore the model we saved to the registry and load it back into the local context to make predictions.

In [None]:
import numpy as np

registry = model_registry.ModelRegistry(
    session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME
)
model = model_registry.ModelReference(registry=registry, model_name=SVC_MODEL_NAME, model_version=SVC_MODEL_VERSION)
restored_clf = model.load_model()

restored_prediction = restored_clf.predict(test_features)

print("Original prediction:", prediction[:10])
print("Restored prediction:", restored_prediction[:10])

print("Result comparison:", np.array_equal(prediction, restored_prediction))

In [None]:
restored_prediction_proba = restored_clf.predict_proba(test_features)

print("Original prediction:", prediction_proba[:10])
print("Restored prediction:", restored_prediction_proba[:10])

print("Result comparison:", np.array_equal(prediction_proba, restored_prediction_proba))

### Deploy Model and Batch Inference

We can also deploy the model we saved to the registry to warehouse and predict it in the warehouse.

Although the model may contain multiple methods, every deployment can only have one target method, and you need to specify that when you deploy the model.

Also, since `snowflake-ml-python` does not exist in Anaconda channel yet, we have to import them manually in the options when deploying, it will not required when we our package into Snowflake Anaconda Channel.

In [None]:
registry = model_registry.ModelRegistry(
    session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME
)
model = model_registry.ModelReference(registry=registry, model_name=SVC_MODEL_NAME, model_version=SVC_MODEL_VERSION)
model.deploy(
    deployment_name="svc_model_predict",
    target_method="predict",
    options={"_snowml_wheel_path": SNOW_ML_WHEEL_STAGE_PATH},
)

In [None]:
remote_prediction = model.predict(deployment_name="svc_model_predict", data=test_features)

print("Remote prediction:", remote_prediction[:10])

print("Result comparison:", np.array_equal(prediction, remote_prediction["feature_0"].values))

We can also deploy another method to warehouse.

In [None]:
model.deploy(
    deployment_name="svc_model_predict_proba",
    target_method="predict_proba",
    options={"_snowml_wheel_path": SNOW_ML_WHEEL_STAGE_PATH},
)

In [None]:
remote_prediction_proba = model.predict(deployment_name="svc_model_predict_proba", data=test_features)

print("Remote prediction:", remote_prediction_proba[:10])

print("Result comparison:", np.array_equal(prediction_proba, remote_prediction_proba.values))

## Use with customize model

Also with customized model, it could do much more than what shows above.

### Download a GPT-2 model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

### Store GPT-2 Model components locally

In [None]:
ARTIFACTS_DIR = "/tmp/gpt-2/"

In [None]:
import os

os.makedirs(os.path.join(ARTIFACTS_DIR, "model"), exist_ok=True)
os.makedirs(os.path.join(ARTIFACTS_DIR, "tokenizer"), exist_ok=True)

model.save_pretrained(os.path.join(ARTIFACTS_DIR, "model"))
tokenizer.save_pretrained(os.path.join(ARTIFACTS_DIR, "tokenizer"))

### Create a custom model using GPT-2

In [None]:
from snowflake.ml.model import custom_model
import pandas as pd


class GPT2Model(custom_model.CustomModel):
    def __init__(self, context: custom_model.ModelContext) -> None:
        super().__init__(context)

        self.model = AutoModelForCausalLM.from_pretrained(self.context.path("model"))
        self.tokenizer = AutoTokenizer.from_pretrained(self.context.path("tokenizer"))

    @custom_model.inference_api
    def predict(self, X: pd.DataFrame) -> pd.DataFrame:
        def _generate(input_text: str) -> str:
            input_ids = self.tokenizer.encode(input_text, return_tensors="pt")

            output = self.model.generate(input_ids, max_length=50, do_sample=True, top_p=0.95, top_k=60)
            generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)

            return generated_text

        res_df = pd.DataFrame({"output": pd.Series.apply(X["input"], _generate)})
        return res_df

In [None]:
gpt_model = GPT2Model(custom_model.ModelContext(models={}, artifacts={
    "model":os.path.join(ARTIFACTS_DIR, "model"),
    "tokenizer":os.path.join(ARTIFACTS_DIR, "tokenizer")
}))

gpt_model.predict(pd.DataFrame({"input":["Hello, are you GPT?"]}))

### Register the custom model

Here, how to specify dependencies and model signature manually is shown.

In [None]:
GPT2_MODEL_NAME = "GPT2_MODEL"
GPT2_MODEL_VERSION = "2"

In [None]:
from snowflake.ml.model import model_signature

model_id_gpt = registry.log_model(
    model_name=GPT2_MODEL_NAME,
    model_version=GPT2_MODEL_VERSION,
    model=gpt_model,
    conda_dependencies=["tensorflow", "transformers"],
    signatures={
        "predict": model_signature.ModelSignature(
            inputs=[model_signature.FeatureSpec(name="input", dtype=model_signature.DataType.STRING)],
            outputs=[model_signature.FeatureSpec(name="output", dtype=model_signature.DataType.STRING)],
        )
    },
)

gpt_model = model_registry.ModelReference(registry=registry, model_name=GPT2_MODEL_NAME, model_version=GPT2_MODEL_VERSION)
print("Registered new model:", model_id_gpt)

### Deploy the model and predict

Relax version is an option that allow the deployer tries to relax the version specifications when initial attempt to
resolve the dependencies in Snowflake Anaconda Channel fails.

In [None]:
registry = model_registry.ModelRegistry(
    session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME
)
gpt_model = model_registry.ModelReference(
    registry=registry,
    model_name=GPT2_MODEL_NAME,
    model_version=GPT2_MODEL_VERSION,
)
gpt_model.deploy(
    deployment_name="gpt_model_predict",
    target_method="predict",
    options={"relax_version": True, "_snowml_wheel_path": SNOW_ML_WHEEL_STAGE_PATH},
)

In [None]:
res = gpt_model.predict(deployment_name="gpt_model_predict", data=pd.DataFrame({"input":["Hello, are you GPT?"]}))

In [None]:
print(res)

## Use with XGBoost Model, Snowpark DataFrame and permanent deployment

### Prepare a stage for permanent UDF deployment

A non-temporary and Snowflake internal stage is required to permanently deploy a model as a UDF. We have to create manually now but it will eventually managed by model registry. 

In [None]:
PERMANENT_UDF_STAGE_NAME = "SNOWML_MODEL_UDF_DEPLOYMENT"

In [None]:
session.sql(f"CREATE OR REPLACE STAGE {PERMANENT_UDF_STAGE_NAME}").collect()

To make the deployment permanent, any dependency must be put into the a permanent stage as well. Of course, this will no longer be necessary after `snowflake-ml-python` gets available in Snowflake Anaconda channel.

In [None]:
SNOW_ML_WHEEL_STAGE_PATH = upload_snowml_to_tmp_stage(session, SNOW_ML_WHEEL_LOCAL_PATH, f"@{PERMANENT_UDF_STAGE_NAME}")

### Prepare dataset

In [None]:
from sklearn.datasets import fetch_kddcup99

DATA_TABLE_NAME = "KDDCUP99_DATASET"

kddcup99_data = fetch_kddcup99(as_frame=True)
kddcup99_sp_df = session.create_dataframe(kddcup99_data.frame)
kddcup99_sp_df.write.mode("overwrite").save_as_table(DATA_TABLE_NAME)

### Preprocessing Dataset

In [None]:
from snowflake.ml.preprocessing import one_hot_encoder, ordinal_encoder, standard_scaler
import snowflake.snowpark.functions as F

quote_fn = lambda x: f'"{x}"'

ONE_HOT_ENCODE_COL_NAMES = ["protocol_type", "service", "flag"]
ORDINAL_ENCODE_COL_NAMES = ["labels"]
STANDARD_SCALER_COL_NAMES = [
    "duration",
    "src_bytes",
    "dst_bytes",
    "wrong_fragment",
    "urgent",
    "hot",
    "num_failed_logins",
    "num_compromised",
    "num_root",
    "num_file_creations",
    "num_shells",
    "num_access_files",
    "num_outbound_cmds",
    "count",
    "srv_count",
    "dst_host_count",
    "dst_host_srv_count",
]

TRAIN_SIZE_K = 0.2
kddcup99_data = session.table(DATA_TABLE_NAME)
kddcup99_data = kddcup99_data.with_columns(
    list(map(quote_fn, ONE_HOT_ENCODE_COL_NAMES + ORDINAL_ENCODE_COL_NAMES)),
    [
        F.to_char(col_name, "utf-8")
        for col_name in list(map(quote_fn, ONE_HOT_ENCODE_COL_NAMES + ORDINAL_ENCODE_COL_NAMES))
    ],
)
kddcup99_sp_df_train, kddcup99_sp_df_test = tuple(
    kddcup99_data.random_split([TRAIN_SIZE_K, 1 - TRAIN_SIZE_K], seed=2568)
)

ft_one_hot_encoder = one_hot_encoder.OneHotEncoder(
    handle_unknown="ignore",
    input_cols=list(map(quote_fn, ONE_HOT_ENCODE_COL_NAMES)),
    output_cols=ONE_HOT_ENCODE_COL_NAMES,
    drop_input_cols=True,
)
ft_one_hot_encoder = ft_one_hot_encoder.fit(kddcup99_sp_df_train)
kddcup99_sp_df_train = ft_one_hot_encoder.transform(kddcup99_sp_df_train)
kddcup99_sp_df_test = ft_one_hot_encoder.transform(kddcup99_sp_df_test)

ft_ordinal_encoder = ordinal_encoder.OrdinalEncoder(
    input_cols=list(map(quote_fn, ORDINAL_ENCODE_COL_NAMES)),
    output_cols=list(map(quote_fn, ORDINAL_ENCODE_COL_NAMES)),
    drop_input_cols=True,
)
ft_ordinal_encoder = ft_ordinal_encoder.fit(kddcup99_sp_df_train)
kddcup99_sp_df_train = ft_ordinal_encoder.transform(kddcup99_sp_df_train)
kddcup99_sp_df_test = ft_ordinal_encoder.transform(kddcup99_sp_df_test)

ft_standard_scaler = standard_scaler.StandardScaler(
    input_cols=list(map(quote_fn, STANDARD_SCALER_COL_NAMES)),
    output_cols=list(map(quote_fn, STANDARD_SCALER_COL_NAMES)),
    drop_input_cols=True,
)
ft_standard_scaler = ft_standard_scaler.fit(kddcup99_sp_df_train)
kddcup99_sp_df_train = ft_standard_scaler.transform(kddcup99_sp_df_train)
kddcup99_sp_df_test = ft_standard_scaler.transform(kddcup99_sp_df_test)


### Train an XGBoost model

In [None]:
XGB_MODEL_NAME = "XGB_MODEL_KDDCUP99"
XGB_MODEL_VERSION = "2"

In [None]:
import xgboost

regressor = xgboost.XGBClassifier(objective="multi:softprob", n_estimators=500, reg_lambda=1, gamma=0, max_depth=5)
kddcup99_pd_df_train = kddcup99_sp_df_train.to_pandas()
regressor.fit(
    kddcup99_pd_df_train.drop(columns=["labels"]),
    kddcup99_pd_df_train["labels"],
)

### Log the model

In [None]:
from snowflake.ml.model import model_signature

registry = model_registry.ModelRegistry(
    session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME
)
# A name and model tags can be added to the model at registration time.
model_id_xgb = registry.log_model(
    model_name=XGB_MODEL_NAME,
    model_version=XGB_MODEL_VERSION,
    model=regressor,
    sample_input_data=kddcup99_sp_df_train.drop('"labels"'),
)

# The object API can be used to reference a model after creation.
xgb_model = model_registry.ModelReference(registry=registry, model_name=XGB_MODEL_NAME, model_version=XGB_MODEL_VERSION)
print("Registered new model:", model_id_xgb)

### Deploy the model permanently

In [None]:
registry = model_registry.ModelRegistry(
    session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME
)
xgb_model = model_registry.ModelReference(
    registry=registry,
    model_name=XGB_MODEL_NAME,
    model_version=XGB_MODEL_VERSION,
)
xgb_model.deploy(
    deployment_name="xgb_model_predict",
    target_method="predict",
    options={
        "relax_version": True,
        "permanent_udf_stage_location": f"@{PERMANENT_UDF_STAGE_NAME}",
        "_snowml_wheel_path": SNOW_ML_WHEEL_STAGE_PATH,
    },
)

### Predict with Snowpark DataFrame

In [None]:
sp_res = xgb_model.predict(deployment_name="xgb_model_predict", data=kddcup99_sp_df_test)
sp_res.show()

### Prepare another SQL connection

In [None]:
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session

another_session = Session.builder.configs(SnowflakeLoginOptions()).create()

### Call the deployed permanent UDF

In [None]:
registry._session = another_session # Since permanent deployment managing has not been finished in registry.
xgb_model = model_registry.ModelReference(
    registry=registry,
    model_name=XGB_MODEL_NAME,
    model_version=XGB_MODEL_VERSION,
)
sp_res = xgb_model.predict(
    deployment_name="xgb_model_predict", data=another_session.create_dataframe(kddcup99_sp_df_test.to_pandas())
)
sp_res.show()

### Remove the deployed UDF

This would be done by calling delete_deployment in the registry.

In [None]:
session.sql(f"DROP FUNCTION xgb_model_predict(object)").collect()