# Using MODEL via Registry in Snowflake


## Before Everything


### Snowflake-ML-Python Installation


- Please refer to our [landing page](https://docs.snowflake.com/en/developer-guide/snowpark-ml/index) to install `snowflake-ml-python` with the latest version.


### Setup Notebook


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from IPython.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))

### Start Snowpark Session

To avoid exposing credentials in Github, we use a small utility `SnowflakeLoginOptions`. It allows you to score your default credentials in `~/.snowsql/config` in the following format:

```
[connections]
accountname = <string>   # Account identifier to connect to Snowflake.
username = <string>      # User name in the account. Optional.
password = <string>      # User password. Optional.
dbname = <string>        # Default database. Optional.
schemaname = <string>    # Default schema. Optional.
warehousename = <string> # Default warehouse. Optional.
#rolename = <string>      # Default role. Optional.
#authenticator = <string> # Authenticator: 'snowflake', 'externalbrowser', etc
```

Please follow [this](https://docs.snowflake.com/en/user-guide/snowsql-start.html#configuring-default-connection-settings) for more details.


In [None]:
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session

session = Session.builder.configs(SnowflakeLoginOptions()).create()

### Open A Registry


To start we need to open a registry in a given **pre-created** database and schema, or the schema your session is actively using.


In [None]:
REGISTRY_DATABASE_NAME = "MY_REGISTRY"
REGISTRY_SCHEMA_NAME = "PUBLIC"

In [None]:
from snowflake.ml.registry import registry

reg = registry.Registry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)

## Walkthrough Registry with a Small Model


### Train a small model


The cell below trains a small model for demonstration purposes. The nature of the model does not matter, it is purely used to demonstrate the usage of the Registry.


In [None]:
from sklearn import svm, datasets

digits = datasets.load_digits()
target_digit = 6
num_training_examples = 10
svc_gamma = 0.001
svc_C = 10.0

clf = svm.SVC(gamma=svc_gamma, C=svc_C, probability=True)


def one_vs_all(dataset, digit):
    return [x == digit for x in dataset]


# Train a classifier using num_training_examples and use the last 100 examples for test.
train_features = digits.data[:num_training_examples]
train_labels = one_vs_all(digits.target[:num_training_examples], target_digit)
clf.fit(train_features, train_labels)

test_features = digits.data[-100:]
test_labels = one_vs_all(digits.target[-100:], target_digit)
prediction = clf.predict(test_features)

### Log the model


To keep the model for future use, we need to log the model. We need to provide a model name and a version name, with the following API, a SQL MODEL object will be created on your behalf.


In [None]:
model_name = "my_model"
version_name = "v1"

In [None]:
mv = reg.log_model(clf, model_name=model_name, version_name=version_name, sample_input_data=train_features)

### Run the model


After being logged, the model has already been ready to use in Snowflake with Warehouse!


In [None]:
remote_prediction = mv.run(test_features, function_name="predict")

In [None]:
import numpy as np

print("Remote prediction:", remote_prediction[:10])

print("Result comparison:", np.array_equal(prediction, remote_prediction["output_feature_0"].values))

All methods available in the original model can be run.


In [None]:
mv.show_functions()

In [None]:
remote_prediction_proba = mv.run(test_features, function_name="predict_proba")

In [None]:
prediction_proba = clf.predict_proba(test_features)

print("Remote prediction:", remote_prediction_proba[:10])

print("Result comparison:", np.allclose(prediction_proba, remote_prediction_proba.values))

### Get the model and version


After the model being logged, beside using the returned object, there are other APIs for you to get the object to operate on model or model version.


In [None]:
m = reg.get_model(model_name)

In [None]:
mv = m.version(version_name)

### Show and List models and versions

In [None]:
reg.show_models()

In [None]:
reg.models()

In [None]:
m.show_versions()

In [None]:
m.versions()

### Model Description


You could set description of a model or a specific version of the model. They are backend by COMMENT feature in the SQL.


In [None]:
m.description = "This is my model."
print(m.description)

In [None]:
reg.show_models()

In [None]:
mv.description = "This is the first version of my model."
print(mv.description)

In [None]:
m.show_versions()

### Model Metrics


Metrics are a type of metadata annotation that can be associated with a version of models stored in the Registry. Metrics often take the form of scalars but we also support more complex objects such as arrays or dictionaries to represent metrics, as long as they are JSON serializable. In the examples below, we add scalars, dictionaries, and a 2-dimensional numpy array as metrics.


In [None]:
from sklearn import metrics

test_accuracy = metrics.accuracy_score(test_labels, prediction)
print("Model test accuracy:", test_accuracy)

test_confusion_matrix = metrics.confusion_matrix(test_labels, prediction)
print("Confusion matrix:", test_confusion_matrix)

In [None]:
mv.set_metric(metric_name="test_accuracy", value=test_accuracy)

mv.set_metric(metric_name="num_training_examples", value=num_training_examples)

mv.set_metric(metric_name="dataset_test", value={"accuracy": test_accuracy})

mv.set_metric(metric_name="confusion_matrix", value=test_confusion_matrix.tolist())

In [None]:
mv.get_metric(metric_name="confusion_matrix")

In [None]:
mv.delete_metric(metric_name="confusion_matrix")

In [None]:
mv.show_metrics()

In [None]:
m.show_versions()

### Default version


You could set a default version of a model


In [None]:
m.default = version_name

In [None]:
m.default

### TAG management

You could set Snowflake TAG on a model. You need to have your TAG pre-created before setting it. Below we will show how to use TAG to label a live version of the model.

Note: When a tag is not set to a model, `get_tag` will return `None`.

In [None]:
m.get_tag("live_version")

In [None]:
m.set_tag("live_version", version_name)

In [None]:
m.get_tag("live_version")

In [None]:
m.show_tags()

In [None]:
m.unset_tag("live_version")

In [None]:
m.show_tags()

### Delete model


In [None]:
reg.delete_model(model_name)

In [None]:
reg.show_models()

## Use with Snowpark ML Modeling Model and Snowpark DataFrame


### Prepare Dataset


In [None]:
DATA_TABLE_NAME = "KDDCUP99_DATASET"

kddcup99_data = datasets.fetch_kddcup99(as_frame=True)
kddcup99_sp_df = session.create_dataframe(kddcup99_data.frame)
kddcup99_sp_df.write.mode("overwrite").save_as_table(DATA_TABLE_NAME)

In [None]:
from snowflake.ml.modeling.preprocessing import one_hot_encoder, ordinal_encoder, standard_scaler
from snowflake.ml.modeling.pipeline import pipeline
from snowflake.ml.modeling.xgboost import xgb_classifier
import snowflake.snowpark.functions as F

quote_fn = lambda x: f'"{x}"'

ONE_HOT_ENCODE_COL_NAMES = ["protocol_type", "service", "flag"]
ORDINAL_ENCODE_COL_NAMES = ["labels"]
STANDARD_SCALER_COL_NAMES = [
    "duration",
    "src_bytes",
    "dst_bytes",
    "wrong_fragment",
    "urgent",
    "hot",
    "num_failed_logins",
    "num_compromised",
    "num_root",
    "num_file_creations",
    "num_shells",
    "num_access_files",
    "num_outbound_cmds",
    "count",
    "srv_count",
    "dst_host_count",
    "dst_host_srv_count",
]

TRAIN_SIZE_K = 0.2
kddcup99_data = session.table(DATA_TABLE_NAME)
kddcup99_data = kddcup99_data.with_columns(
    list(map(quote_fn, ONE_HOT_ENCODE_COL_NAMES + ORDINAL_ENCODE_COL_NAMES)),
    [
        F.to_char(col_name, "utf-8")
        for col_name in list(map(quote_fn, ONE_HOT_ENCODE_COL_NAMES + ORDINAL_ENCODE_COL_NAMES))
    ],
)
kddcup99_sp_df_train, kddcup99_sp_df_test = tuple(
    kddcup99_data.random_split([TRAIN_SIZE_K, 1 - TRAIN_SIZE_K], seed=2568)
)

pipe = pipeline.Pipeline(
    steps=[
        (
            "OHEHOT",
            one_hot_encoder.OneHotEncoder(
                handle_unknown="ignore",
                input_cols=list(map(quote_fn, ONE_HOT_ENCODE_COL_NAMES)),
                output_cols=ONE_HOT_ENCODE_COL_NAMES,
                drop_input_cols=True,
            ),
        ),
        (
            "ORDINAL",
            ordinal_encoder.OrdinalEncoder(
                input_cols=list(map(quote_fn, ORDINAL_ENCODE_COL_NAMES)),
                output_cols=['"encoded_labels"'],
                drop_input_cols=True,
            ),
        ),
        (
            "STD",
            standard_scaler.StandardScaler(
                input_cols=list(map(quote_fn, STANDARD_SCALER_COL_NAMES)),
                output_cols=list(map(quote_fn, STANDARD_SCALER_COL_NAMES)),
                drop_input_cols=True,
            ),
        ),
        ("CLASSIFIER", xgb_classifier.XGBClassifier(label_cols=['"encoded_labels"'])),
    ]
)
pipe.fit(kddcup99_sp_df_train)

In [None]:
model_name = "pipeline_model"
version_name = "v2"

In [None]:
mv = reg.log_model(pipe, model_name=model_name, version_name=version_name)

In [None]:
mv.run(kddcup99_sp_df_test, function_name="predict").show()

## Use with customize model

### Download a GPT-2 model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

### Store GPT-2 Model components locally

In [None]:
ARTIFACTS_DIR = "/tmp/gpt-2/"

In [None]:
import os

os.makedirs(os.path.join(ARTIFACTS_DIR, "model"), exist_ok=True)
os.makedirs(os.path.join(ARTIFACTS_DIR, "tokenizer"), exist_ok=True)

model.save_pretrained(os.path.join(ARTIFACTS_DIR, "model"))
tokenizer.save_pretrained(os.path.join(ARTIFACTS_DIR, "tokenizer"))

### Create a custom model using GPT-2

In [None]:
from snowflake.ml.model import custom_model
import pandas as pd


class GPT2Model(custom_model.CustomModel):
    def __init__(self, context: custom_model.ModelContext) -> None:
        super().__init__(context)

        self.model = AutoModelForCausalLM.from_pretrained(self.context.path("model"))
        self.tokenizer = AutoTokenizer.from_pretrained(self.context.path("tokenizer"))

    @custom_model.inference_api
    def predict(self, X: pd.DataFrame) -> pd.DataFrame:
        def _generate(input_text: str) -> str:
            input_ids = self.tokenizer.encode(input_text, return_tensors="pt")

            output = self.model.generate(input_ids, max_length=50, do_sample=True, top_p=0.95, top_k=60)
            generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)

            return generated_text

        res_df = pd.DataFrame({"output": pd.Series.apply(X["input"], _generate)})
        return res_df

In [None]:
gpt_model = GPT2Model(
    custom_model.ModelContext(
        models={},
        artifacts={
            "model": os.path.join(ARTIFACTS_DIR, "model"),
            "tokenizer": os.path.join(ARTIFACTS_DIR, "tokenizer"),
        },
    )
)

gpt_model.predict(pd.DataFrame({"input": ["Hello, are you GPT?"]}))

### Register the custom model

Here, how to specify dependencies and model signature manually is shown.

In [None]:
model_name = "gpt2_medium"
version_name = "v1"

In [None]:
from snowflake.ml.model import model_signature

mv = reg.log_model(
    gpt_model,
    model_name=model_name,
    version_name=version_name,
    conda_dependencies=["pytorch", "transformers"],
    signatures={
        "predict": model_signature.ModelSignature(
            inputs=[model_signature.FeatureSpec(name="input", dtype=model_signature.DataType.STRING)],
            outputs=[model_signature.FeatureSpec(name="output", dtype=model_signature.DataType.STRING)],
        )
    },
)

In [None]:
mv.run(pd.DataFrame({"input": ["Hello, are you GPT?"]}))

## Use with Sentence Transformer model

In [None]:
from sentence_transformers import SentenceTransformer

embed_model = SentenceTransformer('distilbert-base-nli-mean-tokens')

In [None]:
input_data = [
    "This is the first sentence.",
    "Here's another sentence for testing.",
    "The quick brown fox jumps over the lazy dog.",
    "I love coding and programming.",
    "Machine learning is an exciting field.",
    "Python is a popular programming language.",
    "I enjoy working with data.",
    "Deep learning models are powerful.",
    "Natural language processing is fascinating.",
    "I want to improve my NLP skills.",
]

In [None]:
mv = reg.log_model(model, model_name="sentence_transformer", sample_input_data=input_data)

In [None]:
mv.run(input_data)