# Part 1. Load, Prep, Train, Register, Deploy and Scale in 50 Lines of Code

In this lab you will learn how to:

1. Create a session for Snowpark with Snowflake
2. Create a DB, Warehouse and Model Registry
3. Prep Data using the highly parallelisable vectorised UDTF functionality
4. Build/train a regression model with Snowpark ML
5. Register your model in the Model Registry
6. Run the model

All this in 50 lines of code (less the library imports).

## Prerequisites:
In a terminal please run:

conda env create -f conda_env.yml
 
conda activate snowpark-ml-hol

jupyter lab <---- this will load jupyter (you cna execute the notebook anywhere really, e.g. vscode, but jupyter is an easy option)

# 1.0 Imports
TO DO: just run the following cell

In [1]:
import json
import pandas as pd
from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
from snowflake.snowpark.types import PandasDataFrameType, IntegerType, StringType, FloatType, DateType
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.modeling.linear_model import LinearRegression
from snowflake.ml.registry import registry
from snowflake.ml._internal.utils import identifier

# 1.1 Reading Snowflake Connection Details, create a Session

In [3]:
snowflake_connection_cfg = json.loads(open("/Users/mitaylor/Documents/creds/creds.json").read()) 
session = Session.builder.configs(snowflake_connection_cfg).create()

# 1.2 Specify Your Database and Create a Virtual Warehouse

In [4]:
session.sql("CREATE OR REPLACE DATABASE MODEL_REGISTRY").collect()
session.sql("CREATE OR REPLACE SCHEMA PUBLIC").collect()
REGISTRY_DATABASE_NAME = "MODEL_REGISTRY"
REGISTRY_SCHEMA_NAME = "PUBLIC"
native_registry = registry.Registry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)
session.sql("CREATE OR REPLACE DATABASE HOL_DEMO").collect()
session.sql("CREATE OR REPLACE WAREHOUSE ASYNC_WH WITH WAREHOUSE_SIZE='MEDIUM' WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'").collect()

[Row(status='Warehouse ASYNC_WH successfully created.')]

In [5]:
# This data load is deliberately lo-fi, lots of other ways of importing data exist that have greater scale, but this compact approach is fine for this task
session.write_pandas(pd.read_csv("test.csv"), table_name='FS_DATASET', auto_create_table=True, overwrite=True)

<snowflake.snowpark.table.Table at 0x10f4dadd0>

# 1.3 Get Your Data (Prepped)

In [6]:
sdf = session.table("FS_DATASET")
sdf = sdf.select(F.to_date(F.col('DATE')).as_('DATE'), "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL")

In [7]:
class ML_Prep:
    def end_partition(self, df):
        df.columns = ['_DATE', "_OPEN", "_HIGH", "_LOW", "_CLOSE", "_SYMBOL"]
        for i in range(1,6):
            df["_CLOSE-" + str(i)] = df["_CLOSE"].shift(i).bfill()
        yield df

ML_Prep.end_partition._sf_vectorized_input = pd.DataFrame

ml_prep_udtf = session.udtf.register(
    ML_Prep, # the class
    name="ml_prep_udtf",
    input_types=[PandasDataFrameType([DateType(), FloatType(), FloatType(), FloatType(), FloatType(), StringType()])], 
    output_schema=PandasDataFrameType([DateType(), FloatType(), FloatType(), FloatType(), FloatType(), StringType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType()],
                                      ['DATE', "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL", "CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]),
    packages=["snowflake-snowpark-python", 'pandas'])  


In [8]:
sdf_prepped = sdf.select(ml_prep_udtf(*["DATE", "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL"]).over(partition_by=['SYMBOL']))
sdf_prepped.limit(10).to_pandas()
sdf_prepped.write.save_as_table("ML_PREPPED", mode="overwrite")
sdf[['SYMBOL']].distinct().to_pandas()

Unnamed: 0,SYMBOL
0,IBM
1,AMZN
2,FDS
3,META


# 1.4.1 Choose Your Symbol, Train/Test Split and Model

In [19]:
sdf_prepped_filt = sdf_prepped.filter((F.col("SYMBOL") == "IBM"))
sdf_filt_train, sdf_filt_test = sdf_prepped_filt.filter((F.col("DATE") <= '2021-01-01')), sdf_prepped_filt.filter((F.col("DATE") > '2021-01-01')) 
regressor = LinearRegression

# 1.4.2 Train Your Model

In [20]:
regressor = regressor(input_cols=["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"], 
                         label_cols=["CLOSE"],
                         output_cols=["CLOSE_PREDICT"]) 
regressor.fit(sdf_prepped_filt)

<snowflake.ml.modeling.linear_model.linear_regression.LinearRegression at 0x182ba5a10>

# 1.5 Register Your Model

In [21]:
MODEL_NAME = "LR_TEST_MODEL"
MODEL_VERSION = "v2"

model = native_registry.log_model(
    model_name=MODEL_NAME,
    version_name=MODEL_VERSION,
    model=regressor,
)

# 1.6 Run Your Model

In [27]:
model.run(sdf_filt_test, function_name="predict").limit(5).to_pandas()

Unnamed: 0,DATE,OPEN,HIGH,LOW,CLOSE,SYMBOL,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5,CLOSE_PREDICT
0,2021-01-04,133.520004,133.610001,126.760002,129.410004,IBM,132.690002,133.720001,134.869995,136.690002,131.970001,133.065836
1,2021-01-05,128.889999,131.740005,128.429993,131.009995,IBM,129.410004,132.690002,133.720001,134.869995,136.690002,130.259945
2,2021-01-06,127.720001,131.050003,126.379997,126.599998,IBM,131.009995,129.410004,132.690002,133.720001,134.869995,131.12131
3,2021-01-07,128.360001,131.630005,127.860001,130.919998,IBM,126.599998,131.009995,129.410004,132.690002,133.720001,127.630447
4,2021-01-08,132.429993,132.630005,130.229996,132.050003,IBM,130.919998,126.599998,131.009995,129.410004,132.690002,130.722491


# 1.7 Or Pull Your Model From The Registry

In [26]:
model_ = native_registry.get_model(MODEL_NAME).version(MODEL_VERSION)
model_.run(sdf_filt_test, function_name="predict").limit(5).to_pandas()

Unnamed: 0,DATE,OPEN,HIGH,LOW,CLOSE,SYMBOL,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5,CLOSE_PREDICT
0,2021-01-04,133.520004,133.610001,126.760002,129.410004,IBM,132.690002,133.720001,134.869995,136.690002,131.970001,133.065836
1,2021-01-05,128.889999,131.740005,128.429993,131.009995,IBM,129.410004,132.690002,133.720001,134.869995,136.690002,130.259945
2,2021-01-06,127.720001,131.050003,126.379997,126.599998,IBM,131.009995,129.410004,132.690002,133.720001,134.869995,131.12131
3,2021-01-07,128.360001,131.630005,127.860001,130.919998,IBM,126.599998,131.009995,129.410004,132.690002,133.720001,127.630447
4,2021-01-08,132.429993,132.630005,130.229996,132.050003,IBM,130.919998,126.599998,131.009995,129.410004,132.690002,130.722491
