# Part 1. From Zero to Snowflake in 50 Lines of Code

In this lab you will learn how to:

1. Create a session for Snowpark with Snowflake
2. Create a DB, Warehouse and Model Registry
3. Prep Data using the highly parallelisable vectorised UDTF functionality
4. Build/train a regression model with Snowpark ML
5. Register your model in the Model Registry
6. Deploy the model
7. Run the model

All this in 50 lines of code (less the library imports). Note - there are some TODOs along the way for you to update

## Prerequisites:
In a terminal please run:

conda env create -f environment.yml
 
conda activate snowpark-ml-hol

In [1]:
import json
import numpy as np
import pandas as pd
from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark.types import PandasDataFrameType, IntegerType, StringType, FloatType, DateType
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.modeling.linear_model import LinearRegression
from snowflake.ml.registry import model_registry
from snowflake.ml._internal.utils import identifier

# 1.1 Reading Snowflake Connection Details, create a Session

TO DO: 

1. Create a JSON with your credentials and update the cell below

{
"account": "your_account_name", 
"user": "your_user_name",
"password": "insert_your_pwd_here",
"role": "ACCOUNTADMIN"
}

2. Update the location 

In [2]:
snowflake_connection_cfg = json.loads(open("/Users/mitaylor/Documents/creds/creds.json").read()) # <--- 2. Update here
session = Session.builder.configs(snowflake_connection_cfg).create()

# 1.2 Specify Your Database and Create a Virtual Warehouse

Snowflake seperates compute from storage, so we need a database AND a warehouse (compute environment) to run this stuff on.  Might as well create a model registry at the same time

In [12]:
session.sql("USE DATABASE HOL_DEMO").collect()
session.sql("CREATE OR REPLACE WAREHOUSE ASYNC_WH WITH WAREHOUSE_SIZE='MEDIUM' WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'").collect()
REGISTRY_DATABASE_NAME = "MODEL_REGISTRY"
REGISTRY_SCHEMA_NAME = "PUBLIC"
model_registry.create_model_registry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)
registry = model_registry.ModelRegistry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)

The `snowflake.ml.registry.model_registry.ModelRegistry` has been deprecated starting from version 1.2.0.
It will stay in the Private Preview phase. For future implementations, kindly utilize `snowflake.ml.registry.Registry`,
except when specifically required. The old model registry will be removed once all its primary functionalities are
fully integrated into the new registry.
        
  registry = model_registry.ModelRegistry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)


### EXTRA BIT, WHILE WE DECIDE ON DATA SHARES, PRE BUILT OR EVEN THIS CSV

In [13]:
df = pd.read_csv("test.csv")
session.write_pandas(df, table_name='FS_DATASET', auto_create_table=True, overwrite=True)

<snowflake.snowpark.table.Table at 0x1982e2990>

# 1.3 Get Your Data (Prepped)
In this case we're going to make a really simple lagging feature transformation for our time series dataset.  Nothign for you to do but run the cells, but note ANY pandas based manipulation could be performed here

In [40]:
sdf = session.table("FS_DATASET")
sdf = sdf.select(F.to_date(F.col('DATE')).as_('DATE'), "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL").drop_duplicates(['DATE', 'SYMBOL'])

In [49]:
class ML_Prep:
    def end_partition(self, df):
        df.columns = ['_DATE', "_OPEN", "_HIGH", "_LOW", "_CLOSE", "_SYMBOL"]
        for i in range(1,6):
            df["_CLOSE-" + str(i)] = df["_CLOSE"].shift(i).fillna(df["_CLOSE"].mean())
        yield df

ML_Prep.end_partition._sf_vectorized_input = pd.DataFrame

ml_prep_udtf = session.udtf.register(
    ML_Prep, # the class
    input_types=[PandasDataFrameType([DateType(), FloatType(), FloatType(), FloatType(), FloatType(), StringType()])], 
    output_schema=PandasDataFrameType([DateType(), FloatType(), FloatType(), FloatType(), FloatType(), StringType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType()],
                                      ['DATE', "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL", "CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]),
    packages=["snowflake-snowpark-python", 'pandas'])  


In [50]:
sdf_prepped = sdf.select(ml_prep_udtf(*["DATE", "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL"]).over(partition_by=['SYMBOL']))
sdf_prepped.limit(10).to_pandas()

Unnamed: 0,DATE,OPEN,HIGH,LOW,CLOSE,SYMBOL,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5
0,2020-06-19,88.660004,89.139999,86.287498,87.43,AMZN,140.808131,140.808131,140.808131,140.808131,140.808131
1,2020-05-29,79.8125,80.287498,79.1175,79.485001,AMZN,87.43,140.808131,140.808131,140.808131,140.808131
2,2020-06-01,79.4375,80.587502,79.302498,80.462502,AMZN,79.485001,87.43,140.808131,140.808131,140.808131
3,2020-06-02,80.1875,80.860001,79.732498,80.834999,AMZN,80.462502,79.485001,87.43,140.808131,140.808131
4,2020-06-03,81.165001,81.550003,80.574997,81.279999,AMZN,80.834999,80.462502,79.485001,87.43,140.808131
5,2020-06-04,81.097504,81.404999,80.195,80.580002,AMZN,81.279999,80.834999,80.462502,79.485001,87.43
6,2020-06-05,80.837502,82.9375,80.807503,82.875,AMZN,80.580002,81.279999,80.834999,80.462502,79.485001
7,2020-06-08,82.5625,83.400002,81.830002,83.364998,AMZN,82.875,80.580002,81.279999,80.834999,80.462502
8,2020-06-09,83.035004,86.402496,83.002502,85.997498,AMZN,83.364998,82.875,80.580002,81.279999,80.834999
9,2020-06-10,86.974998,88.692497,86.522499,88.209999,AMZN,85.997498,83.364998,82.875,80.580002,81.279999


In [22]:
sdf[['SYMBOL']].distinct().to_pandas()

Unnamed: 0,SYMBOL
0,AAPL
1,AMZN
2,FDS
3,IBM
4,META
5,MSFT


# 1.4.1 Choose Your Symbol, Train/Test Split and Model

We've got our data ready, but we need to make a few selections before we build our models

TO DO:
1. Choose the Symbol you want to build a model for
2. Pick the date range for your train/test split
3. Pick a regression model you want type

In [52]:
sdf_prepped_filt = sdf_prepped.filter((F.col("SYMBOL") == 'AAPL'))
sdf_filt_train, sdf_filt_test = sdf_prepped_filt.filter((F.col("DATE") <= '2022-01-01')), sdf_prepped_filt.filter((F.col("DATE") > '2022-01-01'))
regressor = LinearRegression 

#sdf_prepped_filt = sdf_prepped.filter((F.col("SYMBOL") == "")) # <---- update 1.
#sdf_filt_train, sdf_filt_test = sdf_prepped_filt.filter((F.col("DATE") <= '')), sdf_prepped_filt.filter((F.col("DATE") > '')) # <---- update 2.
#regressor = # <---- update 3. hint one look at our imports cell

# 1.4.2 Train Your Model

Our model is almost ready to be trained, but we need to choose our inputs, targets, and outputs.  We could go off piste and alter model (hyper)parameters here too (https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/api/modeling/snowflake.ml.modeling.linear_model.LinearRegression)

TO DO:
1. Select your input columns
2. Select your target(label) column
3. Choose your output column name

In [53]:
regressor = regressor(input_cols=["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"],
                         label_cols=["CLOSE"],
                         output_cols=["CLOSE_PREDICT"])
regressor.fit(sdf_filt_train)

#regressor = regressor(input_cols=[], # <---- update 1.
#                         label_cols=[], # <---- update 2.
#                         output_cols=[]) # <---- update 3.
#regressor.fit(sdf_prepped_filt)

<snowflake.ml.modeling.linear_model.linear_regression.LinearRegression at 0x198d8bc90>

# 1.5 Register Your Model

Let's assume we love the first model, it's time to register it....

TO DO:
1. Choose a  model name
2. Choose a model version (note the combo of name and version needs to be unique)

In [54]:
MODEL_NAME = "SIMPLE_XGB_MODEL"
MODEL_VERSION = "v12"
model = registry.log_model(model_name=MODEL_NAME,
                           model_version=MODEL_VERSION,
                           model=regressor,
                           tags={"stage": "testing", "classifier_type": "xgb"})

#MODEL_NAME = # <---- update 1.
#MODEL_VERSION = # <---- update 2.
#model = registry.log_model(model_name=MODEL_NAME,
#                           model_version=MODEL_VERSION,
#                           model=regressor,
#                           tags={"stage": "testing", "classifier_type": "xgb"},
#                           sample_input_data=sdf_prepped_filt.limit(10).to_pandas()[["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]],)

# 1.6 Deploy Your Model

Time to deploy the model...

In [55]:
model.deploy(deployment_name="model_predict_v6",
             target_method="predict",
             permanent=True,
             options={"relax_version": True})

{'name': 'MODEL_REGISTRY.PUBLIC.model_predict_v6',
 'platform': <TargetPlatform.WAREHOUSE: 'warehouse'>,
 'target_method': 'predict',
 'signature': ModelSignature(
                     inputs=[
                         FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M1'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M2'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M3'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M4'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M5')
                     ],
                     outputs=[
                         FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M1'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M2'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M3'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M4'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M5'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_PREDICT')
                     ]
                 ),
 'options': {'relax_version': True,
  'perman

# 1.7 Run Your Model

We're at the finish line!

In [56]:
model.predict(deployment_name="model_predict_v6", data=sdf_filt_test).limit(20).to_pandas()

Unnamed: 0,DATE,OPEN,HIGH,LOW,CLOSE,SYMBOL,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5,CLOSE_PREDICT
0,2022-02-02,174.75,175.880005,173.330002,175.839996,AAPL,121.099998,124.400002,116.970001,114.970001,115.080002,120.274025
1,2022-01-11,172.320007,175.179993,170.820007,175.080002,AAPL,175.839996,121.099998,124.400002,116.970001,114.970001,161.462255
2,2022-01-12,176.119995,177.179993,174.820007,175.529999,AAPL,175.080002,175.839996,121.099998,124.400002,116.970001,166.748144
3,2022-01-13,175.779999,176.619995,171.789993,172.190002,AAPL,175.529999,175.080002,175.839996,121.099998,124.400002,169.459198
4,2022-01-14,171.339996,173.779999,171.089996,173.070007,AAPL,172.190002,175.529999,175.080002,175.839996,121.099998,166.802102
5,2022-01-18,171.509995,172.539993,169.410004,169.800003,AAPL,173.070007,172.190002,175.529999,175.080002,175.839996,164.599309
6,2022-01-19,170.0,171.080002,165.940002,166.229996,AAPL,169.800003,173.070007,172.190002,175.529999,175.080002,162.100583
7,2022-01-20,166.979996,169.679993,164.179993,164.509995,AAPL,166.229996,169.800003,173.070007,172.190002,175.529999,159.089257
8,2022-01-21,164.419998,166.330002,162.300003,162.410004,AAPL,164.509995,166.229996,169.800003,173.070007,172.190002,157.375848
9,2022-01-24,160.020004,162.300003,154.699997,161.619995,AAPL,162.410004,164.509995,166.229996,169.800003,173.070007,155.398778


In [None]:
model.predict(deployment_name="model_predict_v6", data=sdf_filt_test).write.save_as_table("ML_PREDICT", mode="overwrite")