# From Zero to Snowflake in 50 Lines of Code

We're going to prep data, build and train a regression model, register it and deploy it in less than 50 lines of code, watch out for the TO DOs, you have to update a few things along the way

In [10]:
import json
import numpy as np
import pandas as pd
from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark.types import PandasDataFrameType, IntegerType, StringType, FloatType, DateType
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.modeling.linear_model import LinearRegression
from snowflake.ml.modeling.lightgbm import LGBMRegressor
from snowflake.ml.registry import model_registry
from snowflake.ml._internal.utils import identifier

# 1. Reading Snowflake Connection Details, create a Session

TO DO: 

1. Create a JSON with your credentials and update the cell below

{
"account": "your_account_name", 
"user": "your_user_name",
"password": "insert_your_pwd_here",
"role": "ACCOUNTADMIN"
}

2. Update the location 

In [2]:
snowflake_connection_cfg = json.loads(open("/Users/mitaylor/Documents/creds/creds.json").read()) # <--- 2. Update here
session = Session.builder.configs(snowflake_connection_cfg).create()

# 2. Specify Your Database and Create a Virtual Warehouse

Snowflake seperates compute from storage, so we need a database AND a warehouse (compute environment) to run this stuff on.  Might as well create a model registry at the same time

In [16]:
session.sql("USE DATABASE HOL_DEMO").collect()
session.sql("CREATE OR REPLACE WAREHOUSE ASYNC_WH WITH WAREHOUSE_SIZE='MEDIUM' WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'").collect()
REGISTRY_DATABASE_NAME = "MODEL_REGISTRY"
REGISTRY_SCHEMA_NAME = "PUBLIC"
model_registry.create_model_registry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)
registry = model_registry.ModelRegistry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)

The `snowflake.ml.registry.model_registry.ModelRegistry` has been deprecated starting from version 1.2.0.
It will stay in the Private Preview phase. For future implementations, kindly utilize `snowflake.ml.registry.Registry`,
except when specifically required. The old model registry will be removed once all its primary functionalities are
fully integrated into the new registry.
        
  registry = model_registry.ModelRegistry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)


### EXTRA BIT, WHILE WE DECIDE ON SHARES, PRE BUILT OR EVEN THIS

In [17]:
df = pd.read_csv("test.csv")
session.write_pandas(df, table_name='FS_DATASET', auto_create_table=True, overwrite=True)

<snowflake.snowpark.table.Table at 0x180507090>

# 3. Get Your Data (Prepped)
In this case we're going to make a really simple lagging feature transformation for our time series dataset.  Nothign for you to do but run the cells, but note ANY pandas based manipulation could be performed here

In [18]:
sdf = session.table("FS_DATASET")
sdf = sdf.select(F.to_date(F.col('DATE')).as_('DATE'), "SYMBOL", "CLOSE").drop_duplicates(['DATE', 'SYMBOL'])

In [29]:
class ML_Prep:
    def end_partition(self, df):
        df.columns = ['_DATE', '_SYMBOL', '_CLOSE']
        mean_close = df['_CLOSE'].mean()
        for i in range(1,6):
            df["_CLOSE-" + str(i)] = [mean_close]*i + list(df['_CLOSE'])[i:]
        yield df

ML_Prep.end_partition._sf_vectorized_input = pd.DataFrame

ml_prep_udtf = session.udtf.register(
    ML_Prep, # the class
    input_types=[PandasDataFrameType([DateType(), StringType(), FloatType()])], 
    output_schema=PandasDataFrameType([DateType(),StringType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType()],
                                      ["DATE", "SYMBOL", "CLOSE", "CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]),
    packages=["snowflake-snowpark-python", 'pandas'])  


In [30]:
sdf_prepped = sdf.select(ml_prep_udtf(*['DATE', 'SYMBOL', 'CLOSE']).over(partition_by=['SYMBOL']))
sdf_prepped.limit(5).to_pandas()

Unnamed: 0,DATE,SYMBOL,CLOSE,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5
0,2022-09-29,OTRK,0.51,0.521154,0.521154,0.521154,0.521154,0.521154
1,2022-10-17,OTRK,0.41,0.41,0.521154,0.521154,0.521154,0.521154
2,2022-10-14,OTRK,0.4,0.4,0.4,0.521154,0.521154,0.521154
3,2022-10-13,OTRK,0.41,0.41,0.41,0.41,0.521154,0.521154
4,2022-10-12,OTRK,0.43,0.43,0.43,0.43,0.43,0.521154


In [31]:
sdf[['SYMBOL']].distinct().to_pandas()

Unnamed: 0,SYMBOL
0,VZ
1,GOOG
2,TGVC
3,OTRK


# 4. Choose Your Model

We've got our data ready, but we need to make a few selections before we build our models

TO DO:
1. Choose the Symbol you want to build a model for
2. Pick the date range for your train/test split
3. Pick a regression model you want type

In [22]:
sdf_prepped_filt = sdf_prepped.filter((F.col("SYMBOL") == 'OTRK'))
sdf_filt_train, sdf_filt_test = sdf.filter((F.col("DATE") <= '2022-01-01')), sdf.filter((F.col("DATE") > '2022-01-01'))
regressor = LinearRegression 

#sdf_prepped_filt = sdf_prepped.filter((F.col("SYMBOL") == "")) # <---- update 1.
#sdf_filt_train, sdf_filt_test = sdf.filter((F.col("DATE") <= '')), sdf.filter((F.col("DATE") > '')) # <---- update 2.
#regressor = # <---- update 3. hint one look at our imports cell

# 5. Train Your Model

Our model is almost ready to be trained, but we need to choose our inputs, targets, and outputs.  We could go off piste and alter model (hyper)parameters here too (https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/api/modeling/snowflake.ml.modeling.linear_model.LinearRegression)

TO DO:
1. Select your input columns
2. Select your target(label) column
3. Choose your output column name

In [23]:
regressor = regressor(input_cols=["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"],
                         label_cols=["CLOSE"],
                         output_cols=["CLOSE_PREDICT"])
regressor.fit(sdf_prepped_filt)

#regressor = regressor(input_cols=[], # <---- update 1.
#                         label_cols=[], # <---- update 2.
#                         output_cols=[]) # <---- update 3.
#regressor.fit(sdf_prepped_filt)

<snowflake.ml.modeling.linear_model.linear_regression.LinearRegression at 0x184962ad0>

# 6. Register Your Model

Let's assume we love the first model, it's time to register it....

TO DO:
1. Choose a  model name
2. Choose a model version (note the combo of name and version needs to be unique)

In [25]:
MODEL_NAME = "SIMPLE_XGB_MODEL"
MODEL_VERSION = "v6"
model = registry.log_model(model_name=MODEL_NAME,
                           model_version=MODEL_VERSION,
                           model=regressor,
                           tags={"stage": "testing", "classifier_type": "xgb"},
                           sample_input_data=sdf_prepped_filt.limit(10).to_pandas()[["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]],)

#MODEL_NAME = # <---- update 1.
#MODEL_VERSION = # <---- update 2.
#model = registry.log_model(model_name=MODEL_NAME,
#                           model_version=MODEL_VERSION,
#                           model=regressor,
#                           tags={"stage": "testing", "classifier_type": "xgb"},
#                           sample_input_data=sdf_prepped_filt.limit(10).to_pandas()[["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]],)

  handler.save_model(


# 7. Deploy Your Model

Time to deploy the model...

In [26]:
model.deploy(deployment_name="model_predict",
             target_method="predict",
             permanent=True,
             options={"relax_version": True})



{'name': 'MODEL_REGISTRY.PUBLIC.model_predict',
 'platform': <TargetPlatform.WAREHOUSE: 'warehouse'>,
 'target_method': 'predict',
 'signature': ModelSignature(
                     inputs=[
                         FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M1'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M2'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M3'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M4'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M5')
                     ],
                     outputs=[
                         FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M1'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M2'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M3'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M4'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_M5'),
 		FeatureSpec(dtype=DataType.DOUBLE, name='CLOSE_PREDICT')
                     ]
                 ),
 'options': {'relax_version': True,
  'permanent

# 8. Run Your Model

We're at the finish line!

In [28]:
model.predict(deployment_name="model_predict", data=sdf_prepped_filt).limit(10).to_pandas()



Unnamed: 0,DATE,SYMBOL,CLOSE,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5,CLOSE_PREDICT
0,2022-09-29,OTRK,0.51,0.521154,0.521154,0.521154,0.521154,0.521154,0.520614
1,2022-10-17,OTRK,0.41,0.41,0.521154,0.521154,0.521154,0.521154,0.41
2,2022-10-14,OTRK,0.4,0.4,0.4,0.521154,0.521154,0.521154,0.4
3,2022-10-13,OTRK,0.41,0.41,0.41,0.41,0.521154,0.521154,0.41
4,2022-10-12,OTRK,0.43,0.43,0.43,0.43,0.43,0.521154,0.43
5,2022-10-11,OTRK,0.44,0.44,0.44,0.44,0.44,0.44,0.439331
6,2022-10-10,OTRK,0.46,0.46,0.46,0.46,0.46,0.46,0.459363
7,2022-10-07,OTRK,0.46,0.46,0.46,0.46,0.46,0.46,0.459363
8,2022-10-06,OTRK,0.48,0.48,0.48,0.48,0.48,0.48,0.479395
9,2022-10-05,OTRK,0.48,0.48,0.48,0.48,0.48,0.48,0.479395
