# From Zero to Snowflake in 50 Lines of Code

We're going to prep data, build and train a regression model, register it and deploy it in less than 50 lines of code, watch out for the TO DOs, you have to update a few things along the way

In [54]:
import json
import numpy as np
import pandas as pd
from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark.types import PandasDataFrameType, IntegerType, StringType, FloatType, DateType
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.modeling.linear_model import LinearRegression
from snowflake.ml.modeling.lightgbm import LGBMRegressor
from snowflake.ml.registry import model_registry

In [31]:
from snowflake.snowpark.functions import sproc, col
sdf = session.sql("select * FROM DATA_LAKE_TRADE_DATA_MT.PUBLIC.TRADE")


In [35]:
session.sql("USE DATABASE HOL_DEMO").collect()
sdf_filtered = sdf.filter((col("SYMBOL") == 'TGVC') | (col("SYMBOL") == 'GOOG') | (col("SYMBOL") == 'OTRK') | (col("SYMBOL") == 'VZ'))
sdf_filtered.write.save_as_table("FS_DATASET", mode="overwrite")
sdf_filtered.limit(5).to_pandas()

Unnamed: 0,DATE,SYMBOL,EXCHANGE,ACTION,CLOSE,NUM_SHARES,CASH,TRADER,PM
0,2019-08-05,GOOG,NASDAQ,hold,57.62,0.0,0.0,charles,warren
1,2019-08-06,GOOG,NASDAQ,hold,58.5,0.0,0.0,charles,warren
2,2019-08-07,GOOG,NASDAQ,hold,58.7,0.0,0.0,charles,warren
3,2019-08-08,GOOG,NASDAQ,hold,60.24,0.0,0.0,charles,warren
4,2019-08-09,GOOG,NASDAQ,hold,59.4,0.0,0.0,charles,warren


# 1. Reading Snowflake Connection Details, create a Session

TO DO: 

1. Create a JSON with your credentials and update the cell below

{
"account": "your_account_name", 
"user": "your_user_name",
"password": "insert_your_pwd_here",
"role": "ACCOUNTADMIN"
}

2. Update the location 

In [20]:
snowflake_connection_cfg = json.loads(open("/Users/mitaylor/Documents/creds/creds.json").read()) # <--- 2. Update here
session = Session.builder.configs(snowflake_connection_cfg).create()

# 2. Specify Your Database and Create a Virtual Warehouse

Snowflake seperates compute from storage, so we need a database AND a warehouse (compute environment) to run this stuff on.  Might as well create a model registry at the same time

In [34]:
session.sql("USE DATABASE HOL_DEMO").collect()
session.sql("CREATE OR REPLACE WAREHOUSE ASYNC_WH WITH WAREHOUSE_SIZE='MEDIUM' WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'").collect()
REGISTRY_DATABASE_NAME = "MODEL_REGISTRY"
REGISTRY_SCHEMA_NAME = "PUBLIC"
model_registry.create_model_registry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME))
registry = model_registry.ModelRegistry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)

[Row(status='Warehouse ASYNC_WH successfully created.')]

# 3. Get Your Data (Prepped)
In this case we're going to make a really simple lagging feature transformation for our time series dataset.  Nothign for you to do but run the cells, but note ANY pandas based manipulation could be performed here

In [48]:
sdf = session.table("FS_DATASET").drop_duplicates(['DATE', 'SYMBOL'])

In [49]:
class ML_Prep:
    def end_partition(self, df):
        df.columns = ['_DATE', '_SYMBOL', '_CLOSE']
        mean_close = df['_CLOSE'].mean()
        for i in range(1,6):
            df["_CLOSE-" + str(i)] = [mean_close]*i + list(df['_CLOSE'])[i:]
        yield df

ML_Prep.end_partition._sf_vectorized_input = pd.DataFrame

ml_prep_udtf = session.udtf.register(
    ML_Prep, # the class
    input_types=[PandasDataFrameType([DateType()] + # DATE
                                     [StringType()] + # SYMBOL
                                     [FloatType()] # CLOSE
                                    )], 
    output_schema=PandasDataFrameType([DateType(),StringType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType()],
                                      ["DATE", "SYMBOL", "CLOSE", "CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]),
    packages=["snowflake-snowpark-python", 'pandas'])  


The version of package 'pandas' in the local environment is 1.5.0, which does not fit the criteria for the requirement 'pandas'. Your UDF might not work when the package version is different between the server and your local environment.


In [50]:
sdf_prepped = sdf.select(ml_prep_udtf(*['DATE', 'SYMBOL', 'CLOSE']).over(partition_by=['SYMBOL']))
sdf_prepped.limit(5).to_pandas()

Unnamed: 0,DATE,SYMBOL,CLOSE,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5
0,2022-10-17,OTRK,0.41,0.521154,0.521154,0.521154,0.521154,0.521154
1,2022-09-26,OTRK,0.49,0.49,0.521154,0.521154,0.521154,0.521154
2,2022-10-06,OTRK,0.48,0.48,0.48,0.521154,0.521154,0.521154
3,2022-10-12,OTRK,0.43,0.43,0.43,0.43,0.521154,0.521154
4,2022-10-03,OTRK,0.47,0.47,0.47,0.47,0.47,0.521154


In [53]:
sdf[['SYMBOL']].distinct().to_pandas()

Unnamed: 0,SYMBOL
0,TGVC
1,OTRK
2,VZ
3,GOOG


# 4. Choose Your Model

We've got our data ready, but we need to make a few selections before we build our models

TO DO:
1. Choose the Symbol you want to build a model for
2. Pick the date range for your train/test split
3. Pick a regression model you want type

In [58]:
#sdf_prepped_filt = sdf_prepped.filter((col("SYMBOL") == 'OTRK'))
#sdf_filt_train, sdf_filt_test = sdf.filter((col("DATE") <= '')), sdf.filter((col("DATE") > ''))
#regressor = XGBRegressor 


sdf_prepped_filt = sdf_prepped.filter((col("SYMBOL") == )) # <---- update 1.
#sdf_filt_train, sdf_filt_test = sdf.filter((col("DATE") <= '')), sdf.filter((col("DATE") > '')) # <---- update 2.
regressor = # <---- update 3. hint one look at our imports cell

# 5. Train Your Model

Our model is almost ready to be trained, but we need to choose our inputs, targets, and outputs.  We could go off piste and alter model (hyper)parameters here too (https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/api/modeling/snowflake.ml.modeling.linear_model.LinearRegression)

TO DO:
1. Select your input columns
2. Select your target(label) column
3. Choose your output column name

In [60]:
#regressor = regressor(input_cols=["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"],
#                         label_cols=["CLOSE"],
#                         output_cols=["CLOSE_PREDICT"])
#regressor.fit(sdf_prepped_filt)

regressor = regressor(input_cols=[], # <---- update 1.
                         label_cols=[], # <---- update 2.
                         output_cols=[]) # <---- update 3.
regressor.fit(sdf_prepped_filt)

The version of package 'xgboost' in the local environment is 1.7.6, which does not fit the criteria for the requirement 'xgboost==1.7.3'. Your UDF might not work when the package version is different between the server and your local environment.


  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.



The version of package 'xgboost' in the local environment is 1.7.6, which does not fit the criteria for the requirement 'xgboost==1.7.3'. Your UDF might not work when the package version is different between the server and your local environment.


# 6. Register Your Model

Let's assume we love the first model, it's time to register it....

TO DO:
1. Choose a  model name
2. Choose a model version (note the combo of name and version needs to be unique)

In [None]:
#MODEL_NAME = "SIMPLE_XGB_MODEL"
#MODEL_VERSION = "v2"
#model = registry.log_model(
#    model_name=MODEL_NAME,
#    model_version=MODEL_VERSION,
#    model=regressor,
#    tags={"stage": "testing", "classifier_type": "xgb"},
#    sample_input_data=sdf_prepped_filt.limit(10).to_pandas()[["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]],
#)

MODEL_NAME = # <---- update 1.
MODEL_VERSION = # <---- update 2.
model = registry.log_model(
    model_name=MODEL_NAME,
    model_version=MODEL_VERSION,
    model=regressor,
    tags={"stage": "testing", "classifier_type": "xgb"},
    sample_input_data=sdf_prepped_filt.limit(10).to_pandas()[["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]],
)

# 7. Deploy Your Model

Time to deploy the model...

In [None]:
model.deploy(deployment_name="model_predict",
             target_method="predict",
             permanent=True,
             options={"relax_version": True})

# 8. Run Your Model

We're at the finish line!

In [None]:
result = regressor.predict(sdf_prepped_filt)
result[['DATE',"CLOSE","CLOSE_PREDICT"]].limit(15).to_pandas()

In [63]:
sdf.to_pandas().to_csv("test.csv")