# Part 1. From Zero to Snowflake in 50 Lines of Code

In this lab you will learn how to:

1. Create a session for Snowpark with Snowflake
2. Create a DB, Warehouse and Model Registry
3. Prep Data using the highly parallelisable vectorised UDTF functionality
4. Build/train a regression model with Snowpark ML
5. Register your model in the Model Registry
6. Run the model

All this in 50 lines of code (less the library imports). Note - there are some TODOs along the way for you to update

## Prerequisites:
In a terminal please run:

conda env create -f conda_env.yml
 
conda activate snowpark-ml-hol

jupyter lab <---- this will load jupyter 

In [1]:
import json
import pandas as pd
from snowflake.snowpark.session import Session
import snowflake.snowpark.functions as F
from snowflake.snowpark.types import PandasDataFrameType, IntegerType, StringType, FloatType, DateType
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.modeling.linear_model import LinearRegression
from snowflake.ml.registry import registry
from snowflake.ml._internal.utils import identifier

# 1.1 Reading Snowflake Connection Details, create a Session


In [2]:
snowflake_connection_cfg = json.loads(open("/Users/mitaylor/Documents/creds/creds_sf_azure.json").read()) # <--- 2. Update here creds_sf_azure
session = Session.builder.configs(snowflake_connection_cfg).create()

# 1.2 Specify Your Database and Create a Virtual Warehouse

Snowflake seperates compute from storage, so we need a database AND a warehouse (compute environment) to run this stuff on.  Might as well create a model registry at the same time

In [3]:
session.sql("CREATE OR REPLACE DATABASE HOL_DEMO").collect()
session.sql("CREATE OR REPLACE WAREHOUSE ASYNC_WH WITH WAREHOUSE_SIZE='MEDIUM' WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'").collect()
REGISTRY_DATABASE_NAME = "MODEL_REGISTRY"
REGISTRY_SCHEMA_NAME = "PUBLIC"
native_registry = registry.Registry(session=session, database_name=REGISTRY_DATABASE_NAME, schema_name=REGISTRY_SCHEMA_NAME)

### EXTRA BIT, WHILE WE DECIDE ON DATA SHARES, PRE BUILT OR EVEN THIS CSV

In [5]:
session.write_pandas(pd.read_csv("test.csv"), table_name='FS_DATASET', auto_create_table=True, overwrite=True)

<snowflake.snowpark.table.Table at 0x17eb88350>

# 1.3 Get Your Data (Prepped)
In this case we're going to make a really simple lagging feature transformation for our time series dataset.  Nothign for you to do but run the cells, but note ANY pandas based manipulation could be performed here

In [6]:
sdf = session.table("FS_DATASET")
sdf = sdf.select(F.to_date(F.col('DATE')).as_('DATE'), "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL")

In [7]:
class ML_Prep:
    def end_partition(self, df):
        df.columns = ['_DATE', "_OPEN", "_HIGH", "_LOW", "_CLOSE", "_SYMBOL"]
        for i in range(1,6):
            df["_CLOSE-" + str(i)] = df["_CLOSE"].shift(i).bfill()
        yield df

ML_Prep.end_partition._sf_vectorized_input = pd.DataFrame

ml_prep_udtf = session.udtf.register(
    ML_Prep, # the class
    name="ml_prep_udtf",
    input_types=[PandasDataFrameType([DateType(), FloatType(), FloatType(), FloatType(), FloatType(), StringType()])], 
    output_schema=PandasDataFrameType([DateType(), FloatType(), FloatType(), FloatType(), FloatType(), StringType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType(),FloatType()],
                                      ['DATE', "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL", "CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"]),
    packages=["snowflake-snowpark-python", 'pandas'])  


In [8]:
sdf_prepped = sdf.select(ml_prep_udtf(*["DATE", "OPEN", "HIGH", "LOW", "CLOSE", "SYMBOL"]).over(partition_by=['SYMBOL']))
sdf_prepped.limit(10).to_pandas()
sdf_prepped.write.save_as_table("ML_PREPPED", mode="overwrite")

In [9]:
sdf[['SYMBOL']].distinct().to_pandas()

Unnamed: 0,SYMBOL
0,IBM
1,AMZN
2,FDS
3,META


# 1.4.1 Choose Your Symbol, Train/Test Split and Model

We've got our data ready, but we need to make a few selections before we build our models

In [10]:
sdf_prepped_filt = sdf_prepped.filter((F.col("SYMBOL") == 'IBM'))
sdf_filt_train, sdf_filt_test = sdf_prepped_filt.filter((F.col("DATE") <= '2022-01-01')), sdf_prepped_filt.filter((F.col("DATE") > '2022-01-01'))
regressor = LinearRegression

# 1.4.2 Train Your Model

Our model is almost ready to be trained, but we need to choose our inputs, targets, and outputs.  We could go off piste and alter model (hyper)parameters here too (https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/api/modeling/snowflake.ml.modeling.linear_model.LinearRegression)

In [11]:
regressor = regressor(input_cols=["CLOSE_M1", "CLOSE_M2", "CLOSE_M3", "CLOSE_M4", "CLOSE_M5"],
                         label_cols=["CLOSE"],
                         output_cols=["CLOSE_PREDICT"])
regressor.fit(sdf_filt_train)

<snowflake.ml.modeling.linear_model.linear_regression.LinearRegression at 0x17ed3c090>

# 1.5 Register Your Model

Let's assume we love the first model, it's time to register it....

In [13]:
MODEL_NAME = "REGRESSION_IBM"
MODEL_VERSION = "v01"

model = native_registry.log_model(
    model_name=MODEL_NAME,
    version_name=MODEL_VERSION,
    model=regressor,
)

# 1.6 Run Your Model

We're at the finish line!

In [14]:
model.run(sdf_filt_test, function_name="predict").limit(5).to_pandas()

Unnamed: 0,DATE,OPEN,HIGH,LOW,CLOSE,SYMBOL,CLOSE_M1,CLOSE_M2,CLOSE_M3,CLOSE_M4,CLOSE_M5,CLOSE_PREDICT
0,2022-01-03,177.830002,182.880005,177.710007,182.009995,IBM,177.570007,178.199997,179.380005,179.289993,180.330002,174.761702
1,2022-01-04,182.630005,182.940002,179.119995,179.699997,IBM,182.009995,177.570007,178.199997,179.380005,179.289993,178.16986
2,2022-01-05,179.610001,180.169998,174.639999,174.919998,IBM,179.699997,182.009995,177.570007,178.199997,179.380005,177.002433
3,2022-01-06,172.699997,175.300003,171.639999,172.0,IBM,174.919998,179.699997,182.009995,177.570007,178.199997,172.985956
4,2022-01-07,172.889999,174.139999,171.029999,172.169998,IBM,172.0,174.919998,179.699997,182.009995,177.570007,169.914255


## Make sure you run this last line - it's needed for the next Part

In [15]:
model_ = native_registry.get_model(MODEL_NAME).version(MODEL_VERSION)
model_.run(sdf_filt_test, function_name="predict").write.save_as_table("ML_PREDICT", mode="overwrite")

In [16]:
native_registry.get_model(MODEL_NAME).show_versions()

Unnamed: 0,created_on,name,comment,database_name,schema_name,module_name,is_default_version,functions,metadata,user_data
0,2024-01-30 11:49:08.362000-08:00,V13,,MODEL_REGISTRY,PUBLIC,REGRESSION_IBM,True,"[""PREDICT""]",{},"{""snowpark_ml_data"":{""functions"":[{""name"":""PRE..."
1,2024-01-30 11:52:40.772000-08:00,V14,,MODEL_REGISTRY,PUBLIC,REGRESSION_IBM,False,"[""PREDICT""]",{},"{""snowpark_ml_data"":{""functions"":[{""name"":""PRE..."
2,2024-01-31 03:49:40.169000-08:00,V15,,MODEL_REGISTRY,PUBLIC,REGRESSION_IBM,False,"[""PREDICT""]",{},"{""snowpark_ml_data"":{""functions"":[{""name"":""PRE..."
3,2024-01-31 11:05:24.661000-08:00,V01,,MODEL_REGISTRY,PUBLIC,REGRESSION_IBM,False,"[""PREDICT""]",{},"{""snowpark_ml_data"":{""functions"":[{""name"":""PRE..."
