### Prequisites

[Anaconda](https://www.anaconda.com/) installed

[Python 3.9](https://www.python.org/downloads/) installed

Note that you will be creating a Python environment with 3.9 in the Setup the Python Environment step

A Snowflake account with [Anaconda Packages enabled by ORGADMIN](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-packages.html#using-third-party-packages-from-anaconda). If you do not have a Snowflake account, you can register for a free trial account.

A Snowflake account login with a role that has the ability to create database, schema, tables, stages, user-defined functions, and stored procedures. 

If not, you will need to register for a [free trial](https://signup.snowflake.com/) or use a different role.

### Python Environment
To create our python environment we run the following in our terminal

conda create --name ai-demo --override-channels -c https://repo.anaconda.com/pkgs/snowflake python=3.9 
conda activate ai-demo
conda install -c https://repo.anaconda.com/pkgs/snowflake snowflake-connector-python snowflake-snowpark-python snowflake snowflake-ml-python xgboost os json cachtools joblib

In [None]:
### Load Libraries
# Snowpark
import snowflake.snowpark as snp
from snowflake.snowpark import functions as F
from snowflake.snowpark.functions import udf, col, lag, lit, trunc, to_date, replace, last_day, mean, median
from snowflake.snowpark.session import Session
from snowflake.snowpark.types import *
from snowflake.snowpark.version import VERSION
import snowflake.connector

# Snowpark ML
# https://docs.snowflake.com/en/developer-guide/snowpark-ml/snowpark-ml-modeling
import snowflake.ml.modeling.preprocessing as snowml
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.metrics.correlation import correlation
from snowflake.ml.modeling.xgboost import XGBRegressor
from snowflake.ml.modeling.model_selection import GridSearchCV
from snowflake.ml.registry import model_registry
from snowflake.ml._internal.utils import identifier
from snowflake.ml.modeling.metrics import mean_absolute_percentage_error


# data sci libraries
import pandas as pd
import os
import json
import numpy as np
import joblib
import cachetools


# warning suppresion
import warnings; warnings.simplefilter('ignore')

In [None]:
### Enter Creds, Connect to Snowflake and Create Session

# Read credentials
with open('creds.json') as f:
    connection_parameters = json.load(f)
# Connect to a snowflake session
session = Session.builder.configs(connection_parameters).create()

In [None]:
# check or confirm current session settings? 
session.get_current_database()
session.get_current_schema()
session.get_current_warehouse()

# What version of snowpark are we running?
snowpark_version = VERSION

print('Snowpark for Python version : {}.{}.{}'.format(snowpark_version[0],snowpark_version[1],snowpark_version[2]))

### Press Conference Data 

The notebook we just ran in the Snowflake cloud saved our scored FOMC press conferences to our database.
The next section walks through how to take a csv from github and save it to Snowflake

### Upload data to Snowflake

If you're running this document from github, you will need to load data to your Snowflake account. These two csv's are available on the github repository here: 

https://github.com/sfc-gh-jregenstein/aifutureoffinance

I downloaded them to my local FOMC-ROBERTA folder and here's how to access them and load to Snowflake. 

In [None]:
pc_data = pd.read_csv("FOMC-ROBERTA/pc_data_for_csv.csv")

pc_data.head(5)

pc_data =pc_data[['DATE', 'SENTENCE', 'HAWKISH']]

session.write_pandas(pc_data, "PRESS_CONFERENCE_SCORED_UPPER_DATA", overwrite='True')

### Transform and Create Features

From here, we're going to do some wrangling on this data that is different from how the paper treats it - which is the power of being able to pull models off of Hugging Face, deploy them locally and then use the resulting data in our pipelines. We get to be creative. 

Here I'm going to calculate the monthly change in hawkishness for Fed sentiment, so that we can eventually pass that change as a feature to our model for predicting changes in inflation. The wrangling below happens in the cloud, not on my desktop. 

In [None]:
 # PC Scored Data

date_win = snp.Window.order_by('DATE')

fed_percent_sent =   (
    session.table("PRESS_CONFERENCE_SCORED_UPPER_DATA")  
    .groupBy(['DATE'])
    .agg(
    median('HAWKISH').alias('MEDIAN_SENTIMENT'),
    mean('HAWKISH').alias('MEAN_SENTIMENT'),
    ) 
    .sort('DATE') 
   .with_column('LAG_FOMC_MEAN_SENTIMENT', lag('MEAN_SENTIMENT', offset=1) \
    .over(date_win)
   )
   .with_column('FED_CHANGE', (col('MEAN_SENTIMENT') - col('LAG_FOMC_MEAN_SENTIMENT'))/col('LAG_FOMC_MEAN_SENTIMENT'))  
   .with_column('FRED_DATE', last_day('DATE'))
   .select('FRED_DATE', 'FED_CHANGE')
)

fed_percent_sent.show(5)

### Inflation Data

We're going to explore whether that change in Fed sentiment helps us model changes in inflation. We have only about 60 observations here but our goal is to explore this process and how to create new features and input them into a model. First let's load up our inflation data. This is CPI data, readily available on FRED or in the Snowflake marketplace via Cybersyn. 

In [None]:
cpi_data = session.table("CPI_DATA")


Next we join our Fed sentiment data with our Inflation data, aligning the change in Fed sentiment with the change in inflation. In practice we would want to explore many different lags.

In [None]:
date_win = snp.Window.order_by('DATE')

fed_percent_sent.show(5)

cpi_fed = (
    cpi_data 
    .with_column('LAG_CPI_12', lag('VALUE', offset=12) \
    .over(date_win)
   )
   .with_column('CPI_CHANGE', (col('VALUE') - col('LAG_CPI_12'))/col('LAG_CPI_12'))  
   .join(fed_percent_sent, cpi_data.col('DATE') == fed_percent_sent.col('FRED_DATE'))
   .select('DATE', 'FED_CHANGE', 'CPI_CHANGE')
   .dropna()

)

cpi_fed.show(5)

# XGBOOST

Import our press conference data 


Build a simple XGBoost Regression model
What's happening here?

The model.fit() function actually creates a temporary stored procedure in the background. This also means that the model training is a single-node operation. Be sure to use a Snowpark Optimized Warehouse if you need more memory. We are just using an XS Standard Virtual Warehouse here, which we created at the beginning of this quickstart.
The model.predict() function actualls creates a temporary vectorized UDF in the background, which means the input DataFrame is batched as Pandas DataFrames and inference is parallelized across the batches of data.



In [None]:
session.use_warehouse('snowpark_opt_wh')

# Split the data into train and test sets
fomc_train_df, fomc_test_df = cpi_fed.random_split(weights=[0.7, 0.3], seed=0)

fomc_train_df.show(5)

# Define the XGBRegressor
regressor = XGBRegressor(
    input_cols=['FED_CHANGE'],
    label_cols=['CPI_CHANGE'],
    output_cols=['PREDICTED_CPI_CHANGE']
)

# Train
regressor.fit(fomc_train_df)

# Predict
result = regressor.predict(fomc_test_df)

result.to_pandas()


In [None]:
mape = mean_absolute_percentage_error(df=result, 
                                        y_true_col_names="CPI_CHANGE", 
                                        y_pred_col_names="PREDICTED_CPI_CHANGE")

result.select("CPI_CHANGE", "PREDICTED_CPI_CHANGE").show()
print(f"Mean absolute percentage error: {mape}")

In [None]:



import seaborn as sns
import matplotlib.pyplot as plt
# Plot actual vs predicted 
g = sns.relplot(data=result["CPI_CHANGE", "PREDICTED_CPI_CHANGE"].to_pandas().astype("float64"), x="CPI_CHANGE", y="PREDICTED_CPI_CHANGE", kind="scatter")
g.ax.axline((0,0), slope=1, color="r")

plt.show()



In [None]:
### Grid Search


grid_search = GridSearchCV(
    estimator=XGBRegressor(),
    param_grid={
        "n_estimators":[100, 200, 300, 400, 500],
        "learning_rate":[0.1, 0.2, 0.3, 0.4, 0.5],
    },
    n_jobs = -1,
    scoring="neg_mean_absolute_percentage_error",
    input_cols=['FED_CHANGE'],
    label_cols=['CPI_CHANGE'],
    output_cols=['PREDICTED_CPI_CHANGE']
)

# Train
grid_search.fit(fomc_train_df)


# Predict
result = grid_search.predict(fomc_test_df)

# Analyze results
mape = mean_absolute_percentage_error(df=result, 
                                        y_true_col_names="CPI_CHANGE", 
                                        y_pred_col_names="PREDICTED_CPI_CHANGE")

result.select("CPI_CHANGE", "PREDICTED_CPI_CHANGE").show()
print(f"Mean absolute percentage error: {mape}")


# Analyze grid search results
gs_results = grid_search.to_sklearn().cv_results_
n_estimators_val = []
learning_rate_val = []

for param_dict in gs_results["params"]:
    n_estimators_val.append(param_dict["n_estimators"])
    learning_rate_val.append(param_dict["learning_rate"])
mape_val = gs_results["mean_test_score"]*-1

gs_results_df = pd.DataFrame(data={
    "n_estimators":n_estimators_val,
    "learning_rate":learning_rate_val,
    "mape":mape_val})



# Let's save our optimal model first and its metadata
optimal_model = grid_search.to_sklearn().best_estimator_
optimal_n_estimators = grid_search.to_sklearn().best_estimator_.n_estimators
optimal_learning_rate = grid_search.to_sklearn().best_estimator_.learning_rate

optimal_mape = gs_results_df.loc[(gs_results_df['n_estimators']==optimal_n_estimators) &
                                 (gs_results_df['learning_rate']==optimal_learning_rate), 'mape']


In [None]:
### Model Register



# Get sample input data to pass into the registry logging function
X = fomc_train_df.select('CPI_CHANGE', 'FED_CHANGE').limit(10)

db = identifier._get_unescaped_name(session.get_current_database())
schema = identifier._get_unescaped_name(session.get_current_schema())

# Define model name and version
model_name = "cpi_fed_model"
model_version = 1

# Create a registry and log the model
registry = model_registry.ModelRegistry(session=session, database_name=db, schema_name=schema, create_if_not_exists=True)

registry.log_model(
    model_name=model_name,
    model_version=model_version,
    model=optimal_model,
    sample_input_data=X,
    options={"embed_local_ml_library": True, # This option is enabled to pull latest dev code changes.
             "relax": True} # relax dependencies
)

# Add evaluation metric
registry.set_metric(model_name=model_name, model_version=model_version, metric_name="mean_abs_pct_err", metric_value=optimal_mape)



# Let's confirm it was added
registry.list_models().to_pandas()


# Pick a deployment name and deploy
model_deployment_name = model_name + f"{model_version}" + "_UDF"

registry.deploy(model_name=model_name,
                model_version=model_version,
                deployment_name=model_deployment_name, 
                target_method="predict", 
                permanent=True, 
                options={"relax_version": True})


# Let's confirm it was added
registry.list_deployments(model_name, model_version).to_pandas()


### User our Model for Inference


# We can always get a reference to our registry using this function call
model_ref = model_registry.ModelReference(registry=registry, model_name=model_name, model_version=model_version)

# We can then use the deployed model to perform inference
result_sdf = model_ref.predict(deployment_name=model_deployment_name, data=fomc_test_df)
result_sdf.rename(F.col('"output_feature_0"'),"PREDICTED_VALUE").show()


session.close()
