# Snowpark Introduction
This notebook introduces several key features of Snowpark in the process of training a machine learning model for predicting diabetes readmission. 

* Establish secure connection to Snowflake
* Load features and target from Snowflake table into Snowpark DataFrame
* Prepare features for model training
* Train ML model using Snowpark ML distributed processing
* Save the model to the Snowflake Model Registry
* Run model predictions inside Snowflake

## 1. Setup Environment

In [None]:
# Snowflake connector
from snowflake import connector
#from snowflake.ml.utils import connection_params

# Snowpark for Python
from snowflake.snowpark.session import Session
from snowflake.snowpark.types import Variant
from snowflake.snowpark.version import VERSION

# Snowpark ML
from snowflake.ml.modeling.compose import ColumnTransformer
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.preprocessing import PolynomialFeatures, StandardScaler
from snowflake.ml.modeling.preprocessing import OrdinalEncoder
from snowflake.ml.modeling.impute import SimpleImputer
from snowflake.ml.modeling.model_selection import GridSearchCV

from snowflake.ml.modeling.ensemble import RandomForestClassifier
from snowflake.ml.modeling.model_selection.grid_search_cv import GridSearchCV
from snowflake.ml.modeling.xgboost import XGBClassifier

#from sklearn import datasets
import sklearn 

# Misc
import pandas as pd
import json
import logging 
logger = logging.getLogger("snowflake.snowpark.session")
logger.setLevel(logging.ERROR)

## Establish Secure Connection to Snowflake

Using the Snowpark Python API, it’s quick and easy to establish a secure connection between Snowflake and Notebook.
 *Note: Other connection options include Username/Password, MFA, OAuth, Okta, SSO*

I like to store my credentials in creds.json so they aren't in the notebook.
The file should look like this:
```
{
    "account": "awb99999",
    "user": "your_user_name",
    "password": "your_password",
    "warehouse": "your_warehouse"
  }

In [None]:
with open('../../creds.json') as f:
    data = json.load(f)
    USERNAME = data['user']
    PASSWORD = data['password']
    SF_ACCOUNT = data['account']
    SF_WH = data['warehouse']

CONNECTION_PARAMETERS = {
   "account": SF_ACCOUNT,
   "user": USERNAME,
   "password": PASSWORD,
}

session = Session.builder.configs(CONNECTION_PARAMETERS).create()

Verify everything is connected

In [None]:
snowflake_environment = session.sql('select current_user(), current_version()').collect()
snowpark_version = VERSION

# Current Environment Details
print('User                        : {}'.format(snowflake_environment[0][0]))
print('Role                        : {}'.format(session.get_current_role()))
print('Database                    : {}'.format(session.get_current_database()))
print('Schema                      : {}'.format(session.get_current_schema()))
print('Warehouse                   : {}'.format(session.get_current_warehouse()))
print('Snowflake version           : {}'.format(snowflake_environment[0][1]))
print('Snowpark for Python version : {}.{}.{}'.format(snowpark_version[0],snowpark_version[1],snowpark_version[2]))

If you need to add some custom packages with specific version, you can run these. 

In [None]:
#session.custom_package_usage_config = {"enabled": True}
#session.add_packages("numpy", "pandas==1.5.3","scikit-learn==1.3.0","xgboost")

## 2. Load Data in Snowflake 

Let's get the data (100k rows and 32 columns) and also make the column names all upper cases. It's easier to work with columns names that aren't case sensitive.

In [None]:
df_clean = pd.read_csv('diabetes_clean.csv')
df_clean.columns = df_clean.columns.str.upper()
print (df_clean.shape)

Let's create a Snowpark dataframe and split the data for test/train. This operation is done inside Snowflake and not in your local environment. We will also save this as a table so we don't ever have to manually upload this dataset again.

PRO TIP -- Snowpark will inherit the schema of a pandas dataframe into Snowflake. Either change your schema before importing or after it has landed in snowflake.

In [None]:

input_df = session.create_dataframe(df_clean)
train_df, test_df = input_df.random_split(weights=[0.8, 0.2], seed=0)

In [None]:
train_df.write.mode('overwrite').save_as_table('DIAB_TRAIN')
test_df.write.mode('overwrite').save_as_table('DIAB_TEST')

In [None]:
train_df= session.table("DIAB_TRAIN")
test_df = session.table("DIAB_TEST") 

Snowpark dataframes have lots of operations that can be performed on them. Here I am printing out the column names before we start the feature engineering.

In [None]:
## To see column names
df_clean.select_dtypes(include=['category', 'object']).columns.tolist()
train_df.columns

## 3. Distributed Feature Engineering

These operations are done inside the Snowpark warehouse which provides improved performance and scalability with distributed execution for these scikit-learn preprocessing functions. This dataset uses SMALL, but you can always move up to larger ones including Snowpark Optimized warehouses (16x memory per node than a standard warehouse), e.g., `session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'").collect()`

In [None]:
session.sql("ALTER WAREHOUSE RAJIV SET WAREHOUSE_SIZE = 'SMALL'").collect()

Feature engineering + random forest from sci kit

In [None]:
 ## Distributed Preprocessing - 25X to 50X faster
numeric_features = ['TIME_IN_HOSPITAL', 'NUM_LAB_PROCEDURES', 'NUM_PROCEDURES',
       'NUM_MEDICATIONS', 'NUMBER_OUTPATIENT', 'NUMBER_EMERGENCY',
       'NUMBER_INPATIENT', 'NUMBER_DIAGNOSES', 'CHANGE', 'DIABETESMED']
numeric_transformer = Pipeline(steps=[('poly',PolynomialFeatures(degree = 2)),('scaler', StandardScaler())])

#categorical_cols = ['RACE', 'GENDER','AGE', 'MEDICAL_SPECIALTY','DIAG_1','DIAG_2','DIAG_3','DISCHARGE_DISPOSITION_ID','ADMISSION_SOURCE_ID','ADMISSION_TYPE_ID']
categorical_cols = ['RACE', 'GENDER','AGE', 'MEDICAL_SPECIALTY','DIAG_1','DIAG_2','DIAG_3']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-99999))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_cols)
        ])

pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', RandomForestClassifier())])

## 4. Distributed Training

These operations are done inside the Snowpark warehouse which provides improved performance and scalability with distributed execution for these scikit-learn preprocessing functions and XGBoost training (and many other types of models).

In [None]:
 ## Distributed HyperParameter Optimization
hyper_param = dict(
        classifier__max_depth=[5,10,30],
       # min_samples_leaf=[1,3,10],
      #  min_samples_split=[1.0, 3,10],
        classifier__n_estimators = [20,50,200]
    )

model = GridSearchCV(
    estimator=pipeline,
    param_grid=hyper_param,
    cv=5,
    input_cols=numeric_features + categorical_cols,
    label_cols=['READMITTED'],
    output_cols=["READMITTED_PRED"],
    verbose=2
)

# Fit and Score - Takes 2 minutes
model.fit(train_df)

## 5. Model Evaluation
Look at the results of the mode. cv_results is a dictionary, where each key is a string describing one of the metrics or parameters, and the corresponding value is an array with one entry per combination of parameters

In [None]:
cv_results = model.to_sklearn().cv_results_

# cv_results is a dictionary, where each key is a string describing one of the metrics or parameters,
# and the corresponding value is an array with one entry per combination of parameters

# To print the parameter setting and the corresponding mean test score, for example:
for i in range(len(cv_results['params'])):
    print(f"Parameters: {cv_results['params'][i]}")
    print(f"Mean Test Score: {cv_results['mean_test_score'][i]}")
    print()

Look at the accuracy of the model

In [None]:
train_score = model.score(train_df)
test_score = model.score(test_df)

# R2 score on train and test datasets
print(f"Accuracy on Train : {train_score}")
print(f"Accuracy on Test  : {test_score}")

Dig into results a bit more

In [None]:
testproba = model.predict_proba(test_df)
testproba.write.save_as_table(table_name='DIABETES_TEST_PROBA', mode='overwrite')

testpreds = model.predict(test_df)
testpreds.write.save_as_table(table_name='DIABETES_TEST_SCORED', mode='overwrite')
testpreds.show()

Using metrics from snowpark so calculation is done inside snowflake

In [None]:
from snowflake.ml.modeling.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
print('Acccuracy:', accuracy_score(df=testpreds,y_true_col_names= 'READMITTED', y_pred_col_names= 'READMITTED_PRED'))
print('Precision:', precision_score(df=testpreds, y_true_col_names='READMITTED', y_pred_col_names='READMITTED_PRED'))
print('Recall:', recall_score(df=testpreds, y_true_col_names='READMITTED', y_pred_col_names='READMITTED_PRED'))
print('F1:', f1_score(df=testpreds, y_true_col_names='READMITTED', y_pred_col_names='READMITTED_PRED'))
##AUC needs probabilities
print('AUC:', roc_auc_score(df=testproba, y_true_col_names='READMITTED', y_score_col_names='"predict_proba_1"'))

In [None]:
# Obtaining and plotting a simple confusion matrix
import seaborn as sns
from snowflake.ml.modeling.metrics import confusion_matrix
cf_matrix = confusion_matrix(df=testpreds, y_true_col_name='READMITTED', y_pred_col_name='READMITTED_PRED')
sns.heatmap(cf_matrix, annot=True, fmt='.0f', cmap='Blues')

## 6. Let's do this all over with XGBoost

In [None]:
from snowflake.ml.modeling.xgboost import XGBClassifier

In [None]:
#session.sql("create or replace warehouse snowpark_opt_wh with warehouse_size = 'MEDIUM' warehouse_type = 'SNOWPARK-OPTIMIZED'").collect()

In [None]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor),('classifier', XGBClassifier())])
 ## Distributed HyperParameter Optimization
hyper_param = dict(
        classifier__max_depth=[2,4],
       # min_samples_leaf=[1,3,10],
      #  min_samples_split=[1.0,3,10],
        classifier__n_estimators = [20,50,200]
    )

xg_model = GridSearchCV(
    estimator=pipeline,
    param_grid=hyper_param,
    cv=5,
    input_cols=numeric_features + categorical_cols,
    label_cols=['READMITTED'],
    output_cols=["READMITTED_PRED"],
    scoring="roc_auc",
    verbose=2
)

# Fit and Score
xg_model.fit(train_df)
##Takes 80 seconds

In [None]:
cv_results = xg_model.to_sklearn().cv_results_

for i in range(len(cv_results['params'])):
    print(f"Parameters: {cv_results['params'][i]}")
    print(f"Mean Test Score: {cv_results['mean_test_score'][i]}")
    print()

In [None]:
testproba = xg_model.predict_proba(test_df)
print('AUC:', roc_auc_score(df=testproba, y_true_col_names='READMITTED', y_score_col_names='"predict_proba_1"'))

## 7. Let's Store the XGBoost Model in the Model Registry and Run Predictions Inside Snowflake (Python/SQL)

Connect to the registry

In [None]:
from snowflake.ml.registry import Registry

reg = Registry(session=session, database_name="RAJIV", schema_name="PUBLIC")

In [None]:
model_ref = reg.log_model(
    model_name="Diabetes_XGBooster",
    version_name="v7",    
    model=xg_model,
    conda_dependencies=["scikit-learn","xgboost"],
    sample_input_data=train_df,
)

In [None]:
reg.show_models()

Let's retrieve the model from the registry

In [None]:
reg_model = reg.get_model("DIABETES_XGBOOSTER").version("v7")

Let's do predictions inside the warehouse

In [None]:
remote_prediction = reg_model.run(test_df, function_name='predict_proba')
remote_prediction.show()

If you look in the activity view, you can find the SQL which will run a bit faster.  This SQL command is showing the result in a snowflake dataframe. You could use `collect` to pull the info out into your local session.

In [None]:
results = session.sql("""SELECT "ADMISSION_TYPE_ID", "DISCHARGE_DISPOSITION_ID", "ADMISSION_SOURCE_ID", "MAX_GLU_SERUM", "A1CRESULT", "ACETOHEXAMIDE", "TOLBUTAMIDE", "TROGLITAZONE", "EXAMIDE", "CITOGLIPTON", "GLIPIZIDE-METFORMIN", "GLIMEPIRIDE-PIOGLITAZONE", "METFORMIN-ROSIGLITAZONE", "METFORMIN-PIOGLITAZONE", "READMITTED",  CAST ("TMP_RESULT"['TIME_IN_HOSPITAL'] AS BYTEINT) AS "TIME_IN_HOSPITAL",  CAST ("TMP_RESULT"['NUM_LAB_PROCEDURES'] AS SMALLINT) AS "NUM_LAB_PROCEDURES",  CAST ("TMP_RESULT"['NUM_PROCEDURES'] AS BYTEINT) AS "NUM_PROCEDURES",  CAST ("TMP_RESULT"['NUM_MEDICATIONS'] AS BYTEINT) AS "NUM_MEDICATIONS",  CAST ("TMP_RESULT"['NUMBER_OUTPATIENT'] AS BYTEINT) AS "NUMBER_OUTPATIENT",  CAST ("TMP_RESULT"['NUMBER_EMERGENCY'] AS BYTEINT) AS "NUMBER_EMERGENCY",  CAST ("TMP_RESULT"['NUMBER_INPATIENT'] AS BYTEINT) AS "NUMBER_INPATIENT",  CAST ("TMP_RESULT"['NUMBER_DIAGNOSES'] AS BYTEINT) AS "NUMBER_DIAGNOSES",  CAST ("TMP_RESULT"['CHANGE'] AS BYTEINT) AS "CHANGE",  CAST ("TMP_RESULT"['DIABETESMED'] AS BYTEINT) AS "DIABETESMED",  CAST ("TMP_RESULT"['RACE'] AS STRING) AS "RACE",  CAST ("TMP_RESULT"['GENDER'] AS STRING) AS "GENDER",  CAST ("TMP_RESULT"['AGE'] AS STRING) AS "AGE",  CAST ("TMP_RESULT"['MEDICAL_SPECIALTY'] AS STRING) AS "MEDICAL_SPECIALTY",  CAST ("TMP_RESULT"['DIAG_1'] AS STRING) AS "DIAG_1",  CAST ("TMP_RESULT"['DIAG_2'] AS STRING) AS "DIAG_2",  CAST ("TMP_RESULT"['DIAG_3'] AS STRING) AS "DIAG_3",  CAST ("TMP_RESULT"['predict_proba_0'] AS DOUBLE) AS "predict_proba_0",  CAST ("TMP_RESULT"['predict_proba_1'] AS DOUBLE) AS "predict_proba_1" 
FROM (
    WITH MODEL_VERSION_ALIAS AS MODEL RAJIV.PUBLIC.DIABETES_XGBOOSTER VERSION V7
    SELECT *,
        MODEL_VERSION_ALIAS!PREDICT_PROBA(TIME_IN_HOSPITAL, NUM_LAB_PROCEDURES, NUM_PROCEDURES, NUM_MEDICATIONS, NUMBER_OUTPATIENT, NUMBER_EMERGENCY, NUMBER_INPATIENT, NUMBER_DIAGNOSES, CHANGE, DIABETESMED, RACE, GENDER, AGE, MEDICAL_SPECIALTY, DIAG_1, DIAG_2, DIAG_3) AS TMP_RESULT
    FROM (SELECT * FROM RAJIV.PUBLIC.DIAB_TEST ))""").show()

Optimize inference by randomly moving the data around

In [None]:
results = session.sql("""SELECT "ADMISSION_TYPE_ID", "DISCHARGE_DISPOSITION_ID", "ADMISSION_SOURCE_ID", "MAX_GLU_SERUM", "A1CRESULT", "ACETOHEXAMIDE", "TOLBUTAMIDE", "TROGLITAZONE", "EXAMIDE", "CITOGLIPTON", "GLIPIZIDE-METFORMIN", "GLIMEPIRIDE-PIOGLITAZONE", "METFORMIN-ROSIGLITAZONE", "METFORMIN-PIOGLITAZONE", "READMITTED",  CAST ("TMP_RESULT"['TIME_IN_HOSPITAL'] AS BYTEINT) AS "TIME_IN_HOSPITAL",  CAST ("TMP_RESULT"['NUM_LAB_PROCEDURES'] AS SMALLINT) AS "NUM_LAB_PROCEDURES",  CAST ("TMP_RESULT"['NUM_PROCEDURES'] AS BYTEINT) AS "NUM_PROCEDURES",  CAST ("TMP_RESULT"['NUM_MEDICATIONS'] AS BYTEINT) AS "NUM_MEDICATIONS",  CAST ("TMP_RESULT"['NUMBER_OUTPATIENT'] AS BYTEINT) AS "NUMBER_OUTPATIENT",  CAST ("TMP_RESULT"['NUMBER_EMERGENCY'] AS BYTEINT) AS "NUMBER_EMERGENCY",  CAST ("TMP_RESULT"['NUMBER_INPATIENT'] AS BYTEINT) AS "NUMBER_INPATIENT",  CAST ("TMP_RESULT"['NUMBER_DIAGNOSES'] AS BYTEINT) AS "NUMBER_DIAGNOSES",  CAST ("TMP_RESULT"['CHANGE'] AS BYTEINT) AS "CHANGE",  CAST ("TMP_RESULT"['DIABETESMED'] AS BYTEINT) AS "DIABETESMED",  CAST ("TMP_RESULT"['RACE'] AS STRING) AS "RACE",  CAST ("TMP_RESULT"['GENDER'] AS STRING) AS "GENDER",  CAST ("TMP_RESULT"['AGE'] AS STRING) AS "AGE",  CAST ("TMP_RESULT"['MEDICAL_SPECIALTY'] AS STRING) AS "MEDICAL_SPECIALTY",  CAST ("TMP_RESULT"['DIAG_1'] AS STRING) AS "DIAG_1",  CAST ("TMP_RESULT"['DIAG_2'] AS STRING) AS "DIAG_2",  CAST ("TMP_RESULT"['DIAG_3'] AS STRING) AS "DIAG_3",  CAST ("TMP_RESULT"['predict_proba_0'] AS DOUBLE) AS "predict_proba_0",  CAST ("TMP_RESULT"['predict_proba_1'] AS DOUBLE) AS "predict_proba_1" 
FROM (
    WITH MODEL_VERSION_ALIAS AS MODEL RAJIV.PUBLIC.DIABETES_XGBOOSTER VERSION V7
    SELECT *,
        MODEL_VERSION_ALIAS!PREDICT_PROBA(TIME_IN_HOSPITAL, NUM_LAB_PROCEDURES, NUM_PROCEDURES, NUM_MEDICATIONS, NUMBER_OUTPATIENT, NUMBER_EMERGENCY, NUMBER_INPATIENT, NUMBER_DIAGNOSES, CHANGE, DIABETESMED, RACE, GENDER, AGE, MEDICAL_SPECIALTY, DIAG_1, DIAG_2, DIAG_3) AS TMP_RESULT
    FROM (SELECT * FROM RAJIV.PUBLIC.DIAB_TEST ORDER BY RANDOM()))""").show()

### Save Trained Model to Snowflake Stage

Let's take the model object save it locally and also save a copy of the model in a snowflake stage

In [None]:
import os
from joblib import dump

## Create a stage to store the model
session.sql('CREATE OR REPLACE STAGE raj_models').show()

# Extract SKLearn object 
sk_model = xg_model.to_sklearn()

#Using current file path, you can modify with another directory
model_filename = 'model.joblib'
model_file = os.path.join(os.getcwd(), model_filename)

dump(sk_model, model_file)

session.file.put(model_file, "@raj_models", overwrite=True)

Is there a command to see the models in the stage?

In [None]:
session.sql('LIST @raj_models').show()