# Step 1.3 Train and Stage

We now perform the training, using the dataset we loaded in the last steps. 

As with AutoML functionalities. We leave the magic to AutoGluon to find the right algorithm
and fine tune its parameters and save the best models into a local file. Once the process
completes, we then will stage these models into Snowflake for future predictions.

Though this entire operation could have been done using Pandas, I wanted to demonstrate
how this can be done using Snowpark. 


In [36]:
from IPython.display import display, HTML, Image , Markdown
from snowflake.snowpark.session import Session
from snowflake.snowpark.types import * 
from snowflake.snowpark.functions import *
import configparser

PROJECT_HOME_DIR = '../../..'
CONFIG_FL = f'{PROJECT_HOME_DIR}/config.ini'
LOCAL_TEMP_DIR = f'{PROJECT_HOME_DIR}/temp'

%run ./scripts/notebook_helpers.py

In [37]:
# Initialization
set_cell_background('#EAE3D2')

config = configparser.ConfigParser()
sflk_session = None

print(" Initialize Snowpark session")
with open(CONFIG_FL) as f:
    config.read(CONFIG_FL)
    snow_conn_flpath =  f"{PROJECT_HOME_DIR}/{config['DEFAULT']['connection_fl']}"
    
    # ------------
    # Connect to snowflake
    with open(snow_conn_flpath) as conn_f:
        snow_conn_info = json.load(conn_f)
        sflk_session = Session.builder.configs(snow_conn_info).create()

if(sflk_session == None):
    raise(f'Unable to connect to snowflake. Validate connection information in file: {CONFIG_FL} ')

df = sflk_session.sql('select current_warehouse(), current_user(), current_role();').to_pandas()
display(df)

 Initialize Snowpark session


Unnamed: 0,CURRENT_WAREHOUSE(),CURRENT_USER(),CURRENT_ROLE()
0,LAB_WH,VSEKAR,DEV_BLOGGER


---
## Train Model


In [38]:
# Perform model training
set_cell_background('#EAE3D2')

import pandas as pd

target_db = config['DEFAULT']['db']
target_schema = config['DEFAULT']['sch']
stage = config['DEFAULT']['stage']
imputed_table = config['DEFAULT']['sensor_measurements_imputed_table']

print(f' Creating table sample to be used for training from table: {imputed_table} ...')
imputed_table_pddf = sflk_session.table(f'{target_db}.{target_schema}.{imputed_table}') 

# Project only columns that are of value
imputed_table_pddf = imputed_table_pddf.select(col('CO2') ,col('HUMIDITY') ,col('LIGHT') ,col('TEMPERATURE') ,col('OCCUPIED'))

# Sample the data
SAMPLE_SEED = 73
SAMPLE_PERCENTAGE = 0.1 
imputed_table_sample_pddf = imputed_table_pddf.sample(SAMPLE_PERCENTAGE)

# Convert to pandas
imputed_table_sample_pddf = imputed_table_sample_pddf.to_pandas()
print(f' Sample record count: {len(imputed_table_sample_pddf)}')

display(imputed_table_sample_pddf[0:3])

 Creating table sample to be used for training from table: sensor_measurements_imputed ...
 Sample record count: 26046


Unnamed: 0,CO2,HUMIDITY,LIGHT,TEMPERATURE,OCCUPIED
0,566.0,48.12,94.0,24.36,0
1,576.0,48.12,94.0,24.36,0
2,560.0,48.12,94.0,24.36,0


In [39]:
# Train and test data split
set_cell_background('#EAE3D2')

print(' Split the dataset into train and test')

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(imputed_table_sample_pddf, test_size=0.2)

print(f' Training data size: {len(train_df)}')
print(f' Testing data size: {len(test_df)}')

 Split the dataset into train and test
 Training data size: 20836
 Testing data size: 5210


In [40]:
set_cell_background('#EAE3D2')

from autogluon.tabular import TabularDataset, TabularPredictor

print(" Train ... ")

model_dir_path=f'{LOCAL_TEMP_DIR}/model'

# The Label column that indicates the class to predict
label_col = 'OCCUPIED'

 Train ... 



Model training is done in the context of Snowpark, executing outside of secure snowflake environment.
Meaning I did not have this done using stored-procedures though. The reasons are:
- __Secured managed space:__ As part of training, AutoGluon would tend to download various algorithms or packages
    for eX: pytorch ,fastai ,catboost etc.. to name a few from the internet. Hence network connection
    is needed to achieve this. 
    
    Python stored-procedures run in a secure environment and there is no network connectivity to outside world. Hence
    implementation using stored-procedure is not possible.

In [41]:
set_cell_background('#EAE3D2')

train_ds = TabularDataset(train_df)
predictor = TabularPredictor(label=label_col ,path=model_dir_path ,problem_type='multiclass').fit(train_data=train_ds)

Beginning AutoGluon training ...
AutoGluon will save models to "../../../temp/model/"
AutoGluon Version:  0.5.0
Python Version:     3.8.13
Operating System:   Darwin
Train Data Rows:    20836
Train Data Columns: 4
Label Column: OCCUPIED
Preprocessing data ...
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    1207.44 MB
	Train Data (Original)  Memory Usage: 0.67 MB (0.1% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 4 | ['CO2', 'HUMIDITY', 'LIGHT', 'TEMPERATUR

#### Analyze predictions

In [42]:
set_cell_background('#EAE3D2')

print(" AutoGluon infers problem type is: ", predictor.problem_type)
print(" AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

print(' List of models determined:')
predictor.get_model_names()

# print(' Fit summary')
# results = predictor.fit_summary(show_plot=True)
# results

 AutoGluon infers problem type is:  multiclass
 AutoGluon identified the following types of features:
('float', []) : 4 | ['CO2', 'HUMIDITY', 'LIGHT', 'TEMPERATURE']
 List of models determined:


['KNeighborsUnif',
 'KNeighborsDist',
 'NeuralNetFastAI',
 'RandomForestGini',
 'RandomForestEntr',
 'CatBoost',
 'ExtraTreesGini',
 'ExtraTreesEntr',
 'NeuralNetTorch',
 'WeightedEnsemble_L2']

---
### Local Testing

We perform a local testing, using the test data. We also try to understand the best model that was choosen

In [43]:
set_cell_background('#EAE3D2')

test_data = test_df

y_test = test_df[label_col]  # values to predict
test_data_nolab = test_df.drop(columns=[label_col])  # delete label column to prove we're not cheating
test_data_nolab.head()

Unnamed: 0,CO2,HUMIDITY,LIGHT,TEMPERATURE
9219,473.0,60.77,37.0,22.72
10235,487.0,59.28,4.0,23.39
15378,472.0,56.63,19.0,23.3
6964,476.0,57.53,4.0,23.14
4238,464.0,52.05,5.0,23.89


In [44]:
# Run predictions (Testing)
set_cell_background('#EAE3D2')

predictor = TabularPredictor.load(model_dir_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data_nolab)

# Below for a specific model class predictions
# y_pred = predictor.predict(test_data_nolab, model='NeuralNetFastAI')

print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.9761996161228407
Evaluations on test data:
{
    "accuracy": 0.9761996161228407,
    "balanced_accuracy": 0.9228678874650637,
    "mcc": 0.8595610132036822
}


Predictions:  
 9219     0
10235    0
15378    0
6964     0
4238     0
        ..
33       0
5433     0
12221    0
2158     0
25883    0
Name: OCCUPIED, Length: 5210, dtype: int8


In [45]:
# Run predictions (Testing)
set_cell_background('#EAE3D2')

display_code(p_title='LeaderBoard' ,p_background_color='honeydew'
       ,p_code=f'''
           The Leaderboard depicts the various models that have been trained and reflects the score. Without
           us cofiguring anything, AutoGluon does the heavy lifting of training across multiple AI Algorithms
           and determing the best model and its fine tune.
       ''')

predictor.leaderboard(test_data, silent=True)
# predictor.get_model_names()

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,CatBoost,0.9762,0.981766,0.00715,0.003405,6.664951,0.00715,0.003405,6.664951,1,True,6
1,WeightedEnsemble_L2,0.9762,0.981766,0.008943,0.003674,6.837957,0.001793,0.000269,0.173006,2,True,10
2,RandomForestGini,0.974856,0.980806,0.07041,0.054479,0.535308,0.07041,0.054479,0.535308,1,True,4
3,ExtraTreesGini,0.974856,0.980326,0.076176,0.053885,0.435315,0.076176,0.053885,0.435315,1,True,7
4,ExtraTreesEntr,0.974664,0.980326,0.07422,0.054442,0.429055,0.07422,0.054442,0.429055,1,True,8
5,RandomForestEntr,0.97428,0.980806,0.069455,0.053326,0.512568,0.069455,0.053326,0.512568,1,True,5
6,NeuralNetTorch,0.966795,0.976008,0.039586,0.016401,33.392139,0.039586,0.016401,33.392139,1,True,9
7,KNeighborsDist,0.962956,0.963052,0.013381,0.006718,0.010291,0.013381,0.006718,0.010291,1,True,2
8,NeuralNetFastAI,0.960269,0.963532,0.049616,0.018089,15.453886,0.049616,0.018089,15.453886,1,True,3
9,KNeighborsUnif,0.958349,0.961612,0.011861,0.008377,0.011818,0.011861,0.008377,0.011818,1,True,1


In [46]:
set_cell_background('#EAE3D2')

import pandas as pd

predicted_df  = pd.Series(y_pred, name='predicted_class')
df = test_df.merge(predicted_df,left_index=True, right_index=True)

# proj_df = proj_df.copy()
df['not_match'] = df.apply(lambda x: x.OCCUPIED != x.predicted_class, axis=1)

print(' Sample records of predicted class vs actual value (OCCUPIED)')
display(df[0:3])

 Sample records of predicted class vs actual value (OCCUPIED)


Unnamed: 0,CO2,HUMIDITY,LIGHT,TEMPERATURE,OCCUPIED,predicted_class,not_match
9219,473.0,60.77,37.0,22.72,0,0,False
10235,487.0,59.28,4.0,23.39,0,0,False
15378,472.0,56.63,19.0,23.3,0,0,False


In [47]:
set_cell_background('#EAE3D2')

notmatched_df = df[df.not_match == True]
notmatched_total = len(notmatched_df)
total_count = len(df)
percentageof_mismatch = (notmatched_total*100)/total_count
print(f'No of records that did not match: {notmatched_total} / {total_count} :: {percentageof_mismatch}%')

display(notmatched_df[0:5])

No of records that did not match: 124 / 5210 :: 2.380038387715931%


Unnamed: 0,CO2,HUMIDITY,LIGHT,TEMPERATURE,OCCUPIED,predicted_class,not_match
14120,483.0,54.76,20.0,22.66,1,0,True
1861,467.0,49.23,126.0,24.75,1,0,True
14067,476.0,54.76,20.0,22.66,1,0,True
13963,460.0,53.16,18.0,22.59,1,0,True
8611,476.0,59.65,18.0,23.36,1,0,True


---
## Package and Stage model

Now that training has been done. We save the model into an archive and upload the same to
the internal stage.

In [48]:
# Package and upload model to stage
set_cell_background('#EAE3D2')

import tarfile
import os.path
from pathlib import Path

# Create a local directory to store data, library etc..
model_lib_dir = f'''{LOCAL_TEMP_DIR}/model_lib'''
Path(model_lib_dir).mkdir(parents=True, exist_ok=True)
    
model_packed_filename=f'''{model_lib_dir}/{config['DEFAULT']['ag_model_archive']}'''
print(f' Packaging the model as: {model_packed_filename}')
with tarfile.open(model_packed_filename, "w:gz") as tar:
    tar.add(model_dir_path, arcname=os.path.basename(model_dir_path))

# Upload the packaged model to stage
stage_models_dir = config['DEFAULT']['stage_models_dir']
upload_locallibraries_to_p_stage(sflk_session ,model_lib_dir ,target_db ,target_schema ,stage ,stage_models_dir)

 Packaging the model as: ../../../temp/model_lib/room_occupancy_autogluon_model.tar.gz
 Uploading library to stage: sflk_autogluon_db.public.lib_data_stg 
    ../../../temp/model_lib/room_occupancy_autogluon_model.tar.gz => @lib_data_stg/ag_models/


In [49]:
set_cell_background('#EAE3D2')

print(' List stage directory !!')
sflk_session.sql(f'alter stage {target_db}.{target_schema}.{stage} refresh; ').collect()

df = sflk_session.sql(f'select RELATIVE_PATH from directory(@{target_db}.{target_schema}.{stage}); ').to_pandas()
display(df)

 List stage directory !!


Unnamed: 0,RELATIVE_PATH
0,ag_models/room_occupancy_autogluon_model.tar.gz
1,libs/autogluon-0.5.2-py3-none-any.whl
2,libs/autogluon.common-0.5.2-py3-none-any.whl
3,libs/autogluon.core-0.5.2-py3-none-any.whl
4,libs/autogluon.extra-0.3.1-py3-none-any.whl
5,libs/autogluon.features-0.5.2-py3-none-any.whl
6,libs/autogluon.multimodal-0.5.2-py3-none-any.whl
7,libs/autogluon.tabular-0.5.2-py3-none-any.whl


---
### Close out

    With that we are finished this section of the demo setup

In [50]:
sflk_session.close()
print('Finished!!!')

Finished!!!
