# Step 1.4 Define UDF and Predict

Now that we have used AutoGluon to build the model, we now walk thru the steps to have this model
deployed as a UDF. The UDF then can be called on input dataset to predict the label, based on unseen
data.

In [1]:
from IPython.display import display, HTML, Image , Markdown
from snowflake.snowpark.session import Session
from snowflake.snowpark.types import * 
from snowflake.snowpark.functions import *
import configparser

PROJECT_HOME_DIR = '../../..'
CONFIG_FL = f'{PROJECT_HOME_DIR}/config.ini'
LOCAL_TEMP_DIR = f'{PROJECT_HOME_DIR}/temp'

%run ./scripts/notebook_helpers.py

In [2]:
# Initialization
set_cell_background('#EAE3D2')

config = configparser.ConfigParser()
sflk_session = None

print(" Initialize Snowpark session")
with open(CONFIG_FL) as f:
    config.read(CONFIG_FL)
    snow_conn_flpath =  f"{PROJECT_HOME_DIR}/{config['DEFAULT']['connection_fl']}"
    
    # ------------
    # Connect to snowflake
    with open(snow_conn_flpath) as conn_f:
        snow_conn_info = json.load(conn_f)
        sflk_session = Session.builder.configs(snow_conn_info).create()

if(sflk_session == None):
    raise(f'Unable to connect to snowflake. Validate connection information in file: {CONFIG_FL} ')

df = sflk_session.sql('select current_warehouse(), current_user(), current_role();').to_pandas()
display(df)

 Initialize Snowpark session


Unnamed: 0,CURRENT_WAREHOUSE(),CURRENT_USER(),CURRENT_ROLE()
0,LAB_WH,VSEKAR,DEV_BLOGGER


---
## Define Prediction UDF

We want to run predictions/inference, using the AutoGluon trained models, within Snowflake natively.
Hence in the below steps we define the UDF (predict_occupancy).

In [3]:
# Define prediction udf
set_cell_background('#EAE3D2')

from snowflake.snowpark.functions import pandas_udf

target_db = config['DEFAULT']['db']
target_schema = config['DEFAULT']['sch']
stage = config['DEFAULT']['stage']
tage = config['DEFAULT']['stage']
stage_lib_dir = config['DEFAULT']['stage_lib_dir']
stage_models_dir = config['DEFAULT']['stage_models_dir']


imports_to_fn = [
    f'''@{target_db}.{target_schema}.{stage}/{stage_lib_dir}autogluon.core-0.5.2-py3-none-any.whl'''
    ,f'''@{target_db}.{target_schema}.{stage}/{stage_lib_dir}autogluon.common-0.5.2-py3-none-any.whl'''
    ,f'''@{target_db}.{target_schema}.{stage}/{stage_lib_dir}autogluon.extra-0.3.1-py3-none-any.whl'''
    ,f'''@{target_db}.{target_schema}.{stage}/{stage_lib_dir}autogluon.features-0.5.2-py3-none-any.whl'''
    ,f'''@{target_db}.{target_schema}.{stage}/{stage_lib_dir}autogluon.tabular-0.5.2-py3-none-any.whl'''
    ,f'''@{target_db}.{target_schema}.{stage}/{stage_lib_dir}autogluon-0.5.2-py3-none-any.whl'''

    ,f'''@{target_db}.{target_schema}.{stage}/{stage_models_dir}{config['DEFAULT']['ag_model_archive']}'''
]

predict_room_occupancy_udf = sflk_session.udf.register_from_file(
    file_path=f'{PROJECT_HOME_DIR}/src/main/python/predict_room_occupancy.py'
    
    ,func_name='predict_occupancy'
    ,return_type=IntegerType() 
    ,input_types=[StringType() ,FloatType() ,FloatType() ,FloatType() ,FloatType() ,FloatType()] #model_touse ,CO2 ,HUMIDITY ,LIGHT ,TEMPERATURE ,PIR

    ,name=f'{target_db}.{target_schema}.predict_occupancy'
    ,is_permanent = True ,replace = True
    ,stage_location=f'@{target_db}.{target_schema}.{stage}/fnlib'
    ,imports = imports_to_fn
    ,packages = ['snowflake-snowpark-python' ,'requests' ,'tqdm' ,'scipy' 
        ,'scikit-learn' ,'boto3' ,'networkx' ,'pandas' ,'numpy']
)

---
## Prediction/Inference

Now for the final step of demonstrating the inference using the UDF defined above. In the below SQL, we invoke the UDF defined above
passing in the columns (which are the features). The UDF is implemented as a vectorized UDF, hence the columns are combined into a dataframe
and sent into the UDF.

We can ask AutoGluon to use specific model to use for the inference. However we cannot pass this as a single parameter. Hence we have to
put into a column ('MODEL_TO_USE') and pass it to the UDF. This is because the Vectorized UDF which takes pandas dataframe as input cannot
take additional parameters.

__Reference:__
- Doc:[Using Vectorized UDFs via the Python UDF Batch API](https://docs.snowflake.com/en/developer-guide/snowpark/python/creating-udfs.html#using-vectorized-udfs-via-the-python-udf-batch-api)


In [15]:
set_cell_background('#EAE3D2')

display_code(p_title='Queries' ,p_background_color='honeydew'
       ,p_code=f'''
<b>Q: Can I use any of the models trained by AutoGluon?</b>

A: Unfortunately <strong>NO</strong>. Not all models/algorithms can be used. The reason being that the 3rd party libraries (ex: autogluon.core-0.5.2-py3-none-any.whl)
can be extracted and imported as long as there are no native components/libraries. <i><ins>CatBoost & NeuralNetFastAI</ins></i> are examples of algorithms that 
cannot be used.

In the case of <i>CatBoost</i> it requires a native library '_catboost.so' that would not be able to be loaded. And in the case of <i>NeuralNetFastAI</i>
it requires FastAI which has a dependency of MatPlotLib. The MatPlotLib uses a native library hence it cant be loaded.

There are also certain algorithm that is not possible to use currently, for ex: <i><ins>NeuralNetTorch</ins></i> We need to use the PyTorch 1.12 version which is
used by AutoGluon and not the one from Snowflake Anaconda channel, which is of version 1.10. The PyTorch library is 750MB+ in size, hence when we 
extract it we run out of disk space. Currently the temp folder, which is where we use for libraries locally, is limited to 500MB.

<b>Q: What models have worked currently via the UDF?</b>

A:  KNeighborsUnif ,KNeighborsDist ,ExtraTreesGini ,ExtraTreesEntr ,RandomForestGini ,RandomForestEntr

<b>Q: What are the various models that AutoGluon currently supports?</b>

A: Refer to doc <a href='https://auto.gluon.ai/stable/api/autogluon.tabular.models.html'>autogluon.tabular.models </a>

''')


In [7]:
set_cell_background('#EAE3D2')

df = sflk_session.sql(
        f''' 
        with base as (
                select * from {target_db}.{target_schema}.sensor_measurements_imputed
                limit 10000
        )
        select 
                CO2 ,HUMIDITY ,TEMPERATURE ,PIR
                ,'KNeighborsUnif' as MODEL_TOUSE
                ,{target_db}.{target_schema}.predict_occupancy("MODEL_TOUSE" ,"CO2" ,"HUMIDITY" ,"LIGHT" ,"TEMPERATURE" ,"PIR") as pred_val
                
        from base
        where PIR > 1
        ''')

df.show(max_width=1000)

----------------------------------------------------------------------------
|"CO2"  |"HUMIDITY"  |"TEMPERATURE"  |"PIR"  |"MODEL_TOUSE"   |"PRED_VAL"  |
----------------------------------------------------------------------------
|680.0  |54.48       |24.34          |23.0   |KNeighborsUnif  |1           |
|680.0  |54.48       |24.34          |23.0   |KNeighborsUnif  |1           |
|686.0  |54.48       |24.34          |22.0   |KNeighborsUnif  |1           |
|686.0  |54.48       |24.34          |22.0   |KNeighborsUnif  |1           |
|681.0  |54.48       |24.34          |21.0   |KNeighborsUnif  |1           |
|684.0  |54.48       |24.34          |21.0   |KNeighborsUnif  |1           |
|678.0  |54.48       |24.34          |21.0   |KNeighborsUnif  |1           |
|675.0  |54.48       |24.34          |21.0   |KNeighborsUnif  |1           |
|678.0  |54.48       |24.34          |21.0   |KNeighborsUnif  |1           |
|670.0  |54.48       |24.34          |21.0   |KNeighborsUnif  |1           |

-------------------------

### Close out

    With that we are finished this section of the demo setup

In [None]:
sflk_session.close()
print('Finished!!!')