# Snowflake with Ray Air
By using the Ray Snowflake connector to read and write data into and out of Ray Datasets, all of the capabilities of Ray AIR can be used to build end to end machine learning applications. 

## Snowflake with Ray AIR and LightGBM
For this example we will show how to train and tune a [distributed LightGBM](https://docs.ray.io/en/master/ray-air/examples/lightgbm_example.html) model with Ray AIR using Snowflake data. We will then show how to score data with the trained model and push the scored data back into another Snowflake table.

### Set up the connector
The first step is to get a dictionary of connection properties.

In [1]:
import os

# load from environment
env_connect_props = {
    key.replace('SNOWFLAKE_','').lower(): value 
    for key,value in os.environ.items() if 'SNOWFLAKE_' in key
}

# add sample db and schema to connect props
connect_props = {
    **env_connect_props,
    'database':'SNOWFLAKE_SAMPLE_DATA',
    'schema':'TPCH_SF10',
    'warehouse':'COMPUTE_WH'
}

### Training and Tuning
A typical training or tuning workload will have the following logic when working with tabular data in Snowflake:

![Ray Train with Snowflake](images/snowflake_train_with_air.png)

#### Step 1: Stage data in Snowflake
When working with databases, it is best to take advantage of native join and aggregation features of the database prior to ingesting data into Ray Datasets. Ray datasets is designed to power machine learning workflows, and does not provide some typical analytics capabilities like large joins. For these reasons, as a first step, the data required for training the model will be forumated into a single query that runs within Snowflake prior to reading with the Ray Snowflake connector. 
This query could also be materialized into a staging table
if the data neesd to be used repeatedly.

 The code below creates a dataset of customer returns data from several Snowflake sample tables. We will use this data throughtout the train, tune and scoring process. In the code below, we use the `read_snowflake` method to read the data.

> Note: The data set size is set to be small to keep execution times small. If you would like to try larger dataset size, increase the `SIZE` and be sure to have a large enough cluster defined.

In [2]:
SRC = 'SNOWFLAKE_SAMPLE_DATA.TPCDS_SF10TCL'
SIZE = 1000
query = f"""
    WITH cstmrs as (
        SELECT 
            c_customer_sk as c_customer_sk, 
            c_current_cdemo_sk as c_current_cdemo_sk
        FROM {SRC}.customer LIMIT {SIZE}),
    sales as (
        SELECT 
            c_customer_sk, 
            COUNT(c_customer_sk) as n_sales 
            FROM cstmrs JOIN {SRC}.store_sales ON c_customer_sk = ss_customer_sk
        GROUP BY c_customer_sk),
                    
    rtrns as (
        SELECT 
            c_customer_sk, 
            COUNT(c_customer_sk) as n_returns 
            FROM cstmrs JOIN {SRC}.store_returns ON c_customer_sk = sr_customer_sk
        GROUP BY c_customer_sk)
                        
    SELECT
        cstmrs.c_customer_sk as customer_sk,
        ZEROIFNULL(n_sales) as n_sales,
        ZEROIFNULL(n_returns) as n_returns,
        IFF(n_sales is null or n_sales = 0 or n_returns is null, 0, n_returns/n_sales) as return_probability,
        demos.* 
    FROM cstmrs 
    JOIN {SRC}.customer_demographics as demos ON cstmrs.c_current_cdemo_sk = demos.cd_demo_sk
    LEFT OUTER JOIN sales on cstmrs.c_customer_sk = sales.c_customer_sk
    LEFT OUTER JOIN rtrns on cstmrs.c_customer_sk = rtrns.c_customer_sk
"""

#### Step 2: Read data into Ray
Now that we have a qujery, we can read it into a Ray dataset. 
> Note: Only the first partition is read into the dataset. Whe training starts, additional partitions will be pulled into the dataset when needed.

In [3]:
from ray.data import read_snowflake
ds = read_snowflake(connect_props, query=query)
    
ds.limit(10).to_pandas()

2023-02-06 01:19:49,874	INFO worker.py:1242 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
find: ‘.git’: No such file or directory
2023-02-06 01:19:50,207	INFO worker.py:1364 -- Connecting to existing Ray cluster at address: 10.0.63.233:9031...
2023-02-06 01:19:50,214	INFO worker.py:1544 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://console.anyscale.com/api/v2/sessions/ses_vnmb5jgl4z6q98h61dx25rccju/services?redirect_to=dashboard [39m[22m
2023-02-06 01:19:50,219	INFO packaging.py:330 -- Pushing file package 'gcs://_ray_pkg_ddd00605043f2a3b0bb38077ccc4d18c.zip' (0.35MiB) to Ray cluster...
2023-02-06 01:19:50,225	INFO packaging.py:343 -- Successfully pushed file package 'gcs://_ray_pkg_ddd00605043f2a3b0bb38077ccc4d18c.zip'.
Read progress: 100%|██████████| 1/1 [00:00<00:00, 28.13it/s]
Read progress: 100%|██████████| 1/1 [00:00<00:00, 858.61it/s]


Unnamed: 0,CUSTOMER_SK,N_SALES,N_RETURNS,RETURN_PROBABILITY,CD_DEMO_SK,CD_GENDER,CD_MARITAL_STATUS,CD_EDUCATION_STATUS,CD_PURCHASE_ESTIMATE,CD_CREDIT_RATING,CD_DEP_COUNT,CD_DEP_EMPLOYED_COUNT,CD_DEP_COLLEGE_COUNT
0,12370725,393,57,0.145038,644828,F,W,Advanced Degree,6000,Good,3,2,2
1,12363711,351,23,0.065527,149925,M,D,Advanced Degree,1000,Unknown,5,3,0
2,12365584,892,67,0.075112,949318,F,W,4 yr Degree,1000,High Risk,1,3,3
3,12366529,393,41,0.104326,1822552,F,M,2 yr Degree,8500,Low Risk,3,4,6
4,12368417,372,32,0.086022,182245,M,D,2 yr Degree,2000,High Risk,4,4,0
5,12362216,362,36,0.099448,1891186,F,D,Unknown,8500,High Risk,1,6,6
6,12361970,370,50,0.135135,1144947,M,W,College,8500,Low Risk,1,1,4
7,12364249,378,49,0.12963,847827,M,W,Advanced Degree,6000,Low Risk,4,0,3
8,12367535,383,43,0.112272,222398,F,W,Primary,9000,High Risk,4,5,0
9,12364574,369,60,0.162602,422118,F,W,Secondary,5500,Low Risk,5,3,1


#### Step 3: Train
Now that the data is read into a Ray dataset, we can use it to train or tune a LighGBM model. 

##### Prepare the data
After reading the data, we need to do some simple manipulations to drop columns and split the data into training and test sets.

In [6]:
DROP_COLUMNS = ['N_SALES', 'N_RETURNS', 'CD_DEMO_SK']

ds = ds.fully_executed().drop_columns(DROP_COLUMNS).repartition(100)
train_dataset, valid_dataset = ds.train_test_split(test_size=0.3)

ValueError: The size in bytes of the block must be known: (ObjectRef(00ffffffffffffffffffffffffffffffffffffff0400000005000000), BlockMetadata(num_rows=1209, size_bytes=None, schema=PandasBlockSchema(names=['CUSTOMER_SK', 'N_SALES', 'N_RETURNS', 'RETURN_PROBABILITY', 'CD_DEMO_SK', 'CD_GENDER', 'CD_MARITAL_STATUS', 'CD_EDUCATION_STATUS', 'CD_PURCHASE_ESTIMATE', 'CD_CREDIT_RATING', 'CD_DEP_COUNT', 'CD_DEP_EMPLOYED_COUNT', 'CD_DEP_COLLEGE_COUNT'], types=[dtype('int64'), dtype('int64'), dtype('int64'), dtype('O'), dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('int64'), dtype('O'), dtype('int64'), dtype('int64'), dtype('int64')]), input_files=[], exec_stats=None))

##### Create preprocessors
In Ray Air, all trainers, tuners and predcitors allow for the addition of preprocessors. Preprocessors help to featurize data, by providing common operations like on-hot-encoding, categorizing, scaling, etc. For more on the available preprocessors, read the [RayAIR docs](https://docs.ray.io/en/latest/ray-air/package-ref.html#preprocessor). The code below will use a chain of pre-processors. The `BatchMapper` will drop the ID column so it wont be used when training. The `Categorizer` will categorize columns, and the `StandardScaler` will scale columns. All of the pre-processing logic only modifes the data as it is being passed into training algorithms, and the underlying dataset will remain the same.

In [None]:
from ray.data.preprocessors import Chain, BatchMapper, Categorizer, StandardScaler

ID_COLUMN = 'CUSTOMER_SK'
CATEGORICAL_COLUMNS = ['CD_GENDER', 'CD_MARITAL_STATUS', 'CD_EDUCATION_STATUS', 'CD_CREDIT_RATING']
SCALAR_COLUMNS = ['CD_PURCHASE_ESTIMATE', 'CD_DEP_COUNT', 'CD_DEP_EMPLOYED_COUNT', 'CD_DEP_COLLEGE_COUNT']

# Scale some random columns, and categorify the categorical_column,
# allowing LightGBM to use its built-in categorical feature support
preprocessor = Chain(
    BatchMapper(lambda df: df.drop(ID_COLUMN, axis=1)),
    Categorizer(CATEGORICAL_COLUMNS), 
    StandardScaler(columns=SCALAR_COLUMNS)
)

##### Configure scaling
Training requires compute infrastructure, and specifying what type is needed to optimize your training time and costs. When first beginning, it is best to start with a small dataset size and compute to get things working and then scale up data and compute together. Below we create a `ScalingConfig` that provides 10 workers for distributed trianing. This will likely keep training on a single instance. We also don't request GPU's.

In [None]:
from ray.air.config import ScalingConfig

scaling_config=ScalingConfig(num_workers=10, use_gpu=False),

##### Create a trainer
Now that we have everything required, we can create a trainer. In Ray AIR, the logic to create a trainer and fit it are very simliar. The main differences are in the parameters passed to the algorithm. This makes it easy to swap out algorithms. For example, swapping LightGBM for XGBoost, or even PyTorch tabular, will typically be just a few lines of code.

In [None]:
from ray.train.lightgbm import LightGBMTrainer

TARGET_COLUMN = 'RETURN_PROBABILITY'

# LightGBM specific params
params = {
    "objective": "regression",
    "metric": ["rmse", "mae"],
}

trainer = LightGBMTrainer(
    scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
    label_column=TARGET_COLUMN,
    params=params,
    datasets={"train": train_dataset, "valid": valid_dataset},
    preprocessor=preprocessor,
    num_boost_round=10
)

##### Fit the model
Now that the trainer is defined, al that is required is to call fit to begin the training process. The `fit` method will return a results object that containes the model checkpoint as well as model training metrics.

In [None]:
result = trainer.fit()

### Score a model
Once there is a trained model, we can use it to score data. The flow for training and scoring are similar in that data is staged in Snowflake and read into a Ray dataset with the connector. Once the data is read in, the previously created model checkpoint can be used to creat a batch predictor for scoring. Scored data can then be written back into Snowflake with the connector. 

The typical logical flow for a batch scoring in Snowflake with Ray AIR is the following:

![Snowflake batch scoring](images/snowflake_score_with_air.png)

#### Steps 1-2: Stage and read data
Since the data has already been staged and loaded, we dont need any extra code to do that now. Typically, you will have a script for training, and a script for scoring that will be run independently. The staging and loading of data should be sperated into a shared script that can be used by each of these workflows.

#### Step 3: Score the data
The previously trained checkpoint can be used to create a predictor. This predictor will already contain the pre-processors used to train the model. All that is needed is to drop the target column before feeding it into the model to simulate a real dataset where we dont know the results.

> Note: Typically model checkpoints will be stored in a model registry provided by Weights and Biases or MLFlow, or into an objects store like S3. Checkpoints are written and read using the [checkpoint API](https://docs.ray.io/en/latest/ray-air/package-ref.html#ray.air.checkpoint.Checkpoint).

In [None]:
from ray.train.batch_predictor import BatchPredictor
from ray.train.lightgbm import LightGBMPredictor

predictor = BatchPredictor.from_checkpoint(
    result.checkpoint, LightGBMPredictor
)

test_dataset = valid_dataset.drop_columns(TARGET_COLUMN)
predictions = predictor.predict(test_dataset, keep_columns=[ID_COLUMN])
predictions.limit(10).to_pandas()

#### Step 4: Write data to Snowflake
Now that we have the predictions we can write them into a Snowflake table. We need to first create a destination database.

In [None]:
from ray.data.datasource import SnowflakeConnector

# get new connect proeprties for the new database
write_connect_props = {
    **connect_props, 
    'database':'RAY_SAMPLE', 
    'schema':'PUBLIC'
}

# create destination database
with SnowflakeConnector(**write_connect_props) as con:
    con.query(f'CREATE DATABASE IF NOT EXISTS RAY_SAMPLE')
    

# write the predictions
ds.write_snowflake(
    write_connect_props, 
    table='PREDICTIONS',
    autocreate=True
)

# read the predictions back
read_snowflake(
    write_connect_props, 
    table='PREDICTIONS'
).limit(3).to_pandas()