### Ray AI Runtime (AIR)
For this tutorial we will rely `ray['air']` which can be installed using `pip install -U "ray[air]"`

First step in the training tutorial is start a ray cluster. 

In [11]:
import ray
# verify if a cluster already exist and terminate it
if ray.is_initialized:
    ray.shutdown()
# start a new cluster
ray.init()

2023-05-26 15:30:19,885	INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8266 [39m[22m


0,1
Python version:,3.9.6
Ray version:,2.4.0
Dashboard:,http://127.0.0.1:8266


In [2]:
import time
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

### California Housing dataset:
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing

--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

    The target variable is the median house value for California districts,
    expressed in hundreds of thousands of dollars unit ($100,000).

In [3]:
from sklearn.datasets import fetch_california_housing
# return covariates and target variables
X_att, Y_target = fetch_california_housing(return_X_y=True, as_frame=True)
# split the dataset
X_train, X_test, Y_train, Y_test = train_test_split(X_att,Y_target, test_size=0.3, random_state=64)

Define a sequential training function for comparison purposes

In [4]:
def train_model_seq(train_att, test_att, train_target, test_target, n_estimators):
    start_time = time.time()
    en_model = RandomForestRegressor(n_estimators=n_estimators, random_state=64)
    en_model.fit(train_att, train_target)    
    end_time = time.time()
    delta_time = end_time - start_time
    print( 'Training time: {} seconds'.format(delta_time))

    return delta_time

In [5]:
%%time
for n in range(0, 201, 10):
    delta_time = train_model_seq(X_train, X_test, Y_train, Y_test, 1+n)

Training time: 0.07712388038635254 seconds
Training time: 0.6658492088317871 seconds
Training time: 1.409588098526001 seconds
Training time: 2.013491153717041 seconds
Training time: 2.5449321269989014 seconds
Training time: 3.1334309577941895 seconds
Training time: 3.750880002975464 seconds
Training time: 4.3573079109191895 seconds
Training time: 5.020026922225952 seconds
Training time: 5.497156858444214 seconds
Training time: 6.2171571254730225 seconds
Training time: 6.885180234909058 seconds
Training time: 7.553805112838745 seconds
Training time: 8.064624786376953 seconds
Training time: 8.695956945419312 seconds
Training time: 9.073594808578491 seconds
Training time: 10.076745986938477 seconds
Training time: 10.38935923576355 seconds
Training time: 11.079370021820068 seconds
Training time: 11.526580095291138 seconds
Training time: 12.132888793945312 seconds
CPU times: user 2min 9s, sys: 3.29 s, total: 2min 13s
Wall time: 2min 10s


Define a function on the cluster by using the function decorator `@ray.remote`

In [14]:
@ray.remote
def train_model_par(train_att,test_att,train_target,test_target,n_estimators):
    start_time = time.time()
    en_model = RandomForestRegressor(n_estimators=n_estimators, random_state=64)
    en_model.fit(train_att, train_target)    
    end_time = time.time()
    delta_time = end_time - start_time
    print( 'Training time: {} seconds'.format(delta_time))

    return delta_time

Put the variables on the cluster using `ray.put()`

In [13]:
X_train_ray = ray.put(X_train)
X_test_ray = ray.put(X_test)
Y_train_ray = ray.put(Y_train)
Y_test_ray = ray.put(Y_test)

The access point for the function becomes `my_function.remote()`, retrieving the function output can be done using `ray.get()`

In [15]:
%%time
timer = []
for n in range(0, 201, 10):
    delta_time = train_model_par.remote(X_train_ray, X_test_ray, Y_train_ray, Y_test_ray, 1+n)
    timer.append(delta_time)
ray.get(timer)

[2m[36m(train_model_par pid=94671)[0m Training time: 0.08847880363464355 seconds
[2m[36m(train_model_par pid=94668)[0m Training time: 5.974763870239258 seconds[32m [repeated 6x across cluster][0m
[2m[36m(train_model_par pid=94666)[0m Training time: 10.023831129074097 seconds[32m [repeated 4x across cluster][0m
[2m[36m(train_model_par pid=94665)[0m Training time: 12.563910007476807 seconds[32m [repeated 3x across cluster][0m
[2m[36m(train_model_par pid=94671)[0m Training time: 13.83492398262024 seconds[32m [repeated 3x across cluster][0m
[2m[36m(train_model_par pid=94664)[0m Training time: 14.029475927352905 seconds[32m [repeated 3x across cluster][0m
CPU times: user 328 ms, sys: 826 ms, total: 1.15 s
Wall time: 31.4 s


[0.08847880363464355,
 1.048469066619873,
 1.97599196434021,
 3.1143927574157715,
 4.0829551219940186,
 4.971190929412842,
 5.974763870239258,
 7.563481092453003,
 8.508296966552734,
 9.423293113708496,
 10.023831129074097,
 10.981528997421265,
 11.80382490158081,
 12.563910007476807,
 13.429285049438477,
 13.544419050216675,
 13.83492398262024,
 14.053867101669312,
 14.233737230300903,
 14.029475927352905,
 14.16966986656189]

After finishing the calculation, you should shutdown the cluster usign,

In [17]:
ray.shutdown()