# Ray Tune Example

#### Examples from: 

* https://docs.ray.io/en/latest/tune/examples/tune-sklearn.html
* https://docs.ray.io/en/latest/tune/index.html


In [1]:
import os
from collections import Counter
import platform
import time
import numpy as np

import ray
from ray import tune

from tune_sklearn import TuneGridSearchCV

from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV


Package pickle5 becomes unnecessary in Python 3.8 and above. Its presence may confuse libraries including Ray. Please uninstall the package.


If we have a ray cluster to connect to, do it. otherwise rely on ray to manage distribution of work locally. 

In [2]:
os.environ['RAY_CLUSTER']

'ray-cluster-eerlands-40redhat-2ecom-ray-head'

In [3]:
ray.init('ray://{ray_head}:10001'.format(ray_head=os.environ['RAY_CLUSTER']))

ClientContext(dashboard_url='10.129.4.133:8265', python_version='3.8.12', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', protocol_version='2022-03-16', _num_clients=1, _context_to_restore=<ray.util.client._ClientContext object at 0x7f805af11d90>)

In [4]:
# Create dataset
X, y = make_classification(
    n_samples=11000,
    n_features=1000,
    n_informative=50,
    n_redundant=0,
    n_classes=10,
    class_sep=2.5,
)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1000)

As a baseline to evaluate our increased performance. Let's run SGD once, with one set of parameters and see how long it takes. 

In [5]:
%%time
clf = SGDClassifier(alpha=1,epsilon=0.1)
clf.fit(x_train, y_train)
first_guess_score = clf.score(x_test, y_test)
first_guess_score

CPU times: user 1.59 s, sys: 7.93 ms, total: 1.6 s
Wall time: 1.59 s


0.869

Note that this objective function includes our training and testing data in its closure. Inefficient!

In [6]:
def objective(config, checkpoint_dir=None):
    clf = SGDClassifier(alpha = config["alpha"], epsilon = config["epsilon"])
    clf.fit(x_train,y_train)
    score = clf.score(x_test,y_test)
    return {"score": score}

define our new search space for the SGD (Stochastic Gradient Descent) algorithm. 

In [7]:
search_space = {
    "alpha": tune.grid_search([1e-4, 1e-1, 1]),
    "epsilon": tune.grid_search([0.01, 0.1]),
}

Looks like it takes about 1 second to initialize the SGD, train it and evaluate it. Give our search space contains 6 different combinations, we should expect a serial execution (slowest) to be about 6 seconds, and a fully parralelized implementation to take about 1 second (fastest). Lets see.    

In [8]:
%%time
analysis = tune.run(objective,
                    config=search_space,
                    resources_per_trial={'gpu': 1})

[2m[36m(run pid=1517)[0m 2022-08-16 23:28:32,202	ERROR syncer.py:147 -- Log sync requires rsync to be installed.


[2m[36m(run pid=1517)[0m == Status ==
[2m[36m(run pid=1517)[0m Current time: 2022-08-16 23:28:34 (running for 00:00:08.92)
[2m[36m(run pid=1517)[0m Memory usage on this node: 6.4/30.9 GiB
[2m[36m(run pid=1517)[0m Using FIFO scheduling algorithm.
[2m[36m(run pid=1517)[0m Resources requested: 1.0/4 CPUs, 1.0/1 GPUs, 0.0/4.2 GiB heap, 0.0/1.72 GiB objects
[2m[36m(run pid=1517)[0m Result logdir: /home/ray_results/objective_2022-08-16_23-28-25
[2m[36m(run pid=1517)[0m Number of trials: 6/6 (5 PENDING, 1 RUNNING)
[2m[36m(run pid=1517)[0m +-----------------------+----------+-------------------+---------+-----------+
[2m[36m(run pid=1517)[0m | Trial name            | status   | loc               |   alpha |   epsilon |
[2m[36m(run pid=1517)[0m |-----------------------+----------+-------------------+---------+-----------|
[2m[36m(run pid=1517)[0m | objective_20004_00000 | RUNNING  | 10.129.4.133:1713 |  0.0001 |      0.01 |
[2m[36m(run pid=1517)[0m | objectiv

[2m[36m(run pid=1517)[0m 2022-08-16 23:29:42,908	INFO tune.py:747 -- Total run time: 79.01 seconds (77.14 seconds for the tuning loop).


In [9]:
best = analysis.get_best_config(metric="score", mode="max")
best

{'alpha': 0.1, 'epsilon': 0.01}

In [10]:
%%time
clf = SGDClassifier(**best)
clf.fit(x_train, y_train)
clf.score(x_test, y_test)

CPU times: user 2.81 s, sys: 27.2 ms, total: 2.83 s
Wall time: 2.84 s


0.875

In [11]:
first_guess_score

0.869