# An Overview of Ray

![Ray Layers](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_01/ray_layers.png)

## A distributed computing framework
At its core, Ray is a distributed computing framework.
We'll  provide you with just the basic terminology here, and talk about Ray's architecture.
In short, Ray sets up and manages clusters of computers so that you can run distributed tasks on them.
A ray cluster consists of nodes that are connected to each other via a network.
You program against the so-called _driver_, the program root, which lives on the _head node_.
The driver can run _jobs_, that is a collection of tasks, that are run on the nodes in the cluster.
Specifically, the individual tasks of a job are run on _worker_ processes on _worker nodes_.

![Ray cluster](https://raw.githubusercontent.com/maxpumperla/learning_ray/main/notebooks/images/chapter_01/simple_cluster.png)

What's interesting is that a Ray cluster can also be a _local cluster_, i.e. a cluster
consisting just of your own computer.
In this case, there's just one node, namely the head node, which has the driver process and some worker processes.

With that knowledge at hand, it's time to get your hands dirty and run your first local Ray cluster.
Installing Ray on any of the major operating systems should work seamlessly using `pip`:

```
pip install "ray[rllib, tune, serve]"
```

With a simple `pip install ray` you would have installed just the very basics of Ray.
Since we want to explore some advanced features, we installed the "extras" `rllib` and `tune`,
which we'll discuss in a bit.
Depending on your system configuration you may not need the quotation marks in the above installation command.

Next, go ahead and start a Python session.
You could use the `ipython` interpreter, which I find to be the most suitable environment
for following along simple examples.
The choice is up to you, but in any case please remember to use Python version `3.7` or later.
In your Python session you can now easily import and initialize Ray as follows:

In [4]:
# ray.shutdown()

In [6]:
# tag::init[]

import ray
ray.init()
# end::init[]

2022-10-06 21:51:25,124	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


0,1
Python version:,3.9.13
Ray version:,2.0.0
Dashboard:,http://127.0.0.1:8265


## Data Processing with Ray Data
The first high-level library of Ray we talk about is called "Ray Data".
This library contains a data structure aptly called `Dataset`, a multitude of connectors for loading data from
various formats and systems, an API for transforming such datasets, a way to build data processing pipelines
with them, and many integrations with other data processing frameworks.
The `Dataset` abstraction builds on the powerful [Arrow framework](https://arrow.apache.org/).

To use Ray Data, you need to install Arrow for Python, for instance by running `pip install pyarrow`.
We'll now discuss a simple example that creates a distributed `Dataset` on your local Ray cluster from a Python
data structure. Specifically, you'll create a dataset from a Python dictionary containing a string `name`
and an integer-valued `data` for `10000` entries:

In [23]:
# tag::ray_data_load[]
import ray

items = [{"name": str(i), "data": i} for i in range(10000)]
ds = ray.data.from_items(items)   # <1>
ds.show(5)  # <2>
# end::ray_data_load[]

RaySystemError: System error: Broken pipe

In [7]:
ds

Dataset(num_blocks=200, num_rows=10000, schema={name: string, data: int64})

Great, now you have some distributed rows, but what can you do with that data?
The `Dataset` API bets heavily on functional programming, as it is very well suited for data transformations.
Even though Python 3 made a point of hiding some of its functional programming capabilities, you're probably
familiar with functionality such as `map`, `filter` and others.
If not, it's easy enough to pick up.
`map` takes each element of your dataset and transforms is into something else, in parallel.
`filter` removes data points according to a boolean filter function.
And the slightly more elaborate `flat_map` first maps values similarly to `map`, but then also "flattens" the result.
For instance, if `map` would produce a list of lists, `flat_map` would flatten out the nested lists and give
you just a list.
Equipped with these three functional API calls, let's see how easily you can transform your dataset `ds`:

In [8]:
# tag::ray_data_transform[]
squares = ds.map(lambda x: x["data"] ** 2)  # <1>

evens = squares.filter(lambda x: x % 2 == 0)  # <2>
evens.count()

cubes = evens.flat_map(lambda x: [x, x**3])  # <3> ([1,1],[2,8],...)
sample = cubes.take(10)  # <4>
print(sample)
# end::ray_data_transform[]

Map: 100%|██████████| 200/200 [00:02<00:00, 97.24it/s] 
Filter: 100%|██████████| 200/200 [00:00<00:00, 884.95it/s]
Flat_Map: 100%|██████████| 200/200 [00:00<00:00, 782.39it/s]

[0, 0, 4, 64, 16, 4096, 36, 46656, 64, 262144]





In [9]:
evens.count()

5000

In [13]:
evens

Dataset(num_blocks=200, num_rows=5000, schema=<class 'int'>)

In [19]:
it = evens.iter_batches()

In [20]:
next(it)

[0,
 4,
 16,
 36,
 64,
 100,
 144,
 196,
 256,
 324,
 400,
 484,
 576,
 676,
 784,
 900,
 1024,
 1156,
 1296,
 1444,
 1600,
 1764,
 1936,
 2116,
 2304,
 2500,
 2704,
 2916,
 3136,
 3364,
 3600,
 3844,
 4096,
 4356,
 4624,
 4900,
 5184,
 5476,
 5776,
 6084,
 6400,
 6724,
 7056,
 7396,
 7744,
 8100,
 8464,
 8836,
 9216,
 9604,
 10000,
 10404,
 10816,
 11236,
 11664,
 12100,
 12544,
 12996,
 13456,
 13924,
 14400,
 14884,
 15376,
 15876,
 16384,
 16900,
 17424,
 17956,
 18496,
 19044,
 19600,
 20164,
 20736,
 21316,
 21904,
 22500,
 23104,
 23716,
 24336,
 24964,
 25600,
 26244,
 26896,
 27556,
 28224,
 28900,
 29584,
 30276,
 30976,
 31684,
 32400,
 33124,
 33856,
 34596,
 35344,
 36100,
 36864,
 37636,
 38416,
 39204,
 40000,
 40804,
 41616,
 42436,
 43264,
 44100,
 44944,
 45796,
 46656,
 47524,
 48400,
 49284,
 50176,
 51076,
 51984,
 52900,
 53824,
 54756,
 55696,
 56644,
 57600,
 58564,
 59536,
 60516,
 61504,
 62500,
 63504,
 64516,
 65536,
 66564,
 67600,
 68644,
 69696,
 70756,
 

	(2) raylet has lagging heartbeats due to slow network or busy workload.
[2022-10-06 21:56:43,325 E 16057 31393] core_worker.cc:492: :info_message: Attempting to recover 200 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).


In [14]:
for i in evens.iter_rows():
    print(i)

0
4
16
36
64
100
144
196
256
324
400
484
576
676
784
900
1024
1156
1296
1444
1600
1764
1936
2116
2304
2500
2704
2916
3136
3364
3600
3844
4096
4356
4624
4900
5184
5476
5776
6084
6400
6724
7056
7396
7744
8100
8464
8836
9216
9604
10000
10404
10816
11236
11664
12100
12544
12996
13456
13924
14400
14884
15376
15876
16384
16900
17424
17956
18496
19044
19600
20164
20736
21316
21904
22500
23104
23716
24336
24964
25600
26244
26896
27556
28224
28900
29584
30276
30976
31684
32400
33124
33856
34596
35344
36100
36864
37636
38416
39204
40000
40804
41616
42436
43264
44100
44944
45796
46656
47524
48400
49284
50176
51076
51984
52900
53824
54756
55696
56644
57600
58564
59536
60516
61504
62500
63504
64516
65536
66564
67600
68644
69696
70756
71824
72900
73984
75076
76176
77284
78400
79524
80656
81796
82944
84100
85264
86436
87616
88804
90000
91204
92416
93636
94864
96100
97344
98596
99856
101124
102400
103684
104976
106276
107584
108900
110224
111556
112896
114244
115600
116964
118336
119716
121104
122500


In [12]:
evens.get_internal_block_refs()

[ObjectRef(1e5457bc5d10671cffffffffffffffffffffffff0100000001000000),
 ObjectRef(028103e7e38c170dffffffffffffffffffffffff0100000001000000),
 ObjectRef(6c684b42308fa701ffffffffffffffffffffffff0100000001000000),
 ObjectRef(3b59e15142311122ffffffffffffffffffffffff0100000001000000),
 ObjectRef(0c386fece30bcf61ffffffffffffffffffffffff0100000001000000),
 ObjectRef(82c8912bcb250859ffffffffffffffffffffffff0100000001000000),
 ObjectRef(24bf3f460bb41f9bffffffffffffffffffffffff0100000001000000),
 ObjectRef(a9589cda3d189945ffffffffffffffffffffffff0100000001000000),
 ObjectRef(e4a0fd93eeb6baf3ffffffffffffffffffffffff0100000001000000),
 ObjectRef(2118fdb22e54b668ffffffffffffffffffffffff0100000001000000),
 ObjectRef(3017bfc9a25f186bffffffffffffffffffffffff0100000001000000),
 ObjectRef(fdd830281c6b2054ffffffffffffffffffffffff0100000001000000),
 ObjectRef(8711593c5c76fd9bffffffffffffffffffffffff0100000001000000),
 ObjectRef(63f04e8cda0f26e9ffffffffffffffffffffffff0100000001000000),
 ObjectRef(7c418a877

In [9]:
cubes

Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)

The drawback of `Dataset` transformations is that each step gets executed synchronously.
In the above example this is a non-issue, but for complex tasks that e.g. mix reading files and processing data,
you want an execution that can overlap the individual tasks.
`DatasetPipeline` does exactly that.
Let's rewrite the last example into a pipeline.

In [11]:
# tag::ray_data_pipeline[]
pipe = ds.window()  # <1>
result = pipe\
    .map(lambda x: x["data"] ** 2)\
    .filter(lambda x: x % 2 == 0)\
    .flat_map(lambda x: [x, x**3])  # <2>
result.show(10)
# end::ray_data_pipeline[]

2022-10-04 18:43:49,704	INFO dataset.py:3276 -- Created DatasetPipeline with 20 windows: 7390b min, 8000b max, 7944b mean
2022-10-04 18:43:49,706	INFO dataset.py:3286 -- Blocks per window: 10 min, 10 max, 10 mean
2022-10-04 18:43:49,721	INFO dataset.py:3325 -- ✔️  This pipeline's windows likely fit in object store memory without spilling.
Stage 1:   5%|▌         | 1/20 [00:00<00:03,  6.06it/s]
Stage 0:  10%|█         | 2/20 [00:00<00:01, 11.56it/s]

0
0
4
64
16
4096
36
46656
64
262144





In [7]:
pipe

DatasetPipeline(num_windows=20, num_stages=1)

In [10]:
result

DatasetPipeline(num_windows=20, num_stages=4)

## Distributed training with Ray Train

Ray RLlib is dedicated to reinforcement learning, but what do you do if you need to train models for
other types of machine learning, like supervised learning?
You can use another Ray library for distributed training in this case, called _Ray Train_.
At this point, we don't have built up enough knowledge of frameworks such as `TensorFlow` to give you a
concrete and informative example for Ray Train.
It also doesn't make sense right now to dive into deep learning or explain what Train is, for that matter.
We'll discuss this in chapter 6, when it's time to.
But we can at least roughly sketch what a distributed training "wrapper" for an ML model would look like.
A schematic procedure for running distributed deep learning with Ray Train looks as follows.

In [12]:
# tag::ray_train_sketch[]
from ray.train import Trainer


def training_function():  # <1>
    pass


trainer = Trainer(backend="torch", num_workers=4)  # <2>
trainer.start()

results = trainer.run(training_function)  # <3>
trainer.shutdown()
# end::ray_train_sketch[]

  trainer = Trainer(backend="torch", num_workers=4)  # <2>
2022-10-04 18:55:38,854	INFO trainer.py:160 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for Ray Train. To enable GPU training, make sure to set `use_gpu` to True when instantiating your Trainer.
2022-10-04 18:55:38,855	INFO trainer.py:247 -- Trainer logs will be logged in: /home/serendipita/ray_results/train_2022-10-04_18-55-38
2022-10-04 18:55:43,610	INFO trainer.py:253 -- Run results will be logged in: /home/serendipita/ray_results/train_2022-10-04_18-55-38/run_001
[2m[36m(RayTrainWorker pid=15880)[0m 2022-10-04 18:55:43,558	INFO config.py:71 -- Setting up process group for: env:// [rank=0, world_size=4]


## Hyperparameter Tuning with Ray Tune
Naming things is hard, but the Ray team hit the spot with _Ray Tune_, which you can use to tune all
sorts of parameters.
Specifically, it was built to find good hyperparameters for machine learning models.
The typical setup is as follows:

- You want to run an extremely computationally expensive training function. In ML it's not uncommon
  to run training procedures that take days, if not weeks, but let's say you're dealing with just a couple of minutes.
- As result of training, you compute a so-called objective function. Usually you either want to maximize
  your gains or minimize your losses in terms of performance of your experiment.
- The tricky bit is that your training function might depend on certain parameters,
  hyperparameters, that influence the value of your objective function.
- You may have a hunch what individual hyperparameters should be, but tuning them all can be difficult.
  Even if you can restrict these parameters to a sensible range, it's usually prohibitive to test a wide
  range of combinations. Your training function is simply too expensive.

What can you do to efficiently sample hyperparameters and get "good enough" results on your objective?
The field concerned with solving this problem is called _hyperparameter optimization_ (HPO), and Ray Tune has
an enormous suite of algorithms for tackling it.
Let's look at a first example of Ray Tune used for the situation we just explained.
The focus is yet again on Ray and its API, and not on a specific ML task (which we simply simulate for now).

In [2]:
# tag::ray_tune[]
from ray import tune
import math
import time


def training_function(config):  # <1>
    x, y = config["x"], config["y"]
    time.sleep(10)
    score = objective(x, y)
    tune.report(score=score)  # <2>


def objective(x, y):
    return math.sqrt((x**2 + y**2)/2)  # <3>


result = tune.run(  # <4>
    training_function,
    config={
        "x": tune.grid_search([-1, -.5, 0, .5, 1]),  # <5>
        "y": tune.grid_search([-1, -.5, 0, .5, 1])
    })

print(result.get_best_config(metric="score", mode="min"))
# end::ray_tune[]

  from .autonotebook import tqdm as notebook_tqdm


Trial name,status,loc,x,y,iter,total time (s),score
training_function_c047f_00000,TERMINATED,192.168.43.126:8425,-1.0,-1.0,1,10.0501,1.0
training_function_c047f_00001,TERMINATED,192.168.43.126:8465,-0.5,-1.0,1,10.0523,0.790569
training_function_c047f_00002,TERMINATED,192.168.43.126:8467,0.0,-1.0,1,10.0564,0.707107
training_function_c047f_00003,TERMINATED,192.168.43.126:8469,0.5,-1.0,1,10.0478,0.790569
training_function_c047f_00004,TERMINATED,192.168.43.126:8471,1.0,-1.0,1,10.0468,1.0
training_function_c047f_00005,TERMINATED,192.168.43.126:8473,-1.0,-0.5,1,10.0477,0.790569
training_function_c047f_00006,TERMINATED,192.168.43.126:8475,-0.5,-0.5,1,10.0502,0.5
training_function_c047f_00007,TERMINATED,192.168.43.126:8477,0.0,-0.5,1,10.0482,0.353553
training_function_c047f_00008,TERMINATED,192.168.43.126:8478,0.5,-0.5,1,10.056,0.5
training_function_c047f_00009,TERMINATED,192.168.43.126:8480,1.0,-0.5,1,10.0526,0.790569


Result for training_function_c047f_00000:
  date: 2022-10-04_18-03-39
  done: false
  experiment_id: ac4d03a18b5444ad8f642b618f1f53ef
  hostname: serendipita-IdeaPad-L340-15IRH-Gaming
  iterations_since_restore: 1
  node_ip: 192.168.43.126
  pid: 8425
  score: 1.0
  time_since_restore: 10.050139665603638
  time_this_iter_s: 10.050139665603638
  time_total_s: 10.050139665603638
  timestamp: 1664924619
  timesteps_since_restore: 0
  training_iteration: 1
  trial_id: c047f_00000
  warmup_time: 0.005379438400268555
  
Result for training_function_c047f_00000:
  date: 2022-10-04_18-03-39
  done: true
  experiment_id: ac4d03a18b5444ad8f642b618f1f53ef
  experiment_tag: 0_x=-1,y=-1
  hostname: serendipita-IdeaPad-L340-15IRH-Gaming
  iterations_since_restore: 1
  node_ip: 192.168.43.126
  pid: 8425
  score: 1.0
  time_since_restore: 10.050139665603638
  time_this_iter_s: 10.050139665603638
  time_total_s: 10.050139665603638
  timestamp: 1664924619
  timesteps_since_restore: 0
  training_iterati

2022-10-04 18:03:59,951	INFO tune.py:758 -- Total run time: 34.95 seconds (32.87 seconds for the tuning loop).


{'x': 0, 'y': 0}


## Model Serving with Ray Serve

The last of Ray's high-level libraries we'll discuss specializes on model serving and is simply called _Ray Serve_.
To see an example of it in action, you need a trained ML model to serve.
Luckily, nowadays you can find many interesting models on the internet that have already been trained for you.
For instance, _Hugging Face_ has a variety of models available for you to download directly in Python.
The model we'll use is a language model called _GPT-2_ that takes text as input and produces text to
continue or complete the input.
For example, you can prompt a question and GPT-2 will try to complete it.

Serving such a model is a good way to make it accessible.
You may not now how to load and run a TensorFlow model on your computer, but you do now how
to ask a question in plain English.
Model serving hides the implementation details of a solution and lets users focus on providing
inputs and understanding outputs of a model.

To proceed, make sure to run `pip install transformers` to install the Hugging Face library
that has the model we want to use.
With that we can now import and start an instance of Ray's `serve` library, load and deploy a GPT-2
model and ask it for the meaning of life, like so:

In [1]:
# tag::ray_serve[]
from ray import serve
from transformers import pipeline
import requests

serve.start()  # <1>

@serve.deployment  # <2>
def model(request):
    language_model = pipeline("text-generation", model="gpt2")  # <3>
    query = request.query_params["query"]
    return language_model(query, max_length=100)  # <4>

model.deploy()  # <5>

query = "What's the meaning of life?"
response = requests.get(f"http://localhost:8000/model?query={query}")  # <6>
print(response.text)
# end::ray_serve[]

  from .autonotebook import tqdm as notebook_tqdm
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]
2022-10-06 17:04:38,977	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
[2m[36m(ServeController pid=29746)[0m INFO 2022-10-06 17:04:47,900 controller 29746 http_state.py:129 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:PiDWcX:SERVE_PROXY_ACTOR-78eb6f0256b1a08cea03cec7693ff052ce9c33e6f156319135f3a69c' on node '78eb6f0256b1a08cea03cec7693ff052ce9c33e6f156319135f3a69c' listening on '127.0.0.1:8000'
[2m[36m(HTTPProxyActor pid=29791)[0m INFO:     Started server process [29791]
[2m[36m(ServeController pid=29746)[0m INFO 2022-10-06 17:04:51,058 controller 29746 deployment_state.py:1232 - Adding 1 replicas to deployment 'model'.
[2m[36m(ServeReplica:model pid=29832)[0m Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{"generated_text": "What's the meaning of life? Is it a place that we take for granted? Because it's our country or our people? The Bible is all about God's word. His word is for us, so take heed. If you love Him sincerely, He's just going to work on you for the good of your nation. Because He's our savior, so please do too.\n\nAdvertisement\n\nAnd the Bible is not about us. It's about Jesus, the Savior of the world"}]


[2m[36m(HTTPProxyActor pid=29791)[0m INFO 2022-10-06 17:05:17,429 http_proxy 192.168.43.126 http_proxy.py:315 - GET /model 200 19332.7ms
[2m[36m(ServeReplica:model pid=29832)[0m INFO 2022-10-06 17:05:17,426 model model#xwtbOB replica.py:482 - HANDLE __call__ OK 19320.4ms


In [3]:
import ray
ray.shutdown()