In [1]:
import ray

In [None]:
# %pip install ray pandas pyarrow tqdm

: 

In [2]:
def hi():
 import os
 import socket
 return f"Running on {socket.gethostname()} in pid {os.getpid()}"
hi()

'Running on DESKTOP-URMGSA4 in pid 25024'

You can use the ray.remote decorator to create a remote function. Calling remote
functions is a bit different from calling local ones and is done by calling .remote on
the function. Ray will immediately return a future when you call a remote function
instead of blocking for the result. You can use ray.get to get the values returned in
those futures.

In [3]:
@ray.remote
def remote_hi():
 import os
 import socket
 return f"Running on {socket.gethostname()} in pid {os.getpid()}"
future = remote_hi.remote()
ray.get(future)

2023-10-15 21:59:14,696	INFO worker.py:1642 -- Started a local Ray instance.


'Running on DESKTOP-URMGSA4 in pid 28186'

When you run these two examples, you’ll see that the first is executed in the same
process, and that Ray schedules the second one in another process. When we run
the two examples, we get Running on jupyter-holdenk in pid 33 and Running on
jupyter-holdenk in pid 173, respectively.

## Sleepy task
An easy (although artificial) way to understand how remote futures can help is by
making an intentionally slow function (in our case, slow_task) and having Python
compute in regular function calls and Ray remote calls

In [4]:
import timeit
def slow_task(x):
 import time
 time.sleep(2) # Do something sciency/business
 return x

@ray.remote
def remote_task(x):
 return slow_task(x)

things = range(20)
very_slow_result = map(slow_task, things)
slowish_result = map(lambda x: remote_task.remote(x), things)

slow_time = timeit.timeit(lambda: list(very_slow_result), number=1)
fast_time = timeit.timeit(lambda: list(ray.get(list(slowish_result))), number=1)

print(f"In sequence {slow_time}, in parallel {fast_time}")

In sequence 40.037900334998994, in parallel 5.43759268100257


When you run this code, you’ll see that by using Ray remote functions, your code is able to execute multiple remote functions at the same time. While you can do this without Ray by using multiprocessing, Ray handles all of the details for you and can also eventually scale up to multiple machines

## Nested and chained tasks
Ray is notable in the distributed processing world for allowing nested and chained
tasks. Launching more tasks inside other tasks can make certain kinds of recursive
algorithms easier to implement.
One of the more straightforward examples using nested tasks is a web crawler. In the
web crawler, each page we visit can launch multiple additional visits to the links on
that page,

In [5]:
@ray.remote
def crawl(url, depth=0, maxdepth=1, maxlinks=4):
 links = []
 link_futures = []
 import requests
 from bs4 import BeautifulSoup
 try:
    f = requests.get(url)
    links += [(url, f.text)]
    if (depth > maxdepth):
        return links # base case
    soup = BeautifulSoup(f.text, 'html.parser')
    c = 0
    for link in soup.find_all('a'):
        try:
            c = c + 1
            link_futures += [crawl.remote(link["href"], depth=(depth+1),
            maxdepth=maxdepth)]
        # Don't branch too much; we're still in local mode and the web is big
            if c > maxlinks:
                break
        except:
            pass
    for r in ray.get(link_futures):
        links += r
    return links
 except requests.exceptions.InvalidSchema:
    return [] # Skip nonweb links
 except requests.exceptions.MissingSchema:
    return [] # Skip nonweb links
#ray.get(crawl.remote("http://holdenkarau.com/"))

## Data Hello World
Ray has a somewhat limited dataset API for working with structured data.
Apache Arrow powers Ray’s Datasets API. Arrow is a column-oriented, language-independent format with some popular operations. Many popular tools support
Arrow, allowing easy transfer between them (such as Spark, Ray, Dask, and
TensorFlow).
Ray only recently added keyed aggregations on datasets with version 1.9. The most
popular distributed data example is a word count, which requires aggregates. Instead
of using these, we can perform embarrassingly parallel tasks, such as map transfor‐
mations, by constructing a dataset of web pages

In [6]:
urls = ray.data.from_items([
 "https://github.com/scalingpythonml/scalingpythonml",
 "https://github.com/ray-project/ray"])
def fetch_page(url):
    import requests
    f = requests.get(url)
    return f.text
pages = urls.map(fetch_page)
# Look at a page to make sure it worked
# pages.take(1)

If you want a full-featured DataFrame API, you can convert your Ray dataset into
Dask

## Actor Hello World
One of the unique parts of Ray is its emphasis on actors. Actors give you tools to
manage the execution state, which is one of the more challenging parts of scaling
systems. Actors send and receive messages, updating their state in response. These
messages can come from other actors, programs, or your main execution thread with
the Ray client.
<br/><br/>
For every actor, Ray starts a dedicated process. Each actor has a mailbox of messages
waiting to be processed. When you call an actor, Ray adds a message to the corresponding mailbox, which allows Ray to serialize message processing, thus avoiding
expensive distributed locks. Actors can return values in response to messages, so when you send a message to an actor, Ray immediately returns a future so you can
fetch the value when the actor is done processing your message.
<br/><br/>
Ray actors are created and called similarly to remote functions but use Python
classes, which gives the actor a place to store state. You can see this in action by
modifying the classic “Hello World” example to greet you in sequence,

In [7]:
@ray.remote
class HelloWorld(object):
    def __init__(self):
        self.value = 0
    def greet(self):
        self.value += 1
        return f"Hi user #{self.value}"
# Make an instance of the actor
hello_actor = HelloWorld.remote()
# Call the actor
print(ray.get(hello_actor.greet.remote()))
print(ray.get(hello_actor.greet.remote()))

Hi user #1
Hi user #2


Actors are still more expensive than lock-free remote functions, which can be scaled horizontally. For exam‐
ple, lots of workers calling the same actor to update model weights will still be slower than embarrassingly
parallel operations