# A Guided Tour of Ray Core: Multiprocessing Pool

[*Distributed multiprocessing.Pool*](https://docs.ray.io/en/latest/multiprocessing.html) makes it easy to scale existing Python applications that use [`multiprocessing.Pool`](https://docs.python.org/3/library/multiprocessing.html) by leveraging *actors*. Ray supports running distributed python programs with the **multiprocessing.Pool** API using Ray Actors, each running on a [workder node](https://docs.ray.io/en/latest/ray-core/actors.html#faq-actors-workers-and-resources), instead of local processes. This makes it easy to scale existing applications that use `multiprocessing.Pool` from a single node to a cluster.

<img src="../images/dist_multi_pool.png" width="70%" height="35%">

First, let's start Ray…

In [3]:
import multiprocessing as mp
import time
import logging
import ray

## Multiprocessing Pool example

The following is a simple Python function with a slight delay added (to make it behave like a more complex calculation)...

In [4]:
# this could be some complicated and compute intensive task
def func(x):
    time.sleep(1.5)
    return x ** 2

Then, use the Ray's drop-in replacement for [multiprocessing pool](https://docs.ray.io/en/latest/multiprocessing.html)

In [5]:
from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster

setup_ray_cluster(
  num_worker_nodes=2,
  num_cpus_per_node=4,
  collect_log_to_path="/dbfs/path/to/ray_collected_logs"
)

RayContext(dashboard_url='127.0.0.1:8265', python_version='3.8.12', ray_version='1.12.0', ray_commit='f18fc31c7562990955556899090f8e8656b48d2d', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-05-20_07-56-36_887152_13113/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-05-20_07-56-36_887152_13113/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-05-20_07-56-36_887152_13113', 'metrics_export_port': 63731, 'gcs_address': '127.0.0.1:54809', 'address': '127.0.0.1:54809', 'node_id': '5a401e65ceb43b90f3a9616ac0e574886b2f324a66722aa487d583ba'})

Now we'll create a *Pool* using and distribute its tasks across a cluster (or across the available cores on a laptop):

In [6]:
%%time

from ray.util.multiprocessing import Pool

pool = Pool()

for result in pool.map(func, range(10)):
    print(result)

0
1
4
9
16
25
36
49
64
81
CPU times: user 118 ms, sys: 53.1 ms, total: 172 ms
Wall time: 3.66 s


The distributed version has the trade-off of increased overhead, although now it can scale-out horizontally across a cluster. The benefits would be more pronounced with a more computationally expensive calculation.

In [7]:
pool.terminate()

Let's define a compute intensive class that does some matrix computation. Consider this could be a compute intenstive task doing massive tensor transformation or computation.

In [8]:
def task(n):
    # Simulate a long intensive task
    #TODO
    
    # do some matrix computation 
    # and return results
    return

Define a Ray remote task that launches task() across a pool of Actors on the cluster. It creates a pool of Ray Actors, each scheduled on a cluster worker.


In [9]:
@ray.remote
def launch_long_running_tasks(num_pool):
    # Doing the work, collecting data, updating the database
    # create an Actor pool of num_pool workers nodes
    pool = Pool(num_pool)
    results = []
    # Iterate over 50 times in batches of 10
    for result in pool.map(func, range(1, 50, 10)):
        results.append(result)
        
    # Done so terminate pool
    pool.terminate()
    
    return results

### Create a Actor like supervisor that launches all these remote tasks


In [10]:
@ray.remote
class LaunchDistributedTasks:
    def __init__(self, limit=5):
        self._limit = limit

    def launch(self):
        # launch the remote task
        return launch_long_running_tasks.remote(self._limit)

### Launch our supervisor

In [11]:
hdl = LaunchDistributedTasks.remote(5)
print("Launched remote jobs")

Launched remote jobs


### Launched remote jobs

In [12]:
values = ray.get(ray.get(hdl.launch.remote()))
print(f" list of results :{values}")
print(f" Total results: {len(values)}")

 list of results :[1, 121, 441, 961, 1681]
 Total results: 5


Finally, shutdown Ray

In [13]:
shutdown_ray_cluster()

### Excercises
1. Can you convert task() into a complicated function?
2. Use `task()` in pool.map(task,....)

### Homework
1. Write a Python multiprocessing.pool version of task() and compare the timings with the Ray distributed multiprocessing.pool.
2. Do you see a difference in timings?
3. Write a distributed crawler that downloads gifs or pdfs