# Introduction to Ray
---
(*Suggested Time To Complete: 30 minutes*)

Welcome, we're glad to have you along! This module serves as an interactive introduction to Ray, a flexible distributed computing framework built for Python with data science and machine learning practitioners in mind. Before we jump into the structure of this tutorial, let us first unpack the context of where we are coming from, along with the motivation for learning Ray.

![Main Map](../_static/assets/Introduction_to_Ray/map.png)

*Figure 1*

**Context**

Today's artificial intelligence (AI) applications require enormous amounts of data to be trained on and machine learning (ML) models tend to grow over time. From consumer-facing products like recommendation systems and photo editing software to enterprise-level use cases like reducing downtime in manufacturing and order-fulfillment optimization, ML systems have become so complex and infrastructure intensive that developers have no option but to distribute execution across multiple machines. However, distributed computing is hard. It requires specialized knowledge about orchestrating clusters of computers together to efficiently schedule tasks and must provide features like fault tolerance when a component fails, high availability to minimize service interruption, and autoscaling to reduce waste.

As a data scientist, machine learning pracitioner, developer, or engineer, your contribution may center on building data processing pipelines, training complicated models, running efficient hyperparameter experiments, creating simulations of agents, and/or serving your application to users. In each case, you need to choose a distributed system to support each task, but you don't want to learn a different programming language or toss out your existing toolbox. This is where Ray comes in.

**What is Ray?**

Ray is an open source, distributed execution framework that allows you to scale AI and machine learning workloads. Our goal is to keep things simple (enabled by a concise core API) so that you can parallelize Python programs on your laptop, cluster, cloud, or even on-premise with minimal code changes. Ray automatically handles all aspects of distributed execution including orchestration, scheduling, fault tolerance, and auto scaling so that you can scale your apps without becoming a distributed systems expert. With a rich ecosystem of libraries and integrations with many important data science tools, Ray lowers the effort needed to scale compute intensive workloads.

**Notebook Outline**

This first notebook is part of a series where we will discuss the three major **layers** that comprise Ray, namely its core engine, high-level libraries, and ecosystem of integrations. In this first notebook, we will cover:

- Introduction to Ray
- Part One: Ray Core
    - Ray Core Key Concepts
    - Example: Cross-Validation on Housing Data
        - Sequential Implementation
        - Distributed Implementation with Ray
    - Summary
- Homework
- Next Steps

**Prerequisites**

To gain the most from this notebook, it helps if you have a working knowledge of Python as well as previous experience with machine learning. The ideal learner has minimal familiarity with Ray and is interested in leveraging Ray's simple distributed computing framework to scale AI and Python workloads.

**Learning Goals**

Upon completion of this module, you will have a intuition for:
1. An Overview of Ray
2. Ray Core Key Concepts
2. Ray Core Key API Elements
3. How to Navigate Next Steps to Start Scaling Workloads

# Part One: Ray Core
---

![Ray Core](../_static/assets/Introduction_to_Ray/Ray_Core.png)

*Figure 2*

Ray Core is a low-level, distributed computing framework for Python with a concise core API, and you can think of it as the foundation that Ray's data science libraries (Ray AIR) and third-party integrations (Ray Ecosystem) are built on. This simple and general-purpose Python library enables every developer to easily build scalable, distributed systems that run on your laptop, cluster, cloud or Kubernetes.

A key strength lies in Ray Core's simple primitives: Tasks, Actors, and Objects.

**Tasks:** Ray enables you to designate functions to be executed asychronously on separate Python workers. These asynchronous Ray functions, *tasks*, can specify their resource requirements in terms of CPUs, GPUs, and custom resources which are used by the cluster scheduler to distribute tasks for parallelized execution.

**Actors:** What tasks are to functions, actors are to classes. An actor is a stateful worker and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements.

**Objects:** In Ray, tasks and actors create and compute on objects, and we refer to these objects as *remote objects* because they can be stored anywhere in a Ray cluster. We use *object references* to refer to them, and they are cached in Ray's distributed shared memory *object store*.

Ray sets up and manages clusters of computers so that you can run distributed tasks on them. Let's take some time to practice the key concepts with a hands-on example for each.

### 1. Introduction to Remote Functions
---
Let us begin with something that we know, a basic Python function, and use it as context for understanding remote functions. Say we have the following:

```python
# A simple Python function
def quick_nap(x):
  time.sleep(1)
  return x
```
To call this function, you might run  `quick_nap(2)` in which case you would wait a second before getting `2` as your output.

Now, taking the same function, but adding a `@ray.remote` decorator to it, we have:

```python
# A Ray remote function
@ray.remote
def quick_nap(x):
  time.sleep(1)
  return x
```
This is a Ray remote function. To run it, you would append the postfix `.remote()` onto this function call, i.e. `quick_nap.remote(2)`. Instead of returning a `2` after 1 second like the previous function, this function will immediately return an `ObjectID`. For example:

```
ObjectRef(e0dc174c83599034ffffffffffffffffffffffff0100000001000000)
```
This `ObjectID` is a *promise* of future work, meaning that the actual task of the function is delegated in the background to a worker. In order to access the expected output, you need to call `ray.get()` on the `ObjectID`. Try it out in the coding excercise below!


**Coding Exercise**

In the coding cell below, print both the `ObjectID` and the result when `x = 2` for this remote function.

In [None]:
import ray
import time
import random

if ray.is_initialized:
    ray.shutdown()

ray.init()

def regular_quick_nap(x):
  time.sleep(1)
  return x

@ray.remote
def remote_quick_nap(x):
  time.sleep(1)
  return x

obj_id = remote_quick_nap.remote(2)
print("Object ID = " + str(obj_id))
print("Result = " + str(ray.get(obj_id)))

**Discussion Question**

Now, the obvious question arises, *why* do we take our ordinary Python function and turn it into a remote function? To see this, think (and experiment below) about if we looped over our functions:

```python
for i in range(4):
  regular_quick_nap(i)

for i in range(4):
  remote_quick_nap.remote(i)
```

- About how long do you expect the regular `for` loop to run (in seconds)?
- Do you expect the remote `for` loop to run faster or slower relative to the regular loop?

**Serial vs. Parallel Processes**

Python scripts by default execute code in a serial manner. This means that if you perform a calculation inside of a `for` loop, each iteration must finish performing the calculation before the next iteration can start. This means that even if your computer has multiple cores, the script can only take advantage of at most a single core.

By using Ray, we can transform `for` loops into code that takes advantage of all the cores on our machine by giving each core a set of loop iterations to work on. In other words, we can run the iterations of the loop in *parallel*.

In [None]:
# Serial Implementation
%%time
results = []

for i in range(4):
  results.append(regular_quick_nap(i))

print(results)

**Coding Exercise**

Let's now perform the same computation asynchronously. Notice here that we create a list of object IDs, append `.remote()` onto the relevant remote task, append each call to our list of IDs, call `.get` on the object IDs outside of the loop, and see a runtime improvement.

In [None]:
# Parallel Implementation
start_time = time.time()

obj_ids = []

for i in range(4):
  obj_ids.append(remote_quick_nap.remote(i))
parallel_results = ray.get(obj_ids)

print(parallel_results)

end_time = time.time()
duration = end_time - start_time

print(duration)

### 2. Remote Class as a Stateful Actor

In this example, we want to keep track of who invoked a particular method; this could be a use case for telemetry data we want to track. Let's use this actor to track method invocation of an actor method. Each instance will track who invoked it and the number of times.

In [None]:
CALLERS = ["A", "B", "C"]

@ray.remote
class MethodStateCounter:
    def __init__(self):
        self.invokers = {"A": 0, "B": 0, "C": 0}
    
    def invoke(self, name):
        # pretend to do some work here
        time.sleep(0.5)
        # update times invoked
        self.invokers[name] += 1
        # return the state of that invoker
        return self.invokers[name]
        
    def get_invoker_state(self, name):
        # return the state of the named invoker
        return self.invokers[name]
    
    def get_all_invoker_state(self):
        # reeturn the state of all invokers
        return self.invokers

In [None]:
# Create an instance of our Actor 
worker_invoker = MethodStateCounter.remote()
worker_invoker

Iterate and call the `invoke()` method by random callers and keep track of who called it.

In [None]:
for _ in range(10):
    name = random.choice(CALLERS)
    worker_invoker.invoke.remote(name)

Invoke a random caller and fetch the value or invocations of a random caller.

In [None]:
for _ in range(5): 
    random_name_invoker = random.choice(CALLERS)
    times_invoked = ray.get(worker_invoker.invoke.remote(random_name_invoker))
    print(f"Named caller: {random_name_invoker} called {times_invoked}")

Fetch the count of all callers.

In [None]:
print(ray.get(worker_invoker.get_all_invoker_state.remote()))

Note: We didn't have to reason about where and how the actors are scheduled. We did not have to worry about the socket connection or IP addresses where these actors reside. All of that is abstracted away from us. All we did is write Python code, using Ray core APIs, convert our classes into distributed stateful services.

### 3. Housing Prices with `sklearn`
***
Now that we've warmed-up with some isolated examples using tasks and actors, let's take a look at how we can apply Ray Core's flexible and simple API to scale a bare bones version of a common ML task: cross-validation.

Here, we have a dataset of [California Housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) with 20,640 samples and features including `[longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity]`. Given that we want to use a linear regression model, we want to assess how the results will generalize to an unseen independent dataset, say, on new housing data coming in this year. To do this, we would try cross-validation which is a model validation technique that resamples different portions of the data to train and test a model on different iterations. After we conduct these trials, we can average the error to get an estimate of the model's predictive performance.

However, training the same model multiple times on different subsets of a dataset can take a long time, especially if you're working with a much more complex model and larger dataset. Pictured below in Figure 3, the sequential approach trains each model one after another in a series.

![Sequential Timeline](../_static/assets/Introduction_to_Ray/Sequential_Timeline.png)

*Figure 3*

In this example, we will first implement the sequential approach, then improve it by distributing training with Ray Core, and finally compare the code differences to highlight how minimal the changes are.

#### Import Relevant Packages

In [None]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

num_trials = 100
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

#### Train 100 Models Sequentially
Here, we will define a function that randomly splits our housing dataset into testing and training subsets (in the style of Monte Carlo Cross-Validation, where subsets are generated without replacement and have non-unique subsets from round to round). `sequential()` then fits a model, generates predictions, and returns the R-squared score (closer to 1 = better performance, closer to 0 = worse performance).

Then, we'll train 100 models on these random splits, one after another, and finally print out the average of the rounds.

In [None]:
%%time

def sequential():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model = LinearRegression()

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    r2 = metrics.r2_score(y_test, predictions)

    return r2

errors_seq = []

for i in range (num_trials):
    errors_seq.append(sequential())

print(sum(errors_seq) / num_trials)


#### From Sequential -> Parallel

We just trained 100 linear regression models in a series and averaged their R-squared values in about ~1 second. Let's now leverage Ray to train these models in parallel (where multiple tasks may happen simultaneously) and see a runtime improvement. In Figure 4, you can visually inspect the difference where the scheduler assigns each available worker (in this timeline, we chose `n = 4` workers) a task. The scheduler itself has a nontrivial overhead involved with communicating between workers and other cluster management.

![Distributed_Timeline](../_static/assets/Introduction_to_Ray/Distributed_Timeline.png)

*Figure 4*

With just a few code changes, we will modify our existing Python program to distribute it among *n* number of workers. Of course, this is a lightweight example, but it's illustrative of the kind of user experience you get with Ray Core's lean API.

Notice in Figure 5 that we need to use four API calls:

1. `ray.init()` - initialize a Ray context
2. `@ray.remote` - a decorator that turns functions into tasks and classes into actors
3. `.remote()` - postfix to every remote function, remote class declaration, or invocation of a remote class method; returns an `ObjectID` associated with the work to be done
4. `ray.get()` - returns an object or list of objects from the object ID

You may notice that instead of storing the result of `train.remote()` directly into a list of `errors`, we instead store it in a list called `obj_ids`. Once you run a Ray remote function, it will immediately return an `ObjectID` (or 'Object Reference'). This `ObjectID` is a *promise* of future work, meaning that the task is delegated to a worker, an `ObjectID` is returned while the task executes in the background, and in order to access the expected output, you need to call `ray.get()` on the `ObjectID`

![Housing Diff](../_static/assets/Introduction_to_Ray/Housing_Diff.png)

*Figure 5*

And with just a few lines of difference, we're able to parallelize training without having to concern ourselves with orchestration, fault tolerance, autoscaling, or anything else that requires specialized knowledge of distributed systems.

#### Train 100 Models in Parallel with Ray
To start, we'll import Ray (check out our [installation instructions](https://docs.ray.io/en/latest/ray-overview/installation.html)) and start a Ray cluster on our local machine that can utilize all the cores available on your computer as workers. We use `ray.is_initialized` to allow us to make sure that we only have one Ray cluster active.

In [None]:
import ray

if ray.is_initialized:
    ray.shutdown()

ray.init()

As illustrated above in Figure 5, we will:
1. Add the decorator `@ray.remote` to our function `distributed()` to specify that it is a task to be run remotely. 
2. Then we call that function as `distributed.remote()` in the `for` loop to append to our list of object ids. 
3. Finally, we fetch the result outside of the loop to access the final error list (as to not *block* the launching of remote tasks asynchronously) and print out the average R-squared value.

In [None]:
%%time

@ray.remote
def distributed():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model = LinearRegression()

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    r2 = metrics.r2_score(y_test, predictions)

    return r2

obj_ids = []

for i in range (num_trials):
    obj_ids.append(distributed.remote())

errors_dist = ray.get(obj_ids)

print(sum(errors_dist) / num_trials)

And now you've done it! You have distributed the training of 100 models in a very through round of cross-validation on our California Housing dataset. Compare the runtime for each method of training 100 models. Is this what you expected?

### Summary
1. Introduced to Ray Core and Most Popular Workloads
2. Key Concepts of Ray Core
3. Sequential -> Distributed Training of 100 Models

#### [Key Concepts](https://docs.ray.io/en/latest/ray-core/key-concepts.html)
- Tasks
- Actors
- Objects

#### [Key API Elements in This Section](https://docs.ray.io/en/latest/ray-core/package-ref.html#python-api)
- `ray.init()`
- `@ray.remote`
- `.remote()`
- `ray.get()`
- `ray.put()`

#### Next
Now that we've covered the core engine, let's go up one layer of abstraction to look at a suite of data science libraries build on top of Ray Core to target specific machine learning workloads in the next notebook!

# Homework
---
If you would like to practice your new skills further with some in-depth examples beyond the embedded coding excercises, take a look at this list of suggested problems:
- [Read About Debugging and Profiling on Ray](https://docs.ray.io/en/latest/ray-core/troubleshooting.html)
    - Dig into how to observe Ray work by visualizing tasks in the Ray timeline, profiling using Python's CProfile, understanding crashes and suboptimal performance, and more in this user guide.
- [Distribute a Classical Algorithm with Ray](https://github.com/ray-project/hackathon5-algo)
    - In this excercise, go to the GitHub repo linked above for details on choosing a classic algorithm implemented in Python, editing the implementation to parallelize the work with Ray, and compare your results against the sequential implementation.


# Next Steps
---
Congratulations! You have completed your first tutorial on an Introduction to Ray and Ray Core! We introduced the three layers of Ray: Core, AIR, and the Ecosystem. In this notebook, we explored Ray Core's key concepts of tasks, actors, and objects along with key API elements through examples. In the next module, we will talk about Ray AI Runtime, a set of native libraries built on top of Ray Core specialized for machine learning workloads.

From here, you can learn and get more involved with our active community of developers and researchers by checking out the following resources:
- [Ray's "Getting Started" Guides](https://docs.ray.io/en/latest/ray-overview/index.html): A collection of QuickStart Guides for every library including installation walkthrough, examples, blogs, talks, and more!
- [Official Ray Website](https://www.ray.io/): Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.
- [Join the Community on Slack](https://forms.gle/9TSdDYUgxYs8SA9e8): Find friends to discuss your new learnings in our Slack space.
- [Use the Discussion Board](https://discuss.ray.io/): Ask questions, follow topics, and view announcements on this community forum.
- [Join a Meetup Group](https://www.meetup.com/Bay-Area-Ray-Meetup/): Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.
- [Open an Issue](https://github.com/ray-project/ray/issues/new/choose): Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.