## Data Pipeline Observability (Ray Data)

#### Run a simple Data Pipeline

💻 Local Environment. Please modify the value of **default_cluster_storage** to your local.

Let's start with an example pipeline that we will use to demonstrate Ray Data's observability tools. Please modify **default_cluster_storage** to the path in your environment.

In [None]:
%%writefile simple_pipeline.py
import ray
import time
import pyarrow.fs as fs

default_cluster_storage = "/mnt/cluster_storage/observed_data/"

"""
s3://anyscale-public-materials/nyc-taxi-cab/yellow_tripdata_2011-05.parquet
"""
s3_fs = fs.S3FileSystem(anonymous=True)
ds = ray.data.read_parquet(
    "s3://anyscale-public-materials/nyc-taxi-cab/yellow_tripdata_2011-05.parquet",
    filesystem=s3_fs
)

def slow_adjust_total_amount(batch):
    time.sleep(10)
    batch["adjusted_total_amount"] = batch["total_amount"] - batch["tip_amount"]
    return batch

ds = ds.map_batches(slow_adjust_total_amount)
ds.write_parquet(default_cluster_storage)

print("Done!")

Let's now execute the pipeline from the terminal

In [None]:
# copy and run this inside a terminal
# !python simple_pipeline.py

To view job running, navigate to the Ray Dashboard tab and click on the jobs tab to see the pipeline running.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/jobs_tab_v2.png" width="800">

Let's click on the job to see its overview.

Under the Job overview page, let's click on the **Ray Data Overview** section to see our dataset.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data_overview_collapsed_v2.png" width="800">

The dataset is currently named as "dataset_{index}" - the auto-generated name. 

Let's expand the dataset to see the operators/stages in the pipeline.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data_overview_expanded_v2.png" width="800">


For each operator, you can view:
- the number of blocks outputted 
- the state

these are the two key pieces of information that you can use to monitor your pipeline.

#### Ray Data Logs

Next let's look at the Ray Data logs that are generated in our terminal and stored on disk.

The logs can help us find the operators that are backpressured - i.e. operators that can't add more tasks to process their inputs

Here is an explanation of the Ray Data logs

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/progress_bar_annotated.png" width="800">


#### Ray Data Metrics

<!-- Similar to overview + logs but timeseries along with more internal metrics. -->

Now let's take a look at the Ray Data Dashboard.

The Ray Data Dashboard provides a more detailed view of the pipeline, including a timeseries view of the dataset operators and metrics.

To navigate to the Ray Data Dashboard, click on the "Ray Dashboard > Metrics tab" and then click on the "Open in Grafana" dropdown (select the Ray Data Dashboard).

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/metrics-grafana-access.png">

You should see a dashboard similar to the one below:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data-dashboard-outlook.png" width="800">

Here we can get throughput metrics for each operator in the pipeline (e.g. number of blocks/rows/bytes processed per second).

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/experian/data-dashboard-througput.png" width="800">


#### Ray Workloads Data Dashboard

🚀 **Anyscale Platform**: This view only exists in Anyscale 

From the Ray Workloads Data dashboard, view detailed Ray Data pipeline execution status (Throughputs, Resources, etc):

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/3-Ray-Data/ray_data_1.png" width="80%" loading="lazy">
</div>

<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/3-Ray-Data/ray_data_2.png" width="80%" loading="lazy">
</div>

Check the details of any operator in the pipeline by viewing:
<div style="display: flex; gap: 20px; margin: 20px 0;">
<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-observability/3-Ray-Data/ray_data_3.png" width="80%" loading="lazy">
</div>