# Introduction to Ray Data: Industry Landscape
© 2025, Anyscale. All Rights Reserved





This document is meant to provide a landscape of the industry.
<b> Here is the roadmap: </b>
<ul> 
    <li> The Data Layer </li>
    <li> The Compute Layer </li>
    <li> The Orchestration Layer </li>
    <li> Commercial Distributed Computing Platforms </li>
    <li> Distributed Computing execution models </li>
</ul>

# The data layer

This document provides an overview of the data layer, including its components and commonly used tools. The evolution of data storage patterns reflects the changing needs of businesses and advancements in technology. Below is a breakdown of these patterns and their applications.


## Databases

- **Purpose**: Built for handling transactional data.
- **Optimization**: Designed for online transaction processing (OLTP).
- **Key characteristics**:
  - Handles frequent, small-scale, atomic transactions.
  - Ensures data consistency and integrity using **ACID properties**: atomicity, consistency, isolation, and durability.
- **Examples**: MySQL, PostgreSQL, MongoDB.
- **Specialized Databases**:
  - **Vector Databases**: Databases that are optimized for storing and querying vector data.
    - **Examples**: Pinecone, Zilliz, ChromaDB, Weaviate.

## Data warehouses

- **Purpose**: Designed for analyzing large datasets.
- **Optimization**: Suited for online analytical processing (OLAP).
- **Key characteristics**:
  - Stores vast amounts of structured and historical data optimized for analytics.
  - Uses indexing and sorting to accelerate query performance.
  - Often employs proprietary storage formats (e.g., Snowflake’s columnar format) for efficiency.
  - Supports SQL-based querying and analytical functions for business intelligence.
  - Relies on ETL (extract, transform, load) or ELT (extract, load, transform) pipelines to preprocess and structure data.
- **Examples**: Amazon Redshift, Google BigQuery, Snowflake.


## Data lakes

- **Purpose**: Store large volumes of raw, semi-structured, or unstructured data.
- **Key characteristics**:
  - Maintains raw data in its native format without requiring prior transformation.
  - Supports diverse data types (e.g., text, images, videos, logs).
  - Ideal for machine learning, big data analytics, and scenarios requiring data exploration.
  - Lacks built-in mechanisms for enforcing data quality, transactional consistency, or indexing.


### Structure of a data lake

Data lakes are organized into layers that enable efficient storage and processing:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/storage-layer.png" width="500">

1. **Storage layer**: The foundation of the data lake, implemented using scalable systems like AWS S3, Cloudflare R2, or distributed file systems such as Ceph and Lustre.
2. **File formats**: Stores data in multiple formats, including Parquet, CSV, JSON, JPEG, and Protocol Buffers, to support varied workloads.
3. **Table format layer**: Adds features like schema management, ACID transactions, and indexing to provide database-like capabilities within the data lake. Examples of table formats include:
   - **Apache Iceberg**: Focuses on versioning, schema evolution, and large-scale analytics.
   - **Delta Lake**: Built with ACID transactions and data quality controls for analytics and machine learning.
   - **Apache Hudi**: Offers incremental processing capabilities and efficient updates for data lakes.

### Lakehouses

Lakehouses build on the foundation of data lakes, addressing their limitations while incorporating features of data warehouses.

- **Purpose**: Provide a unified platform for both analytical and transactional workloads.
- **Key characteristics**:
  - Combines the scalability and flexibility of data lakes with the structure and performance of data warehouses.
  - Implements ACID transactions, indexing, and schema enforcement directly on lake-stored data.
  - Built on open table formats (e.g., Apache Iceberg, Delta Lake, Apache Hudi) for robust querying and data management.
  - Bridges the gap between data engineering and machine learning workflows, enabling seamless data sharing.
  - Eliminates the need to move data between separate lake and warehouse systems, reducing complexity and cost.

- **Examples**: Databricks Lakehouse Platform, Delta Lake, and other systems leveraging modern table formats.


## In-memory data formats

### Apache Arrow

Apache Arrow is a high-performance in-memory data processing framework designed to enhance analytical workflows across languages and platforms.

- **Purpose**: Provides a standard for in-memory data that minimizes serialization costs and maximizes interoperability.
- **Language bindings**:
  - Examples: PyArrow for Python, Arrow-rs for Rust.
- **Key capabilities**:
  - **Zero-copy data sharing**: Supports efficient shared memory access and RPC-based data transfers.
  - **File format support**: Reads and writes formats like CSV, Apache ORC, and Apache Parquet.
  - **In-memory analytics**: Facilitates high-speed operations using the Arrow Table abstraction for query processing and computation.



## The Compute Layer

Here is one way in which compute can be categorized:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/compute-layer-overview.png" width="500">

### Compute by Function

- **Data Engineering**: Transforming a dataset (X) into a new dataset (X') through various operations.
- **Analytics**: Using datasets to create visualizations, dashboards, and reports.
- **Machine Learning and AI**: Training models using the data to produce predictive models.

### Data Engineering Compute

If we want to drill deeper, here is how the data engineering compute space is categorized:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/compute-layer-data-eng.png" width="500">

### Machine Learning and AI Compute

Similarly, here is how the machine learning and AI compute space is categorized:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/compute-layer-data-eng-v2.png" width="500">


## The Orchestration Layer

To orchestrate between different stages of compute, we usually use a workflow engine or orchestration platform.

Here are some of the most popular ones:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/orchestration-layer-v2.png" width="500">

Orchestration platforms usually differ in the following ways:
- **Language**: Some are natively written in Python, others are not.
- **Ease of use**: Some have a rigid DAG/DSL interface, others have a more flexible API-based interface to creating tasks.
- **Tilt towards ML vs Data Engineering**: Some are more focused on ML, some are more focused on Data Engineering.
  - Some have native dataset concepts meant for data engineering whereas others have native workflow concepts meant for ML.
- **Cost**: Some are cheaper than others depending on the commercial offering.

## Distributed Computing Frameworks

Distributed computing frameworks enable scalable data processing and analysis by dividing workloads across multiple machines. These frameworks are often categorized based on their primary use case, such as stream processing, batch processing, or general-purpose computing.

### Streaming Applications

Streaming frameworks process data as it arrives in real-time or near-real-time. They are essential for applications requiring immediate insights, such as fraud detection, monitoring, and recommendation systems.

- **Apache Kafka** (JVM-Based):
  - Distributed event streaming platform.
  - Fault-tolerant and scalable messaging.
  - Widely used for real-time data pipelines and log aggregation.
  - Python integration relies on external libraries like `confluent-kafka`.

- **Apache Flink** (JVM-Based):
  - Unified engine for stream and batch processing.
  - Features low-latency, event-time processing, and stateful computation.
  - Commonly used for real-time dashboards, fraud detection, and IoT analytics.
  - PyFlink API provides Python support but has fewer features compared to Java/Scala.

### Batch Processing

Batch processing frameworks handle large data volumes in discrete chunks, often for ETL workflows, batch inference and last-mile data processing.

- **Apache Hadoop** (JVM-Based):
  - Pioneer in distributed storage (HDFS) and batch processing (MapReduce).
  - Reliable for large-scale, fault-tolerant data processing.
  - Historically used for batch ETL and analytics.

- **Apache Spark** (JVM-Based):
  - Unified engine for batch and streaming workloads.
  - Features in-memory computation, scalable processing, and rich APIs.
  - Popular for data transformation, machine learning pipelines, and analytics.
  - Python integration via PySpark can introduce overhead due to JVM interaction.

### General-Purpose Distributed Computing

These frameworks are designed for diverse tasks, including machine learning, reinforcement learning, and data processing.

- **Dask**:
  - Python-native framework for parallel and distributed computing.
  - Frequently used in scientific computing and dataframe-based workflows.
  - Scales efficiently from single machines to distributed clusters.

- **Ray**:
  - Flexible platform for scalable distributed applications.
  - Rich in high-level libraries for:
    - Reinforcement Learning (Ray RLlib)
    - Distributed Data processing (Ray Data)
    - Distributed Training (Ray Train)
    - Distributed Hyperparameter Tuning (Ray Tune)
    - Distributed Serving (Ray Serve)
  - Seamless integration across the Ray ecosystem to build end-to-end data pipelines.
  - Ideal for Python-centric teams needing high-performance distributed computing.


### Challenges with JVM

JVM-based frameworks like Spark and Flink have historically dominated the distributed computing landscape. However, they present several challenges:

Here is a diagram of the data flow in Spark:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/data-flow-jvm.png" width="500">
   
Here are the issues highlighted in the diagram:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/data-flow-jvm-issues.png" width="500">

1. **Painful local development UX**:  
   Getting a local development environment setup is difficult given the complexity of the dependencies.

2. **Inscrutable error traces between Python and JVM**:  
   Some tracebacks are not helpful in debugging given failures can occur across the language boundary (e.g. socket errors, JVM crashes vs Python application crashes).

3. **Data/Memory Overhead**:  
   The onus is on the user to properly type and design their UDFs and to minimize the data/memory overhead in serializing and deserializing data between Python and JVM.

By contrast, frameworks like Ray and Dask avoid the JVM overhead entirely, offering Python-native performance and better alignment with modern data science workflows.

## Data Processing with Ray Data

### What is Ray Data ?

Ray Data is a distributed data processing library designed for high-performance workloads.

- Initially developed as a last-mile data processing solution to seamlessly integrate with model training workflows.
- Facilitates efficient execution of GPU-intensive batch inference tasks.
- Currently being enhanced to support structured data processing, including advanced functionalities for operations such as joins and groupby.

### Why Ray Data ?
* Ray Data is natively designed to support:
    - Heterogeneous computational workloads
    - Pass data dependencies via a distributed in-memory object store
* Ray's support for stateful computation through Actors is a core feature. In contrast:
    - Spark lacks native support for stateful computations
    - Dask documents stateful capabilities, it does not guarantee execution.
* Ray Data's seamless integration with the broader Ray ecosystem (including Train, Tune, Core, and Serve) offers significant advantages in integration engineering, which often incurs higher costs and performance impacts than application engineering.
* Ray's advanced resource tagging, accounting, and scalability capabilities are more sophisticated than those of other tools. 

### When to use Ray Core over Ray Data ?
If a user is:
- an expert in Ray Core
- knows their data distribution very well
- has very complex data processing logic

Then perhaps trying out Ray Data won't lead to a win, given they will be able to optimize their workflow while implementing complex logic to handle object store backpressure.

### On Ray Data vs Spark
On the positive side:
- It handles running on heterogeneous compute much nicer than Spark where GPU support has been patched in.
- Its Python-native API and integration with the end-to-end Ray and ML ecosystem is a big win.

On the negative side:
- Still immature to claim it can compete with Spark on SQL-like operations.
- Spark is a much more mature product with a much larger community and ecosystem.

## Ray Serve

### What is Ray Serve ?

Ray Serve is a framework for building distributed ML inference services.

#### Data flow

Here is a diagram of the request lifecycle in Ray Serve:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/geotab/request_lifecycle.jpg" width="800">

When an HTTP or gRPC request is sent to the corresponding HTTP or gRPC proxy, the following happens:

1. The request is received and parsed.
2. Ray Serve looks up the correct deployment associated with the HTTP URL path or application name metadata. Serve places the request in a queue.
3. For each request in a deployment's queue, an available replica is looked up and the request is sent to it.
4. If no replicas are available (that is, more than `max_ongoing_requests` requests are outstanding at each replica), the request is left in the queue until a replica becomes available. 
5. Each replica maintains a queue of requests and executes requests one at a time, possibly using asyncio to process them concurrently.

### Why Ray Serve ?

Ray Serve enables scaling services and is a good choice given it:
* allows for an intuitive approach to autoscaling based on request load.
* has integrations with tools like FastAPI to make it easier to develop and document APIs.
* allows for easy composition of a complex DAG of models and data processing steps.
* provides support for both grpc and http protocols.

### Ray Serve vs Ray Data

Rules of thumb:
- When dealing with continuous/streaming applications where low-latency is critical, Ray Serve is a good choice.
- Otherwise, if data can be batched, processed at longer time intervals, Ray Data is a good choice to maximize throughput.

In terms of implementation:
- Ray Data implements complex logic to handle object store backpressure and perform dynamic resource allocation
- whereas Ray Serve relies on simple logic to batch and queue requests and statically allocates resources.

