# Introduction to Ray Data: Industry Landscape
© 2025, Anyscale. All Rights Reserved





This document is meant to provide a landscape of the industry.
<b> Here is the roadmap: </b>
<ul> 
    <li> The Data Layer </li>
    <li> The Compute Layer </li>
    <li> The Orchestration Layer </li>
    <li> Commercial Distributed Computing Platforms </li>
    <li> Distributed Computing execution models </li>
</ul>

# The data layer

This document provides an overview of the data layer, including its components and commonly used tools. The evolution of data storage patterns reflects the changing needs of businesses and advancements in technology. Below is a breakdown of these patterns and their applications.


## Databases

- **Purpose**: Built for handling transactional data.
- **Optimization**: Designed for online transaction processing (OLTP).
- **Key characteristics**:
  - Handles frequent, small-scale, atomic transactions.
  - Ensures data consistency and integrity using **ACID properties**: atomicity, consistency, isolation, and durability.
- **Examples**: MySQL, PostgreSQL, MongoDB.
- **Specialized Databases**:
  - **Vector Databases**: Databases that are optimized for storing and querying vector data.
    - **Examples**: Pinecone, Zilliz, ChromaDB, Weaviate.

## Data warehouses

- **Purpose**: Designed for analyzing large datasets.
- **Optimization**: Suited for online analytical processing (OLAP).
- **Key characteristics**:
  - Stores vast amounts of structured and historical data optimized for analytics.
  - Uses indexing and sorting to accelerate query performance.
  - Often employs proprietary storage formats (e.g., Snowflake’s columnar format) for efficiency.
  - Supports SQL-based querying and analytical functions for business intelligence.
  - Relies on ETL (extract, transform, load) or ELT (extract, load, transform) pipelines to preprocess and structure data.
- **Examples**: Amazon Redshift, Google BigQuery, Snowflake.


## Data lakes

- **Purpose**: Store large volumes of raw, semi-structured, or unstructured data.
- **Key characteristics**:
  - Maintains raw data in its native format without requiring prior transformation.
  - Supports diverse data types (e.g., text, images, videos, logs).
  - Ideal for machine learning, big data analytics, and scenarios requiring data exploration.
  - Lacks built-in mechanisms for enforcing data quality, transactional consistency, or indexing.


### Structure of a data lake

Data lakes are organized into layers that enable efficient storage and processing:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/storage-layer.png" width="500">

1. **Storage layer**: The foundation of the data lake, implemented using scalable systems like AWS S3, Cloudflare R2, or distributed file systems such as Ceph and Lustre.
2. **File formats**: Stores data in multiple formats, including Parquet, CSV, JSON, JPEG, and Protocol Buffers, to support varied workloads.
3. **Table format layer**: Adds features like schema management, ACID transactions, and indexing to provide database-like capabilities within the data lake. Examples of table formats include:
   - **Apache Iceberg**: Focuses on versioning, schema evolution, and large-scale analytics.
   - **Delta Lake**: Built with ACID transactions and data quality controls for analytics and machine learning.
   - **Apache Hudi**: Offers incremental processing capabilities and efficient updates for data lakes.

### Lakehouses

Lakehouses build on the foundation of data lakes, addressing their limitations while incorporating features of data warehouses.

- **Purpose**: Provide a unified platform for both analytical and transactional workloads.
- **Key characteristics**:
  - Combines the scalability and flexibility of data lakes with the structure and performance of data warehouses.
  - Implements ACID transactions, indexing, and schema enforcement directly on lake-stored data.
  - Built on open table formats (e.g., Apache Iceberg, Delta Lake, Apache Hudi) for robust querying and data management.
  - Bridges the gap between data engineering and machine learning workflows, enabling seamless data sharing.
  - Eliminates the need to move data between separate lake and warehouse systems, reducing complexity and cost.

- **Examples**: Databricks Lakehouse Platform, Delta Lake, and other systems leveraging modern table formats.


## In-memory data formats

### Apache Arrow

Apache Arrow is a high-performance in-memory data processing framework designed to enhance analytical workflows across languages and platforms.

- **Purpose**: Provides a standard for in-memory data that minimizes serialization costs and maximizes interoperability.
- **Language bindings**:
  - Examples: PyArrow for Python, Arrow-rs for Rust.
- **Key capabilities**:
  - **Zero-copy data sharing**: Supports efficient shared memory access and RPC-based data transfers.
  - **File format support**: Reads and writes formats like CSV, Apache ORC, and Apache Parquet.
  - **In-memory analytics**: Facilitates high-speed operations using the Arrow Table abstraction for query processing and computation.

