## Distributed Computing Frameworks

Distributed computing frameworks enable scalable data processing and analysis by dividing workloads across multiple machines. These frameworks are often categorized based on their primary use case, such as stream processing, batch processing, or general-purpose computing.

### Streaming Applications

Streaming frameworks process data as it arrives in real-time or near-real-time. They are essential for applications requiring immediate insights, such as fraud detection, monitoring, and recommendation systems.

- **Apache Kafka** (JVM-Based):
  - Distributed event streaming platform.
  - Fault-tolerant and scalable messaging.
  - Widely used for real-time data pipelines and log aggregation.
  - Python integration relies on external libraries like `confluent-kafka`.

- **Apache Flink** (JVM-Based):
  - Unified engine for stream and batch processing.
  - Features low-latency, event-time processing, and stateful computation.
  - Commonly used for real-time dashboards, fraud detection, and IoT analytics.
  - PyFlink API provides Python support but has fewer features compared to Java/Scala.

### Batch Processing

Batch processing frameworks handle large data volumes in discrete chunks, often for ETL workflows, batch inference and last-mile data processing.

- **Apache Hadoop** (JVM-Based):
  - Pioneer in distributed storage (HDFS) and batch processing (MapReduce).
  - Reliable for large-scale, fault-tolerant data processing.
  - Historically used for batch ETL and analytics.

- **Apache Spark** (JVM-Based):
  - Unified engine for batch and streaming workloads.
  - Features in-memory computation, scalable processing, and rich APIs.
  - Popular for data transformation, machine learning pipelines, and analytics.
  - Python integration via PySpark can introduce overhead due to JVM interaction.

### General-Purpose Distributed Computing

These frameworks are designed for diverse tasks, including machine learning, reinforcement learning, and data processing.

- **Dask**:
  - Python-native framework for parallel and distributed computing.
  - Frequently used in scientific computing and dataframe-based workflows.
  - Scales efficiently from single machines to distributed clusters.

- **Ray**:
  - Flexible platform for scalable distributed applications.
  - Rich in high-level libraries for:
    - Reinforcement Learning (Ray RLlib)
    - Distributed Data processing (Ray Data)
    - Distributed Training (Ray Train)
    - Distributed Hyperparameter Tuning (Ray Tune)
    - Distributed Serving (Ray Serve)
  - Seamless integration across the Ray ecosystem to build end-to-end data pipelines.
  - Ideal for Python-centric teams needing high-performance distributed computing.


### Challenges with JVM

JVM-based frameworks like Spark and Flink have historically dominated the distributed computing landscape. However, they present several challenges:

Here is a diagram of the data flow in Spark:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/data-flow-jvm.png" width="500">
   
Here are the issues highlighted in the diagram:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/cko-2025-q1/data-flow-jvm-issues.png" width="500">

1. **Painful local development UX**:  
   Getting a local development environment setup is difficult given the complexity of the dependencies.

2. **Inscrutable error traces between Python and JVM**:  
   Some tracebacks are not helpful in debugging given failures can occur across the language boundary (e.g. socket errors, JVM crashes vs Python application crashes).

3. **Data/Memory Overhead**:  
   The onus is on the user to properly type and design their UDFs and to minimize the data/memory overhead in serializing and deserializing data between Python and JVM.

By contrast, frameworks like Ray and Dask avoid the JVM overhead entirely, offering Python-native performance and better alignment with modern data science workflows.