<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/ingestion/parallel_execution_ingestion_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Distributed Ingestion Pipeline with Ray

In this notebook, we demonstrate how to execute ingestion pipelines using Ray.

In [None]:
%pip install llama-index-ingestion-ray llama-index-embeddings-huggingface

Start a new cluster, or connect to an existing one. See https://docs.ray.io/en/latest/ray-core/configure.html for details about Ray cluster configurations.

In [None]:
import ray

ray.init()

### Load data

For this notebook, we'll load the `PatronusAIFinanceBenchDataset` llama-dataset from [llamahub](https://llamahub.ai).

In [None]:
!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

### Define the RayIngestionPipeline

First, we define our transformations. Each `TransformComponent` object is wrapped into a `RayTransformComponent` that encapsulates the transformation logic within stateful [Ray Actors](https://docs.ray.io/en/latest/ray-core/actors.html). All the transformation logic is performed using [Ray Data](https://docs.ray.io/en/latest/data/data.html). For more details about how to configure the hardware requirements and Actor Pool strategies, see [ray.data.Dataset.map_batches documentation](https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html).

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.ingestion.ray import RayTransformComponent

transformations = [
    RayTransformComponent(
        transform_class=SentenceSplitter,
        chunk_size=1024,
        chunk_overlap=20,
        map_batches_kwargs={
            "batch_size": 100,  # Batch Size
            "num_cpus": 1,  # Request 1 CPU per actor
            "compute": ray.data.ActorPoolStrategy(
                size=20
            ),  # Fixed Pool of 20 actors
        },
    ),
    RayTransformComponent(
        transform_class=HuggingFaceEmbedding,
        model_name="BAAI/bge-small-en-v1.5",
        map_batches_kwargs={
            "batch_size": 100,
            # Fractional GPU Usage
            # This tells Ray: "1 Actor needs 25% of a GPU".
            # If you have 1 physical GPU, Ray autoscales to 4 Actors.
            # If you have 4 physical GPUs, Ray autoscales to 16 Actors.
            "num_gpus": 0.25,
        },
    ),
]

Then, we create the ingestion pipeline.

In [None]:
from llama_index.ingestion.ray import RayIngestionPipeline

pipeline = RayIngestionPipeline(transformations=transformations)

### Run the Pipeline

We can finally run the pipeline with our Ray cluster.

In [None]:
nodes = pipeline.run(documents=documents)

2026-01-02 19:45:57,691	INFO logging.py:397 -- Registered dataset logger for dataset dataset_8_0
2026-01-02 19:45:57,692	INFO logging.py:405 -- dataset_8_0 registers for logging while another dataset dataset_2_0 is also logging. For performance reasons, we will not log to the dataset dataset_8_0 until it is the only active dataset.
2026-01-02 19:45:57,694	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_8_0. Full logs are in /tmp/ray/session_2026-01-02_19-32-39_779796_94512/logs/ray-data
2026-01-02 19:45:57,694	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_8_0: InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(TransformActor)] -> ActorPoolMapOperator[MapBatches(TransformActor)]
2026-01-02 19:45:58,296	INFO progress_bar.py:213 -- === Ray Data Progress {MapBatches(TransformActor)} ===
2026-01-02 19:45:58,297	INFO progress_bar.py:215 -- MapBatches(TransformActor): Tasks: 0; Actors: 20 (running=0, restarting=0, pending=20); Queued blocks