# Parallelizing Ingestion Pipeline

In this notebook, we demonstrate how to execute ingestion pipelines using parallel processes.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import cProfile, pstats
from pstats import SortKey

### Load data

For this notebook, we'll load the `PatronusAIFinanceBenchDataset` llama-dataset from [llamahub](https://llamahub.ai).

In [None]:
!llamaindex-cli download-llamadataset PatronusAIFinanceBenchDataset --download-dir ./data

In [None]:
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()

### Define our IngestionPipeline

In [None]:
from llama_index import Document
from llama_index.embeddings import OpenAIEmbedding
from llama_index.text_splitter import SentenceSplitter
from llama_index.extractors import TitleExtractor
from llama_index.ingestion import IngestionPipeline

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        TitleExtractor(),
        OpenAIEmbedding(),
    ]
)

# since we'll be testing performance, using timeit and cProfile
# we're going to disable cache
pipeline.disable_cache = True

### Parallel Execution

A single run. Setting `num_workers` to a value greater than 1 will invoke parallel execution.

In [None]:
nodes = pipeline.run(documents=documents, num_workers=4)

100%|██████████| 5/5 [00:01<00:00,  4.48it/s]
100%|██████████| 5/5 [00:01<00:00,  3.37it/s]
100%|██████████| 5/5 [00:01<00:00,  2.76it/s]
100%|██████████| 5/5 [00:01<00:00,  2.57it/s]
100%|██████████| 1/1 [00:01<00:00,  1.33s/it]


In [None]:
len(nodes)

1161

In [None]:
%timeit pipeline.run(documents=documents, num_workers=4)

100%|██████████| 5/5 [00:01<00:00,  4.34it/s]
100%|██████████| 5/5 [00:01<00:00,  3.92it/s]
100%|██████████| 5/5 [00:01<00:00,  3.59it/s]
100%|██████████| 5/5 [00:01<00:00,  2.93it/s]
100%|██████████| 1/1 [00:01<00:00,  1.20s/it]
100%|██████████| 5/5 [00:01<00:00,  3.56it/s]
100%|██████████| 5/5 [00:01<00:00,  3.42it/s]
100%|██████████| 5/5 [00:01<00:00,  2.89it/s]
100%|██████████| 5/5 [00:02<00:00,  1.83it/s]
100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
100%|██████████| 5/5 [00:01<00:00,  3.99it/s]
100%|██████████| 5/5 [00:01<00:00,  3.28it/s]
100%|██████████| 5/5 [00:01<00:00,  2.80it/s]
100%|██████████| 5/5 [00:02<00:00,  2.39it/s]
100%|██████████| 1/1 [00:01<00:00,  1.38s/it]
100%|██████████| 5/5 [00:00<00:00,  5.39it/s]
100%|██████████| 5/5 [00:01<00:00,  3.59it/s]
100%|██████████| 5/5 [00:01<00:00,  3.73it/s]
100%|██████████| 5/5 [00:01<00:00,  2.75it/s]
100%|██████████| 1/1 [00:01<00:00,  1.25s/it]
100%|██████████| 5/5 [00:01<00:00,  4.39it/s]
100%|██████████| 5/5 [00:01<00:00,

12.4 s ± 2.89 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
cProfile.run(
    "pipeline.run(documents=documents, parallel=True, num_workers=4)",
    "newstats",
)
p = pstats.Stats("newstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

100%|██████████| 5/5 [00:01<00:00,  4.51it/s]
100%|██████████| 5/5 [00:01<00:00,  3.38it/s]
100%|██████████| 5/5 [00:01<00:00,  3.26it/s]
100%|██████████| 5/5 [00:01<00:00,  2.96it/s]
100%|██████████| 1/1 [00:01<00:00,  1.03s/it]


Tue Jan  9 01:48:45 2024    newstats

         2054 function calls in 10.402 seconds

   Ordered by: cumulative time
   List reduced from 211 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   10.402   10.402 {built-in method builtins.exec}
        1    0.010    0.010   10.402   10.402 <string>:1(<module>)
        1    0.000    0.000   10.392   10.392 pipeline.py:353(run)
       12    0.000    0.000   10.358    0.863 threading.py:589(wait)
       12    0.000    0.000   10.358    0.863 threading.py:288(wait)
       75   10.358    0.138   10.358    0.138 {method 'acquire' of '_thread.lock' objects}
        1    0.000    0.000   10.356   10.356 pool.py:369(starmap)
        1    0.000    0.000   10.356   10.356 pool.py:767(get)
        1    0.000    0.000   10.356   10.356 pool.py:764(wait)
        1    0.000    0.000    0.028    0.028 context.py:115(Pool)
        1    0.000    0.000    0.028    0.028 pool.py

<pstats.Stats at 0x16bf2df60>

### Sequential Execution

By default `num_workers` is set to `None` and this will inovke sequential execution.

In [None]:
nodes = pipeline.run(documents=documents)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.74it/s]


In [None]:
len(nodes)

1161

In [None]:
%timeit pipeline.run(documents=documents)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.23it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.48it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.62it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.33it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.59it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.97it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.63it/s]
100%|███████████████████████████████████████████

21.3 s ± 2.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
cProfile.run("pipeline.run(documents=documents)", "oldstats")
p = pstats.Stats("oldstats")
p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats(15)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  2.95it/s]


Tue Jan  9 01:52:26 2024    oldstats

         1024732 function calls (989764 primitive calls) in 26.372 seconds

   Ordered by: cumulative time
   List reduced from 1236 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   26.373   26.373 {built-in method builtins.exec}
        1    0.021    0.021   26.373   26.373 <string>:1(<module>)
        1    0.000    0.000   26.353   26.353 pipeline.py:353(run)
        1    0.000    0.000   26.353   26.353 pipeline.py:51(run_transformations)
        1    0.004    0.004   21.593   21.593 base.py:334(__call__)
        1    0.001    0.001   21.571   21.571 base.py:234(get_text_embedding_batch)
       12    0.000    0.000   21.567    1.797 openai.py:377(_get_text_embeddings)
       12    0.000    0.000   21.567    1.797 __init__.py:287(wrapped_f)
       12    0.001    0.000   21.567    1.797 __init__.py:369(__call__)
       12    0.000    0.000   21.565    1.797 openai.

<pstats.Stats at 0x16ba77970>

### In Conclusion

The results above show that with just 4 workers, you can get a speed up of nearly 50% when using parallel execution. 