# Batch Example
This example follows the guide: https://www.notion.so/llamaindex/Batch-APIs-1dedb4b7d41a80bd84b5f79e86ff4ec4

## Overview
This example will create a (parsing) batch of pdf documents from [kaggle](https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files?resource=download)
1. [Setup API Keys](#Setup-API-Keys)
    - `organization_id` & `project_id`
1. [Setup a s3 bucket with the uploaded documents](#Setup-S3-Bucket)
    - example s3: [llamacloud-demo-batch-example](https://us-east-1.console.aws.amazon.com/s3/buckets/llamacloud-demo-batch-example?region=us-east-1&bucketType=general&tab=objects)
1. [Create a `data source` integration in LlamaCloud](#Create-Data-Source)
1. [Send a batch request](#Send-a-Batch-Request)
1. [Check batch status](#Check-Batch-Status)
1. [Get parse results](#Get-Parse-Results)

### Setup

In [1]:
!pip install llama-index
!pip install llama-index-core


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import json

def pretty(data):
    json_string = json.dumps(data, indent=4)
    print(json_string)
    

## Setup API Keys

Here we setup `LLAMA_CLOUD_API_KEY` for managing the index on LlamaCloud.

In [None]:
import os
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "<LLAMA_CLOUD_API_KEY>" # Get your API key from https://cloud.llamaindex.ai/
os.environ["AWS_ACCESS_KEY_ID"] = "<AWS_ACCESS_KEY_ID>" # Get your AWS access ID from https://cloud.llamaindex.ai/
os.environ["AWS_SECRET_ACCESS_KEY"] = "<AWS_SECRET_ACCESS_KEY>" # Get your AWS access secret from https://cloud.llamaindex.ai/

In [24]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

In [25]:
from llama_cloud.client import LlamaCloud

client = LlamaCloud(token=os.environ["LLAMA_CLOUD_API_KEY"])

Setup headers when not using `llama-cloud` client

In [26]:
import requests

LLAMA_CLOUD_URL = "https://api.cloud.llamaindex.ai"

HEADERS = {
    "Authorization": f"Bearer {os.environ['LLAMA_CLOUD_API_KEY']}"
}

### Set Organization ID

In [27]:
client.organizations.list_organizations()

[Organization(id='5c5008d4-2bc6-4095-8e20-75721c5a0edf', created_at=datetime.datetime(2024, 9, 30, 8, 40, 28, 288200, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 3, 31, 0, 26, 44, 877279, tzinfo=datetime.timezone.utc), name='CEMEX-Dev-01', stripe_customer_id=None, parse_plan_level=<ParsePlanLevel.PREMIUM: 'PREMIUM'>),
 Organization(id='9e0e9e99-3f95-47d6-9a84-6141f18e4eaa', created_at=datetime.datetime(2025, 3, 31, 22, 50, 8, 793156, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 3, 31, 22, 50, 8, 793156, tzinfo=datetime.timezone.utc), name='Default Org', stripe_customer_id=None, parse_plan_level=<ParsePlanLevel.DEFAULT: 'DEFAULT'>)]

In [28]:
ORGANIZATION_ID = "9e0e9e99-3f95-47d6-9a84-6141f18e4eaa"

### Set Project ID

In [31]:
client.projects.list_projects(organization_id=ORGANIZATION_ID)


[Project(name='Default', id='bfe80772-7d84-4cc8-a29b-79273595c444', created_at=datetime.datetime(2025, 3, 31, 22, 50, 8, 912836, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 3, 31, 22, 50, 8, 912836, tzinfo=datetime.timezone.utc), ad_hoc_eval_dataset_id=None, organization_id='9e0e9e99-3f95-47d6-9a84-6141f18e4eaa', is_default=True),
 Project(name='test-batch-mode', id='800fa8e1-dc6c-4609-b889-8dbb06b40ff1', created_at=datetime.datetime(2025, 4, 30, 19, 4, 27, 716831, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 4, 30, 19, 4, 27, 716831, tzinfo=datetime.timezone.utc), ad_hoc_eval_dataset_id=None, organization_id='9e0e9e99-3f95-47d6-9a84-6141f18e4eaa', is_default=False)]

In [32]:
PROJECT_ID = "800fa8e1-dc6c-4609-b889-8dbb06b40ff1"

## Setup S3 Bucket

Create an S3 bucket and upload the documents you'd like to batch parse.

Example of a pdf dataset from [kaggle](https://www.kaggle.com/datasets/manisha717/dataset-of-pdf-files?resource=download) can be found in [llamacloud-demo-batch-example](https://us-east-1.console.aws.amazon.com/s3/buckets/llamacloud-demo-batch-example?region=us-east-1&bucketType=general&tab=objects).

_Note: you may create `data source`s with a path prefix. This means that you can have multiple folders in s3 while specifying the exact folder you'd like to connect your `data source`_

### Create Access Key

LlamaCloud requires access into the s3 bucket in order to be able to parse the documents. Create a new user with the following policy:
```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "LLamaCloudPermissions",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-name",
                "arn:aws:s3:::your-bucket-name/*"
            ]
        }
    ]
}
```

For more info, read [AWS Required Permissions](https://docs.cloud.llamaindex.ai/llamacloud/integrations/data_sources/s3#aws-required-permissions)

## Create Data Source

In [34]:
from llama_cloud.types import CloudS3DataSource

S3_BUCKET_NAME = "llamacloud-demo-batch-example"
S3_PREFIX = "single/"

data_source = {
    'name': 'test-batch-example-s3-data-source-5',
    'source_type': 'S3', 
    'component': CloudS3DataSource(
        bucket=S3_BUCKET_NAME,
        prefix=S3_PREFIX,
        aws_access_id=os.environ["AWS_ACCESS_KEY_ID"],
        aws_access_secret=os.environ["AWS_SECRET_ACCESS_KEY"]
    )
}

response = client.data_sources.create_data_source(
    project_id=PROJECT_ID,
    organization_id=ORGANIZATION_ID,
    request=data_source,
)

DATA_SOURCE_ID = response.id

print(f"Created data source with ID: {DATA_SOURCE_ID}")


Created data source with ID: f19f6958-68fc-45d7-b99a-c6120564aa2a


In [35]:
client.data_sources.get_data_source(DATA_SOURCE_ID)

DataSource(id='f19f6958-68fc-45d7-b99a-c6120564aa2a', created_at=datetime.datetime(2025, 5, 1, 16, 49, 9, 741051, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2025, 5, 1, 16, 49, 9, 741051), name='test-batch-example-s3-data-source-5', source_type=<ConfigurableDataSourceNames.S_3: 'S3'>, custom_metadata=None, component={'supports_access_control': False, 'bucket': 'llamacloud-demo-batch-example', 'prefix': 'single/', 'aws_access_id': 'AKIAYEKP5E2PCWW5UEG2', 'aws_access_secret': '**********', 's3_endpoint_url': None, 'class_name': 'CloudS3DataSource'}, version_metadata=None, project_id='800fa8e1-dc6c-4609-b889-8dbb06b40ff1')

## Send a Batch Request

Batch is only available in beta and is supported by the client starting `llama-cloud:0.1.20`

### Send Request

In [36]:
route = "/api/v1/beta/batches"

external_id = "single-demo-5"
project_id = PROJECT_ID
data_source_id = DATA_SOURCE_ID

body = {
  "project_id": project_id,
  "external_id": external_id,
  "tool": "parse",
  "tool_data": {
    "model": "",
    "preset": "",
    "bbox_top": None,
    "auto_mode": False,
    "bbox_left": None,
    "fast_mode": False,
    "input_url": None,
    "languages": ["en"],
    "max_pages": None,
    "bbox_right": None,
    "gpt4o_mode": False,
    "http_proxy": None,
    "parse_mode": None,
    "project_id": None,
    "bbox_bottom": None,
    "disable_ocr": False,
    "page_prefix": "",
    "page_suffix": "",
    "user_prompt": None,
    "webhook_url": "",
    "bounding_box": "",
    "do_not_cache": False,
    "premium_mode": False,
    "target_pages": "",
    "gpt4o_api_key": "",
    "s3_input_path": "",
    "system_prompt": None,
    "annotate_links": False,
    "extract_charts": False,
    "extract_layout": False,
    "page_separator": None,
    "continuous_mode": False,
    "input_s3_region": "",
    "take_screenshot": False,
    "azure_openai_key": None,
    "invalidate_cache": False,
    "output_s3_region": "",
    "structured_output": False,
    "max_pages_enforced": None,
    "skip_diagonal_text": False,
    "adaptive_long_table": False,
    "parsing_instruction": "",
    "system_prompt_append": None,
    "azure_openai_endpoint": None,
    "do_not_unroll_columns": False,
    "guess_xlsx_sheet_name": False,
    "output_tables_as_HTML": False,
    "s3_output_path_prefix": "",
    "strict_mode_image_ocr": False,
    "disable_reconstruction": False,
    "formatting_instruction": None,
    "job_timeout_in_seconds": None,
    "output_pdf_of_document": False,
    "strict_mode_buggy_font": False,
    "azure_openai_api_version": None,
    "disable_image_extraction": False,
    "is_formatting_instruction": True,
    "vendor_multimodal_api_key": "",
    "html_remove_fixed_elements": False,
    "internal_is_screenshot_job": False,
    "strict_mode_reconstruction": False,
    "use_vendor_multimodal_model": False,
    "azure_openai_deployment_name": None,
    "strict_mode_image_extraction": False,
    "vendor_multimodal_model_name": "",
    "content_guideline_instruction": None,
    "structured_output_json_schema": None,
    "html_make_all_elements_visible": False,
    "spreadsheet_extract_sub_tables": False,
    "html_remove_navigation_elements": False,
    "auto_mode_trigger_on_text_in_page": None,
    "auto_mode_trigger_on_image_in_page": False,
    "auto_mode_trigger_on_table_in_page": False,
    "structured_output_json_schema_name": None,
    "auto_mode_trigger_on_regexp_in_page": None,
    "complemental_formatting_instruction": None,
    "preserve_layout_alignment_across_pages": False,
    "job_timeout_extra_time_per_page_in_seconds": None,
    "ignore_document_elements_for_layout_detection": False
  },
  "input_type": "datasource",
  "input_id": data_source_id,
}

response = requests.post(LLAMA_CLOUD_URL+route, json=body, headers=HEADERS)
pretty(response.json())

BATCH_ID = response.json()["id"]
print("BATCH_ID:", BATCH_ID)


{
    "tool": "parse",
    "tool_data": {
        "languages": [
            "en"
        ],
        "parsing_instruction": "",
        "disable_ocr": false,
        "annotate_links": false,
        "adaptive_long_table": false,
        "compact_markdown_table": false,
        "disable_reconstruction": false,
        "disable_image_extraction": false,
        "invalidate_cache": false,
        "output_pdf_of_document": false,
        "do_not_cache": false,
        "fast_mode": false,
        "skip_diagonal_text": false,
        "preserve_layout_alignment_across_pages": false,
        "gpt4o_mode": false,
        "gpt4o_api_key": "",
        "do_not_unroll_columns": false,
        "extract_layout": false,
        "html_make_all_elements_visible": false,
        "html_remove_navigation_elements": false,
        "html_remove_fixed_elements": false,
        "guess_xlsx_sheet_name": false,
        "page_separator": null,
        "bounding_box": "",
        "bbox_top": null,
        "bbox_ri

## Check Batch Status

In [47]:
batch_id = BATCH_ID

response = requests.get(LLAMA_CLOUD_URL+"/api/v1/beta/batches/"+batch_id, headers=HEADERS)
pretty(response.json())

{
    "batch": {
        "tool": "parse",
        "tool_data": {
            "languages": [
                "en"
            ],
            "parsing_instruction": "",
            "disable_ocr": false,
            "annotate_links": false,
            "adaptive_long_table": false,
            "compact_markdown_table": false,
            "disable_reconstruction": false,
            "disable_image_extraction": false,
            "invalidate_cache": false,
            "output_pdf_of_document": false,
            "do_not_cache": false,
            "fast_mode": false,
            "skip_diagonal_text": false,
            "preserve_layout_alignment_across_pages": false,
            "gpt4o_mode": false,
            "gpt4o_api_key": "",
            "do_not_unroll_columns": false,
            "extract_layout": false,
            "html_make_all_elements_visible": false,
            "html_remove_navigation_elements": false,
            "html_remove_fixed_elements": false,
            "guess_xlsx_she

In [48]:
BATCH_ITEMS = response.json()["batch_items"]

success_count = 0
error_count = 0
pending_count = 0
no_task_count = 0
for i in BATCH_ITEMS:
    if i["task"] is None:
        no_task_count += 1
        continue

    match i["task"]["status"]:
        case "SUCCESS":
            success_count += 1
        case "ERROR":
            error_count += 1
        case "PENDING":
            pending_count += 1

print("no task count:", no_task_count)
print("error count:", error_count)
print("pending count:", pending_count)
print("success count:", success_count)

no task count: 0
error count: 0
pending count: 0
success count: 1


## Get Parse Results

In [46]:

items = BATCH_ITEMS

item = items[0]
result = item["task"]["data_path"]

pretty(requests.get(LLAMA_CLOUD_URL+result, headers=HEADERS).json())

{
    "pages": [
        {
            "page": 1,
            "text": "THE CENTRE FOR HUM ANITARIAN DATA\nS\u00c9RIE DE NOTES D\u2019ORIENTATION\nL A RESPONSABILIT\u00c9 DES DONN\u00c9ES DANS L\u2019AC TION HUMANITAIRE\nL A RESPONSABILIT\u00c9 DES DONN\u00c9ES DANS LES\nPARTENARIATS PUBLIC-PRIV\u00c9\n   POINTS CL\u00c9S :\n    \u2022  Les organisations humanitaires et le secteur priv\u00e9 collaborent r\u00e9guli\u00e8rement sur des initiatives\n       li\u00e9es aux technologies de l\u2019information et des communications (TIC) et li\u00e9es aux donn\u00e9es. Les types\n       les plus courants de tels partenariats public-priv\u00e9 (PPP) dans ce domaine comprennent (i) des\n       contributions financi\u00e8res, (ii) la fourniture de technologies, (iii) un soutien technique en nature, (iv)\n       le d\u00e9veloppement technologique conjoint et (v) le partage de donn\u00e9es et la collaboration.\n    \u2022  La Responsabilit\u00e9 des donn\u00e9es est la gestion s\u00e9curis\u00e9e,