# Document Splitting with LlamaCloud

This notebook demonstrates how to use the LlamaCloud **Split** API to automatically segment a concatenated PDF into logical document sections based on content categories.

## Use Case

When dealing with large PDFs that contain multiple distinct documents or sections (e.g., a bundle of research papers, a collection of reports), you often need to split them into individual segments. The Split API uses AI to:

1. Analyze each page's content
2. Classify pages into user-defined categories
3. Group consecutive pages of the same category into segments

## Example Document

We'll use a PDF containing three concatenated documents:
- **Alan Turing's essay** "Intelligent Machinery, A Heretical Theory" (an essay)
- **ImageNet paper** (a research paper)
- **"Attention is All You Need"** paper (a research paper)

We'll split this into segments categorized as either `essay` or `research_paper`.


## Setup


In [None]:
# Install required packages
%pip install llama-cloud python-dotenv requests


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import os
import time
import requests
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configuration
LLAMA_CLOUD_API_KEY = os.environ.get("LLAMA_CLOUD_API_KEY", "llx-...")
BASE_URL = os.environ.get("LLAMA_CLOUD_BASE_URL", "https://api.cloud.llamaindex.ai")
PROJECT_ID = os.environ.get("LLAMA_CLOUD_PROJECT_ID", None)

# Headers for API requests
headers = {
    "Authorization": f"Bearer {LLAMA_CLOUD_API_KEY}",
    "Content-Type": "application/json",
}

print(f"‚úÖ API configured with base URL: {BASE_URL}")
print(f"‚úÖ Project ID: {PROJECT_ID or 'using default project'}")

‚úÖ API configured with base URL: https://api.cloud.llamaindex.ai
‚úÖ Project ID: using default project


## Step 1: Upload the PDF File

First, we'll upload our concatenated PDF to LlamaCloud using the Files API. This can be done using the `llama-cloud` SDK.


In [None]:
from llama_cloud.client import LlamaCloud

# Initialize the client
client = LlamaCloud(token=LLAMA_CLOUD_API_KEY, base_url=BASE_URL)

# Path to the PDF file
pdf_path = "./data/turing+imagenet+attention.pdf"

# Upload the file
print(f"üì§ Uploading {pdf_path}...")

with open(pdf_path, "rb") as f:
    uploaded_file = client.files.upload_file(upload_file=f, project_id=PROJECT_ID)

file_id = uploaded_file.id
print(f"‚úÖ File uploaded successfully!")
print(f"   File name: {uploaded_file.name}")

üì§ Uploading ./data/turing+imagenet+attention.pdf...
‚úÖ File uploaded successfully!
   File name: turing+imagenet+attention.pdf


## Step 2: Create a Split Job

Now we'll create a split job using the Split API. Since the Split API is in beta and not yet available in the SDK, we'll use raw HTTP requests.

We define two categories:
- **essay**: For philosophical or reflective writing
- **research_paper**: For formal academic documents with methodology and citations


In [None]:
# Define the split job request
split_request = {
    "document_input": {
        "type": "file_id",  # only file_id is supported for now
        "value": file_id,
    },
    "categories": [
        {
            "name": "essay",
            "description": "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure",
        },
        {
            "name": "research_paper",
            "description": "A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references",
        },
    ],
}

# Create the split job
print("üîÑ Creating split job...")
response = requests.post(
    f"{BASE_URL}/api/v1/beta/split/jobs",
    params={"project_id": PROJECT_ID},
    headers=headers,
    json=split_request,
)
response.raise_for_status()

split_job = response.json()
job_id = split_job["id"]

print(f"‚úÖ Split job created!")
print(f"   Job ID: {job_id}")
print(f"   Status: {split_job['status']}")
print(f"   Categories: {[c['name'] for c in split_job['categories']]}")

üîÑ Creating split job...
‚úÖ Split job created!
   Job ID: spl-zsssb632a742aikliu96pqkb56t5
   Status: pending
   Categories: ['essay', 'research_paper']


## Step 3: Poll for Job Completion

The split job runs asynchronously. We'll poll the job status until it completes.


In [None]:
def poll_split_job(job_id: str, max_wait_seconds: int = 180, poll_interval: int = 5):
    """
    Poll a split job until it reaches a terminal state.

    Args:
        job_id: The split job ID
        max_wait_seconds: Maximum time to wait for completion
        poll_interval: Seconds between poll attempts

    Returns:
        The completed job response
    """
    start_time = time.time()

    while (time.time() - start_time) < max_wait_seconds:
        response = requests.get(
            f"{BASE_URL}/api/v1/beta/split/jobs/{job_id}",
            params={"project_id": PROJECT_ID},
            headers=headers,
        )
        response.raise_for_status()
        job = response.json()

        status = job["status"]
        elapsed = int(time.time() - start_time)
        print(f"   Status: {status} (elapsed: {elapsed}s)")

        if status in ["completed", "failed"]:
            return job

        time.sleep(poll_interval)

    raise TimeoutError(f"Job did not complete within {max_wait_seconds} seconds")


print("‚è≥ Waiting for split job to complete...")
completed_job = poll_split_job(job_id)

if completed_job["status"] == "completed":
    print("\n‚úÖ Split job completed successfully!")
else:
    print(
        f"\n‚ùå Split job failed: {completed_job.get('error_message', 'Unknown error')}"
    )

‚è≥ Waiting for split job to complete...
   Status: processing (elapsed: 0s)
   Status: processing (elapsed: 5s)
   Status: processing (elapsed: 11s)
   Status: completed (elapsed: 16s)

‚úÖ Split job completed successfully!


## Step 4: Analyze the Results

Let's examine the split results to see how the document was segmented.


In [None]:
# Get the segments from the result
segments = completed_job.get("result", {}).get("segments", [])

print(f"üìä Split Results Summary")
print(f"=" * 50)
print(f"Total segments found: {len(segments)}")
print()

# Count by category
category_counts = {}
for segment in segments:
    cat = segment["category"]
    category_counts[cat] = category_counts.get(cat, 0) + 1

print("Segments by category:")
for cat, count in category_counts.items():
    print(f"   ‚Ä¢ {cat}: {count} segment(s)")

üìä Split Results Summary
Total segments found: 3

Segments by category:
   ‚Ä¢ essay: 1 segment(s)
   ‚Ä¢ research_paper: 2 segment(s)


In [None]:
# Display detailed segment information
print(f"\nüìÑ Segment Details")
print(f"=" * 50)

for i, segment in enumerate(segments, 1):
    category = segment["category"]
    pages = segment["pages"]
    confidence = segment["confidence_category"]

    # Format page range
    if len(pages) == 1:
        page_range = f"Page {pages[0]}"
    else:
        page_range = f"Pages {min(pages)}-{max(pages)}"

    print(f"\nSegment {i}:")
    print(f"   Category: {category}")
    print(f"   {page_range} ({len(pages)} page{'s' if len(pages) > 1 else ''})")
    print(f"   Confidence: {confidence}")


üìÑ Segment Details

Segment 1:
   Category: essay
   Pages 1-4 (4 pages)
   Confidence: high

Segment 2:
   Category: research_paper
   Pages 5-13 (9 pages)
   Confidence: high

Segment 3:
   Category: research_paper
   Pages 14-24 (11 pages)
   Confidence: high


## Expected Results

Based on our test document, we expect:
- **1 essay segment**: Alan Turing's "Intelligent Machinery, A Heretical Theory"
- **2 research paper segments**: ImageNet paper and "Attention is All You Need" paper

The pages should be grouped consecutively, with no overlap between segments.


In [None]:
# Verify no page overlap
all_pages = []
for segment in segments:
    all_pages.extend(segment["pages"])

unique_pages = set(all_pages)

print(f"\n‚úÖ Validation")
print(f"=" * 50)
print(f"Total pages assigned: {len(all_pages)}")
print(f"Unique pages: {len(unique_pages)}")

if len(all_pages) == len(unique_pages):
    print(f"‚úÖ No page overlap detected - each page belongs to exactly one segment")
else:
    print(
        f"‚ö†Ô∏è  Page overlap detected - {len(all_pages) - len(unique_pages)} duplicate assignments"
    )


‚úÖ Validation
Total pages assigned: 24
Unique pages: 24
‚úÖ No page overlap detected - each page belongs to exactly one segment


## Using `allow_uncategorized` Strategy

You can also use the `allow_uncategorized` splitting strategy. This is useful when you want to capture pages that don't match any defined category.


In [None]:
# Example with allow_uncategorized strategy
split_request_uncategorized = {
    "document_input": {"type": "file_id", "value": file_id},
    "categories": [
        {
            "name": "essay",
            "description": "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic",
        }
        # Note: We only define 'essay' category
        # Research papers will be classified as 'uncategorized'
    ],
    "splitting_strategy": {"allow_uncategorized": True},
}

print("üìù With allow_uncategorized=True and only 'essay' category defined,")
print("   pages that don't match 'essay' will be grouped as 'uncategorized'.")

üìù With allow_uncategorized=True and only 'essay' category defined,
   pages that don't match 'essay' will be grouped as 'uncategorized'.


## Conclusion

The LlamaCloud Split API provides a powerful way to automatically segment concatenated documents based on content categories. This is useful for:

- **Document processing pipelines**: Automatically separate bundled documents before further processing
- **Content organization**: Categorize and organize mixed document collections
- **Information extraction**: Identify different document types within a single file

### Key Features

- **AI-powered classification**: Uses LLMs to understand page content and assign categories
- **Flexible categories**: Define any categories relevant to your use case
- **Confidence scoring**: Each segment includes a confidence level
- **Page-level granularity**: Results include exact page numbers for each segment

### API Reference

- **Create Split Job**: `POST /api/v1/beta/split/jobs`
- **Get Split Job**: `GET /api/v1/beta/split/jobs/{job_id}`
- **List Split Jobs**: `GET /api/v1/beta/split/jobs`
