# Document Classification + Extraction Workflow with LlamaCloud + LlamaIndex Workflows

<a href="https://colab.research.google.com/github/run-llama/llama_cloud_services/blob/main/examples/misc/parse_classify_extract_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows a multi-step agentic document workflow that uses the **parsing**, **classification** and **extraction** modules in LlamaCloud, orchestrated through **LlamaIndex Workflows**. The workflow can take in a complex input document, parse it into clean markdown, classify it according to its subtype, and extract data according to a specified schema for that subtype. This allows you to automate document extraction of various types within the same workflow instead of having to manually separate the data beforehand. 

This notebook uses the following modules:
1. **Parse (LlamaParse)** - Extract and convert documents to markdown
2. **Classify** - Categorize documents based on their content
3. **Extract (LlamaExtract)** - Extract structured data using the markdown as input via SourceText
4. **LlamaIndex Workflows** - Event-driven orchestration of the parse, classify and extract steps

The workflow is implemented as a proper LlamaIndex Workflow with separate steps for parsing, classification, and extraction, connected by typed events. This provides modularity, observability, and type safety.

## Setup and Installation

In [None]:
# Install required packages
%pip install llama-cloud-services
%pip install python-dotenv

In [None]:
import os
import nest_asyncio
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
nest_asyncio.apply()

# Set up API key
# os.environ["LLAMA_CLOUD_API_KEY"] = ""  # edit it

# Setup Base URL
# os.envrion["LLAMA_CLOUD_BASE_URL"] = "https://api.cloud.eu.llamaindex.ai/" # update if necessay

print("✅ API key configured")

✅ API key configured


## Download Sample Documents

Let's download some sample documents to work with:

In [None]:
import requests

# Create directory for sample documents
os.makedirs("sample_docs", exist_ok=True)

# Download sample documents
docs_to_download = {
    "financial_report.pdf": "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/10k/uber_2021.pdf",
    "technical_spec.pdf": "https://www.ti.com/lit/ds/symlink/lm317.pdf",
}

for filename, url in docs_to_download.items():
    filepath = f"sample_docs/{filename}"
    if not os.path.exists(filepath):
        print(f"Downloading {filename}...")
        response = requests.get(url)
        if response.status_code == 200:
            with open(filepath, "wb") as f:
                f.write(response.content)
            print(f"✅ Downloaded {filename}")
        else:
            print(f"❌ Failed to download {filename}")
    else:
        print(f"📁 {filename} already exists")

print("\n📂 Sample documents ready!")

Downloading financial_report.pdf...
✅ Downloaded financial_report.pdf
📁 technical_spec.pdf already exists

📂 Sample documents ready!


## Phase 1: Document Parsing

First, let's parse our documents using LlamaParse to extract clean markdown content.

In [None]:
from llama_cloud_services.parse.base import LlamaParse
from llama_cloud_services.parse.utils import ResultType

# Initialize the parser
parser = LlamaParse(
    result_type=ResultType.MD,  # Get markdown output
    verbose=True,
    language="en",
    # Premium mode for better accuracy
    premium_mode=True,
    # Extract tables as HTML for better structure
    output_tables_as_HTML=True,
    # Parse only first few pages for demo
)

print("🔄 Parsing documents...")

# Parse the financial report
financial_result = await parser.aparse("sample_docs/financial_report.pdf")
print(f"✅ Parsed financial report (Job ID: {financial_result.job_id})")

# Parse the technical specification
technical_result = await parser.aparse("sample_docs/technical_spec.pdf")
print(f"✅ Parsed technical spec (Job ID: {technical_result.job_id})")

print("\n📄 Parsing complete!")

🔄 Parsing documents...
Started parsing the file under job_id 530c187a-bd2d-4eea-b38d-9e5738eab465
.✅ Parsed financial report (Job ID: 530c187a-bd2d-4eea-b38d-9e5738eab465)
Started parsing the file under job_id a6e27710-776b-4445-8b94-8d75959ff5db
✅ Parsed technical spec (Job ID: a6e27710-776b-4445-8b94-8d75959ff5db)

📄 Parsing complete!


### Extract Markdown Content

Now let's get the markdown content from our parsed documents:

In [None]:
# Get markdown content from parsed documents
financial_markdown = await financial_result.aget_markdown()
technical_markdown = await technical_result.aget_markdown()

print("📋 Financial Report Markdown (first 500 chars):")
print(financial_markdown[:500])
print("...\n")

print("📋 Technical Spec Markdown (first 500 chars):")
print(technical_markdown[:500])
print("...\n")

print(f"📏 Financial report markdown length: {len(financial_markdown)} characters")
print(f"📏 Technical spec markdown length: {len(technical_markdown)} characters")

document_texts = [financial_markdown, technical_markdown]

📋 Financial Report Markdown (first 500 chars):


# UNITED STATES
# SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

## FORM 10-K

(Mark One)

☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended December 31, 2021
OR
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from_____ to _____
Commission File Number: 001-38902

# UBER TECHNOLOGIES, INC.
(Exact name of registrant as specified in its charter)

Delaware
...

📋 Technical Spec Markdown (first 500 chars):


LM317
SLVS044Z – SEPTEMBER 1997 – REVISED APRIL 2025

# LM317 3-Pin Adjustable Regulator

## 1 Features

- Output voltage range:
  – Adjustable: 1.25V to 37V
- Output current: 1.5A
- Line regulation: 0.01%/V (typ)
- Load regulation: 0.1% (typ)
- Internal short-circuit current limiting
- Thermal overload protection
- Output safe-area compensation (new chip)
- PSRR: 80dB at 120Hz for CADJ = 10μF (ne

## Phase 2: Document Classification

Next, let's classify our documents based on their content using the ClassifyClient.

In [None]:
from llama_cloud_services.beta.classifier.client import ClassifyClient
from llama_cloud.types import ClassifierRule

# Initialize the classify client
api_key = os.environ["LLAMA_CLOUD_API_KEY"]
classify_client = ClassifyClient.from_api_key(api_key)

print("🏷️  Setting up document classification...")

# Define classification rules
classification_rules = [
    ClassifierRule(
        type="financial_document",
        description="Documents containing financial data, revenue, expenses, SEC filings, or financial statements",
    ),
    ClassifierRule(
        type="technical_specification",
        description="Technical datasheets, component specifications, engineering documents, or technical manuals",
    ),
    ClassifierRule(
        type="general_document",
        description="General business documents, contracts, or other unspecified document types",
    ),
]

print(f"📝 Created {len(classification_rules)} classification rules")

🏷️  Setting up document classification...
📝 Created 3 classification rules


### Try Classification Independently

Let's test the classification on one of our parsed documents to see how it works:


In [None]:
import tempfile
from pathlib import Path

# Let's classify the financial document
print("🔍 Classifying financial document...")
print(f"   Document length: {len(financial_markdown):,} characters\n")

# Write to temp file for classification
with tempfile.NamedTemporaryFile(
    mode="w", suffix=".md", delete=False, encoding="utf-8"
) as tmp:
    tmp.write(financial_markdown)
    temp_financial_path = Path(tmp.name)

# Classify the document
financial_classification = await classify_client.aclassify_file_path(
    rules=classification_rules, file_input_path=str(temp_financial_path)
)

doc_type = financial_classification.items[0].result.type
confidence = financial_classification.items[0].result.confidence
reasoning = financial_classification.items[0].result.reasoning

print("✅ Classification Result:")
print(f"   Type: {doc_type}")
print(f"   Confidence: {confidence:.2%}")
print(
    f"   Reasoning: {reasoning[:200]}..."
    if reasoning and len(reasoning) > 200
    else f"   Reasoning: {reasoning}"
)

print("\n" + "=" * 70)

🔍 Classifying financial document...
   Document length: 1,338,499 characters

✅ Classification Result:
   Type: financial_document
   Confidence: 100.00%
   Reasoning: This document is a Form 10-K, which is an annual report required by the U.S. Securities and Exchange Commission (SEC) for publicly traded companies. It contains financial data, information about the c...



## Phase 3: Structured Data Extraction using SourceText

Now comes the key part - using the markdown content as input for structured data extraction via SourceText.

In [None]:
from llama_cloud_services.extract.extract import LlamaExtract, SourceText
from pydantic import BaseModel, Field
from typing import List, Optional

# Initialize LlamaExtract
llama_extract = LlamaExtract(api_key=api_key, verbose=True)

print("⚙️  LlamaExtract initialized")

⚙️  LlamaExtract initialized


### Define Extraction Schemas

Let's define different schemas for different document types:

In [None]:
# Schema for financial documents
class FinancialMetrics(BaseModel):
    company_name: str = Field(description="Name of the company")
    document_type: str = Field(
        description="Type of financial document (10-K, 10-Q, annual report, etc.)"
    )
    fiscal_year: int = Field(description="Fiscal year of the report")
    revenue_2021: str = Field(description="Total revenue in 2021")
    net_income_2021: str = Field(description="Net income in 2021")
    key_business_segments: List[str] = Field(
        default=[], description="Main business segments or divisions"
    )
    risk_factors: List[str] = Field(
        default=[], description="Key risk factors mentioned"
    )


# Schema for technical specifications
class VoltageRange(BaseModel):
    min_voltage: Optional[float] = Field(description="Minimum voltage")
    max_voltage: Optional[float] = Field(description="Maximum voltage")
    unit: str = Field(default="V", description="Voltage unit")


class TechnicalSpec(BaseModel):
    component_name: str = Field(description="Name of the technical component")
    manufacturer: Optional[str] = Field(description="Manufacturer name")
    part_number: Optional[str] = Field(description="Part or model number")
    description: str = Field(description="Brief description of the component")
    operating_voltage: Optional[VoltageRange] = Field(
        description="Operating voltage range"
    )
    maximum_current: Optional[float] = Field(
        description="Maximum current rating in amperes"
    )
    key_features: List[str] = Field(
        default=[], description="Key features and capabilities"
    )
    applications: List[str] = Field(default=[], description="Typical applications")


print("📋 Extraction schemas defined")

📋 Extraction schemas defined


## Building the Complete Workflow

Now that we've seen how parsing works, let's build a complete 3-step workflow (Parse → Classify → Extract) using LlamaIndex Workflows. We'll define the workflow structure here, and you can see it in action below where we also demonstrate the classification and extraction modules independently.

### Install Workflows Package

First, let's install the LlamaIndex workflows package:


In [None]:
%pip install llama-index-workflows llama-index-utils-workflow

## Define the Workflow

Let's restructure the document processing into a proper LlamaIndex Workflow with separate classification and extraction steps:


In [None]:
import tempfile
from pathlib import Path
from llama_cloud import ExtractConfig
from workflows import Workflow, step, Context
from workflows.events import Event, StartEvent, StopEvent


# Define workflow events
class ParseEvent(Event):
    """Event emitted after parsing"""

    file_path: str
    markdown_content: str
    job_id: str


class ClassifyEvent(Event):
    """Event emitted after classification"""

    markdown_content: str
    temp_path: str
    doc_type: str
    confidence: float


class ExtractEvent(Event):
    """Event emitted after extraction"""

    doc_type: str
    confidence: float
    extracted_data: dict
    markdown_length: int
    temp_path: str
    markdown_sample: str


class DocumentWorkflow(Workflow):
    """
    Complete document processing workflow: Parse → Classify → Extract
    """

    def __init__(
        self,
        parser,
        classify_client,
        classification_rules,
        llama_extract,
        financial_schema,
        technical_schema,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.parser = parser
        self.classify_client = classify_client
        self.classification_rules = classification_rules
        self.llama_extract = llama_extract
        self.financial_schema = financial_schema
        self.technical_schema = technical_schema

    @step
    async def parse_document(self, ctx: Context, ev: StartEvent) -> ParseEvent:
        """
        Step 1: Parse the document to extract markdown
        """
        file_path = ev.file_path
        print(f"📄 Step 1: Parsing document: {file_path}...")

        # Parse the document
        parse_result = await self.parser.aparse(file_path)
        markdown_content = await parse_result.aget_markdown()
        job_id = parse_result.job_id

        print(f"   ✅ Parsed successfully (Job ID: {job_id})")
        print(f"   📝 Extracted {len(markdown_content):,} characters")

        # Write event to stream for monitoring
        parse_event = ParseEvent(
            file_path=file_path,
            markdown_content=markdown_content,
            job_id=job_id,
        )
        ctx.write_event_to_stream(parse_event)

        return parse_event

    @step
    async def classify_document(self, ctx: Context, ev: ParseEvent) -> ClassifyEvent:
        """
        Step 2: Classify the document based on its content
        """
        markdown_content = ev.markdown_content
        print("🏷️  Step 2: Classifying document...")

        # Write markdown to temp file for classification
        with tempfile.NamedTemporaryFile(
            mode="w", suffix=".md", delete=False, encoding="utf-8"
        ) as tmp:
            tmp.write(markdown_content)
            temp_path = Path(tmp.name)

        # Classify the document
        classification = await self.classify_client.aclassify_file_path(
            rules=self.classification_rules, file_input_path=str(temp_path)
        )
        doc_type = classification.items[0].result.type
        confidence = classification.items[0].result.confidence

        print(f"   ✅ Classified as: {doc_type} (confidence: {confidence:.2f})")

        # Write event to stream for monitoring
        classify_event = ClassifyEvent(
            markdown_content=markdown_content,
            temp_path=str(temp_path),
            doc_type=doc_type,
            confidence=confidence,
        )
        ctx.write_event_to_stream(classify_event)

        return classify_event

    @step
    async def extract_data(self, ctx: Context, ev: ClassifyEvent) -> ExtractEvent:
        """
        Step 3: Extract structured data based on classification
        """
        print("🔍 Step 3: Extracting structured data using SourceText...")

        # Choose schema based on classification
        if "financial" in ev.doc_type.lower():
            schema = self.financial_schema
            print("   📊 Using FinancialMetrics schema")
        elif "technical" in ev.doc_type.lower():
            schema = self.technical_schema
            print("   🔧 Using TechnicalSpec schema")
        else:
            schema = self.financial_schema  # Default fallback
            print("   📊 Using default FinancialMetrics schema")

        # Create SourceText from markdown content
        source_text = SourceText(
            text_content=ev.markdown_content,
            filename=f"{os.path.basename(ev.temp_path)}_markdown.md",
        )

        # Configure extraction
        extract_config = ExtractConfig(
            extraction_mode="BALANCED",
        )

        # Perform extraction
        extraction_result = self.llama_extract.extract(
            data_schema=schema, config=extract_config, files=source_text
        )

        print("   ✅ Extraction complete!")

        # Create markdown sample
        markdown_sample = (
            ev.markdown_content[:200] + "..."
            if len(ev.markdown_content) > 200
            else ev.markdown_content
        )

        extract_event = ExtractEvent(
            doc_type=ev.doc_type,
            confidence=ev.confidence,
            extracted_data=extraction_result.data,
            markdown_length=len(ev.markdown_content),
            temp_path=ev.temp_path,
            markdown_sample=markdown_sample,
        )
        ctx.write_event_to_stream(extract_event)

        return extract_event

    @step
    async def finalize_results(self, ctx: Context, ev: ExtractEvent) -> StopEvent:
        """
        Step 4: Finalize and return results
        """
        result = {
            "file_path": ev.temp_path,
            "markdown_length": ev.markdown_length,
            "classification": ev.doc_type,
            "confidence": ev.confidence,
            "extracted_data": ev.extracted_data,
            "markdown_sample": ev.markdown_sample,
        }

        return StopEvent(result=result)


print("🔧 Workflow defined!")

🔧 Workflow defined!


### Workflow Structure

The workflow consists of four steps connected by typed events:

```
┌─────────────┐
│ StartEvent  │  (file_path)
└──────┬──────┘
       │
       ▼
┌──────────────────┐
│ parse_document   │  Step 1: Parse PDF to markdown
└──────┬───────────┘
       │
       ▼
┌─────────────┐
│  ParseEvent │  (markdown_content, job_id)
└──────┬──────┘
       │
       ▼
┌─────────────────────┐
│ classify_document   │  Step 2: Classification
└──────┬──────────────┘
       │
       ▼
┌──────────────┐
│ ClassifyEvent│  (doc_type, confidence, markdown_content)
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ extract_data │  Step 3: Extraction with schema selection
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ ExtractEvent │  (extracted_data, doc_type, confidence)
└──────┬───────┘
       │
       ▼
┌──────────────────┐
│ finalize_results │  Step 4: Format and return results
└──────┬───────────┘
       │
       ▼
┌─────────────┐
│  StopEvent  │  (final result dictionary)
└─────────────┘
```

**Key Features:**
- **Step 1 (parse_document)**: Takes a file path and parses the document into clean markdown
- **Step 2 (classify_document)**: Takes markdown content and classifies it into document types
- **Step 3 (extract_data)**: Selects appropriate schema based on classification and extracts structured data
- **Step 4 (finalize_results)**: Packages all results into final output format
- Events are written to the stream for real-time monitoring


## Visualize the Workflow

Let's visualize the workflow structure to see the flow of events:


In [None]:
# Initialize the workflow
workflow = DocumentWorkflow(
    parser=parser,
    classify_client=classify_client,
    classification_rules=classification_rules,
    llama_extract=llama_extract,
    financial_schema=FinancialMetrics,
    technical_schema=TechnicalSpec,
    timeout=300,
    verbose=True,
)

In [None]:
# Draw the workflow visualization
from llama_index.utils.workflow import draw_all_possible_flows

draw_all_possible_flows(
    workflow,
    filename="document_workflow.html",
)

document_workflow.html


The workflow has been visualized and saved to `document_workflow.html`. You can open this file in a browser to see the interactive workflow diagram.


The workflow visualization shows:
1. **StartEvent** → **parse_document** step
2. **ParseEvent** → **classify_document** step
3. **ClassifyEvent** → **extract_data** step  
4. **ExtractEvent** → **finalize_results** step
5. **StopEvent** (final output)

Each step is connected by typed events, allowing for clean separation of concerns and easy monitoring of the workflow execution.


## Run the Workflow on Both Documents

Now let's run the workflow on both documents and monitor the events:


In [None]:
# Process both documents through the workflow
results = []

# Define the document files to process
document_files = [
    "sample_docs/financial_report.pdf",
    "sample_docs/technical_spec.pdf",
]

for i, file_path in enumerate(document_files, 1):
    print(f"\n{'=' * 70}")
    print(f"🚀 Processing Document {i}: {file_path}")
    print(f"{'=' * 70}\n")

    try:
        # Run the workflow
        handler = workflow.run(file_path=file_path)

        # Monitor events as they are emitted
        async for event in handler.stream_events():
            if isinstance(event, ParseEvent):
                print(
                    f"📄 Parse Event: Extracted {len(event.markdown_content):,} characters"
                )
            elif isinstance(event, ClassifyEvent):
                print(
                    f"📊 Classification Event: {event.doc_type} ({event.confidence:.2f})"
                )
            elif isinstance(event, ExtractEvent):
                print(
                    f"✅ Extraction Event: {len(event.extracted_data)} fields extracted"
                )

        # Get final result
        result = await handler
        results.append(result)

        print(f"\n✅ Document {i} processed successfully!")

    except Exception as e:
        print(f"❌ Error processing document {i}: {str(e)}")
        import traceback

        traceback.print_exc()

print(f"\n\n📋 Processed {len(results)} documents successfully!")


🚀 Processing Document 1: sample_docs/financial_report.pdf

Running step parse_document
📄 Step 1: Parsing document: sample_docs/financial_report.pdf...
Started parsing the file under job_id bb53c6bf-79cc-4f63-9c97-16983d59f29d
.   ✅ Parsed successfully (Job ID: bb53c6bf-79cc-4f63-9c97-16983d59f29d)
   📝 Extracted 1,338,499 characters
Step parse_document produced event ParseEvent
📄 Parse Event: Extracted 1,338,499 characters
Running step classify_document
🏷️  Step 2: Classifying document...
   ✅ Classified as: financial_document (confidence: 1.00)
Step classify_document produced event ClassifyEvent
📊 Classification Event: financial_document (1.00)
Running step extract_data
🔍 Step 3: Extracting structured data using SourceText...
   📊 Using FinancialMetrics schema
..   ✅ Extraction complete!
Step extract_data produced event ExtractEvent
Running step finalize_results
Step finalize_results produced event StopEvent
✅ Extraction Event: 7 fields extracted

✅ Document 1 processed successfully!

## Final Results Summary


In [None]:
print("📈 COMPLETE WORKFLOW RESULTS SUMMARY")
print("=" * 70)

for i, result in enumerate(results, 1):
    print(f"\n📄 Document {i}: {os.path.basename(result['file_path'])}")
    print(
        f"   📊 Classification: {result['classification']} (confidence: {result['confidence']:.2f})"
    )
    print(f"   📝 Markdown length: {result['markdown_length']:,} characters")
    print(f"   📋 Markdown sample: {result['markdown_sample'][:100]}...")
    print(f"   🎯 Extracted fields: {len(result['extracted_data'])} fields")

    # Print all key–value pairs
    extracted = result["extracted_data"]
    for key, value in extracted.items():
        print(f"   • {key}: {value}")

print("\n✨ Workflow completed successfully!")

📈 COMPLETE WORKFLOW RESULTS SUMMARY

📄 Document 1: tmpuyxzpd3x.md
   📊 Classification: financial_document (confidence: 1.00)
   📝 Markdown length: 1,338,499 characters
   📋 Markdown sample: 

# UNITED STATES
# SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

## FORM 10-K

(Mark O...
   🎯 Extracted fields: 7 fields
   • company_name: Uber Technologies, Inc.
   • document_type: Annual Report on Form 10-K
   • fiscal_year: 2021
   • revenue_2021: $17,455 and $21,764
   • net_income_2021: $(496) to (700)
   • key_business_segments: ['Borrower and the Restricted Subsidiaries', 'Holdings', 'Guarantors', 'Material Domestic Subsidiaries', 'Material Foreign Subsidiaries']
   • risk_factors: ['Indemnification obligations of the borrower for losses, claims, damages, liabilities, and out-of-pocket expenses incurred by agents, lenders, arrangers, and related parties in connection with the agreement or loans, except in certain cases such as gross negligence, bad faith, willful misconduct, 

## Conclusion

The notebook shows you how to build an e2e document **Classify → Extract** workflow using LlamaCloud. This uses some of our core building blocks around **classification** interleaved with **document extraction**.

### Main Components:

1. **LlamaParse** (`llama_cloud_services.parse.base.LlamaParse`):
   - Converts documents to clean, structured markdown
   - Preserves document structure and formatting
   - Handles various file types (PDF, DOCX, etc.)

2. **ClassifyClient** (`llama_cloud_services.beta.classifier.client.ClassifyClient`):
   - Automatically categorizes documents based on content
   - Uses customizable rules for classification
   - Provides confidence scores for classifications

3. **LlamaExtract with SourceText** (`llama_cloud_services.extract.extract.LlamaExtract`, `SourceText`):
   - Extracts structured data using custom Pydantic schemas
   - You can either feed in the file directly (in which case parsing will happen under the hood), or the parsed text through the **SourceText** object (which is the case in this example) 

**Benefits of an e2e workflow**: The main benefit of doing Classify -> Extract, instead of only Extract, is the fact that you can handle documents of different types/different expected schemas within the same workflow, without having to separate out the data before and running separate extractions on each data subset. 