# ⚠️ Important Notice

This notebook (and repository) is deprecated.

For the latest python examples, please refer to the `llama-cloud-services` repository examples: 
https://github.com/run-llama/llama_cloud_services/tree/main/examples

---

## Invoice Enrichment with Spend Category and Cost Center Workflow

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/document_workflows/invoice_sku_product_catalog_matching/invoice_spend_costcentre.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates an automated workflow for enriching invoice line items with appropriate spend categories and cost centers. The system helps standardize financial reporting and automates cost allocation across departments.

![End To End Process](E2Eprocess.png)

#### The Challenge: Automating Spend Category and Cost Center Enrichment

Organizations face two key challenges when processing invoices:

1. **Spend Category Mapping**: Invoice line items come with varying descriptions that need to be mapped to standardized spend categories for consistent financial reporting. For example, "Dell Precision 5680 Workstation" and "HP ProBook Laptop" should both map to "IT Hardware" category.

2. **Cost Center Attribution**: Each spend category needs to be mapped to appropriate cost centers based on historical patterns and organizational rules. The system uses embedding similarity to compare line items against historical cost center allocations. If the similarity score exceeds 0.85, the historical cost center is used. If the confidence is lower, the item is marked for "FINANCE" team review, ensuring manual verification of uncertain cases.

#### The Documents We Work With

1. **Invoice Data**: (`invoice.pdf`)
   * Contains detailed line items for various purchases
   * Each line includes:
      * Line item ID
      * Description
      * Quantity
      * Unit price
      * Total amount
   * Contains various types of purchases (hardware, software, services)

2. **Spend Categories**: (`spend_categories.json`)
   * Contains standardized spend category definitions:
      * Category name
      * Department
      * Description
      * Keywords and patterns
   * Maps descriptions to standard expense categories

3. **Cost Center Historical Data**: (`cost_centre_historical_data.csv`)
   * Contains cost center allocation patterns:
      * Cost center IDs
      * Departments
      * Item descriptions
      * Historical allocations

#### Enrichment Process

1. Parse the invoice using LlamaParse to extract structured data
2. For each line item:
   * Query the spend category index to determine expense classification
   * Query the cost center index to determine appropriate cost allocation
3. Generate enriched invoice with categorized line items

#### Value Added

The automated classification helps standardize financial reporting by:
* Consistently categorizing similar expenses across different vendors
* Automating cost center allocation based on historical patterns
* Reducing manual classification effort and errors

Example transformations:

Raw line item:
```
"Dell Precision 5680 Workstation with 64GB RAM"
```

Gets enriched with:
```
Spend Category: IT Hardware
Cost Center: IT-INFRA-001
```

#### Implementation

The workflow uses:
* LlamaParse for PDF invoice parsing.
* LLMs for structured data extraction from invoice.
* Embedding-based similarity search for spend category matching.
* Embedding-based similarity search for cost center attribution using historical data.

#### Input/Output Files

**Inputs:**
* `invoice.pdf`: Invoice document to process
* `spend_categories.json`: Spend category definitions
* `cost_centre_historical_data.csv`: Historical cost center allocations

**Output:**
* `invoice_enriched.json`: Invoice data enriched with spend categories and cost center allocations

#### Installation

In [1]:
!pip install llama-parse llama-index

    PyYAML (>=5.1.*)
            ~~~~~~^[0m[33m
[0m

#### Setup API Keys

In [2]:
import os

os.environ['LLAMA_CLOUD_API_KEY'] = 'llx-..' # Get your API Key from https://cloud.llamaindex.ai/
os.environ['OPENAI_API_KEY'] = 'sk-...' # Get your API Key from https://platform.openai.com/

#### Imports

In [3]:
from llama_index.core.workflow import (
    Event,
    StartEvent,
    StopEvent,
    Context,
    Workflow,
    step,
)
from llama_index.core.llms import LLM
from llama_index.llms.openai import OpenAI
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.prompts import ChatPromptTemplate
from llama_parse import LlamaParse
from typing import List
import logging
from pathlib import Path
import json
from pydantic import BaseModel, Field
from typing import List, Optional

from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from typing import List
import pandas as pd
import json


_logger = logging.getLogger(__name__)
_logger.setLevel(logging.INFO)

#### Data Models

Here, we define the Pydantic models that are essential for enriching invoice data with spend categories and cost center information. Pydantic models are used because they provide strict type checking and data validation while ensuring consistent data structures throughout the enrichment workflow.

`LineItem`: Represents a single line item from an invoice, capturing both original and enriched data:
* Unique line item identifier
* Original item description
* Quantity and pricing information
* Enriched spend category (optional)
* Enriched cost center attribution (optional)

`InvoiceInput`: Represents the initial extracted invoice data with:
* Invoice identifier and date
* Vendor information (name and address)
* Collection of line items
* Financial totals (subtotal, tax, total amount)

`EnrichedInvoice`: Complete representation of an invoice after spend category and cost center enrichment:
* All original invoice information
* Enriched line items with categorization
* Total amounts (subtotal, tax, total)
* Processing timestamp

This model hierarchy ensures:
* Clean separation between raw and enriched data
* Consistent data validation throughout the pipeline
* Optional fields for enriched attributes that may require manual review
* Proper handling of financial calculations and date information

In [4]:
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date
from decimal import Decimal

class LineItem(BaseModel):
    """
    Represents a single line item from an invoice with enrichment data.
    Each line item contains original invoice details and enriched categorization.
    """
    id: str = Field(
        description="Unique identifier for the line item"
    )
    description: str = Field(
        description="Original description of the item from the invoice"
    )
    quantity: int = Field(
        description="Number of units ordered",
        gt=0  # Ensures quantity is greater than 0
    )
    unit_price: str = Field(
        description="Price per unit as it appears on invoice"
    )
    total_amount: str = Field(
        description="Total amount for this line item"
    )
    spend_category: Optional[str] = Field(
        None,
        description="Enriched spend category based on item description matching"
    )
    cost_center: Optional[str] = Field(
        None,
        description="Assigned cost center based on historical data matching (threshold > 0.85)"
    )

class InvoiceInput(BaseModel):
    """
    Raw invoice data model used after initial document extraction.
    Contains all basic invoice information and line items before enrichment.
    """
    invoice_number: str = Field(
        description="Unique identifier for the invoice (e.g., 'INV-2024-001')"
    )
    invoice_date: date = Field(
        description="Date of invoice issuance in YYYY-MM-DD format"
    )
    vendor_name: str = Field(
        description="Name of the vendor issuing the invoice"
    )
    vendor_address: str = Field(
        description="Complete address of the vendor"
    )
    line_items: List[LineItem] = Field(
        description="List of all line items in the invoice"
    )
    subtotal: Optional[float] = Field(
        None,
        description="Sum of all line items before tax"
    )
    tax: Optional[float] = Field(
        None,
        description="Total tax amount applied to the invoice"
    )
    total_amount: Optional[float] = Field(
        None,
        description="Final invoice amount including tax"
    )

class EnrichedInvoice(BaseModel):
    """
    Final enriched invoice model after spend category and cost center processing.
    Contains all original invoice data plus enrichment information and processing metadata.
    """
    invoice_number: str = Field(
        description="Unique identifier for the invoice"
    )
    invoice_date: date = Field(
        description="Date of invoice issuance"
    )
    vendor_name: str = Field(
        description="Name of the vendor"
    )
    vendor_address: str = Field(
        description="Complete vendor address"
    )
    line_items: List[LineItem] = Field(
        description="List of enriched line items with categorization"
    )
    subtotal: Optional[float] = Field(
        None,
        description="Invoice subtotal before tax"
    )
    tax: Optional[float] = Field(
        None,
        description="Total tax amount"
    )
    total_amount: Optional[float] = Field(
        None,
        description="Final invoice amount"
    )
    processed_date: date = Field(
        default_factory=date.today,
        description="Date when the invoice was processed and enriched"
    )

#### Event Models for Workflow

`InvoiceOutputEvent`: Carries structured invoice data after initial parsing
`EnrichmentOutputEvent`: Carries enriched invoice data after category and cost center matching
`LogEvent`: Handles logging and progress updates during processing

These models form a pipeline where:
1. Raw invoice PDF is parsed into structured `InvoiceInput` data
2. Each line item is enriched through two stages:
   - Matched against spend categories
   - Attributed to cost centers based on historical patterns
3. Results are combined into enriched line items
4. Final output is collected in an `EnrichedInvoice`

##### Example Flow:

```
Raw Invoice Line:
Description: "Dell Precision 5680 Workstation with 64GB RAM", Qty: 10, Price: $3,250.00
       ↓
InvoiceInput Line Item:
{
    "id": "LINE-001",
    "description": "Dell Precision 5680 Workstation with 64GB RAM",
    "quantity": 10,
    "unit_price": "$3,250.00",
    "total_amount": "$32,500.00"
}
       ↓
Spend Category Matching:
Found category "IT Hardware" (based on description matching)
       ↓
Cost Center Attribution:
Found historical match "IT-INFRA-001" (similarity score: 0.92)
       ↓
Enriched Line Item:
{
    "id": "LINE-001",
    "description": "Dell Precision 5680 Workstation with 64GB RAM",
    "quantity": 10,
    "unit_price": "$3,250.00",
    "total_amount": "$32,500.00",
    "spend_category": "IT Hardware",
    "cost_center": "IT-INFRA-001"
}
```

Note: If the cost center matching confidence is below 0.85, the `cost_center` field will be set to "FINANCE" for manual review.

This workflow ensures:
- Consistent extraction of invoice details
- Standardized spend categorization and cost center attribution
- Automated handling of high-confidence matches
- Clear flagging of cases needing manual review

In [5]:
class InvoiceOutputEvent(Event):
    invoice_data: InvoiceInput

class EnrichmentOutputEvent(Event):
    enriched_invoice: EnrichedInvoice

class LogEvent(Event):
    msg: str
    delta: bool = False

#### Prompt
Here we define prompt for extraction of information from the invoice.

In [6]:
INVOICE_EXTRACT_PROMPT = """
You are given invoice data below. Extract the relevant information into the defined schema.
{invoice_data}
"""

#### Creating Indices for Spend Categories and Cost Centers

Before we can enrich invoice line items, we need to create two searchable indices: one for spend categories and another for cost center attribution. These indices enable semantic matching of line item descriptions to appropriate categories and cost centers.

##### Spend Category Index

This index is created from structured spend category data (`spend_categories.json`) that defines standard expense classifications. The function creates a vector store index that combines:

1. Category descriptions
2. Department information
3. Spend Category.

*Example:* When searching for "Dell Precision 5680 Workstation", the index helps find the appropriate spend category (e.g., "IT Hardware") based on semantic similarity rather than requiring exact matches.

##### Cost Center Historical Index

This index is built from historical cost center allocation data (`cost_centre_historical_data.csv`). It creates embeddings from past line item descriptions and their associated cost centers, enabling the system to learn from previous allocation patterns.

*Example:* If similar IT equipment purchases were historically allocated to "IT-INFRA-001" with high confidence (>0.85 similarity), new matching purchases will be automatically assigned the same cost center.

In [7]:
def create_spend_category_index(spend_categories_info) -> VectorStoreIndex:
    """Create index from spend category data."""
    nodes = []
    for data in spend_categories_info:
        metadata = {
                  "spend_category": data['spend_category'],
                  "department": data['department']
                  }

        node = TextNode(text=data['description'], id_=str(data['id']), metadata = metadata)
        nodes.append(node)
    return VectorStoreIndex(nodes)

def create_cost_center_index(csv_path: str) -> VectorStoreIndex:
    """Create index from cost center CSV data."""
    df = pd.read_csv(csv_path)
    nodes = []
    
    for _, row in df.iterrows():
        node = TextNode(
            text=row['Item Description'],
            id_=str(row['id']),
            metadata={
                "cost_center_id": row['Cost Center'],
                "department": row['Department']
            }
        )
        nodes.append(node)

    return VectorStoreIndex(nodes)

def load_spend_categories(json_path: str):
    """Load spend categories from JSON file."""
    with open(json_path) as f:
        data = json.load(f)
    # return [SpendCategoryDetails(**item) for item in data]
    return data

#### Main Workflow Implementation

Here we implement the workflow for invoice enrichment with spend categories and cost center attribution.

**a) Parse Invoice (First Step)**
* Triggered by: Initial StartEvent containing invoice path
* Reads invoice PDF using LlamaParse
* Uses LLM with a structured prompt to extract invoice data
* Outputs: InvoiceOutputEvent containing structured invoice data and line items

**b) Enrich Line Items (Second Step)**
* Triggered by: InvoiceOutputEvent
* For each line item:
   * Queries spend category index using description
   * Determines appropriate spend category
   * Queries cost center index using description and category
   * If cost center match confidence > 0.85:
      * Assigns matched cost center
   * Else:
      * Assigns "FINANCE" for manual review
* Outputs: EnrichmentOutputEvent with categorized and attributed line items

**c) Save Enriched Output (Final Step)**
* Triggered by: EnrichmentOutputEvent
* Combines all enriched line items
* Creates standardized JSON output with:
   * Original invoice details
   * Enriched categorizations
   * Cost center assignments
   * Processing metadata
* Outputs: StopEvent with final results

The event flow looks like this:
```
StartEvent 
  → InvoiceOutputEvent 
    → EnrichmentOutputEvent 
      → StopEvent
```

Throughout the workflow:
1. LogEvents provide detailed progress updates
2. Context (ctx) maintains access to both indices:
   * Spend category index for classification
   * Cost center index for attribution
3. Events carry structured data between steps
4. Each step runs asynchronously (@step decorator)
5. Cost center confidence scores ensure reliable attribution

The workflow uses two types of vector similarity matching:
1. Spend Category matching to standardize expense classifications
2. Cost center matching based on historical patterns


In [8]:
class InvoiceEnrichmentWorkflow(Workflow):
    """Workflow for processing and enriching invoices with spend categories and cost centers."""

    def __init__(
        self,
        parser: LlamaParse,
        spend_cat_retriever: BaseRetriever,
        cost_center_retriever: BaseRetriever,
        llm: LLM | None = None,
        output_dir: str = "data_out",
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        self.parser = parser
        self.spend_cat_retriever = spend_cat_retriever
        self.cost_center_retriever = cost_center_retriever
        self.llm = llm or OpenAI(model="gpt-4")
        
        # Create output directory
        out_path = Path(output_dir) / "workflow_output"
        out_path.mkdir(parents=True, exist_ok=True)
        self.output_dir = out_path

    @step
    async def parse_invoice(
        self, ctx: Context, ev: StartEvent
    ) -> InvoiceOutputEvent:
        """Extract structured data from invoice."""
        if self._verbose:
            ctx.write_event_to_stream(LogEvent(msg=">> Parsing invoice"))
            
        # Parse invoice
        docs = await self.parser.aload_data(ev.invoice_path)
        invoice_text = "\n".join([d.get_content(metadata_mode="all") for d in docs])

        # Create extraction prompt
        prompt = ChatPromptTemplate.from_messages([
            ("system", "You are an assistant that extracts structured data from the invoice."),
            ("user", INVOICE_EXTRACT_PROMPT)
        ])
        
        # Extract using LLM
        invoice_data = await self.llm.astructured_predict(
            InvoiceInput,
            prompt=prompt,
            invoice_data=invoice_text
        )

        if self._verbose:
            ctx.write_event_to_stream(
                LogEvent(msg=f">> Extracted invoice data: {invoice_data.dict()}")
            )

        return InvoiceOutputEvent(invoice_data=invoice_data)

    @step 
    async def enrich_line_items(
        self, ctx: Context, ev: InvoiceOutputEvent
    ) -> EnrichmentOutputEvent:
        """Enrich line items with spend categories and cost centers."""
        if self._verbose:
            ctx.write_event_to_stream(LogEvent(msg=">> Enriching line items"))

        enriched_items = []
        for item in ev.invoice_data.line_items:
            # Query spend category
            spend_cat_docs = self.spend_cat_retriever.retrieve(
                f"Find matching spend category for: {item.description}"
            )
            spend_category = spend_cat_docs[0].metadata["spend_category"] if spend_cat_docs else None

            # Query cost center
            cost_center_docs = self.cost_center_retriever.retrieve(
                f"Find cost center responsible for spend category: {spend_category}"
            )
            cost_center = cost_center_docs[0].metadata["cost_center_id"] if cost_center_docs[0].score >= 0.85 else "FINANCE"

            # Create enriched line item
            enriched_item = LineItem(
                id=item.id,
                description=item.description,
                quantity=item.quantity,
                unit_price=item.unit_price,
                spend_category=spend_category,
                cost_center=cost_center,
                total_amount=item.total_amount
            )
            enriched_items.append(enriched_item)

        # Create enriched invoice
        enriched_invoice = EnrichedInvoice(
            invoice_number=ev.invoice_data.invoice_number,
            invoice_date=ev.invoice_data.invoice_date,
            vendor_name=ev.invoice_data.vendor_name,
            vendor_address=ev.invoice_data.vendor_address,
            line_items=enriched_items,
            subtotal=ev.invoice_data.subtotal,
            tax=ev.invoice_data.tax,
            total_amount=ev.invoice_data.total_amount
        )

        if self._verbose:
            ctx.write_event_to_stream(
                LogEvent(msg=f">> Enriched invoice data: {enriched_invoice.dict()}")
            )

        return EnrichmentOutputEvent(enriched_invoice=enriched_invoice)

    @step
    async def save_output(
        self, ctx: Context, ev: EnrichmentOutputEvent
    ) -> StopEvent:
        """Save enriched invoice to file."""
        output_path = self.output_dir / "invoice_enriched.json"
        with open(output_path, "w") as f:
            json.dump(ev.enriched_invoice.dict(), f, indent=2, default=str)

        return StopEvent(result=ev.enriched_invoice)

#### Create Workflow

Here we initialize and configure all components needed for the invoice enrichment workflow.

Note: The default timeout for running the workflow is 10 seconds. However, since parsing and various LLM calls often take longer than 10 seconds, we have opted to extend the timeout to 300 seconds.

1. Initialize Core Components:

In [9]:
# Initialize LlamaParse for PDF extraction
parser = LlamaParse(result_type="markdown")

# Load spend category data and create index
spend_categories_info = load_spend_categories("spend_categories.json")
spend_cat_index = create_spend_category_index(spend_categories_info)

# Create cost center index from historical data
cost_center_index = create_cost_center_index("cost_centre_historical_data.csv")

# Create retrievers with appropriate settings
spend_cat_retriever = spend_cat_index.as_retriever(similarity_top_k=1)
cost_center_retriever = cost_center_index.as_retriever(similarity_top_k=1)

2. Initialize and Configure Workflow:

In [10]:
workflow = InvoiceEnrichmentWorkflow(
    parser=parser,
    spend_cat_retriever=spend_cat_retriever,
    cost_center_retriever=cost_center_retriever,
    llm=OpenAI(model="gpt-4o"),
    verbose=True,
    timeout=300
)

3. Visualize workflow


In [11]:
from llama_index.utils.workflow import draw_all_possible_flows

draw_all_possible_flows(InvoiceEnrichmentWorkflow, filename="invoice_enrichment_spend_cost_centre_categorization.html")

<class 'NoneType'>
<class '__main__.EnrichmentOutputEvent'>
<class '__main__.InvoiceOutputEvent'>
<class 'llama_index.core.workflow.events.StopEvent'>
invoice_enrichment_spend_cost_centre_categorization.html


![Visualize Workflow](workflow_visualization.png)

4. Run the workflow

Here we run the workflow and check the final result

In [12]:
# Process invoice
async def process_invoice(invoice_path: str):
    handler = workflow.run(invoice_path=invoice_path)
    async for event in handler.stream_events():
        if hasattr(event, 'msg'):
            print(event.msg)
    
    response = await handler
    return response

invoice_path = "invoice.pdf"
enriched_invoice = await process_invoice(invoice_path)
print(f"Processed invoice saved to: {workflow.output_dir}")

Running step parse_invoice
>> Parsing invoice
Started parsing the file under job_id 191c2a4a-d53a-4005-9708-068a1416fc61
Step parse_invoice produced event InvoiceOutputEvent
>> Extracted invoice data: {'invoice_number': 'INV-2024-1234', 'invoice_date': datetime.date(2024, 1, 15), 'vendor_name': 'TechPro Solutions Inc.', 'vendor_address': '123 Business Street\nCorporate City, BZ 12345', 'line_items': [{'id': '1', 'description': 'Dell Precision 5680 Workstation with 64GB RAM, 2TB SSD, NVIDIA RTX 4000 Graphics, 3-year ProSupport Plus warranty and accidental damage protection', 'quantity': 10, 'unit_price': '$3,250.00', 'total_amount': '$32,500.00', 'spend_category': None, 'cost_center': None}, {'id': '2', 'description': 'Microsoft Office 365 E5 licenses annual renewal including advanced security features and Teams Phone System (enterprise-wide deployment)', 'quantity': 500, 'unit_price': '$250.00', 'total_amount': '$125,000.00', 'spend_category': None, 'cost_center': None}, {'id': '3', 'd

#### Check Enriched Invoice

In [13]:
with open("data_out/workflow_output/invoice_enriched.json", "r") as f:
    data = f.read()

print(data)

{
  "invoice_number": "INV-2024-1234",
  "invoice_date": "2024-01-15",
  "vendor_name": "TechPro Solutions Inc.",
  "vendor_address": "123 Business Street\nCorporate City, BZ 12345",
  "line_items": [
    {
      "id": "1",
      "description": "Dell Precision 5680 Workstation with 64GB RAM, 2TB SSD, NVIDIA RTX 4000 Graphics, 3-year ProSupport Plus warranty and accidental damage protection",
      "quantity": 10,
      "unit_price": "$3,250.00",
      "total_amount": "$32,500.00",
      "spend_category": "Enterprise Hardware",
      "cost_center": "FINANCE"
    },
    {
      "id": "2",
      "description": "Microsoft Office 365 E5 licenses annual renewal including advanced security features and Teams Phone System (enterprise-wide deployment)",
      "quantity": 500,
      "unit_price": "$250.00",
      "total_amount": "$125,000.00",
      "spend_category": "Enterprise Software Licensing",
      "cost_center": "IT-001"
    },
    {
      "id": "3",
      "description": "Custom software