[![Labellerr](https://storage.googleapis.com/labellerr-cdn/%200%20Labellerr%20template/notebook.webp)](https://www.labellerr.com)

# **Vision OCR Agent**

---

[![labellerr](https://img.shields.io/badge/Labellerr-BLOG-black.svg)](https://www.labellerr.com/blog/<BLOG_NAME>)
[![Youtube](https://img.shields.io/badge/Labellerr-YouTube-b31b1b.svg)](https://www.youtube.com/@Labellerr)
[![Github](https://img.shields.io/badge/Labellerr-GitHub-green.svg)](https://github.com/Labellerr/Hands-On-Learning-in-Computer-Vision)

## Overview
This notebook implements an OCR (Optical Character Recognition) system specifically designed for processing invoices. The system uses Mistral AI for OCR and text extraction, and Google's Gemini for intelligent chat interactions about the extracted invoice data.

## Dependencies
The following packages are required:
- `python-dotenv`: For environment variable management
- `mistralai`: For OCR and text extraction
- `pillow`: For image processing
- `crewai`: For agent-based interactions
- `google-generativeai`: For Gemini AI integration

Run the cell below to install these dependencies.

In [10]:
# Install Dependencies
# %pip install python-dotenv mistralai pillow crewai google-generativeai

## Library Imports
This cell imports all necessary Python libraries:
- `os`: For file and environment operations
- `json`: For handling JSON data structures
- `base64`: For encoding image data
- `Path`: For cross-platform path handling
- `load_dotenv`: For loading environment variables
- `Mistral`: For OCR capabilities
- `Agent, Task, Crew, LLM`: For CrewAI framework
- `PIL.Image`: For image processing operations

In [12]:
# Import Libraries
import os
import json
import base64
from pathlib import Path
from dotenv import load_dotenv
from mistralai import Mistral
from crewai import Agent, Task, Crew, LLM
from PIL import Image


## API Client Initialization
This cell sets up the necessary API clients:
1. Loads environment variables from a `.env` file
2. Initializes the Mistral client for OCR functionality
   - Requires `MISTRAL_API_KEY` in environment variables
3. Sets up Gemini for chat capabilities
   - Requires `GEMINI_API_KEY` in environment variables

Make sure to have a `.env` file with the required API keys before running this cell.

In [13]:
# Initialize API Clients
# Load environment variables
load_dotenv()

# Initialize Mistral for OCR
mistral_api_key = os.getenv("MISTRAL_API_KEY")
mistral_client = Mistral(api_key=mistral_api_key)

# Initialize Gemini for Chat
gemini_api_key = os.getenv("GEMINI_API_KEY")
os.environ["GEMINI_API_KEY"] = gemini_api_key

print("✅ Mistral client initialized (for OCR)")
print("✅ Gemini API key loaded (for Chat)")


✅ Mistral client initialized (for OCR)
✅ Gemini API key loaded (for Chat)


## OCR Functions

### 1. extract_text_from_invoice(image_path: str) -> str
Extracts raw text from an invoice image using Mistral's OCR capabilities:
- Accepts images in JPG, JPEG, PNG, or WEBP formats
- Converts image to base64 encoding
- Uses Mistral's `pixtral-large-latest` model for OCR
- Returns extracted text or empty string on error

### 2. convert_invoice_to_json(extracted_text: str) -> dict
Converts extracted text into structured JSON format:
- Uses Mistral's `mistral-small-latest` model
- Extracts key invoice information including:
  - Invoice details (number, date)
  - Vendor information
  - Customer information
  - Line items with pricing
  - Totals and tax information
- Returns formatted JSON or empty dict on error

In [14]:
# OCR Functions (Using Mistral)
def extract_text_from_invoice(image_path: str) -> str:
    """Extract text from invoice image using Mistral OCR"""
    
    print(f"\n{'='*60}")
    print(f"📄 Extracting text from: {image_path}")
    print(f"{'='*60}\n")
    
    try:
        with open(image_path, "rb") as image_file:
            image_data = base64.standard_b64encode(image_file.read()).decode("utf-8")
        
        suffix = Path(image_path).suffix.lower()
        media_type_map = {
            ".jpg": "image/jpeg",
            ".jpeg": "image/jpeg",
            ".png": "image/png",
            ".webp": "image/webp"
        }
        media_type = media_type_map.get(suffix, "image/jpeg")
        
        response = mistral_client.chat.complete(
            model="pixtral-large-latest",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": f"data:{media_type};base64,{image_data}"},
                    {"type": "text", "text": "Extract ALL text from this invoice. Return only the text."}
                ]
            }]
        )
        
        extracted_text = response.choices[0].message.content
        
        print("✅ Text Extracted!\n")
        print(f"Extracted Text:\n{extracted_text}\n")
        print(f"{'='*60}\n")
        
        return extracted_text
        
    except Exception as e:
        print(f"❌ OCR Error: {str(e)}")
        return ""


def convert_invoice_to_json(extracted_text: str) -> dict:
    """Convert extracted invoice text to structured JSON"""
    
    print(f"\n{'='*60}")
    print(f"🔧 Converting to JSON...")
    print(f"{'='*60}\n")
    
    try:
        response = mistral_client.chat.complete(
            model="mistral-small-latest",
            messages=[{
                "role": "user",
                "content": f"""Extract invoice information as JSON with these fields:
- invoice_number, date, vendor_name, vendor_address
- customer_name, customer_address
- items (list with description, quantity, unit_price, total)
- subtotal, tax, total_amount, currency

Text: {extracted_text}

Return ONLY valid JSON."""
            }]
        )
        
        json_text = response.choices[0].message.content
        
        if json_text.startswith("```"):
            json_text = json_text.split("```")[1]
            if json_text.startswith("json"):
                json_text = json_text[4:]
        json_text = json_text.strip()
        
        invoice_json = json.loads(json_text)
        
        print("✅ JSON Created!\n")
        print(json.dumps(invoice_json, indent=2))
        print(f"\n{'='*60}\n")
        
        return invoice_json
        
    except Exception as e:
        print(f"❌ JSON Error: {str(e)}")
        return {}



## Main Invoice Processing Pipeline
The `process_invoice(image_path: str) -> dict` function orchestrates the complete invoice processing workflow:

1. **Input Validation**
   - Checks if the input image file exists
   
2. **Text Extraction**
   - Calls `extract_text_from_invoice()` to perform OCR
   - Validates the extracted text
   
3. **JSON Conversion**
   - Calls `convert_invoice_to_json()` to structure the data
   - Validates the JSON output

The function includes comprehensive error handling and progress logging throughout the pipeline.

In [15]:
# Main Invoice Processing
def process_invoice(image_path: str) -> dict:
    """Complete invoice processing pipeline"""
    
    print(f"\n{'#'*60}")
    print(f"🚀 PROCESSING INVOICE")
    print(f"{'#'*60}\n")
    
    if not os.path.exists(image_path):
        print(f"❌ File not found: {image_path}")
        return {}
    
    extracted_text = extract_text_from_invoice(image_path)
    if not extracted_text:
        return {}
    
    invoice_json = convert_invoice_to_json(extracted_text)
    if not invoice_json:
        return {}
    
    print(f"\n{'#'*60}")
    print(f"✅ INVOICE PROCESSED!")
    print(f"{'#'*60}\n")
    
    return invoice_json



## CrewAI and Gemini Integration

This section sets up the AI-powered chat capability using CrewAI framework and Google's Gemini model:

### Components
1. **Gemini LLM Instance**
   - Uses `gemini-2.0-flash-exp` model
   - Temperature set to 0.7 for balanced creativity
   
2. **Invoice Analysis Agent**
   - Role: Financial analyst specializing in invoice analysis
   - Goal: Provide accurate and helpful invoice data analysis
   - Capabilities:
     - Understanding invoice details
     - Pattern identification
     - Total calculations
     - Billing insights

In [16]:
# CrewAI Setup with Gemini for Chat
# Create Gemini LLM instance
gemini_llm = LLM(
    model="gemini/gemini-2.0-flash-exp",
    api_key=gemini_api_key,
    temperature=0.7
)

# Create Invoice Analysis Agent
invoice_analyst = Agent(
    role='Invoice Data Analyst',
    goal='Answer questions about invoice data accurately and helpfully',
    backstory="""You are an expert financial analyst specializing in invoice analysis. 
    You help users understand invoice details, identify patterns, calculate totals, 
    and provide insights about billing information.""",
    llm=gemini_llm,
    verbose=True,
    allow_delegation=False
)

print("✅ Gemini-powered Invoice Analyst Agent created!")


✅ Gemini-powered Invoice Analyst Agent created!


## Chat Functionality Implementation

The `chat_with_invoice(invoice_data: dict, user_question: str) -> str` function implements the interactive chat feature:

### Process Flow
1. **Task Creation**
   - Creates an analysis task with the user's question
   - Includes complete invoice data context
   
2. **Crew Setup**
   - Initializes a CrewAI instance
   - Configures with single agent setup
   - Disables memory and caching for fresh responses
   
3. **Execution**
   - Processes the task through the Gemini-powered agent
   - Formats and returns the response
   - Includes comprehensive error handling

In [17]:
# Chat Function with CrewAI
def chat_with_invoice(invoice_data: dict, user_question: str) -> str:
    """Chat with invoice data using Gemini via CrewAI"""
    
    print(f"\n{'='*60}")
    print(f"💬 Question: {user_question}")
    print(f"{'='*60}\n")
    
    # Create analysis task
    analysis_task = Task(
        description=f"""Answer this question about the invoice data: {user_question}

Invoice Data:
{json.dumps(invoice_data, indent=2)}

Provide a clear, accurate answer based on the invoice data.""",
        agent=invoice_analyst,
        expected_output="Clear answer to the user's question about the invoice"
    )
    
    # Create crew
    crew = Crew(
        agents=[invoice_analyst],
        tasks=[analysis_task],
        verbose=False,
        memory=False,
        cache=False
    )
    
    try:
        result = crew.kickoff()
        answer = str(result)
        
        print(f"🤖 Answer:\n{answer}\n")
        print(f"{'='*60}\n")
        
        return answer
        
    except Exception as e:
        error_msg = f"Error: {str(e)}"
        print(f"❌ {error_msg}\n")
        return error_msg


## Interactive Chat Session

The `interactive_chat(invoice_data: dict)` function provides a user-friendly interface for continuous interaction with the invoice data:

### Features
- Continuous question-answer loop
- Simple command interface
  - 'exit', 'quit', or 'q' to end session
  - Empty input handling
- Clear session boundaries with visual separators
- Informative user prompts and instructions

This interactive mode allows users to explore invoice data through natural language questions.

In [18]:
# Interactive Chat Loop
def interactive_chat(invoice_data: dict):
    """Interactive chat session with invoice data"""
    
    print(f"\n{'#'*60}")
    print(f"💬 INVOICE CHAT SESSION")
    print(f"{'#'*60}")
    print("Ask questions about the invoice data.")
    print("Type 'exit' or 'quit' to end the session.")
    print(f"{'#'*60}\n")
    
    while True:
        user_question = input("You: ").strip()
        
        if user_question.lower() in ['exit', 'quit', 'q']:
            print("\n👋 Chat session ended!")
            break
        
        if not user_question:
            continue
        
        chat_with_invoice(invoice_data, user_question)



In [None]:
# Run Complete Pipeline
if __name__ == "__main__":
    
    # Step 1: Process invoice image
    invoice_image = "sample_invoice_2.webp"
    
    print("STEP 1: Processing Invoice Image\n")
    invoice_data = process_invoice(invoice_image)
    
    if not invoice_data:
        print("❌ Failed to process invoice")
    else:
        print("\n📊 Invoice Data Ready!")
        
        # Step 2: Example questions
        print("\n" + "="*60)
        print("STEP 2: Chatting with Invoice Data (Examples)")
        print("="*60 + "\n")
        
        example_questions = [
            "What is the total amount of this invoice?",
            "Who is the vendor?",
            "List all the items in the invoice",
            "When was this invoice issued?",
            "Calculate the tax percentage"
        ]
        
        for question in example_questions:
            chat_with_invoice(invoice_data, question)
        
        # # Step 3: Interactive chat (optional)
        # print("\n" + "="*60)
        # print("STEP 3: Interactive Chat Mode")
        # print("="*60 + "\n")
        
        # start_chat = input("Start interactive chat? (y/n): ").strip().lower()
        # if start_chat == 'y':
        #     interactive_chat(invoice_data)


STEP 1: Processing Invoice Image


############################################################
🚀 PROCESSING INVOICE
############################################################


📄 Extracting text from: sample_invoice_2.webp

✅ Text Extracted!

Extracted Text:
```
Amazon

Paid
Payment reference ID847O/Oz0ho2G03Oqxw4W
Sold by MUNRO&CO LTD
VAT # GB369728108

Invoice date / Delivery date 08 March 2025
Invoice # GB2201OU15874W
Total payable £40.70

LOUIE HILL
10 BALDOCK STREET
NEWTON REGIS, B79 8BD
GB

For customer support visit www.amazon.co.uk/contact-us

Billing address    Delivery address    Sold by
LOUIE HILL         LOUIE HILL          MUNRO&CO LTD
10 BALDOCK STREET  10 BALDOCK STREET   9 Gander Green Crescent
NEWTON REGIS, B79 8BD                NEWTON REGIS, B79 8BD  HAMPTON, Middlesex, TW12 2FA
GB                                   GB                        GB
VAT # GB369728147

Order information
Order date 07 March 2025
Order # 026-8796475-8888339

Invoice details
Description    

🤖 Answer:
The total amount of this invoice is 40.7 GBP.



💬 Question: Who is the vendor?



🤖 Answer:
MUNRO&CO LTD



💬 Question: List all the items in the invoice



🤖 Answer:
The items in the invoice are:
- Oral-B Pro 2500 3D White Electric Rechargeable Toothbrush with Travel Case Powered by Braun - Pink - Ships with a UK 2 pin plug
- Shipping Charges



💬 Question: When was this invoice issued?



🤖 Answer:
The invoice was issued on 08 March 2025.



💬 Question: Calculate the tax percentage



🤖 Answer:
The tax percentage is 20.0%.




## Main Execution Pipeline

This cell contains the main execution logic that runs when the notebook is executed directly:

### Pipeline Steps
1. **Invoice Processing**
   - Processes a sample invoice image (`sample_invoice_2.webp`)
   - Creates structured JSON data
   
2. **Example Questions**
   - Demonstrates system capabilities with preset questions:
     - Total amount query
     - Vendor identification
     - Item listing
     - Date information
     - Tax calculations
   
3. **Interactive Mode (Optional)**
   - Offers user-initiated interactive session
   - Currently commented out for demonstration purposes
   - Can be uncommented for full interactive experience

Note: Ensure `sample_invoice_2.webp` exists in the workspace before running.