In [1]:
# %pip install retab

In [1]:
# Draft an Initial Schema—we're using here the same schema as in the previous notebook

from pydantic import BaseModel

class Invoice(BaseModel):
    date: str
    invoice_number: str
    total: str
    status: str
    customer: str
    customer_address: str
    customer_email: str
    customer_phone: str
    customer_website: str

## **CONSENSUS**

The `Consensus` parameter provides a simple way to reduce variability in LLMs' outputs and increase structured data's **QUALITY**, where the user reaches accuracy scores above 98%.

We wanted to provide a quick, type‑safe wrapper around OpenAI Chat Completions and Responses endpoints to provide you with automatic consensus reconciliation.

**For more information, check our [Documentation about k-LLMs Consensus](https://docs.retab.com/core-concepts/k-LLMs_Consensus) (see also our [k-LLMs library](https://www.retab.com/k-llms)).**

In [4]:
# Run the first extraction with Consensus, using the Schema that is not optimzed yet

from dotenv import load_dotenv
from retab import Retab
import json

load_dotenv() # We advise you to create a .env file containing your RETAB_API_KEY=sk_retab_***

client = Retab()

response = client.documents.extract(
    documents=["../assets/docs/invoice.jpeg"],
    model="gemini-2.5-flash",          # or any model your plan supports
    json_schema=Invoice.model_json_schema(),
    modality="native", 
    temperature=0.5,
    n_consensus=5                 # Consensus parameter      
)

print(json.dumps(response.likelihoods, indent=2))

{
  "date": 1.0,
  "invoice_number": 1.0,
  "total": 1.0,
  "status": 1.0,
  "customer": 1.0,
  "customer_address": 1.0,
  "customer_email": 1.0,
  "customer_phone": 1.0,
  "customer_website": 1.0
}


Enabling `Consensus` gives the user confidentiality scores on the extraction fields.

We can see here that this parameter provides 2 advantages:

- running k-LLMs consensus makes extractions more precise (by aggregating the results of a "*k-LLMs' concil*").

- running k-LLMs consensus **helps the user to spot low confidence fields, making it easier to improve the schema & prompt to get higher accuracy**.

**For more informaiton on `X-ReasoningPrompt`, check our [Documentation here](https://docs.retab.com/core-concepts/Reasoning).**

## **REASONING**

In [None]:
# As the "status" field reached a confidence score below 90% in the previous extraction, 
# we will add Reasoning and see what performance we can achieve with this simple add. 

from pydantic import Field

class Invoice_with_reasoning(BaseModel):
    date: str
    invoice_number: str
    total: str
    
    status: str = Field(...,
        description="Invoice Status, either Blanck, Paid or Unpaid.",
        
        # Reasoning Prompt here
        json_schema_extra={ 
            "X-ReasoningPrompt": "If the Status is not specified, make it explicit that it is blank. Otherwise, use the provided status making sure it is either Paid or Unpaid.",
        }
    )

    customer: str
    customer_address: str
    customer_email: str
    customer_phone: str
    customer_website: str



response = client.documents.extract(
    documents=["../assets/docs/invoice.jpeg"],
    model="gemini-2.5-flash",          
    json_schema=Invoice_with_reasoning.model_json_schema(),
    temperature=0.5,              
    modality="native",      
    n_consensus=5                 
)

print(json.dumps(response.likelihoods, indent=2))

{
  "date": 1.0,
  "invoice_number": 1.0,
  "total": 1.0,
  "status": 1.0,
  "customer": 0.6,
  "customer_address": 1.0,
  "customer_email": 1.0,
  "customer_phone": 1.0,
  "customer_website": 0.6
}


We can see here that **adding `Reasoning` on the "status" Field helped achieve 100% confidence, compared to 80% before.**

# **ANNEXES**

In [None]:
# We could have updated the Schema and iterated to see whether the extraction precision improved.
# Adding Reasoning helps gain time during this iteration phase.

from pydantic import BaseModel, Field
from enum import Enum

class StatusEnum(str, Enum):
    Blank  = "Blank"
    Paid   = "Paid"
    Unpaid = "Unpaid"

class Invoice_v2(BaseModel):
    date: str
    invoice_number: str
    total: str

    # Improvement on this field
    status: StatusEnum = Field(
        default=StatusEnum.Blank,
        description="Invoice status; Blank when no status appears on the document." # We add a description to gain in precision
    )

    customer: str
    customer_address: str
    customer_email: str
    customer_phone: str
    customer_website: str

    # Evaluate the precision of the new Schema
response = client.documents.extract(
    documents=["../assets/docs/invoice.jpeg"],
    model="gemini-2.5-flash",          # or any model your plan supports
    json_schema=Invoice_v2.model_json_schema(),
    temperature=0.5,              # you need to add temperature
    modality="native",
    n_consensus=5
)

print(json.dumps(response.likelihoods, indent=2))

{
  "date": 1.0,
  "invoice_number": 1.0,
  "total": 1.0,
  "status": 0.8,
  "customer": 0.6,
  "customer_address": 1.0,
  "customer_email": 1.0,
  "customer_phone": 1.0,
  "customer_website": 1.0
}


We improve the likelihood on the `status` field from 0.6 to 1.0!