# 02 - Receipt Extraction

This notebook demonstrates the receipt extraction pipeline using Google Gemini 2.5 Flash.

## Pipeline Overview

```
PDF → Gemini 2.5 Flash → JSON → Pydantic Validation → Nominatim Geocoding → SQLite
```

## Contents
1. Setup - Load environment, connect to DB
2. Single Receipt Demo - Process one PDF step by step
3. LLM Response Analysis - Show raw JSON, validate with Pydantic
4. Geocoding Demo - Address → Coordinates → Zone
5. Client Matching Demo - Duplicate detection
6. Batch Processing - Process all receipts
7. Results Summary - Plotly visualizations
8. Folium Map - Extracted delivery locations

In [1]:
# Standard library imports
import json
import os
import sys
from pathlib import Path

# Add src to path for imports
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# External imports
import folium
import pandas as pd
import plotly.express as px
from dotenv import load_dotenv

# Local imports
from src.database import (
    ClientModel,
    DatabaseManager,
    GeocodingCacheModel,
    OrderModel,
    ProcessedReceiptModel,
)
from src.extraction import (
    extract_from_pdf,
    find_client_by_name,
    find_client_by_tax_id,
    process_all_receipts,
    get_processing_summary,
)
from src.geo import (geocode_address)

# Load environment variables
load_dotenv(project_root / ".env")

print("Imports successful!")
print(f"GEMINI_API_KEY configured: {'Yes' if os.getenv('GEMINI_API_KEY') else 'No - Please set in .env'}")

Imports successful!
GEMINI_API_KEY configured: Yes


## 1. Setup

Connect to the database and verify existing data.

In [2]:
# Define paths
DATA_DIR = project_root / "data"
RECEIPTS_DIR = DATA_DIR / "raw" / "receipts"
DB_PATH = DATA_DIR / "processed" / "delivery.db"

# Initialize database manager
db = DatabaseManager(DB_PATH)

# Check existing data
with db.get_session() as session:
    client_count = session.query(ClientModel).count()
    order_count = session.query(OrderModel).count()
    processed_count = session.query(ProcessedReceiptModel).count()
    cache_count = session.query(GeocodingCacheModel).count()

print(f"Database: {DB_PATH}")
print(f"Existing clients: {client_count}")
print(f"Existing orders: {order_count}")
print(f"Processed receipts: {processed_count}")
print(f"Geocoding cache entries: {cache_count}")

Database: c:\Users\Santi\Desktop\CV\portafolio\Eco-Bags-Delivery-Optimizer\data\processed\delivery.db
Existing clients: 31
Existing orders: 37
Processed receipts: 0
Geocoding cache entries: 0


In [3]:
# List available receipts
if RECEIPTS_DIR.exists():
    pdf_files = list(RECEIPTS_DIR.glob("*.pdf"))
    print(f"Found {len(pdf_files)} PDF files in {RECEIPTS_DIR}:")
    for pdf in pdf_files:
        print(f"  - {pdf.name}")
else:
    print(f"Receipts directory not found: {RECEIPTS_DIR}")
    print("Creating directory...")
    RECEIPTS_DIR.mkdir(parents=True, exist_ok=True)
    pdf_files = []

Found 4 PDF files in c:\Users\Santi\Desktop\CV\portafolio\Eco-Bags-Delivery-Optimizer\data\raw\receipts:
  - Invoice_Coffee_Shop.pdf
  - Invoice_FMartínez.pdf
  - Invoice_Toy_Store.pdf
  - Receipt_El_Gaucho.pdf


## 2. Single Receipt Demo

Process one PDF step by step to understand the extraction pipeline.

In [4]:
# Select a receipt to process
if pdf_files:
    sample_pdf = pdf_files[0]
    print(f"Selected: {sample_pdf.name}")
    print(f"Full path: {sample_pdf}")
else:
    print("No PDF files found. Please add some receipts to data/raw/receipts/")
    sample_pdf = None

Selected: Invoice_Coffee_Shop.pdf
Full path: c:\Users\Santi\Desktop\CV\portafolio\Eco-Bags-Delivery-Optimizer\data\raw\receipts\Invoice_Coffee_Shop.pdf


In [5]:
# Step 1: Extract data from PDF using Gemini
if sample_pdf:
    print("Extracting data from PDF using Gemini 2.5 Flash...")
    extraction = extract_from_pdf(sample_pdf)
    print(f"\nExtraction confidence: {extraction.extraction_confidence:.2f}")
    print(f"Requires review: {extraction.requires_review}")
else:
    extraction = None
    print("Skipping extraction - no PDF available")

Extracting data from PDF using Gemini 2.5 Flash...

Extraction confidence: 0.95
Requires review: False


In [6]:
# Load geographic reference data
zones_file = project_root / 'data' / 'geo' / 'zones.json'
localities_file = project_root / 'data' / 'geo' / 'localities.json'

if zones_file.exists() and localities_file.exists():
    with open(zones_file, "r", encoding="utf-8") as f:
        zones_data = json.load(f)
    with open(localities_file, "r", encoding="utf-8") as f:
        localities_data = json.load(f)
    print("Geographic reference data loaded successfully")
else:
    print("Geographic reference files not found")

Geographic reference data loaded successfully


## 3. LLM Response Analysis

Examine the extracted data structure and validate with Pydantic.

In [7]:
# Display extracted client information
if extraction:
    print("=== Extracted Client ===")
    print(f"Business Name: {extraction.client.business_name}")
    print(f"Tax ID (CUIT): {extraction.client.tax_id}")
    print(f"Delivery Address: {extraction.client.delivery_address}")
    
    print("\n=== Document Info ===")
    print(f"Document Number: {extraction.document.document_number}")
    print(f"Issue Date: {extraction.document.issue_date}")
else:
    print("No extraction data available")

=== Extracted Client ===
Business Name: O'connor Coffee Shop
Tax ID (CUIT): 20-432.135.67-9
Delivery Address: Av. Regimiento de Patricios 1030, CABA

=== Document Info ===
Document Number: 00002289
Issue Date: 2026-01-10


In [8]:
# Display extracted items
if extraction and extraction.items:
    print("=== Extracted Items ===")
    items_data = []
    for i, item in enumerate(extraction.items, 1):
        items_data.append({
            "#": i,
            "Raw Type": item.bag_type_raw,
            "Normalized": item.bag_type_normalized.value,
            "Quantity": item.quantity_packs,
        })
    
    items_df = pd.DataFrame(items_data)
    display(items_df)
    
    print(f"\n=== Totals ===")
    print(f"Total Amount: ${extraction.totals.total_amount}" if extraction.totals.total_amount else "Total Amount: Not specified")
    print(f"Total Packs: {extraction.totals.total_packs}" if extraction.totals.total_packs else "Total Packs: Not specified")
else:
    print("No items extracted")

=== Extracted Items ===


Unnamed: 0,#,Raw Type,Normalized,Quantity
0,1,Bolsa mediana color café,medium,160
1,2,Bolsa mediana color blanca,medium,160



=== Totals ===
Total Amount: $29040.0
Total Packs: 320


In [9]:
# Display extraction notes and full JSON
if extraction:
    print("=== Extraction Notes ===")
    print(extraction.extraction_notes or "No notes")
    
    print("\n=== Full Extraction JSON ===")
    print(json.dumps(extraction.model_dump(), indent=2, default=str))

=== Extraction Notes ===
All requested data was found clearly and unambiguously in the document.

=== Full Extraction JSON ===
{
  "extraction_confidence": 0.95,
  "client": {
    "business_name": "O'connor Coffee Shop",
    "tax_id": "20-432.135.67-9",
    "delivery_address": "Av. Regimiento de Patricios 1030, CABA"
  },
  "document": {
    "issue_date": "2026-01-10",
    "document_number": "00002289"
  },
  "items": [
    {
      "bag_type_raw": "Bolsa mediana color caf\u00e9",
      "bag_type_normalized": "medium",
      "quantity_packs": 160
    },
    {
      "bag_type_raw": "Bolsa mediana color blanca",
      "bag_type_normalized": "medium",
      "quantity_packs": 160
    }
  ],
  "totals": {
    "total_amount": 29040.0,
    "total_packs": 320
  },
  "extraction_notes": "All requested data was found clearly and unambiguously in the document.",
  "requires_review": false
}


## 4. Geocoding Demo

Demonstrate the geocoding pipeline: Address → Coordinates → Zone.

In [10]:
# Test geocoding with the extracted address
if extraction and extraction.client.delivery_address:
    address = extraction.client.delivery_address
    print(f"Geocoding address: {address}")
    print("\nCalling Nominatim API...")
    
    with db.get_session() as session:
        geocoding_result = geocode_address(address, session)
    
    print(f"\n=== Geocoding Result ===")
    print(f"Success: {geocoding_result.success}")
    print(f"Latitude: {geocoding_result.latitude}")
    print(f"Longitude: {geocoding_result.longitude}")
    print(f"Locality: {geocoding_result.locality}")
    print(f"Zone ID: {geocoding_result.zone_id}")
    print(f"Confidence: {geocoding_result.confidence.value}")
else:
    print("No address to geocode")
    geocoding_result = None

Geocoding address: Av. Regimiento de Patricios 1030, CABA

Calling Nominatim API...

=== Geocoding Result ===
Success: True
Latitude: -34.6394927
Longitude: -58.3692052
Locality: La Boca
Zone ID: CABA
Confidence: medium


In [11]:
# Test geocoding with sample addresses

# Clear geocoding cache and localities to test fresh results
with db.get_session() as session:
    session.query(GeocodingCacheModel).delete()
    session.commit()
    print("Cleared geocoding cache to test fresh results\n")

sample_addresses = [
    "Av. Santa Fe 3200, Palermo, Buenos Aires",  # Palermo - correct address
    "José Hernández 3302, Tortuguitas, CP: 1667",  # Tortuguitas - test postal code cleaning & auto-add
    "Calle Florida 100, CABA, Argentina",  # CABA - Florida street
]

print("=== Sample Address Geocoding ===")
geocoding_results = []

with db.get_session() as session:
    for addr in sample_addresses:
        result = geocode_address(addr, session)
        geocoding_results.append({
            "Address": addr[:45] + "..." if len(addr) > 45 else addr,
            "Success": result.success,
            "Locality": result.locality,
            "Zone": result.zone_id,
            "Lat": round(result.latitude, 4) if result.latitude else None,
            "Lon": round(result.longitude, 4) if result.longitude else None,
        })

print("\n=== Results Summary ===")
geocoding_df = pd.DataFrame(geocoding_results)
display(geocoding_df)

Cleared geocoding cache to test fresh results

=== Sample Address Geocoding ===

=== Results Summary ===


Unnamed: 0,Address,Success,Locality,Zone,Lat,Lon
0,"Av. Santa Fe 3200, Palermo, Buenos Aires",True,Palermo,CABA,-34.5765,-58.4315
1,"José Hernández 3302, Tortuguitas, CP: 1667",True,Tortuguitas,NORTH_ZONE,-34.472,-58.76
2,"Calle Florida 100, CABA, Argentina",True,Retiro,CABA,-34.5956,-58.3751


In [12]:
# More sample addresses with different zones
sample_addresses = [
    "Av. Santa Fe 3200, Palermo, Buenos Aires, Argentina",
    "Av. Corrientes 3200, Almagro, Buenos Aires, Argentina",
    "Av. Cabildo 2000, Belgrano, Buenos Aires, Argentina",
    "Av. San Martin 456, Quilmes, Buenos Aires, Argentina",
    "Florida 100, San Nicolas, Buenos Aires, Argentina",
]

print("=== Extended Sample Address Geocoding ===")
geocoding_results = []

with db.get_session() as session:
    for addr in sample_addresses:
        result = geocode_address(addr, session)
        geocoding_results.append({
            "Address": addr[:45] + "..." if len(addr) > 45 else addr,
            "Success": result.success,
            "Locality": result.locality,
            "Zone": result.zone_id,
            "Lat": round(result.latitude, 4) if result.latitude else None,
            "Lon": round(result.longitude, 4) if result.longitude else None,
        })

geocoding_df = pd.DataFrame(geocoding_results)
display(geocoding_df)

=== Extended Sample Address Geocoding ===


Unnamed: 0,Address,Success,Locality,Zone,Lat,Lon
0,"Av. Santa Fe 3200, Palermo, Buenos Aires, Arg...",True,Palermo,CABA,-34.5765,-58.4315
1,"Av. Corrientes 3200, Almagro, Buenos Aires, A...",True,Almagro,CABA,-34.6023,-58.4294
2,"Av. Cabildo 2000, Belgrano, Buenos Aires, Arg...",True,Belgrano,CABA,-34.5632,-58.456
3,"Av. San Martin 456, Quilmes, Buenos Aires, Ar...",True,Parque Bernal,SOUTH_ZONE,-34.7098,-58.2805
4,"Florida 100, San Nicolas, Buenos Aires, Argen...",True,San Nicolás,CABA,-34.6004,-58.3754


## 5. Client Matching Demo

Demonstrate how the system finds existing clients or creates new ones.

In [13]:
# Show existing clients in database
with db.get_session() as session:
    clients_df = pd.read_sql(
        "SELECT client_id, business_name, tax_id, zone_id, is_new_client FROM clients LIMIT 10",
        session.bind
    )

print("=== Existing Clients (sample) ===")
display(clients_df)

=== Existing Clients (sample) ===


Unnamed: 0,client_id,business_name,tax_id,zone_id,is_new_client
0,CLI-GAUCHO01,Mayorista El Gaucho,30-12345678-9,NORTH_ZONE,0
1,CLI-242C86DA,Distribuidora La Victoria,20-36774184-0,NORTH_ZONE,0
2,CLI-A3A3042F,Almacen El Buen Precio,33-10520043-1,SOUTH_ZONE,0
3,CLI-04ECF774,Fiambreria La Esquina,20-86903824-7,WEST_ZONE,0
4,CLI-7108B9BC,Autoservicio El Trebol,20-17271004-4,WEST_ZONE,1
5,CLI-4C481DBB,Distribuidora del Sur,23-92024137-8,WEST_ZONE,0
6,CLI-AA9D1B54,Autoservicio Mi Barrio,20-88401920-5,CABA,1
7,CLI-51F9B575,Mayorista La Union,20-38645714-3,SOUTH_ZONE,0
8,CLI-9A1253D6,Supermercado Los Amigos,33-67530849-7,SOUTH_ZONE,1
9,CLI-2B7FC9A7,Comercial Rivadavia,27-85224161-5,SOUTH_ZONE,0


In [14]:
# Demonstrate client matching logic

# Get a sample client to test matching
with db.get_session() as session:
    sample_client = session.query(ClientModel).first()
    
    if sample_client:
        print(f"Testing with existing client: {sample_client.business_name}")
        print(f"Tax ID: {sample_client.tax_id}")
        
        # Test tax_id match
        found_by_tax = find_client_by_tax_id(sample_client.tax_id, session)
        print(f"\nFound by Tax ID: {found_by_tax.client_id if found_by_tax else 'Not found'}")
        
        # Test name match
        found_by_name = find_client_by_name(sample_client.business_name, session)
        print(f"Found by Name: {found_by_name.client_id if found_by_name else 'Not found'}")
        
        # Test non-existent client
        not_found = find_client_by_tax_id("99-99999999-9", session)
        print(f"\nNon-existent Tax ID search: {not_found if not_found else 'Not found (expected)'}")

Testing with existing client: Mayorista El Gaucho
Tax ID: 30-12345678-9

Found by Tax ID: CLI-GAUCHO01
Found by Name: CLI-GAUCHO01

Non-existent Tax ID search: Not found (expected)


## 6. Batch Processing

Process all receipts in the receipts folder.

In [15]:
# Process all receipts

if pdf_files:
    print(f"Processing {len(pdf_files)} receipts...")
    print("=" * 50)
    
    try:
        with db.get_session() as session:
            results = process_all_receipts(RECEIPTS_DIR, session)
        
        print("\n" + "=" * 50)
        print(f"Processing complete! Processed {len(results)} receipts.")
    except Exception as e:
        print(f"Error during processing: {type(e).__name__}: {e}")
        import traceback
        traceback.print_exc()
        results = []
else:
    print("No receipts to process.")
    results = []

Processing 4 receipts...
Processing: Invoice_Coffee_Shop.pdf
  ✓ - Confidence: 0.98 | New client created
Processing: Invoice_FMartínez.pdf
  ✓ - Confidence: 0.95 | New client created
Processing: Invoice_Toy_Store.pdf
  ✓ - Confidence: 0.90 | New client created
Processing: Receipt_El_Gaucho.pdf
  ✓ - Confidence: 0.95 | Existing client matched

Processing complete! Processed 4 receipts.


In [16]:
# Display processing summary
if results:
    summary = get_processing_summary(results)
    
    print("=== Processing Summary ===")
    for key, value in summary.items():
        print(f"{key.replace('_', ' ').title()}: {value}")

=== Processing Summary ===
Total Receipts: 4
Successful: 4
Duplicates: 0
Failed: 0
New Clients Created: 3
Average Confidence: 0.94
Requires Review: 1
Total Processing Time Seconds: 62.94


In [17]:
# Display individual results
if results:
    results_data = []
    for r in results:
        results_data.append({
            "File": Path(r.receipt_path).name,
            "Success": "✓" if r.success else "✗",
            "Duplicate": "Yes" if r.order_is_duplicate else "No",
            "New Client": "Yes" if r.client_is_new else "No",
            "Confidence": f"{r.extraction.extraction_confidence:.2f}" if r.extraction else "N/A",
            "Order ID": r.order_id or "N/A",
            "Time (s)": f"{r.processing_time_seconds:.2f}",
        })
    
    results_df = pd.DataFrame(results_data)
    display(results_df)

Unnamed: 0,File,Success,Duplicate,New Client,Confidence,Order ID,Time (s)
0,Invoice_Coffee_Shop.pdf,✓,No,Yes,0.98,ORD-ED01AC3D,16.7
1,Invoice_FMartínez.pdf,✓,No,Yes,0.95,ORD-B6FBFFE5,14.99
2,Invoice_Toy_Store.pdf,✓,No,Yes,0.9,ORD-CE98BB79,16.94
3,Receipt_El_Gaucho.pdf,✓,No,No,0.95,ORD-C4A1485D,14.31


## 7. Results Summary

Visualize the extraction results with Plotly charts.

In [18]:
# Load updated data from database
with db.get_session() as session:
    all_orders_df = pd.read_sql("SELECT * FROM orders", session.bind)
    all_clients_df = pd.read_sql("SELECT * FROM clients", session.bind)
    processed_df = pd.read_sql("SELECT * FROM processed_receipts", session.bind)
    geocache_df = pd.read_sql("SELECT * FROM geocoding_cache", session.bind)

print(f"Total orders: {len(all_orders_df)}")
print(f"Total clients: {len(all_clients_df)}")
print(f"Processed receipts: {len(processed_df)}")
print(f"Geocoding cache entries: {len(geocache_df)}")

Total orders: 41
Total clients: 34
Processed receipts: 4
Geocoding cache entries: 12


In [19]:
# Chart 1: Extraction Confidence Distribution
if len(processed_df) > 0:
    fig = px.histogram(
        processed_df,
        x="extraction_confidence",
        nbins=10,
        title="Extraction Confidence Distribution",
        labels={"extraction_confidence": "Confidence Score", "count": "Number of Receipts"},
        color_discrete_sequence=["#3498db"]
    )
    fig.add_vline(x=0.7, line_dash="dash", line_color="red", 
                  annotation_text="Review Threshold (0.7)")
    fig.update_layout(xaxis_title="Confidence Score", yaxis_title="Count")
    fig.show()
else:
    print("No processed receipts to visualize")

In [20]:
# Chart 2: Orders by Zone
if len(all_orders_df) > 0:
    # Load zone colors
    with open(project_root / "data" / "geo" / "zones.json", "r") as f:
        zones_colors = json.load(f)
    zone_color_map = {zone_id: info["color"] for zone_id, info in zones_colors.items()}
    
    orders_by_zone = all_orders_df.groupby("delivery_zone_id").size().reset_index(name="count")
    
    fig = px.bar(
        orders_by_zone,
        x="delivery_zone_id",
        y="count",
        color="delivery_zone_id",
        color_discrete_map=zone_color_map,
        title="All Orders by Zone",
        labels={"delivery_zone_id": "Zone", "count": "Number of Orders"},
        text="count"
    )
    fig.update_traces(textposition="outside")
    fig.update_layout(showlegend=False)
    fig.show()

In [21]:
# Chart 3: New vs Existing Clients
if len(all_clients_df) > 0:
    client_types = all_clients_df.groupby("is_new_client").size().reset_index(name="count")
    client_types["type"] = client_types["is_new_client"].map({1: "New", 0: "Existing"})
    
    fig = px.pie(
        client_types,
        values="count",
        names="type",
        title="Client Distribution: New vs Existing",
        color_discrete_sequence=["#2ecc71", "#3498db"],
        hole=0.4
    )
    fig.update_traces(textposition="inside", textinfo="percent+label")
    fig.show()

In [22]:
# Chart 4: Geocoding Confidence
if len(geocache_df) > 0:
    geo_confidence = geocache_df.groupby("confidence").size().reset_index(name="count")
    
    fig = px.bar(
        geo_confidence,
        x="confidence",
        y="count",
        title="Geocoding Confidence Levels",
        labels={"confidence": "Confidence Level", "count": "Count"},
        color="confidence",
        color_discrete_map={"high": "#2ecc71", "medium": "#f1c40f", "low": "#e74c3c"}
    )
    fig.show()
else:
    print("No geocoding cache entries to visualize")

In [23]:
# Chart 5: Processing Results Summary (if we have results)
if results:
    status_data = {
        "Status": ["Successful", "Duplicates", "Failed"],
        "Count": [
            summary["successful"],
            summary["duplicates"],
            summary["failed"]
        ]
    }
    
    fig = px.bar(
        status_data,
        x="Status",
        y="Count",
        title="Receipt Processing Results",
        color="Status",
        color_discrete_map={
            "Successful": "#2ecc71",
            "Duplicates": "#f1c40f",
            "Failed": "#e74c3c"
        },
        text="Count"
    )
    fig.update_traces(textposition="outside")
    fig.update_layout(showlegend=False)
    fig.show()

## 8. Folium Map

Display extracted delivery locations on an interactive map.

In [24]:
# Create map with all order delivery locations
DEPOT_LAT=-34.73231090267173
DEPOT_LON=-58.295889556357935
# Load zone colors
with open(project_root / "data" / "geo" / "zones.json", "r") as f:
    zones_colors = json.load(f)
zone_color_map = {zone_id: info["color"] for zone_id, info in zones_colors.items()}

# Create map
m = folium.Map(
    location=[DEPOT_LAT, DEPOT_LON],
    zoom_start=11,
    tiles="cartodbpositron"
)

# Add depot marker
folium.Marker(
    location=[DEPOT_LAT, DEPOT_LON],
    popup="<b>Eco-Bags Factory</b><br>Depot Location",
    tooltip="Factory Depot",
    icon=folium.Icon(color="black", icon="industry", prefix="fa")
).add_to(m)

# Add order delivery locations with client names
pending_orders = all_orders_df[all_orders_df["status"] == "pending"]
# Merge with clients to get business names
orders_with_clients = pending_orders.merge(
    all_clients_df[["client_id", "business_name"]],
    on="client_id",
    how="left"
)

for _, order in orders_with_clients.iterrows():
    if order["delivery_latitude"] and order["delivery_longitude"]:
        zone_color = zone_color_map.get(order["delivery_zone_id"], "#808080")
        
        popup_html = f"""
        <b>Order: {order['order_id']}</b><br>
        <b>Client: {order['business_name']}</b><br>
        Zone: {order['delivery_zone_id']}<br>
        Packs: {order['quantity_packs']}<br>
        Pallets: {order['total_pallets']}<br>
        Status: {order['status']}
        """
        
        folium.CircleMarker(
            location=[order["delivery_latitude"], order["delivery_longitude"]],
            radius=8,
            popup=folium.Popup(popup_html, max_width=200),
            tooltip=order["order_id"],
            color=zone_color,
            fill=True,
            fill_color=zone_color,
            fill_opacity=0.7,
            weight=2
        ).add_to(m)

# Add legend
legend_html = """
<div style="position: fixed; bottom: 50px; left: 50px; z-index: 1000; 
            background-color: white; padding: 10px; border-radius: 5px;
            border: 2px solid grey; font-size: 12px;">
    <b>Zones</b><br>
    <i style="background: #FF6B6B; width: 12px; height: 12px; display: inline-block; border-radius: 50%;"></i> CABA<br>
    <i style="background: #4ECDC4; width: 12px; height: 12px; display: inline-block; border-radius: 50%;"></i> North Zone<br>
    <i style="background: #45B7D1; width: 12px; height: 12px; display: inline-block; border-radius: 50%;"></i> South Zone<br>
    <i style="background: #96CEB4; width: 12px; height: 12px; display: inline-block; border-radius: 50%;"></i> West Zone<br>
    <br><b>Markers</b><br>
    <i class="fa fa-industry" style="color: black;"></i> Factory Depot
</div>
"""
m.get_root().html.add_child(folium.Element(legend_html))

# Display map
m

In [25]:
# Save map
map_output_path = project_root / "output" / "maps" / "delivery_locations.html"
m.save(str(map_output_path))
print(f"Map saved to: {map_output_path}")

Map saved to: c:\Users\Santi\Desktop\CV\portafolio\Eco-Bags-Delivery-Optimizer\output\maps\delivery_locations.html


## Summary

This notebook demonstrates the document extraction pipeline using AI:

### Key Achievements
- ✅ Configured Gemini 2.0 Flash for structured data extraction
- ✅ Extracted order data from PDF receipts with variable formats
- ✅ Matched extracted clients against existing database records
- ✅ Validated and stored processed orders with Pydantic schemas
- ✅ Cached geocoding results for performance

### Pipeline Status

| Phase | Notebook | Status |
|-------|----------|--------|
| **Phase 1** | 01_base_data_setup | ✅ Complete |
| **Phase 2** | 02_receipt_extraction | ✅ Complete |
| **Phase 3** | 03_priority_score | ✅ Complete |
| **Phase 4** | 04_order_selector | ✅ Complete |
| **Phase 5** | 05_route_optimizer | ✅ Complete |

In [26]:
# Final database statistics
print("=" * 50)
print("DATABASE SUMMARY AFTER EXTRACTION")
print("=" * 50)

with db.get_session() as session:
    stats = {
        "Clients": session.query(ClientModel).count(),
        "New Clients": session.query(ClientModel).filter(ClientModel.is_new_client == True).count(),
        "Orders": session.query(OrderModel).count(),
        "Pending Orders": session.query(OrderModel).filter(OrderModel.status == "pending").count(),
        "Processed Receipts": session.query(ProcessedReceiptModel).count(),
        "Geocoding Cache Entries": session.query(GeocodingCacheModel).count(),
    }

for key, value in stats.items():
    print(f"{key}: {value}")

print("=" * 50)
print("Phase 2 complete!")

DATABASE SUMMARY AFTER EXTRACTION
Clients: 34
New Clients: 11
Orders: 41
Pending Orders: 26
Processed Receipts: 4
Geocoding Cache Entries: 12
Phase 2 complete!
