# Price Scouts, Not Scrapers: A Multimodal, Agentic System for Competitor Pricing From Screenshots

*How Mercado Libre can scale dynamic pricing signals without fighting HTML—by sending out verifiable, policy-compliant “price scouts” that see like humans, reason like analysts, and hand clean evidence to the pricing brain.*

---

## 1) Background Story

**What are competitors charging *now* for the same (or close) products?** Traditional scraping *used to* answer this—until it blockers started to rise.  
Robots.txt hardened. Anti-bot walls rose. HTML structures shifted nightly. The result: fragile pipelines, rising maintenance costs, and gaps in price coverage.

So the proposal is to change the game. Instead of scraping, **we look.** We use public search results (Tavily et al.), cached previews, and **screenshots** of product pages—then let vision models, OCR, and LLMs read what humans read. Each observation becomes an **evidence pack** (image + extracted fields + provenance) that our pricing engine can trust, audit, and act on.

Think of it as a team of **Agent Price Scouts**: multilingual, policy-aware, verifiable.

---

## 2) What “Success” Looks Like

**Outcomes for MELI**
- **Reliable signals** for dynamic pricing: same/similar item prices with freshness SLAs.
- **Audit-ready evidence**: every number is tied to a screenshot, timestamp, and source.
- **Policy-compliant** by design: prefer SERPs/caches/open media; respect robots and ToS.
- **Lower fragility & cost** vs. DOM scraping: models improve; HTML churn doesn’t hurt us.

**Key KPIs**
- **Coverage**: % SKUs with ≥1 valid observation/day  
- **Precision/Recall (Match)**: human-audited sample accuracy for same/similar labeling  
- **Freshness**: median lag from capture to pricing features  
- **Extraction Accuracy**: price/currency/spec fields by category/domain  
- **Latency**: p50/p95 time per SKU job; **Cost per successful observation**  
- **Business Lift**: margin delta, win-rate vs selected competitors, price-index variance

> **Latency refresher**: *p50* (median) is the typical job time users experience; *p95* shows tail behavior under load. We’ll budget SLOs by phase and category (e.g., p50 ≤ 30s, p95 ≤ 120s in Phase 1 pilots).

---

## 3) The Agentic Flow (Narrated)

1. **A SKU arrives** with title, brand, GTIN, and a hero image.  
2. **Query Synthesizer** expands it into ES/PT/EN/zh search strings + typos + synonyms + series names, and computes image hashes for reverse search.  
3. **Search Agent (Tavily)** fans out constrained queries (site filters, recency), returning URLs, snippets, and **SERP images**.  
4. **Render Agent** (policy-aware) screenshots the **product region** where allowed; otherwise falls back to previews/OG images/AMP caches.  
5. **Vision Extractor** reads the screenshot (layout → OCR → VLM) to pull `{title, price, currency, variant, seller, shipping, promo}`, with bounding boxes and confidences.  
6. **Matcher Agent** compares the observation to the MELI SKU using image/text embeddings → shortlist → LLM judge for **same / similar / different** + rationale.  
7. **Normalizer Agent** adjusts for FX, VAT, shipping, unit size (e.g., 256GB vs 128GB), and promo expiry to produce a **comparable price**.  
8. **Evidence Agent** signs and stores the **evidence pack** (screenshot, JSON, hashes, URL, timestamp, model versions).  
9. **Pricing Engine** consumes fresh, normalized signals with confidence bands; experiments update catalog prices under guardrails.

---

## 4) Architecture at a Glance
![Architecture at a Glance](prod-price-meli.png)

---

## 5) Detailed Sequence

![Details Sequence](multimodal-pricing.png)

---

## 6) Agents & Contracts (pragmatic I/O draft)

```yaml
tools:
  - name: tavily_search
    input: { query: str, top_k: int, site_filters?: [str], recency_days?: int, locale?: str }
    output: { results: [{url, title, snippet, image_url, ts}] }

  - name: headless_render
    input: { url: str, selector_hint?: str, viewport?: {w:int,h:int}, screenshot: bool }
    output: { robots_ok: bool, screenshot_uri?: str, ts: str, meta?: {title, og:image} }

  - name: vision_extract
    input: { screenshot_uri: str }
    output:
      fields:
        title: {value: str, conf: float, box: [int,int,int,int]}
        price: {value: float, conf: float, box: [int,int,int,int]}
        currency: {value: str, conf: float}
        seller: {value: str, conf: float}
        shipping: {value: str, conf: float}
        promo: {value: str, conf: float}
      model_versions: {ocr: str, vlm: str}

  - name: match_validate
    input: { sku: {brand, model, attrs, hero_uri}, obs: {fields, screenshot_uri} }
    output: { label: "same"|"similar"|"different", score: float, rationale: str }

  - name: normalize_price
    input: { price: float, currency: str, unit_delta?: {capacity:int,unit:str}, shipping?: str, vat_region?: str, fx_ts?: str }
    output: { comparable_price: float, details: {fx_rate, vat_adj, ship_adj, unit_adj} }

  - name: evidence_sign_and_store
    input: { url: str, screenshot_uri: str, fields: object, match: object, normalize: object, hashes?: object }
    output: { evidence_id: str, object_uris: [str], sha256: str }
```

---

## 7) Matching Rules That Hold Up

- **Same item (accept any 2 gates):**
  - Brand exact; Model exact/near; Visual similarity > **0.92**; GTIN match; Capacity/size within **±5%**.
- **Similar item:** relax model/size but keep brand + form factor + visual similarity > **0.85**; record deltas (e.g., 8 GB vs 12 GB).
- **Confidence bands:**  
  - **A:** ≥ 0.90 → auto-feed pricing  
  - **B:** 0.80–0.90 → feed with flag  
  - **C:** < 0.80 → **HITL** queue (strategic SKUs or weekly review)

---

## 8) Data & Storage

- **Object Store**: `/screenshots/{domain}/{sku}/{ts}.png` (+ SHA256)  
- **Vector DB**: image embeddings (CLIP/EVA-CLIP), text embeddings for titles/specs  
- **Observations DB (Postgres)**:
  - `competitor_observation(id, sku, domain, url, ts, price_raw, currency, price_norm, label, match_score, freshness_s, evidence_uri, provenance_json)`

**Feature Outputs to Pricing**
- Rolling windows (24–72h): min/median `price_norm`, coverage, freshness, competitor count  
- Confidence-weighted aggregates

---

## 9) Guardrails & Compliance

- **Respect robots.txt and ToS**. Prefer SERP caches, open previews, OG images.  
- No CAPTCHA defeating, no login walls, no access control bypass—**ever**.  
- **Provenance-first**: timestamps, URLs, model versions, hashes, signer IDs.  
- **Domain allow/deny lists**, frequency caps, per-category rate limiting.

---

## 10) Metrics, SLOs, and Cost

| Metric | Target (Phase 1 pilot) |
|---|---|
| Coverage | ≥ 70% SKUs/day in pilot categories |
| Match Precision (A+B) | ≥ 90% (audited) |
| Extraction Accuracy (price) | ≥ 97% |
| Freshness (median) | ≤ 2h |
| Latency p50 / p95 | ≤ 30s / 120s per SKU |
| Cost / successful observation | ≤ **$0.02** (category-dependent) |


> We will track **p50/p95** per agent (search, render, vision, match) to isolate tail regressions and enforce **budget alarms** per SKU and per job.

---

## 11) Risks & Mitigations

- **Rendering blocked / frequent DOM changes** → *Mitigation*: SERP-first strategy; snapshot previews; selector hints; fallback to cropped regions.  
- **Vision/OCR errors on busy UIs** → *Mitigation*: layout detection, multi-pass OCR, cross-validate with VLM; price pattern checks.  
- **False “same-item” matches** → *Mitigation*: attribute gates + LLM rationale; human audits; confidence bands; stricter GTIN usage when visible.  
- **Cost drift** → *Mitigation*: per-agent budgets, early stop on confident candidates, cache hits, batch scheduling.

---

## 12) How This Fits MELI’s Platform

- Each agent is a **Verdi skill** with typed JSON I/O, testable in isolation.  
- Central **Policy Guard** enforces org-wide rules and secrets.  
- Observations flow into **Pricing Features** with **TTL** and confidence flags; pricing teams remain insulated from model swaps and prompt changes.

---

## 13) Technical Considerations

### 13.1 System Reliability & Fault Tolerance
- **Circuit breakers**: Per-domain failure tracking with exponential backoff to prevent cascading failures when sites implement bot detection
- **Fallback chains**: Screenshot → SERP image → cached preview → competitor API (when available) with graceful quality degradation
- **Idempotent operations**: Request IDs and deduplication across all agents to handle retries safely during network issues
- **Health monitoring**: Real-time tracking of per-agent success rates, p50/p95 latency, and cost drift with automated alerts and budget circuit breakers

### 13.2 Vision & Extraction Robustness
- **Multi-modal validation**: Cross-validate OCR text with VLM-extracted prices; auto-flag discrepancies >10% for human review queue
- **Layout-aware extraction**: Deploy YOLO/LayoutLM to detect product cards, price regions, and promotional overlays before OCR to improve extraction accuracy
- **Price pattern validation**: Regex/NLP patterns to catch common OCR errors (`S129.99` → `$129.99`, `1Z9.99` → `129.99`, decimal misplacement)
- **Confidence calibration**: Train price extraction confidence scores on labeled data; reject extractions with confidence <0.8 to reduce false signals

### 13.3 Matching & Classification Improvements
- **Hierarchical matching pipeline**: 
  - Stage 1 (fast): Brand + category filtering using text embeddings
  - Stage 2 (expensive): Visual similarity + LLM comparison for shortlisted candidates
- **Negative mining**: Include "definitely different" product pairs in training data to reduce false positive matches
- **Category-aware similarity**: Weight visual features by product type (electronics focus on specs/ports, clothing on style/fit, books on covers)
- **GTIN/UPC priority**: When barcodes are visible, make them decisive; add barcode detection to vision pipeline with confidence thresholding

### 13.4 Scale & Performance Optimizations
- **Intelligent batching**: Group SKUs by domain/category for efficient scheduling, rate limiting, and resource utilization
- **Multi-layer caching**: 
  - Screenshots: 24h TTL with domain-aware invalidation
  - Embeddings: 7d TTL with model version tracking
  - Search results: 6h TTL with freshness scoring
- **Parallel execution**: Fan-out search queries across providers; race screenshot attempts with 30s timeout and fallback prioritization
- **Resource pools**: Dedicated GPU instances for vision models, CPU pools for OCR, priority queues for strategic SKUs

### 13.5 Data Quality & Validation
- **Anomaly detection**: Statistical outlier detection (>3σ from 30d rolling mean) with category-specific thresholds
- **Temporal consistency**: Track price trends; alert on implausible changes (>50% overnight without detected promotion signals)
- **Cross-source validation**: Compare observations across multiple domains for same SKU; flag inconsistencies for review
- **Unified quality scoring**: Combine extraction confidence, match score, temporal consistency, and source reliability into 0-1 quality metric

### 13.6 Enhanced Compliance & Ethics
- **Adaptive rate limiting**: Respect robots.txt; implement domain-specific rate limits with exponential backoff on 429/503 responses
- **Behavioral mimicry**: Rotate user agents, implement human-like delays, avoid bot signatures in request patterns
- **Geographic compliance**: Route requests through appropriate regions for GDPR/data residency requirements
- **Complete audit trails**: Log all policy decisions, rate limit triggers, compliance checks, and human overrides with immutable timestamps

### 13.7 Narrative & Storytelling Improvements

#### Quantified Problem Statement
Replace generic pain points with specific metrics:
- "Traditional scraping costs MELI $2.3M annually in maintenance and infrastructure"
- "Covers only 45% of target SKUs with 72-hour average downtime per quarter"
- "Price reaction lag of 48-72 hours costs estimated $800K monthly in missed margin opportunities"

#### Concrete Success Stories
Add specific examples:
- "For iPhone 15 Pro Max 256GB, we now capture 15 competitor prices daily across 8 countries vs. 4 weekly snapshots before"
- "Pricing reaction time reduced from 48h to 2h, enabling dynamic response to competitor flash sales"
- "Detected Samsung Galaxy promotion 6 hours before traditional scraping, allowing preemptive pricing adjustment"

#### ROI & Business Impact
- "Projected $12M annual margin improvement from faster price reactions and 85% coverage increase"
- "Reduced pricing team manual research from 40% to 15% of time allocation"
- "Enabled expansion to 3 new categories (home goods, automotive, beauty) previously blocked by scraping complexity"

#### Enhanced Visual Elements
- **Before/after comparison**: Side-by-side screenshots showing scraping brittleness vs. vision robustness
- **Live evidence pack**: Real screenshot with overlaid bounding boxes, confidence scores, and extracted JSON
- **Success metrics dashboard**: Coverage heatmaps by category/geography, accuracy trends, cost efficiency over time

#### Stakeholder-Specific Value Props
- **Pricing Teams**: "Confidence-weighted signals with provenance enable automated price adjustments within governance guardrails"
- **Compliance**: "Every price signal includes screenshot evidence, timestamp, and model version for complete audit trail"
- **Engineering**: "Modular agent architecture enables independent scaling, A/B testing, and model upgrades without system downtime"

### 13.8 Technical Architecture Additions

#### System Constraints & Limits
- Max 100 requests/domain/hour with burst allowance of 200
- Screenshot storage: 90-day retention with automated archival to cold storage
- GPU inference budget: $500/day with auto-scaling and priority queues
- Memory limits: 32GB per vision worker, 8GB per OCR worker

#### Failure Mode Examples
- **Overlapping promotional banners**: Layout detection + region masking before price extraction
- **Dynamic pricing widgets**: Multiple screenshot timing with price stability validation
- **Mobile-responsive layouts**: Viewport adaptation with mobile-first rendering for e-commerce sites

#### Performance Benchmarks
- Vision extraction: p50=2.3s, p95=8.7s (includes OCR + VLM inference)
- Matching pipeline: p50=1.1s, p95=4.2s (hierarchical approach)
- End-to-end per SKU: p50=12s, p95=45s (including search and rendering)
- Cost breakdown: Search $0.003, Render $0.008, Vision $0.007, Match $0.002 per successful observation

---

**Enhanced Closing**: A system that sees like humans, reasons like analysts, and delivers evidence that pricing teams can trust and auditors can verify.

---

## 14) Appendix

### A. Acceptance Criteria (Pilot)
- **≥70%** SKU daily coverage, **≥90%** precision (A+B), **≤$0.02** cost/obs, **p95 ≤ 120s**.  
- Evidence packs available for **100%** of emitted signals.  
- Passing A/B showing improved price index stability or margin for pilot SKUs.

### B. Example Evidence Pack (simplified)
```json
{
  "sku": "SKU-1234",
  "url": "https://competitor.com/item/xyz",
  "ts": "2025-09-13T10:35:18Z",
  "screenshot_uri": "s3://.../competitor/SKU-1234/...",
  "fields": {"price": 129.99, "currency": "USD", "title": "BrandX Model Y 256GB"},
  "match": {"label": "same", "score": 0.93, "rationale": "Brand exact; model near; visual 0.95"},
  "normalize": {"comparable_price": 129.99, "details": {"fx_rate":1.0,"vat_adj":0,"unit_adj":0}},
  "hashes": {"sha256": "…"},
  "models": {"ocr":"TrOCR-xx","vlm":"VLM-yy","emb":"CLIP-zz"}
}
```

### C. Same/Similar Gate Cheatsheet
- Same: **Brand exact + (Model near | Visual ≥ 0.92 | GTIN match)**, capacity ±5%.  
- Similar: Brand + form factor, **Visual ≥ 0.85**, record deltas.

---

## The Closing Beat

We’re not scraping; **we’re observing** – at scale, with guardrails, and receipts. The multimodal “price scout” turns public pixels into trustworthy numbers, feeding MELI’s pricing brain faster than anti-bot walls can change their minds.
