# 05 - Grounding DINO: Auto-Anotaci√≥n Zero-Shot

Demo interactivo de anotaci√≥n autom√°tica usando lenguaje natural.

## ¬øQu√© es Grounding DINO?
- Modelo que detecta objetos a partir de **descripciones de texto**
- No necesita entrenamiento previo (zero-shot)
- Ideal para crear datasets r√°pidamente

## Contenido
1. Instalaci√≥n y carga del modelo
2. Detecci√≥n con prompt simple
3. Experimentar con diferentes prompts
4. Generar anotaciones YOLO
5. Comparativa: YOLO-World vs Grounding DINO

In [None]:
import sys
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from PIL import Image
import torch

PROJECT_ROOT = Path('..').resolve()
sys.path.insert(0, str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'

print(f"PyTorch: {torch.__version__}")
print(f"CUDA disponible: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 1. Cargar Grounding DINO

Usamos el modelo de HuggingFace que es f√°cil de instalar.

In [None]:
# Instalar si no est√° (descomentar si es necesario)
# !pip install transformers

from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

MODEL_NAME = "IDEA-Research/grounding-dino-tiny"

print(f"Cargando modelo: {MODEL_NAME}")
print("(Primera vez descarga ~400MB)\n")

processor = AutoProcessor.from_pretrained(MODEL_NAME)
model = AutoModelForZeroShotObjectDetection.from_pretrained(MODEL_NAME)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"‚úÖ Modelo cargado en: {device}")

## 2. Detecci√≥n con Prompt Simple

Probemos con una imagen de ejemplo y el prompt para pilares VR.

In [None]:
def detect_objects(image_path, prompt, threshold=0.3):
    """
    Detecta objetos en una imagen usando un prompt de texto.
    
    Args:
        image_path: Ruta a la imagen
        prompt: Descripci√≥n del objeto (ej: "yellow and black checkered box")
        threshold: Confianza m√≠nima (0.0-1.0)
    
    Returns:
        boxes: Coordenadas [x1, y1, x2, y2]
        scores: Confianzas
    """
    # Cargar imagen
    image = Image.open(image_path).convert("RGB")
    w, h = image.size
    
    # IMPORTANTE: Grounding DINO requiere punto final en el prompt
    prompt_with_dot = prompt if prompt.endswith(".") else prompt + "."
    
    # Procesar
    inputs = processor(images=image, text=prompt_with_dot, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Inferencia
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Post-procesar
    results = processor.post_process_grounded_object_detection(
        outputs,
        inputs["input_ids"],
        box_threshold=threshold,
        text_threshold=threshold,
        target_sizes=[(h, w)]
    )[0]
    
    return results["boxes"].cpu().numpy(), results["scores"].cpu().numpy(), image

print("Funci√≥n detect_objects() definida ‚úÖ")

In [None]:
def visualize_detections(image, boxes, scores, title="Detecciones"):
    """
    Visualiza las detecciones sobre la imagen.
    """
    fig, ax = plt.subplots(1, 1, figsize=(12, 8))
    ax.imshow(image)
    
    for box, score in zip(boxes, scores):
        x1, y1, x2, y2 = box
        w, h = x2 - x1, y2 - y1
        
        # Rect√°ngulo
        rect = patches.Rectangle(
            (x1, y1), w, h,
            linewidth=2, edgecolor='lime', facecolor='none'
        )
        ax.add_patch(rect)
        
        # Etiqueta con confianza
        ax.text(x1, y1 - 5, f"{score:.2f}", color='white', fontsize=10,
               fontweight='bold', bbox=dict(boxstyle='round', facecolor='lime', alpha=0.8))
    
    ax.set_title(f"{title} - {len(boxes)} detecciones")
    ax.axis('off')
    plt.tight_layout()
    plt.show()
    
    return len(boxes)

print("Funci√≥n visualize_detections() definida ‚úÖ")

In [None]:
# Buscar imagen de ejemplo
sample_images = list((DATA_DIR / 'dataset' / 'val' / 'images').glob('*.jpg'))
if not sample_images:
    sample_images = list((DATA_DIR / 'video_frames').glob('*.jpg'))

if sample_images:
    test_image = sample_images[0]
    print(f"Imagen de prueba: {test_image.name}")
    
    # Detectar con prompt
    PROMPT = "yellow and black checkered box"
    print(f"Prompt: '{PROMPT}'")
    print(f"Threshold: 0.3\n")
    
    boxes, scores, image = detect_objects(test_image, PROMPT, threshold=0.3)
    visualize_detections(image, boxes, scores, title=f"Prompt: '{PROMPT}'")
    
    # Mostrar estad√≠sticas
    if len(scores) > 0:
        print(f"\nEstad√≠sticas:")
        print(f"  Detecciones: {len(scores)}")
        print(f"  Confianza min: {scores.min():.3f}")
        print(f"  Confianza max: {scores.max():.3f}")
        print(f"  Confianza media: {scores.mean():.3f}")
else:
    print("‚ö†Ô∏è No se encontraron im√°genes de ejemplo")

## 3. Experimentar con Diferentes Prompts

El prompt es **clave** para buenos resultados. Probemos variaciones.

In [None]:
# Diferentes prompts para probar
PROMPTS_TO_TEST = [
    "yellow and black checkered box",
    "checkered pillar",
    "yellow black striped cube",
    "warning pillar",
    "box",  # Muy gen√©rico
]

if sample_images:
    test_image = sample_images[0]
    
    print("Comparando prompts:\n")
    print(f"{'Prompt':<40} {'Detecciones':>12} {'Conf. Media':>12}")
    print("="*66)
    
    results_by_prompt = {}
    
    for prompt in PROMPTS_TO_TEST:
        boxes, scores, _ = detect_objects(test_image, prompt, threshold=0.25)
        
        n_det = len(scores)
        mean_conf = scores.mean() if n_det > 0 else 0
        
        results_by_prompt[prompt] = {'count': n_det, 'mean_conf': mean_conf}
        print(f"{prompt:<40} {n_det:>12} {mean_conf:>12.3f}")
    
    # Recomendar mejor prompt
    best = max(results_by_prompt.items(), key=lambda x: (x[1]['count'], x[1]['mean_conf']))
    print(f"\n‚úÖ Mejor prompt: '{best[0]}'")

In [None]:
# Visualizar el mejor prompt vs uno gen√©rico
if sample_images:
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    prompts_compare = ["yellow and black checkered box", "box"]
    
    for ax, prompt in zip(axes, prompts_compare):
        boxes, scores, image = detect_objects(test_image, prompt, threshold=0.25)
        
        ax.imshow(image)
        for box, score in zip(boxes, scores):
            x1, y1, x2, y2 = box
            rect = patches.Rectangle(
                (x1, y1), x2-x1, y2-y1,
                linewidth=2, edgecolor='lime', facecolor='none'
            )
            ax.add_patch(rect)
            ax.text(x1, y1-5, f"{score:.2f}", color='white', fontsize=9,
                   bbox=dict(facecolor='lime', alpha=0.7))
        
        ax.set_title(f"'{prompt}'\n{len(boxes)} detecciones")
        ax.axis('off')
    
    plt.suptitle("Espec√≠fico vs Gen√©rico", fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

## 4. Generar Anotaciones YOLO

Convertir las detecciones al formato YOLO para entrenar.

In [None]:
def boxes_to_yolo(boxes, img_width, img_height, class_id=0):
    """
    Convierte boxes [x1,y1,x2,y2] a formato YOLO normalizado.
    
    YOLO format: class x_center y_center width height (todos en [0,1])
    """
    yolo_lines = []
    
    for box in boxes:
        x1, y1, x2, y2 = box
        
        # Centro
        x_center = (x1 + x2) / 2 / img_width
        y_center = (y1 + y2) / 2 / img_height
        
        # Tama√±o
        width = (x2 - x1) / img_width
        height = (y2 - y1) / img_height
        
        # Clamp a [0,1]
        x_center = max(0, min(1, x_center))
        y_center = max(0, min(1, y_center))
        width = max(0, min(1, width))
        height = max(0, min(1, height))
        
        yolo_lines.append(f"{class_id} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}")
    
    return yolo_lines

# Ejemplo
if sample_images:
    boxes, scores, image = detect_objects(test_image, "yellow and black checkered box", 0.3)
    w, h = image.size
    
    yolo_lines = boxes_to_yolo(boxes, w, h, class_id=0)
    
    print("Formato YOLO generado:")
    print("-" * 60)
    for line in yolo_lines:
        print(line)
    print("-" * 60)
    print(f"\nGuardar en: {test_image.stem}.txt")

In [None]:
def annotate_batch(image_paths, prompt, output_dir, threshold=0.3, class_id=0):
    """
    Anota un lote de im√°genes y guarda labels YOLO.
    """
    from tqdm.notebook import tqdm
    
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    stats = {'total': 0, 'with_detections': 0, 'total_boxes': 0}
    
    for img_path in tqdm(image_paths, desc="Anotando"):
        try:
            boxes, scores, image = detect_objects(img_path, prompt, threshold)
            w, h = image.size
            
            yolo_lines = boxes_to_yolo(boxes, w, h, class_id)
            
            # Guardar label
            label_path = output_dir / f"{img_path.stem}.txt"
            with open(label_path, 'w') as f:
                f.write('\n'.join(yolo_lines))
            
            stats['total'] += 1
            if len(boxes) > 0:
                stats['with_detections'] += 1
                stats['total_boxes'] += len(boxes)
                
        except Exception as e:
            print(f"Error en {img_path.name}: {e}")
    
    return stats

print("Funci√≥n annotate_batch() definida ‚úÖ")

In [None]:
# Demo: anotar algunas im√°genes
if sample_images:
    demo_images = sample_images[:5]  # Solo 5 para demo
    output_demo = PROJECT_ROOT / 'data' / 'demo_labels'
    
    print(f"Anotando {len(demo_images)} im√°genes de demo...\n")
    
    stats = annotate_batch(
        demo_images,
        prompt="yellow and black checkered box",
        output_dir=output_demo,
        threshold=0.3
    )
    
    print(f"\n‚úÖ Resultados:")
    print(f"   Im√°genes procesadas: {stats['total']}")
    print(f"   Con detecciones: {stats['with_detections']}")
    print(f"   Total boxes: {stats['total_boxes']}")
    print(f"   Promedio por imagen: {stats['total_boxes']/stats['total']:.1f}")
    print(f"\n   Labels guardados en: {output_demo}")

## 5. Comparativa: YOLO-World vs Grounding DINO

¬øPor qu√© elegimos Grounding DINO?

In [None]:
# Intentar cargar YOLO-World para comparar
try:
    from ultralytics import YOLO
    
    print("Cargando YOLO-World...")
    yolo_world = YOLO('yolov8s-worldv2.pt')
    yolo_world.set_classes(["yellow and black checkered box"])
    
    HAS_YOLO_WORLD = True
    print("‚úÖ YOLO-World cargado")
except Exception as e:
    HAS_YOLO_WORLD = False
    print(f"‚ö†Ô∏è YOLO-World no disponible: {e}")
    print("   Comparativa solo mostrar√° resultados previos")

In [None]:
# Tabla comparativa de resultados previos (del DIARY.md)
import pandas as pd

comparison_data = {
    'Modelo': ['YOLO-World', 'Grounding DINO'],
    'Prompt': ['yellow black checkered cube', 'yellow and black checkered box'],
    'Cobertura': ['8.5%', '100%'],
    'Detecciones (47 imgs)': [6, 259],
    'Confianza': ['0.05-0.09', '0.25-0.69'],
    'Velocidad': ['~25 img/s', '~2 img/s'],
}

df_comparison = pd.DataFrame(comparison_data)
print("\n" + "="*70)
print("COMPARATIVA: YOLO-World vs Grounding DINO")
print("="*70)
display(df_comparison)

print("\nüí° Conclusi√≥n:")
print("   - YOLO-World es m√°s r√°pido pero no detecta objetos VR espec√≠ficos")
print("   - Grounding DINO es 43x mejor para nuestro caso de uso")
print("   - CLIP (base de YOLO-World) fue entrenado con fotos reales, no VR")

In [None]:
# Gr√°fico comparativo
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Detecciones
ax1 = axes[0]
models = ['YOLO-World', 'Grounding DINO']
detections = [6, 259]
colors = ['#e74c3c', '#2ecc71']
bars = ax1.bar(models, detections, color=colors, edgecolor='black')
ax1.set_ylabel('Detecciones')
ax1.set_title('Total Detecciones (47 im√°genes)')
for bar, det in zip(bars, detections):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5,
             str(det), ha='center', fontweight='bold')

# Cobertura
ax2 = axes[1]
coverage = [8.5, 100]
bars = ax2.bar(models, coverage, color=colors, edgecolor='black')
ax2.set_ylabel('Cobertura (%)')
ax2.set_title('Im√°genes con Detecciones')
ax2.set_ylim(0, 110)
for bar, cov in zip(bars, coverage):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
             f"{cov}%", ha='center', fontweight='bold')

plt.suptitle('¬øPor qu√© Grounding DINO?', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 6. Tips para Mejores Resultados

### Gu√≠a de Prompts

| Tipo | Ejemplo | Cu√°ndo usar |
|------|---------|-------------|
| Descriptivo | "yellow and black checkered box" | Objeto con patr√≥n espec√≠fico |
| Material | "wooden crate" | Textura distintiva |
| Funci√≥n | "warning pillar" | Objeto con prop√≥sito conocido |
| Color | "red cube" | Color es la caracter√≠stica principal |

### Ajuste de Threshold

| Threshold | Comportamiento |
|-----------|----------------|
| 0.1 - 0.2 | Muchas detecciones, posibles falsos positivos |
| 0.25 - 0.35 | **Recomendado** para anotaci√≥n inicial |
| 0.4 - 0.5 | M√°s conservador, puede perder objetos |
| 0.6+ | Muy estricto, solo detecciones muy seguras |

In [None]:
# Efecto del threshold
if sample_images:
    thresholds = [0.15, 0.25, 0.35, 0.50]
    
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    for ax, thresh in zip(axes, thresholds):
        boxes, scores, image = detect_objects(test_image, "yellow and black checkered box", thresh)
        
        ax.imshow(image)
        for box in boxes:
            x1, y1, x2, y2 = box
            rect = patches.Rectangle(
                (x1, y1), x2-x1, y2-y1,
                linewidth=2, edgecolor='lime', facecolor='none'
            )
            ax.add_patch(rect)
        
        ax.set_title(f"Threshold: {thresh}\n{len(boxes)} detecciones")
        ax.axis('off')
    
    plt.suptitle('Efecto del Threshold en las Detecciones', fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.show()

## 7. Limpieza

Eliminar archivos de demo si se crearon.

In [None]:
# Limpiar carpeta demo (opcional)
import shutil

demo_dir = PROJECT_ROOT / 'data' / 'demo_labels'
if demo_dir.exists():
    response = input("¬øEliminar carpeta demo_labels? (s/n): ")
    if response.lower() == 's':
        shutil.rmtree(demo_dir)
        print("‚úÖ Carpeta demo_labels eliminada")
    else:
        print("Carpeta conservada")

---

## Pr√≥ximos Pasos

1. **Usar el script completo** para anotar todo un dataset:
   ```bash
   python scripts/auto_annotate_grounding_dino.py \
       --source data/video_frames/ \
       --prompt "yellow and black checkered box" \
       --output data/dataset_auto/ \
       --threshold 0.3
   ```

2. **Revisar anotaciones** con el Annotation Reviewer en la GUI:
   ```bash
   python app.py
   # Tab: Annotations
   ```

3. **Entrenar modelo** con el nuevo dataset:
   ```bash
   python scripts/train.py
   ```