### 🦖 Grounding DINO – Object Detection Through Language

Grounding DINO is an **open-vocabulary object detection model** that can locate and label objects in an image using free-form natural language. Unlike traditional object detectors trained on fixed label sets, Grounding DINO allows you to specify **any objects you want to detect** by simply typing them into a text box.

This notebook launches a **Gradio app** where you can upload an image and enter a prompt describing the objects you're interested in — for example:  
```
cat. dog. person.
```

👉 **Important:** Separate your target labels with **periods (`.`)**.  
The model will use this prompt to find and draw boxes around the matching objects in your image.

Try uploading your own photos or test images, and experiment with both simple and unusual prompts. You can detect everyday objects — or test how the model handles rare or unexpected terms!

In [None]:
import torch
from PIL import Image, ImageDraw
import gradio as gr
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

# Load model & processor
model_id = "IDEA-Research/grounding-dino-base"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

def detect_and_draw(image, prompt):
    if not prompt.endswith("."):
        prompt += "."  # Add dot if missing

    inputs = processor(images=image, text=prompt.lower(), return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**inputs)

    results = processor.post_process_grounded_object_detection(
        outputs,
        inputs.input_ids,
        box_threshold=0.4,
        text_threshold=0.3,
        target_sizes=[image.size[::-1]]
    )[0]

    draw = ImageDraw.Draw(image)
    for box, label, score in zip(results["boxes"], results["labels"], results["scores"]):
        box = box.tolist()
        draw.rectangle(box, outline="red", width=3)
        draw.text((box[0], box[1]), f"{label} ({score:.2f})", fill="white")

    return image

# Add explanation under prompt input
label_prompt = gr.Textbox(
    label="Text Prompt",
    placeholder="e.g., a cat. a remote control.",
    info="⚠️ Use lowercase labels and end each with a period (e.g., 'a cat. a remote control.'). This is required for Grounding DINO."
)

# Launch the interface
gr.Interface(
    fn=detect_and_draw,
    inputs=[
        gr.Image(type="pil", label="Upload an Image"),
        label_prompt
    ],
    outputs=gr.Image(type="pil", label="Detected Image"),
    title="Grounding DINO Object Detector",
    description="Detect objects using natural language prompts. Separate object names with periods and use lowercase."
).launch(server_name="0.0.0.0", server_port=8080, share=True)