[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1omtfPT5WueHxXk_sqO452eV0-Ex76K1U)

Author: 
- **Safouane El Ghazouali**, 
- Ph.D. in AI, 
- Senior data scientist and researcher at TOELT LLC,
- Lecturer at HSLU

# -----  -----  -----  -----  -----  -----  -----  -----

# Hands-on: GroundingDINO for Zero-Shot Object Detection

Welcome to this comprehensive hands-on notebook on using GroundingDINO for zero-shot object detection! GroundingDINO combines DINO and GLIP to detect objects using free-form text prompts, enabling open-vocabulary detection without training.

![GroundingDINO Example](https://huggingface.co/IDEA-Research/grounding-dino-base/resolve/main/demo_image.jpg)

### Why Use GroundingDINO?
- **Zero-Shot Detection**: Detect arbitrary objects via text prompts like "red car. person in blue shirt".
- **Open-Vocabulary**: Handles unseen classes without labeled data.
- **Flexibility**: Useful for custom detection in robotics, surveillance, etc.
- **Hugging Face Integration**: Easy loading via Transformers library.

### What You'll Learn
- Loading the GroundingDINO model from Hugging Face.
- Performing inference on single images from URLs.
- Processing outputs: bounding boxes, logits, phrases.
- Visualizing detections with boxes and labels.
- Adjusting thresholds and using multiple prompts.
- Batch processing and analysis.

# 🧰 Environment Setup

Install Transformers and dependencies for image processing and visualization.

In [None]:
!pip install -q transformers requests pillow opencv-python matplotlib

### Import Libraries

Import modules for model loading, image handling, and drawing.

In [None]:
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
import cv2
import matplotlib.pyplot as plt
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

# 📚 Understanding GroundingDINO

GroundingDINO detects objects by grounding text descriptions to image regions. Key:
- Prompts: Lowercase phrases ending with '.' (e.g., "cat. remote control.").
- Outputs: Bounding boxes, confidence logits, matched phrases.
- Thresholds: Box (0.35+) and text (0.25+) for filtering.

Reference: [Hugging Face Model Card](https://huggingface.co/IDEA-Research/grounding-dino-base)

# 📦 Loading the Model

Load the processor and model from Hugging Face.

In [None]:
model_id = "IDEA-Research/grounding-dino-base"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
print('GroundingDINO loaded!')

# 🖼️ Zero-Shot Detection on a Single Image

Download an image via URL, prepare prompt, and run detection.

In [None]:
# Sample image URL (COCO example with cats and remotes)
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

# Display original image
plt.imshow(image)
plt.title('Sample Image')
plt.axis('off')
plt.show()

# Text prompt (lowercase, end with .)
text = "a cat. a remote control."

# Prepare inputs
inputs = processor(images=image, text=text, return_tensors="pt").to(device)

# Inference
with torch.no_grad():
    outputs = model(**inputs)

# Post-process
results = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.35,
    text_threshold=0.25,
    target_sizes=[image.size[::-1]]
)[0]  # Take first (only) result

print('Detection Results:')
print(results)

# 🎨 Visualizing Detections

Draw bounding boxes and labels on the image.

In [None]:
# Convert PIL to OpenCV
img_cv = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)

# Draw boxes
for box, score, label in zip(results['boxes'], results['scores'], results['labels']):
    box = [int(coord) for coord in box.tolist()]
    cv2.rectangle(img_cv, (box[0], box[1]), (box[2], box[3]), (0, 255, 0), 2)
    text = f'{label}: {score:.2f}'
    cv2.putText(img_cv, text, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

# Display
plt.imshow(cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB))
plt.title('Detected Objects')
plt.axis('off')
plt.show()

# Explanation
# - boxes: [x_min, y_min, x_max, y_max]
# - scores: Confidence logits
# - labels: Matched phrases from prompt

# ⚙️ Adjusting Thresholds

Tune box_threshold and text_threshold to filter detections.

In [None]:
# Re-post-process with higher thresholds
results_high = processor.post_process_grounded_object_detection(
    outputs,
    inputs.input_ids,
    box_threshold=0.5,
    text_threshold=0.4,
    target_sizes=[image.size[::-1]]
)[0]

print('High Threshold Results:')
print(results_high)

# 🔄 Using Multiple Prompts

Test different prompts on the same image.

In [None]:
text2 = "furniture. animal. electronic device."
inputs2 = processor(images=image, text=text2, return_tensors="pt").to(device)

with torch.no_grad():
    outputs2 = model(**inputs2)

results2 = processor.post_process_grounded_object_detection(
    outputs2,
    inputs2.input_ids,
    box_threshold=0.35,
    text_threshold=0.25,
    target_sizes=[image.size[::-1]]
)[0]

print('Alternative Prompt Results:')
print(results2)

# 📸 Batch Processing Multiple Images

Process a list of image URLs.

In [None]:
# List of URLs
image_urls = [
    "http://images.cocodataset.org/val2017/000000039769.jpg",
    "http://images.cocodataset.org/val2017/000000281759.jpg"  # Another COCO image
]

text = "person. car. dog."

for url in image_urls:
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    inputs_batch = processor(images=img, text=text, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs_batch = model(**inputs_batch)
    
    results_batch = processor.post_process_grounded_object_detection(
        outputs_batch,
        inputs_batch.input_ids,
        box_threshold=0.35,
        text_threshold=0.25,
        target_sizes=[img.size[::-1]]
    )[0]
    
    print(f'Results for {url}:')
    print(results_batch)

    # Visualize (similar to above)

# 🧠 Interpreting Results

Outputs include filtered boxes, scores (logits), and phrases matched from the prompt. Higher thresholds reduce false positives but may miss detections.

# 💡 Student Task

1. Test on a custom image URL with your own prompt.
2. Experiment with thresholds to balance precision/recall.
3. Use complex prompts (e.g., "red apple. green bottle.").
4. Batch process 3+ URLs and visualize.
5. Compare detections across different prompts on the same image.

In [None]:
# Your code here