[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dTHQAnkWkbD68_iPiNT0A2Qs9zBEu2Es)

Author: 
- **Safouane El Ghazouali**, 
- Ph.D. in AI, 
- Senior data scientist and researcher at TOELT LLC,
- Lecturer at HSLU

# -----  -----  -----  -----  -----  -----  -----  -----

# Hands-on: Prompt Engineering for CLIP and GroundingDINO

Welcome to this comprehensive hands-on notebook on prompt engineering! We'll explore best practices to optimize prompts for zero-shot classification with CLIP and zero-shot detection with GroundingDINO, enhancing model performance without additional training.

![Prompt Engineering](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fprompt.3c3c4c4f.png&w=1920&q=75)

### Why Prompt Engineering?
- **Optimization**: Well-crafted prompts can significantly boost accuracy in zero-shot tasks.
- **Flexibility**: Adapt models to specific domains or nuances via text alone.
- **Efficiency**: No need for retraining; iterate on prompts for quick improvements.
- **Best Practices**: For CLIP: Use ensembles, descriptive templates. For GroundingDINO: Lowercase, dot-separated phrases, thresholds.

### What You'll Learn
- Crafting effective prompts for CLIP classification.
- Optimizing detection prompts for GroundingDINO.
- Testing variations on images from URLs.
- Analyzing impacts on results.
- Applying to small datasets like CIFAR10/COCO samples.

# 🧰 Environment Setup

Install required libraries for CLIP, GroundingDINO, and utilities.

In [None]:
!pip install -q git+https://github.com/openai/CLIP.git transformers requests pillow opencv-python matplotlib numpy

### Import Libraries

In [None]:
import torch
import clip
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
import requests
from PIL import Image
from io import BytesIO
import cv2
import matplotlib.pyplot as plt
import numpy as np

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

# 📝 Prompt Engineering for CLIP (Classification)

Best Practices:
- Use descriptive templates like "a photo of a {class}".
- Ensemble multiple prompts for robustness (e.g., average similarities).
- Add context: "a low-quality photo of a {class}" for domain adaptation.
- Avoid ambiguity; be specific to reduce misclassifications.

Load CLIP model.

In [None]:
clip_model, clip_preprocess = clip.load('ViT-B/32', device=device)
print('CLIP loaded!')

### Single Image with Basic vs. Engineered Prompts

Test prompt variations on an image.

In [None]:
# Image URL (e.g., a cat)
url = 'https://images.unsplash.com/photo-1541963463532-daf8c885265c'
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img_preprocessed = clip_preprocess(img).unsqueeze(0).to(device)

plt.imshow(img)
plt.title('Sample Image')
plt.axis('off')
plt.show()

# Classes
classes = ['cat', 'dog', 'car']

# Basic prompt
basic_prompts = [f'{c}' for c in classes]
text_basic = clip.tokenize(basic_prompts).to(device)

# Engineered prompt
eng_prompts = [f'a high-quality photo of a {c} in natural light' for c in classes]
text_eng = clip.tokenize(eng_prompts).to(device)

# Inference function
def clip_predict(text_inputs):
    with torch.no_grad():
        image_features = clip_model.encode_image(img_preprocessed)
        text_features = clip_model.encode_text(text_inputs)
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)
        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    return similarity[0].cpu().numpy()

# Results
basic_probs = clip_predict(text_basic)
eng_probs = clip_predict(text_eng)

print('Basic Prompt Probabilities:')
for c, p in zip(classes, basic_probs):
    print(f'{c}: {p*100:.2f}%')

print('\nEngineered Prompt Probabilities:')
for c, p in zip(classes, eng_probs):
    print(f'{c}: {p*100:.2f}%')

### Prompt Ensemble

Average multiple templates for better robustness.

In [None]:
templates = [
    'a photo of a {}',
    'an image of {}',
    'a picture of the {}'
]

ensemble_probs = np.zeros(len(classes))
for temp in templates:
    text_ens = clip.tokenize([temp.format(c) for c in classes]).to(device)
    ensemble_probs += clip_predict(text_ens)

ensemble_probs /= len(templates)

print('Ensemble Probabilities:')
for c, p in zip(classes, ensemble_probs):
    print(f'{c}: {p*100:.2f}%')

# 📝 Prompt Engineering for GroundingDINO (Detection)

Best Practices:
- Lowercase phrases, separate with '.', end each with '.' (e.g., "cat. remote.").
- Be specific: "orange cat. tv remote." for better matching.
- Adjust thresholds based on prompt complexity.
- Use for open-set: Detect novel objects via descriptive text.

Load GroundingDINO.

In [None]:
model_id = 'IDEA-Research/grounding-dino-base'
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
print('GroundingDINO loaded!')

### Single Image with Basic vs. Engineered Prompts

In [None]:
# Image URL
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'  # Cats and remotes
response = requests.get(url)
img = Image.open(BytesIO(response.content))

plt.imshow(img)
plt.title('Sample Image')
plt.axis('off')
plt.show()

# Basic prompt
basic_prompt = 'cat. remote.'

# Engineered prompt
eng_prompt = 'orange cat. brown cat. black remote. white remote.'

# Function to detect and visualize
def dino_detect(prompt, box_th=0.35, text_th=0.25):
    inputs = processor(images=img, text=prompt, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    results = processor.post_process_grounded_object_detection(
        outputs, inputs.input_ids, box_threshold=box_th, text_threshold=text_th, target_sizes=[img.size[::-1]]
    )[0]
    
    img_cv = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
    for box, score, label in zip(results['boxes'], results['scores'], results['labels']):
        box = [int(c) for c in box]
        cv2.rectangle(img_cv, (box[0], box[1]), (box[2], box[3]), (0, 255, 0), 2)
        cv2.putText(img_cv, f'{label}: {score:.2f}', (box[0], box[1]-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    
    plt.imshow(cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB))
    plt.title(f'Detections with Prompt: {prompt}')
    plt.axis('off')
    plt.show()
    
    return results

# Basic
print('Basic Prompt Results:')
basic_results = dino_detect(basic_prompt)
print(basic_results)

# Engineered
print('\nEngineered Prompt Results:')
eng_results = dino_detect(eng_prompt)
print(eng_results)

### Adjusting Thresholds with Prompts

Lower thresholds for more detections, higher for precision.

In [None]:
print('Low Threshold Results:')
low_results = dino_detect(eng_prompt, box_th=0.2, text_th=0.15)
print(low_results)

# 🧠 Interpreting Results

Engineered prompts yield more precise matches. Ensembles in CLIP reduce variance. For DINO, specific descriptors and '.' separation improve grounding.

# 💡 Student Task

1. Test custom prompts on a new image URL for CLIP.
2. Add more templates to CLIP ensemble.
3. For DINO, try prompts with colors/attributes.
4. Compare basic vs. engineered on CIFAR10 sample.
5. Adjust thresholds and observe trade-offs.

In [None]:
# Your code here