# Image Recognition & Generation with AI (Hugging Face — Offline)

It uses **Hugging Face** models
- Setup & installation
- Image classification (ViT)
- Object detection (DETR)
- Image generation (Stable Diffusion via `diffusers`)
- Image-to-image (img2img)

Notes:
- Downloading the models requires internet the first time; after that they are cached and can be used offline.
- GPU is strongly recommended for Stable Diffusion. CPU will work for smaller demos (but slowly).


In [None]:
# Install required libraries
# Run this cell once in your environment. In Colab or local notebooks, prefix with `!` to run shell commands. If you already have some libraries installed you can skip them.

# Uncomment and run if needed:
# !pip install --upgrade pip
# !pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu117  # or CPU-only wheel if no GPU
# !pip install transformers[torch] timm pillow
!pip install diffusers transformers accelerate safetensors scipy ftfy supervision inference
# !pip install -U "git+https://github.com/huggingface/transformers"

# If using a local machine without CUDA, install CPU-only PyTorch as per https://pytorch.org


In [None]:
from PIL import Image
import requests
from io import BytesIO
import torch
from torchvision import transforms
from IPython.display import display

# small helper to show images inline
def show_pil(img, title=None):
    if title:
        print(title)
    display(img)

# helper to download example images
def load_image_from_url(url):
    resp = requests.get(url)
    return Image.open(BytesIO(resp.content)).convert("RGB")

# device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)


## Dataset -> VIT -> Finetune

## 1) Image Classification using ViT (Hugging Face `transformers`)

We'll use a pre-trained Vision Transformer model and the associated feature extractor.

Model: `google/vit-base-patch16-224` (image classification head trained on ImageNet)


In [None]:
from transformers import ViTFeatureExtractor, ViTForImageClassification

# load feature extractor and model
feat_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model_vit = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224').to(device)
model_vit.eval()

# Example: classify an image
img_url = 'https://t3.ftcdn.net/jpg/02/36/99/22/360_F_236992283_sNOxCVQeFLd5pdqaKGh8DRGMZy7P4XKm.jpg'  # replace with local path if desired
img = load_image_from_url(img_url)
show_pil(img, 'Input image')

# preprocess
inputs = feat_extractor(images=img, return_tensors='pt').to(device)

# predict
with torch.no_grad():
    outputs = model_vit(**inputs)
    logits = outputs.logits
    probs = logits.softmax(dim=-1)
    top5 = torch.topk(probs, k=5)

# decode labels
id2label = model_vit.config.id2label
for score, idx in zip(top5.values[0], top5.indices[0]):
    print(f"{id2label[int(idx.item())]}: {float(score):.4f}")


In [None]:
# Example: classify an image
img_url = 'https://media.istockphoto.com/id/1098182434/photo/young-cat-scottish-straight.jpg?s=612x612&w=0&k=20&c=WP-SVdLfKH7nDV5FvXN8flUbo9CI0xtE775wm-eegE0='  # replace with local path if desired
img = load_image_from_url(img_url)
show_pil(img, 'Input image')

# preprocess
inputs = feat_extractor(images=img, return_tensors='pt').to(device)

# predict
with torch.no_grad():
    outputs = model_vit(**inputs)
    logits = outputs.logits
    probs = logits.softmax(dim=-1)
    top5 = torch.topk(probs, k=5)

# decode labels
id2label = model_vit.config.id2label
for score, idx in zip(top5.values[0], top5.indices[0]):
    print(f"{id2label[int(idx.item())]}: {float(score):.4f}")

## 2) Object Detection using DETR (Hugging Face `transformers`)

Model: `facebook/detr-resnet-50`

This demonstrates bounding box detection and class labels.


In [None]:
from transformers import DetrFeatureExtractor, DetrForObjectDetection
import matplotlib.pyplot as plt
import matplotlib.patches as patches

feat_det = DetrFeatureExtractor.from_pretrained('facebook/detr-resnet-50')
model_detr = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50').to(device)
model_detr.eval()

# Load example image
img_url = 'https://images.unsplash.com/photo-1518791841217-8f162f1e1131'
img = load_image_from_url(img_url)
show_pil(img, 'Object detection input')

# prepare
inputs = feat_det(images=img, return_tensors='pt').to(device)
with torch.no_grad():
    outputs = model_detr(**inputs)

# post-process
target_sizes = torch.tensor([img.size[::-1]])  # (height, width)
results = feat_det.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]


# plot
fig, ax = plt.subplots(1, figsize=(12,8))
ax.imshow(img)
for score, label, box in zip(results['scores'], results['labels'], results['boxes']):
    box = [round(i, 2) for i in box.tolist()]
    x0, y0, x1, y1 = box
    w, h = x1-x0, y1-y0
    rect = patches.Rectangle((x0, y0), w, h, linewidth=2, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    class_name = model_detr.config.id2label[int(label.item())]
    ax.text(x0, y0, f"{class_name}: {score:.2f}", bbox=dict(facecolor='yellow', alpha=0.5))

plt.axis('off')
plt.show()


In [None]:
# Load example image
img_url = 'https://petstrainingandboarding.com.au/wp-content/uploads/2017/05/15994040_ml.jpg'
img = load_image_from_url(img_url)
show_pil(img, 'Object detection input')

# prepare
inputs = feat_det(images=img, return_tensors='pt').to(device)
with torch.no_grad():
    outputs = model_detr(**inputs)

# post-process
target_sizes = torch.tensor([img.size[::-1]])  # (height, width)
results = feat_det.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]


# plot
fig, ax = plt.subplots(1, figsize=(12,8))
ax.imshow(img)
for score, label, box in zip(results['scores'], results['labels'], results['boxes']):
    box = [round(i, 2) for i in box.tolist()]
    x0, y0, x1, y1 = box
    w, h = x1-x0, y1-y0
    rect = patches.Rectangle((x0, y0), w, h, linewidth=2, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    class_name = model_detr.config.id2label[int(label.item())]
    ax.text(x0, y0, f"{class_name}: {score:.2f}", bbox=dict(facecolor='yellow', alpha=0.5))

plt.axis('off')
plt.show()

In [None]:
import os
import supervision as sv
from inference import get_model
from PIL import Image
from io import BytesIO
import requests

url = "https://media.roboflow.com/dog.jpeg"
image = Image.open(BytesIO(requests.get(url).content))

model = get_model("rfdetr-base")

predictions = model.infer(image, confidence=0.5)[0]

detections = sv.Detections.from_inference(predictions)

labels = [prediction.class_name for prediction in predictions.predictions]

annotated_image = image.copy()
annotated_image = sv.BoxAnnotator(color=sv.ColorPalette.ROBOFLOW).annotate(annotated_image, detections)
annotated_image = sv.LabelAnnotator(color=sv.ColorPalette.ROBOFLOW).annotate(annotated_image, detections, labels)

In [None]:
sv.plot_image(annotated_image)

### Model Capability Sota | Latency | Model | cost

In [None]:
url = "https://petstrainingandboarding.com.au/wp-content/uploads/2017/05/15994040_ml.jpg"
image = Image.open(BytesIO(requests.get(url).content))

model = get_model("rfdetr-base")

predictions = model.infer(image, confidence=0.5)[0]

detections = sv.Detections.from_inference(predictions)

labels = [prediction.class_name for prediction in predictions.predictions]

annotated_image = image.copy()
annotated_image = sv.BoxAnnotator(color=sv.ColorPalette.ROBOFLOW).annotate(annotated_image, detections)
annotated_image = sv.LabelAnnotator(color=sv.ColorPalette.ROBOFLOW).annotate(annotated_image, detections, labels)

In [None]:
sv.plot_image(annotated_image)


 ## 3) Image Generation with Stable Diffusion (`diffusers`)

We'll use the `diffusers` pipeline. Model example: `runwayml/stable-diffusion-v1-5` or `stabilityai/stable-diffusion-2-1`.

**Warning:** These models are large (~4-8GB). If you're on a CPU-only machine, generation will be slow. Consider smaller models or check `diffusers` for tiny checkpoints.


In [None]:
from diffusers import StableDiffusionPipeline

sd_model_id = 'runwayml/stable-diffusion-v1-5'

# If you have a GPU, use torch_dtype=torch.float16 to save memory
pipe = StableDiffusionPipeline.from_pretrained(sd_model_id, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32)
pipe = pipe.to(device)

# safety: disable NSFW checker to avoid blocking in teaching demo (only for trusted local environment)
try:
    pipe.safety_checker = None
except Exception:
    pass

# Simple prompt-based generation
prompt = "A watercolor painting of a small island with a single tree, high detail"
with torch.autocast(device.type if device.type!='cpu' else 'cpu'):
    image = pipe(prompt, guidance_scale=7.5, num_inference_steps=25).images[0]

show_pil(image, 'Generated image')


## 4) Image-to-image (img2img)

Take an existing image and generate a variation. We'll use the same Stable Diffusion pipeline with an initial image.


In [None]:
init_img_url = 'https://images.unsplash.com/photo-1595598239736-223ad4fe7da5?fm=jpg&q=60&w=3000&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxzZWFyY2h8Mnx8Z3JlZW4lMjBhcmVhfGVufDB8fDB8fHww'
init_img = load_image_from_url(init_img_url).resize((512,512))
show_pil(init_img, 'init image')

prompt = "A dreamy oil painting version of this scene, soft brush strokes"
with torch.autocast(device.type if device.type!='cpu' else 'cpu'):
    out = pipe(prompt=prompt, image=init_img, strength=0.7, guidance_scale=7.5, num_inference_steps=25)
    img2 = out.images[0]

show_pil(img2, 'img2img result')


## 5) Zero-shot image classification using CLIP

We can use CLIP to compute similarity between images and text labels (useful for custom classes).


In [None]:
from transformers import CLIPProcessor, CLIPModel

clip_model = CLIPModel.from_pretrained('openai/clip-vit-base-patch32').to(device)
clip_processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

# sample image
img_url = 'https://images.unsplash.com/photo-1518791841217-8f162f1e1131'
img = load_image_from_url(img_url)

# labels
candidate_labels = ["person", "car", "dog", "cat", "bicycle", "tree"]
inputs = clip_processor(text=candidate_labels, images=img, return_tensors='pt', padding=True).to(device)
with torch.no_grad():
    outputs = clip_model(**inputs)
    logits_per_image = outputs.logits_per_image # this is the image-text similarity score
    probs = logits_per_image.softmax(dim=1)

for label, p in zip(candidate_labels, probs[0]):
    print(f"{label}: {float(p):.4f}")


## Tips for teaching & running locally

- **Model caching**: First run requires downloading models. Ask students to run `huggingface-cli login` if they hit rate limits for some models.
- **Smaller alternatives**: For classrooms without GPU, use smaller models (e.g., use `torchvision` pretrained models like mobilenet_v2 for fast classification).
- **Memory**: For Stable Diffusion try `scheduler=EulerAncestralDiscreteScheduler` and fewer `num_inference_steps` (10–25) for speed.
- **Safety & licenses**: Remind students to check model licenses and avoid generating unsafe content.


## Next steps / Exercises for students

1. Fine-tune ViT on a tiny custom dataset (use `datasets` and `Trainer`).
2. Replace Stable Diffusion with a lightweight diffusion model for faster generation.
3. Build a small web app using Gradio to wrap the classifier + generator.

---

# End of notebook
