In [None]:
%pip install ultralytics

# Exploring Pretrained Models from PyTorch Hub

This notebook explores various pretrained computer vision models available through PyTorch Hub. We'll cover:

- **Object detection models** (SSD, YOLOv5)
- **Instance segmentation models** (Mask R-CNN)

## What is PyTorch Hub?

PyTorch Hub is a pre-trained model repository designed to facilitate research reproducibility and enable easy access to state-of-the-art models. Models are loaded directly from GitHub repositories with a single line of code.

## Datasets:

- **ImageNet1k**: 1.2M images, 1000 classes (ILSVRC 2012)
- **ImageNet21k**: 14M images, 21,841 classes (full ImageNet dataset)
- **COCO**: Common Objects in Context, 330K images, 80 object categories (detection & segmentation)

## Setup and Imports

In [None]:
%matplotlib inline

import torch
import torchvision
import torchvision.transforms as transforms
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import Polygon
import numpy as np
import requests
from io import BytesIO
from scipy.ndimage import zoom
import cv2
import warnings
warnings.filterwarnings('ignore')

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Utility Functions

In [None]:
def load_image_from_url(url):
    """
    Load an image from a URL.
    """
    response = requests.get(url)
    img = Image.open(BytesIO(response.content)).convert('RGB')
    return img


def display_image(img, title="Image", figsize=(8, 8)):
    """
    Display an image with matplotlib.
    """
    plt.figure(figsize=figsize)
    plt.imshow(img)
    plt.axis('off')
    plt.title(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()


def get_imagenet_labels():
    """
    Load ImageNet class labels.
    """
    url = "https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt"
    response = requests.get(url)
    labels = response.text.strip().split('\n')
    return labels


def get_coco_labels():
    """
    COCO dataset class labels (80 classes).
    """
    return [
        '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
        'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
        'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
        'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
        'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
        'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
        'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
        'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
        'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
        'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
        'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
        'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
    ]


def count_parameters(model):
    """
    Count total and trainable parameters in a model.
    """
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"Size (MB): {total_params * 4 / (1024**2):.2f}")  # Assuming float32

    return total_params, trainable_params

---
# Part 2: Object Detection Models

Object detection involves both locating objects (bounding boxes) and classifying them.

## 2.1 SSD (Single Shot MultiBox Detector)

### Overview
SSD is a fast object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.

### Key Features:
- Single-stage detector (no region proposals)
- Multi-scale feature maps for detection
- Fast inference speed
- Uses VGG16 as backbone
- Trained on COCO dataset (80 classes)

**Paper:** *SSD: Single Shot MultiBox Detector*  
Liu, W., et al. (2016)  
[ECCV 2016 Paper](https://arxiv.org/abs/1512.02325)

### Architecture:
![SSD Architecture](https://miro.medium.com/v2/resize:fit:1400/1*51joMGlhxvftTxGtA4lA7Q.png)

In [None]:
# Load SSD model from torch.hub
print("Loading SSD300 with VGG16 backbone...")
model_ssd = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd', pretrained=True)
model_ssd = model_ssd.to(device)
model_ssd.eval()

print("\nSSD model loaded successfully!")
print("Model trained on COCO dataset (80 object classes)")

In [None]:
# Load image for object detection
detection_url = "https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg"
img_detect = load_image_from_url(detection_url)
display_image(img_detect, "Image for Object Detection")

In [None]:
# Prepare image for SSD
def prepare_ssd_input(image):
    transform = transforms.Compose([
        transforms.Resize((300, 300)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    return transform(image).unsqueeze(0)

ssd_input = prepare_ssd_input(img_detect).to(device)

# Get predictions
utils_ssd = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_ssd_processing_utils')

with torch.no_grad():
    detections = model_ssd(ssd_input)

# Process detections
results_per_input = utils_ssd.decode_results(detections)
best_results_per_input = [utils_ssd.pick_best(results, 0.4) for results in results_per_input]

# Get COCO labels
coco_labels = get_coco_labels()

print("\n" + "="*60)
print("SSD DETECTIONS (Confidence > 40%)")
print("="*60)
for image_idx in range(len(best_results_per_input)):
    bboxes, classes, confidences = best_results_per_input[image_idx]
    for bbox, cls, conf in zip(bboxes, classes, confidences):
        print(f"Class: {coco_labels[cls]:20s} | Confidence: {conf:.2%} | BBox: {bbox}")
print("="*60)

In [None]:
# Visualize SSD detections
def visualize_detections(image, bboxes, classes, confidences, labels, title="Detections"):
    """
    Visualize object detection results.
    """
    fig, ax = plt.subplots(1, figsize=(12, 9))
    ax.imshow(image)

    # Generate colors
    np.random.seed(42)
    colors = np.random.rand(len(labels), 3)

    img_width, img_height = image.size

    for bbox, cls, conf in zip(bboxes, classes, confidences):
        # SSD outputs normalized coordinates [xmin, ymin, xmax, ymax]
        xmin = bbox[0] * img_width
        ymin = bbox[1] * img_height
        xmax = bbox[2] * img_width
        ymax = bbox[3] * img_height

        width = xmax - xmin
        height = ymax - ymin

        # Draw rectangle
        rect = patches.Rectangle((xmin, ymin), width, height,
                                linewidth=2, edgecolor=colors[cls], facecolor='none')
        ax.add_patch(rect)

        # Add label
        label_text = f"{labels[cls]}: {conf:.2f}"
        ax.text(xmin, ymin - 5, label_text,
               bbox=dict(boxstyle='round', facecolor=colors[cls], alpha=0.7),
               fontsize=10, color='white', weight='bold')

    ax.axis('off')
    plt.title(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Visualize
for image_idx in range(len(best_results_per_input)):
    bboxes, classes, confidences = best_results_per_input[image_idx]
    visualize_detections(img_detect, bboxes, classes, confidences,
                        coco_labels, "SSD Object Detection Results")

## 2.2 YOLOv5 (You Only Look Once v5)

### Overview
YOLOv5 is one of the most popular object detection models, known for its speed and accuracy. It's widely used in real-time applications.

### Key Features:
- Single-stage detector
- Extremely fast (up to 140 FPS on GPU)
- Multiple model sizes (nano, small, medium, large, xlarge)
- CSPDarknet53 backbone
- Trained on COCO dataset
- Easy to use and deploy

**Repository:** [Ultralytics YOLOv5](https://github.com/ultralytics/yolov5)

### Architecture:
![YOLOv5 Architecture](https://user-images.githubusercontent.com/26456083/86477109-5a7ca780-bd7a-11ea-9cb7-48d9fd6848e7.jpg)

In [None]:
# Load YOLOv5 from torch.hub
print("Loading YOLOv5s (small) model...")
model_yolo = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)
model_yolo = model_yolo.to(device)
model_yolo.eval()

print("\nYOLOv5s model loaded successfully!")
print(f"Number of classes: {len(model_yolo.names)}")
print(f"Model size: YOLOv5s (small)")

In [None]:
# Run inference with YOLOv5 (it handles preprocessing internally)
results = model_yolo(img_detect)

# Print results
print("\n" + "="*60)
print("YOLOv5 DETECTIONS")
print("="*60)
print(results.pandas().xyxy[0])  # Pandas DataFrame format
print("="*60)

In [None]:
# Visualize YOLOv5 results (built-in visualization)
results.show()  # This will display the image with bounding boxes

# Alternative: render as matplotlib
plt.figure(figsize=(12, 9))
plt.imshow(np.array(results.render()[0]))
plt.axis('off')
plt.title('YOLOv5 Object Detection Results', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Test YOLOv5 on a more complex scene
complex_scene_url = "https://images.unsplash.com/photo-1721910256794-fb3a7896ba45?fm=jpg&q=60&w=3000&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1yZWxhdGVkfDE5fHx8ZW58MHx8fHx8"
img_complex = load_image_from_url(complex_scene_url)
display_image(img_complex, "Complex Scene for Detection")

# Run detection
results_complex = model_yolo(img_complex)

# Visualize
plt.figure(figsize=(14, 10))
plt.imshow(np.array(results_complex.render()[0]))
plt.axis('off')
plt.title('YOLOv5 Detection on Complex Scene', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Print detections
print("\nDetected objects:")
print(results_complex.pandas().xyxy[0][['name', 'confidence']])

In [None]:
results_complex.show()

---
# Part 3: Instance Segmentation

## 3.1 Mask R-CNN ResNet50 FPN v2

### Overview
Mask R-CNN extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI). It performs:
1. Object detection (bounding boxes + class labels)
2. Instance segmentation (pixel-level masks for each object)

### Key Features:
- Two-stage detector
- ResNet50 backbone with FPN (Feature Pyramid Network)
- Predicts masks at pixel level
- State-of-the-art instance segmentation
- Trained on COCO dataset

**Paper:** *Mask R-CNN*  
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017)  
[ICCV 2017 Paper](https://arxiv.org/abs/1703.06870)

### Architecture:
![Mask R-CNN Architecture](https://cdn.prod.website-files.com/680a070c3b99253410dd3df5/68ee302b0bd7d85155faad63_684d854d2575826aacdbf2a1_67ed516491886a7a596e2fda_67dd6e0e3d3570fece529542_rcnn_fig4.webp)

In [None]:
# Load Mask R-CNN ResNet50 FPN v2
print("Loading Mask R-CNN ResNet50 FPN v2...")
model_maskrcnn = torchvision.models.detection.maskrcnn_resnet50_fpn_v2(pretrained=True)
model_maskrcnn = model_maskrcnn.to(device)
model_maskrcnn.eval()

print("\n" + "="*60)
print("MASK R-CNN RESNET50 FPN V2")
print("="*60)
print("Model loaded successfully!")
print("Trained on COCO dataset (80 object classes)")
print("\nParameter count:")
count_parameters(model_maskrcnn)

In [None]:
%matplotlib inline
# Prepare image for Mask R-CNN
transform_maskrcnn = transforms.Compose([
    transforms.ToTensor(),
])

# Load image for segmentation
segmentation_url = "https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg"
img_segment = load_image_from_url(segmentation_url)
display_image(img_segment, "Image for Instance Segmentation")

# Transform image
img_tensor = transform_maskrcnn(img_segment).to(device)

# Run inference
with torch.no_grad():
    predictions = model_maskrcnn([img_tensor])[0]

# Print predictions
print("\n" + "="*60)
print("MASK R-CNN PREDICTIONS")
print("="*60)
print(f"Number of detected instances: {len(predictions['labels'])}")
print("\nTop detections (confidence > 0.5):")

for i, (label, score, box) in enumerate(zip(predictions['labels'], predictions['scores'], predictions['boxes'])):
    if score > 0.5:
        print(f"{i+1}. {coco_labels[label]:20s} | Confidence: {score:.2%} | Box: [{box[0]:.1f}, {box[1]:.1f}, {box[2]:.1f}, {box[3]:.1f}]")
print("="*60)

In [None]:
# Visualize instance segmentation results with polygon contours
def visualize_instance_segmentation(image, predictions, labels, threshold=0.5):
    """
    Visualize instance segmentation with polygon masks and bounding boxes.
    Converts masks to polygon contours for cleaner visualization.
    """
    # Filter predictions by threshold first
    mask_indices = [i for i, score in enumerate(predictions['scores']) if score > threshold]

    if len(mask_indices) == 0:
        print(f"No detections found with confidence > {threshold}")
        return

    print(f"Visualizing {len(mask_indices)} detections with confidence > {threshold}...")

    # Convert image to numpy array
    if isinstance(image, Image.Image):
        img_array = np.array(image)
    else:
        img_array = image

    # Create figure with 3 subplots
    fig, axes = plt.subplots(1, 3, figsize=(24, 8))

    # Subplot 1: Original image
    axes[0].imshow(img_array)
    axes[0].set_title('Original Image', fontsize=16, fontweight='bold')
    axes[0].axis('off')

    # Subplot 2: Image with bounding boxes
    axes[1].imshow(img_array)
    axes[1].set_title(f'Bounding Boxes ({len(mask_indices)} detections)', fontsize=16, fontweight='bold')
    axes[1].axis('off')

    # Subplot 3: Image with transparent polygon masks
    axes[2].imshow(img_array)
    axes[2].set_title('Instance Segmentation Masks', fontsize=16, fontweight='bold')
    axes[2].axis('off')

    # Generate random colors for each unique class
    np.random.seed(42)
    max_label = max([predictions['labels'][idx].item() for idx in mask_indices])
    colors = np.random.rand(max_label + 1, 3)

    # Process each detection
    for idx in mask_indices:
        label = predictions['labels'][idx].item()
        score = predictions['scores'][idx].item()
        box = predictions['boxes'][idx].cpu().numpy()
        mask = predictions['masks'][idx, 0].cpu().numpy()

        # Get box coordinates
        xmin, ymin, xmax, ymax = box
        width = xmax - xmin
        height = ymax - ymin

        # Draw bounding box on subplot 2
        rect = patches.Rectangle((xmin, ymin), width, height,
                                linewidth=3, edgecolor=colors[label], facecolor='none')
        axes[1].add_patch(rect)

        # Add label text on subplot 2
        label_text = f"{labels[label]}: {score:.2f}"
        axes[1].text(xmin, ymin - 10, label_text,
                   bbox=dict(boxstyle='round', facecolor=colors[label], alpha=0.9),
                   fontsize=12, color='white', weight='bold')

        # Convert mask to polygon contours for subplot 3
        # Resize mask to image dimensions if needed
        if mask.shape != img_array.shape[:2]:
            zoom_factors = (img_array.shape[0] / mask.shape[0],
                          img_array.shape[1] / mask.shape[1])
            mask = zoom(mask, zoom_factors, order=1)

        # Create binary mask
        mask_binary = (mask > 0.5).astype(np.uint8)

        # Find contours using OpenCV
        contours, _ = cv2.findContours(mask_binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

        # Draw each contour as a polygon
        for contour in contours:
            # Simplify contour to reduce points (optional, for smoother appearance)
            epsilon = 0.005 * cv2.arcLength(contour, True)
            approx = cv2.approxPolyDP(contour, epsilon, True)

            # Convert contour to polygon coordinates
            if len(approx) >= 3:  # Need at least 3 points for a polygon
                polygon_coords = approx.reshape(-1, 2)

                # Create matplotlib Polygon patch with transparency
                poly = Polygon(polygon_coords,
                             facecolor=colors[label],
                             edgecolor=colors[label],
                             alpha=0.5,  # 50% transparency
                             linewidth=2)
                axes[2].add_patch(poly)

        # Add bounding box on subplot 3
        rect3 = patches.Rectangle((xmin, ymin), width, height,
                                 linewidth=2, edgecolor=colors[label], facecolor='none')
        axes[2].add_patch(rect3)

        # Add label text on subplot 3
        axes[2].text(xmin, ymin - 10, label_text,
                    bbox=dict(boxstyle='round', facecolor=colors[label], alpha=0.9),
                    fontsize=12, color='white', weight='bold')

    plt.tight_layout()
    plt.show()

    # Print detection summary
    print(f"\n{'='*70}")
    print(f"DETECTED {len(mask_indices)} INSTANCES (confidence > {threshold})")
    print(f"{'='*70}")
    for idx in mask_indices:
        label_id = predictions['labels'][idx].item()
        score = predictions['scores'][idx].item()
        box = predictions['boxes'][idx].cpu().numpy()
        print(f"  [{score:.2%}] {labels[label_id]:20s} @ [{box[0]:.0f}, {box[1]:.0f}, {box[2]:.0f}, {box[3]:.0f}]")
    print(f"{'='*70}\n")

# Visualize results
visualize_instance_segmentation(img_segment, predictions, coco_labels, threshold=0.5)

In [None]:
# Test on a scene with multiple objects
multi_object_url = "https://images.unsplash.com/photo-1511688878353-3a2f5be94cd7?w=800"
img_multi = load_image_from_url(multi_object_url)
display_image(img_multi, "Multi-object Scene for Segmentation")

# Transform and predict
img_multi_tensor = transform_maskrcnn(img_multi).to(device)

with torch.no_grad():
    predictions_multi = model_maskrcnn([img_multi_tensor])[0]

print(f"\nDetected {len([s for s in predictions_multi['scores'] if s > 0.5])} objects with confidence > 0.5")

# Visualize
visualize_instance_segmentation(img_multi, predictions_multi, coco_labels, threshold=0.7)

---
# Summary and Key Takeaways

## What We Learned:

### 1. Object Detection Models
- **SSD**: Single-stage detector, good balance of speed and accuracy
- **YOLOv5**: Fastest detection method, ideal for real-time applications
- Both trained on **COCO dataset** with 80 object categories
- Single-stage detectors are faster but may be less accurate than two-stage methods

### 2. Instance Segmentation
- **Mask R-CNN**: Extends object detection with pixel-level segmentation masks
- Two-stage approach: first detects objects, then segments them
- More computationally expensive but provides detailed object boundaries
- Essential for applications requiring precise object localization

## PyTorch Hub Benefits:
1. **Easy model loading**: Single line of code to load pretrained models
2. **Reproducibility**: Ensures consistent model versions
3. **Community models**: Access to models from various repositories
4. **No manual downloads**: Automatic model weight downloading

## Model Trade-offs:
- **Speed**: YOLOv5 > SSD > Mask R-CNN
- **Accuracy**: Mask R-CNN > SSD > YOLOv5 (generally)
- **Detail**: Mask R-CNN provides pixel-level masks, others only bounding boxes

## Practical Applications:
- **Object Detection**: Autonomous driving, surveillance, retail analytics, crowd counting
- **Instance Segmentation**: Medical imaging, robotics, augmented reality, precise object manipulation

## Next Steps:
1. Fine-tune models on custom datasets
2. Experiment with different YOLOv5 variants (n, s, m, l, x)
3. Deploy models for production use
4. Optimize for mobile/edge devices (ONNX, TensorRT)
5. Explore more recent architectures (YOLOv8, YOLOv9, DETR, Swin Transformer)

---
## Additional Resources

### Papers:
1. **SSD**: Liu et al., "SSD: Single Shot MultiBox Detector" (ECCV 2016) - [arxiv.org/abs/1512.02325](https://arxiv.org/abs/1512.02325)
2. **YOLOv5**: Ultralytics - [GitHub Repository](https://github.com/ultralytics/yolov5)
3. **Mask R-CNN**: He et al., "Mask R-CNN" (ICCV 2017) - [arxiv.org/abs/1703.06870](https://arxiv.org/abs/1703.06870)

### Datasets:
- [COCO Dataset](https://cocodataset.org/) - 80 object categories for detection and segmentation
- [ImageNet](https://www.image-net.org/) - 1000 classes (ImageNet1k) or 21,841 classes (ImageNet21k)
- [Papers With Code - Datasets](https://paperswithcode.com/datasets)

### Documentation:
- [PyTorch Hub](https://pytorch.org/hub/) - Pre-trained model repository
- [TorchVision Models](https://pytorch.org/vision/stable/models.html) - Official torchvision models
- [Ultralytics YOLOv5 Docs](https://docs.ultralytics.com/yolov5/)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)

### Model Repositories:
- [NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples) - SSD and other optimized models
- [Ultralytics YOLOv5](https://github.com/ultralytics/yolov5) - Most popular YOLO implementation
- [Detectron2](https://github.com/facebookresearch/detectron2) - Facebook's detection and segmentation platform

### Further Exploration:
- Try different YOLOv5 model sizes: `yolov5n`, `yolov5s`, `yolov5m`, `yolov5l`, `yolov5x`
- Explore YOLOv8 and YOLOv9 (latest versions with improved performance)
- Learn about semantic segmentation (FCN, DeepLab, SegFormer)
- Experiment with model quantization and pruning for deployment
- Convert models to ONNX format for cross-platform inference
- Use TensorRT for optimized GPU inference