# Object Detection using SSD in SVHN dataset

In this tutorial, you will learn to:
- Use object detection on the Street View House Numbers (SVHN) dataset.
- Getting started in SSD using `horch`.

Run the following cell to load the packages and dependencies that are going to be useful for your journey!

In [None]:
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader

from horch import cuda
from horch.datasets import train_test_split, CocoDetection, Fullset

from horch.train import Trainer, Save
from horch.train.metrics import TrainLoss, CocoAveragePrecision, COCOEval
from horch.train.lr_scheduler import CosineAnnealingWarmRestarts

from horch.transforms import Compose, ToTensor
from horch.transforms.detection import Resize, ToPercentCoords, SSDTransform
from horch.transforms.detection.functional import to_absolute_coords

from horch.detection import generate_mlvl_anchors, misc_target_collate, draw_bboxes, find_priors_coco
from horch.detection.one import MatchAnchors, AnchorBasedInference, MultiBoxLoss

from horch.models.utils import summary
from horch.models.detection.backbone import ShuffleNetV2
from horch.models.detection.enhance import FPN
from horch.models.detection.head import SSDHead
from horch.models.detection import OneStageDetector

%matplotlib inline
%load_ext autoreload
%autoreload 2

## 1 - Dataset

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. 

Dataset Overview:
- 10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10.
- 33402 images and 73257 digits for training, 13068 images and 26032 digits for testing

Let's load the train set from the prepared COCO format data and check some samples from it with `draw_bboxes` from `horch.detection`.

In [None]:
image_dir = "./SVHN/train"
ann_file = "./SVHN/annotations/train.json"
ds = CocoDetection(image_dir, ann_file)
print("The dataset has totally %d samples." % len(ds))

Because digit '1' has label 1, '9' has label 9, '0' has label 10 and no digit has label 0, we set the argument `categories` to ' 1234567890'. Notice that the first char of `categories` is a blank.

In [None]:
i = 0
img, anns = ds[i]
print("The size of image of sample #%d is %s" % (i, img.size))
print("Annotations: %s" % anns)
fig, ax = draw_bboxes(img, anns, categories=' 1234567890')

In [None]:
i = 1
img, anns = ds[i]
print("The size of image of sample #%d is %s" % (i, img.size))
print("Annotations: %s" % anns)
fig, ax = draw_bboxes(img, anns, categories=' 1234567890')

In [None]:
i = 2
img, anns = ds[i]
print("The size of image of sample #%d is %s" % (i, img.size))
print("Annotations: %s" % anns)
fig, ax = draw_bboxes(img, anns, categories=' 1234567890')

In [None]:
i = 2579
img, anns = ds[i]
print("The size of image of sample #%d is %s" % (i, img.size))
print("Annotations: %s" % anns)
fig, ax = draw_bboxes(img, anns, categories=' 1234567890')

It seems that images from the dataset are of very different sizes (741x350 vs 354x173 vs 199x83 vs 52x23), but with similar aspect ratios (nearly 2:1). And the bounding boxes of digits are also of different sizes, similar aspect ratios (tall and thin) and very close to each other.

After inspectation, we decide to resize every image to a propriate size 192x96 and choose feature levels 3, 4, 5 (of stride 8, 16, 32) to learn to detect. Because the bounding boxes are of similar aspect ratios, we assign only 3 anchors (priors) for every feature level (totally 9). We can define them by hand, but it's more precise and convenient to find propriate anchors using `find_priors_coco` in `horch.detection`.
`find_priors_coco` can find propriate anchors using k-means (described in YOLOv2). The outputs are percent, we scale them back to absolute.

In [None]:
width = 192
height = 96
levels = (3, 4, 5)
priors_per_level = 3
num_classes = 11 # 10 + *1*(background)
strides = [2 ** l for l in levels]
num_levels = len(levels)

priors = find_priors_coco(ds, k=num_levels * priors_per_level, verbose=False)
print("Generated priors (percent):")
print(priors)
anchor_sizes = priors.view(num_levels, priors_per_level, 2) * torch.tensor([width, height], dtype=torch.float32)
print("Anchors (3 levels, 3 per level, absolute):")
print(anchor_sizes)

This function finds very good anchors. The mean of max IoU between anchors and bounding boxes are higher than 0.8. Finally, we use them to generate uniformlly distributed anchors for every feature level.

In [None]:
anchors = generate_mlvl_anchors((width, height), strides, anchor_sizes)
for a in anchors:
    print(a.shape)

Then, we define transforms for train and test. The `train_transform` resize a sample to the expected size, convert coordinates of bounding boxes to percent, convert the image to tensor and finally match each ground truth box to the anchor with the best IoU and match anchors to any ground truth with IoU higher than a threshold (0.5).
For simplicity, we don't use data augmentation and use only a subset to train our model and evalute on the same subset to check how many AP our model can achieve.

In [None]:
train_transform = Compose([
    Resize((width, height)),
    ToPercentCoords(),
    ToTensor(),
    MatchAnchors(anchors, pos_thresh=0.5),
])

test_transform = Compose([
    Resize((width, height)),
    ToPercentCoords(),
    ToTensor(),
])

rest, ds_small = train_test_split(ds, test_ratio=0.0005)
print("The small subset has %d samples." % len(ds_small))

ds_train = Fullset(ds_small, train_transform)
ds_val = Fullset(ds_small, test_transform)

## 2 - Model

Now, it's the time to define our SSD model, which is very simple using `horch`. Unlike the original SSD, we use a light network ShuffleNetV2 as our backbone rather than VGG. ShuffleNetV2 has 60x fewer parameters (2.3M vs 138M) and similar classification perforamce (69.4% vs 70.5%) compared to VGG16.

Let's see how it works.
<img src="model.png">
<img src="get_loc_cls_preds.png">

Then, we load ShuffleNetV2 pretrained on ImageNet from `horch.models.detection.backbone`. `horch.models.detection.backbone` has many different kinds of backbones for different usage scenarios, e.g ResNet, Darknet, ShuffleNetV2.

In [None]:
backbone = ShuffleNetV2(feature_levels=levels, pretrained=False)

We can show summary of it with `summary` in `horch.models.utils`.

In [None]:
summary(backbone, (3, height, width))

Check the output channels of backbone.

In [None]:
cs = backbone(torch.randn(1,3,height,width))
for c in cs:
    print(c.shape)

All backbones from `horch` have the attribute `out_channels` that returns the output channels of the given feature levels.

In [None]:
backbone.out_channels

Then we define the head of SSD and check the outputs. Notice that heads in `horch` accept variable number of arguments, so we use list deconstruction.

In [None]:
head = SSDHead(priors_per_level, num_classes, backbone.out_channels)
summary(head, [(116, 12, 24), (232, 6, 12), (1024, 3, 6)])

In [None]:
loc_p, cls_p = head(*cs)
print(loc_p.shape)
print(cls_p.shape)

Next we define inference using `AnchorBasedInference` in `horch.detection.one`. It is suitable for neary all one-stage detection models (SSD, DSSD, FPN, RetinaNet) and can be simply extended to other models (Faster R-CNN, RefineDet).

`AnchorBasedInference` inference detections from `loc_p` and `cls_p` based on `anchors`, then filter detections whose confidence is lower than 0.01 and remove highly overlapped detections using non-max suppresion with IoU threshold 0.5.

In [None]:
inference = AnchorBasedInference(cuda(anchors), conf_threshold=0.1, iou_threshold=0.45, nms='nms')

Try it!. Because we have not trained our model, the detection results are random. Notice that the results are also in COCO format, but with percent coordinates.

Notice: `inference` use GPU as default, but our model is in CPU now. We must explicitly transform the outputs of the model to GPU and then give them to `inference`.

In [None]:
inference(cuda(loc_p), cuda(cls_p))

Now we compose them to SSD, which is rather simple.

In [None]:
class SSD(nn.Module):
    def __init__(self, backbone, head, inference):
        super().__init__()
        self.backbone = backbone
        self.head = head
        self._inference = inference

    def forward(self, x):
        cs = self.backbone(x)
        loc_p, cls_p = self.head(*cs)
        return loc_p, cls_p

    def inference(self, x):
        self.eval()
        with torch.no_grad():
            loc_p, cls_p = self.forward(x)
        dets = self._inference(loc_p, cls_p)
        self.train()
        return dets


In [None]:
net = SSD(backbone, head, inference)

## 3 - Training

Now, let's see how to train our model.

We use `MultiBoxLoss` in `horch.detection.one` as our loss function. Like `AnchorBasedInference`, it is also suitable for neary all one-stage detection models (SSD, DSSD, FPN, RetinaNet) and can be simply extended to other models (Faster R-CNN, RefineDet).

<img src="train.png">

`Adam` is the default optimizer in most cases which converges quickly and gives acceptable performance. `SGD` with carefully selected hyperparameters gives better performance but needs more time to tune and train.

`CosineAnnealingWarmRestarts` decays the learning rate with a cosine annealing.

<img src="cosinelr.png">

In [None]:
criterion = MultiBoxLoss(p=1) # p represents the probability to print loss
optimizer = Adam(filter(lambda x: x.requires_grad,
                        net.parameters()), lr=0.001, weight_decay=1e-5)
lr_scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=100)

We use 'loss' as the train-time metric and 'AP' as the test-time metric. Finally, we compose all to a `Trainer` and define "./checkpoints" as the path to save the trained model.

In [None]:
metrics = {
    'loss': TrainLoss(),
}

test_metrics = {
    "AP": COCOEval(ds_small.dataset.to_coco(ds_small.indices))
}

trainer = Trainer(net, criterion, optimizer, lr_scheduler,
                  metrics=metrics, evaluate_metrics=test_metrics,
                  save_path="./checkpoints", name="SSD-SVHN")

Define `train_loader` for training and `val_loader` for eval. We use small batch size because our dataset is only a subset of the original dataset. `misc_target_collate` in `val_loader` is necessary for object detection model evaluation.

In [None]:
train_loader = DataLoader(
    ds_train, batch_size=2, shuffle=True)
val_loader = DataLoader(
    ds_val, batch_size=64, collate_fn=misc_target_collate)

We train 100 epochs, evaluate every 10 epochs and save model with highmest AP after 80 epochs. Open `localhost:6006` and you can see the metrics online in `Tensorboard`.

In [None]:
trainer.fit(train_loader, 100, val_loader=(val_loader, 10), save=Save.ByMetric("val_AP", patience=80))

In theory, our model is able to overfit the small subset and achieve AP near 1.0. However, even after training for 100 epochs, our model has not got a good result.

The reason is that our model uses Batch Normalization by default, which works very poor when batch size is small.
Try to use Group Normalization in backbone and head, then train a new model. After that, the model will converge very quickly and achieve AP near 1.0.

Hint: set the argument `norm_layer` to `gn`.

After trying Group Normalization and training a very good model on the small subset, let's see how it performs in inference time.

In [None]:
img, anns = ds_small[0]

In [None]:
draw_bboxes(img, anns, categories=" 1234567890")
print("The ground truth is:")

In [None]:
x = test_transform(img, anns)[0]
x = cuda(x[None])

dets = net.inference(x)[0]
dets = to_absolute_coords(dets, img.size)

draw_bboxes(img, dets, categories=" 1234567890")
print("Our detections are:")

- If you see some obviously wrong detections, set `conf_threshold` of `inference` to a higher value (e.g. 0.3).
- If you feel that non-max suppresion didn't work well, set `iou_threshold` of `inference` to a lower value (e.g. 0.25).

In [None]:
net._inference.iou_threshold = 0.25
net._inference.conf_threshold = 0.3

In [None]:
x = test_transform(img, anns)[0]
x = cuda(x[None])

dets = net.inference(x)[0]
dets = to_absolute_coords(dets, img.size)

draw_bboxes(img, dets, categories=" 1234567890")
print("Our detections are:")

## What's Next

- **B** Train a model on the full SVHN dataset and get AP higher than 0.375.
- **A** Add FPN (`horch.models.detection.enhance.FPN`) to the model. (Hint: between backbone and head).
- **A** Try to replace the backbone with MobileNetV2. It is more powerful than ShuffleNetV2 and as fast as it.
- **S** Train a model on the full SVHN dataset and get AP around 0.45 with less than 4M parameters. You may need `horch.models.detection.RefineDet` or `horch.models.detection.FCOS`.
- **S** Train a model on your own dataset.