# Object category detection practical

This is an [Oxford Visual Geometry Group](http://www.robots.ox.ac.uk/~vgg) computer vision practical, authored by [Andrea Vedaldi](http://www.robots.ox.ac.uk/~vedaldi/) and Andrew Zisserman (Release 2018a).

![cover](data/cover.jpg)

The goal of *object category detection* is to identify and localize objects of a given type in an image. Examples applications include detecting pedestrian, cars, or traffic signs in street scenes, objects of interest such as tools or animals in web images, or particular features in medical image.

Given an object type, such as *people*, a *detector* receives as input an image and produces as output zero, one, or more bounding boxes around each occurrence of the object in the image. The key challenge is that the detector needs to find objects regardless of their location and scale in the image, as well as pose and other variation factors, such as clothing, illumination, occlusions, etc.

This practical explores basic techniques in visual object detection, focusing on *image based models*. The appearance of image patches containing objects is learned using statistical analysis. Then, in order to detect objects in an image, the statistical model is applied to image windows extracted at all possible scales and locations, in order to identify which ones, if any, contain the object.

In more detail, the practical explores the following topics: (i) using HOG features to describe image regions, (ii) building a HOG-based sliding-window detector to localize objects in images; (iii) working with multiple scales and multiple object occurrences; (iv) using a linear support vector machine to learn the appearance of objects; (v) evaluating an object detector in term of average precision; (vi) learning an object detector using hard negative mining.

$$
\newcommand{\bx}{\mathbf{x}}
\newcommand{\by}{\mathbf{y}}
\newcommand{\bz}{\mathbf{z}}
\newcommand{\bw}{\mathbf{w}}
\newcommand{\bp}{\mathbf{p}}
\newcommand{\cP}{\mathcal{P}}
\newcommand{\cN}{\mathcal{N}}
\newcommand{\vv}{\operatorname{vec}}
$$

In [None]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

<a id="part1"></a>

## Part 1: Detection fundamentals

As running example, we consider the problem of street sign detection, using the data from the [German Traffic Sign Detection Benchmark](http://benchmark.ini.rub.de/?section=gtsdb&subsection=news). This data consists of a number of example traffic images, as well as a number of larger test images containing one or more traffic signs at different sizes and locations. It also comes with *ground truth* annotation, i.e. with specified bounding boxes and sign labels for each sign occurrence, which is required to evaluate the quality of the detector.

In this part we will build a basic sliding-window object detector based on HOG features. Follow the steps below.

### Step 1.0: Loading the training data

First, we will load a Python data structure containing the data for the practical. To this end, we use the `lab.load_data` function of the `lab` package and store the result in a dictionary `imdb`. This dictionary has fields:

* `imdb['train']['images']`: a list of train image names.
* `imdb['train']['boxes']`: a $N\times 4$ tensor of object bounding boxes, in the form $[x_\text{min},y_\text{min},x_\text{max},y_\text{max}]$.
* `imdb['train']['box_images']`: for each bounding box, the name of the image containing it.
* `imdb['train']['box_labels']`: for each bounding box, the object label.
* `imdb['train']['box_patches']`: a $N \times 3 64 \times 64$ tensor of image patches, one for each training object. Patches are in RGB format.

An analogous set of fields `imdb['val']` are provided for the validation data. Familiarise yourself with the contents of these variables.

> **Task:** Run the following code to load the data structure.

In [None]:
# Load the data.
import lab
import torch
import copy
torch.set_grad_enabled(False)

imdb = lab.load_data('mandatory') ;

> **Question:** why is there a `imdb['train']['images']` and a `imdb['train']['box_images']` fields?

### Step 1.1: Visualize the training image patches

Next, we use the functon `lab.imarraysc` to display the image patches stored in `imdb['train']['box_patches']`. The `lab.imarraysc` takes as input a $N\times3\times H\times W$ tensor of $N$ RGB images of size $H\times W$.

> **Task:** Run the following code to visualize the training image patches.

In [None]:
from matplotlib import pyplot as plt

# Plot the training patches.
plt.figure(1, figsize=(12,12))
lab.imarraysc(imdb['train']['box_patches'])
plt.title('Training patches')

We also plot the mean of these patches using the function `imsc`, which instead takes as input a single $3\times H\times W$ image.

> **Task:** Run the following code to visualize the mean training image.

In [None]:
# Plot the average patch.
plt.figure(2,figsize=(8,8))
lab.imsc(imdb['train']['box_patches'].mean(0))
plt.title('Training patches mean') ;

> **Question:** what can you deduce about the object variability from the average image?

> **Question:** most boxes extend slightly around the object extent. Why do you think this may be valuable in learning a detector?

### Step 1.2: Extract HOG features from the training patches

Object detectors usually work on top of a layer of low-level features. We will use a *Convolutional Neural Network* to extract such features. However, instead of learning it from data as it would be customary done, we take a shortcut and create an *handcrafted* network that computes the HOG (*Histogram of Oriented Gradients*) features. In this manner, we can still make efficient use of the underlying PyTorch library.

In order to learn a model of the object, we start by extracting features from the image patches corresponding to the available training examples. HOG is computed by the `lab.HOGNet()` neural network. This network takes as input a grayscale or RGB image (or a batch of $N$ images), represented as a PyTorch tensor $N \times C \times W \times H$ array where $C$ is either 1 or 3. The output is a $N \times 27 \times W/\text{cell-size} \times H/\text{cell-size}$ dimensional array, where `cell_size` is 8 by default.

> **Task:** Run the following code to extract the HOG representation of the training patches.

In [None]:
# Get the HOG extractor neural network.
hog_extractor = lab.HOGNet()

# Get the HOG representation of the trianing patches.
hog_train = hog_extractor(imdb['train']['box_patches'])

<a id="part1.3"></a>

### Step 1.3: Learn a simple HOG template model 

A very basic object model can be obtained by averaging the features of the example objects, then subtracting the mean feature value, as follows.

> **Task:** Run the following colde to train a basic model for a target class.

In [None]:
# Average the HOG of the corresponding training images
w = torch.mean(hog_train, 0, keepdim=True)
w = w - w.mean()

The model can be visualized by *rendering* `w` as if it was a HOG feature array. This can be done using the `hog_extractor.to_image` method. This method takes a $N\times 27 \times H\times W$ HOG tensor and return a corresponding $N\times 1\times H_r\times W_r$ tensor of rendered HOG images.

> **Task:** Run the following code to visualize the model `w`.

In [None]:
# Render the HOG descriptor
wim = hog_extractor.to_image(w)

# Plot it
plt.figure(figsize=(6,6))
lab.imsc(wim[0]) ;

Spend some time to study this plot and make sure you understand what is visualized.

> **Question:** Can you make sense of the resulting plot?

### Step 1.4: Apply the model to a test image

The model is matched to a test image by: (i) extracting the HOG features of the image and (ii) convolving the model over the resulting feature map. This can be conveniently achieved by using PyTorch convolution operator `nn.Conv2d` by wrapping the model in a corresponding convolutional layer.

> **Task:** Run the following code to apply the learned model to an image.

In [None]:
import torch
import torch.nn as nn

# Wrap the model w in a convolutional layer.
model = nn.Conv2d(27, 1, w.shape[2:], bias=False)
model.weight.data = w
print("Model parameters:", model)

# Extract the HOG representation of a test image.
im = lab.imread('data/mandatory.jpg')
hog = hog_extractor(im)

# Apply the model.
scores = model(hog)

> **Tasks:** 
> 1. Work out the dimension of the `scores` arrays. Then, check your result with the dimension of the array computed by MATLAB.
> 2. Visualize the image `im` and the `scores` array using the provided example code below. Does the result match your expectations?

In [None]:
# Visualize the image.
plt.figure(1, figsize=(8,8))
lab.imsc(im[0])
plt.title('Input image')

# Visualize its HOG representation.
plt.figure(2, figsize=(8,8))
lab.imsc(hog_extractor.to_image(hog)[0])
plt.title('HOG representation')

# Visualize the scores as a heatmap.
plt.figure(3, figsize=(8,8))
lab.imsc(scores[0])
plt.title('Scores') ;

### Step 1.5: Extract the top detection

Now that the model has been applied to the image, we have a response map `scores`. To extract a detection from this, we (i) find which image boxes correspond to each entry in the `scores` tensor and (ii) find the box that has mximum score.

The first step is done by the `lab.boxes_for_scores` function, which takes as input a $1\times H\times W$ score map and returns as output a $4\times H\times W$ tensor `boxes` where `boxes[:,v_,u_]` is a vector $[u_0,v_0,u_1,v_1]$ representing the bounding box in image space corresponding to score value `scores[1,v_,u_]`. Here $(u_0,v_0)$ is the upper-left corner of the box and $(u_1,v_1)$ the lower-right.

The second step is done by rearranging `boxes` as a $N \times 4$ tensor representing a list of $N=HW$ boxes, one per row, rearranging `scores` as a `N` vector, finding the maximum element of the latter, and use the corresponding index to pick a box from `boxes`.

> **Question:** Inspect the code below. Why `cell_size` is involved in the calculations performed by `boxes_for_scores`?

> **Task.** Run the following code to find the maximally-scored box and plot a corresponding bounding box.

In [None]:
# Get the box coordinates for each score map location as a 4 x H x W tensor where
# score[0] is a 1 x H x W tensor.
boxes = lab.boxes_for_scores(model, scores[0])

# Convert `boxes` into a N x 4 list.
boxes = boxes.reshape(4,-1).permute(1,0)

# Do the same for `scores` and pick the maximum.
best, best_index = torch.max(scores[0].reshape(-1), 0)
box = boxes[best_index]

# Plot the results.
plt.figure(1, figsize=(10,10))
lab.imsc(im[0])
lab.plot_box(box)
plt.title(f'Detected object (score: {best:.3f})');

> **Question:** Use the example code to plot the image and overlay the bounding box of the detected object. Did it work as expected?

> **[Optional] How the score map is converted into boxes.** The code of `lab.boxes_to_scores` can be inspected [here](lab.py).
> The maximum is found in two steps, by first maximizing scores along dimension 2 (height) and then dimension 1 (width), obtaining indices `u` and `v`.
>
> `u` and `v` are in units of HOG cells. We convert this into pixel coordinates by multiplying by the HOG `cell_size`, which is also the downsampling factor applied by the `HOGNet` feature extractor.
>
> The size of the model template in number of HOG cell can be found as the size of the kernel in `model` (note that the kernel is square in this case, so PyTorch only saves one dimension). This is then converted in pixels by multipling by `cell_size` as before.
>
> In this manner, we can deduce the coordiantes of the upper-left and bottom-right conrenrs of the object bounding box and plot it.
>
> **Note:** the bounding box encloses exactly all the pixel of the HOG template. In MATLAB, pixel centers have integer coordinates and pixel borders are at a distance $\pm1/2$.


## Part 2: Multiple scales and learning with an SVM

In this second part, we will: (i) extend the detector to search objects at multiple scales and (ii) learn a better model using a support vector machine. Let's start by loading a subset of the data targeting a particular class of road signs.

> **Task:** Run the following code to reload the data structure keeping only the "mandatory" signs. The "mandatory" class is simply the union of all mandatory traffic signs.

In [None]:
imdb = lab.load_data(meta_class='mandatory')

### Step 2.1: Multi-scale detection

First, we demonstrate that our vanilla detector lacks scale invariance: if the image is presented at different resoltuons, then different results are obtained.

> **Task:** Run the following code to perform *singe-scale detection* on three different versions of the same image, differing by size.

In [None]:
from PIL import Image

for t, s in enumerate([1.5, 1, .6]):
    # Load a target image in PIL format.
    image = Image.open('data/mandatory.jpg')
    
    # Change the resolution of the image.
    image = image.resize([int(s*x) for x in image.size], Image.LANCZOS)

    # Run the detector as shown above.
    model = nn.Conv2d(27, 1, w.shape[2:], bias=False)
    model.weight.data = w
    hog = hog_extractor(lab.pil_to_torch(image))
    scores = model(hog)
    boxes = lab.boxes_for_scores(model, scores[0])
    boxes = boxes.reshape(4,-1).permute(1,0)
    best, best_index = torch.max(scores[0].reshape(-1), 0)
    box = boxes[best_index]

    # Plot the results for this resolution.
    plt.figure(t, figsize=(10,10))
    plt.imshow(image)
    lab.plot_box(box)
    plt.title(f'Detected object (score: {best:.3f})') ;

> **Question.** Does the result look right?

This is a practical issue as objects exist in images at sizes different from one of the learned template. In order to find objects of all sizes, we scale the image up and down and search for the object over and over again.

We solve this issue by running **detection at multiple scales**. The set of searched scales $s_0,s_1,\dots$ is defined logarithmically as $s_i = 2^{\frac{i}{S} - o}$ where $S$ is the number of octave subdivisions and $o$ the minimum octave.

> **Task**: Run the following code to define the set of searched scales.

In [None]:
import numpy as np

# Scale space configuraiton
min_octave = -1
max_octave = 3
num_octave_subdivisions = 3
scales = 2**np.linspace(min_octave, max_octave, num_octave_subdivisions * (max_octave - min_octave + 1))

# Print the scale values
print('Scales: ', ', '.join([f"{x:.2f}" for x in scales]))

The lab code provides a member function `HOGNet.detect_at_multiple_scales` that implements multi-scale detection using these scales.

> **Question:** [Open](lab.py) and study the `HOGNet.detect_at_multiple_scales` function. Convince yourself that it is the same code as before, but operated after rescaling the image a number of times. 
Next, we use the function `hog_extractor.detect_at_multiple_scales` to perform multi-scale detection.

The code below applies multi-scale detection to the three image sizes tested above.

> **Task:** Run the following code to perform *multi-scale detection* on three different versions of the same image, differing by size.

In [None]:
for t, s in enumerate([1.5, 1, .6]):
    # Load a target image in PIL format.
    image = Image.open('data/mandatory.jpg')
    
    # Change the resolution of the image.
    image = image.resize([int(s*x) for x in image.size], Image.LANCZOS)

     # Get all boxes and scores
    boxes, scores, _ = hog_extractor.detect_at_multiple_scales(w, scales, image)

    # Pick the best box
    best, best_index = torch.max(scores, 0)
    box = boxes[best_index]

    # Plot the results for this resolution.
    plt.figure(t, figsize=(10,10))
    plt.imshow(image)
    lab.plot_box(box)
    plt.title(f'Detected object (score: {best:.3f})') ;

> **Question:** Compare this result with the one obtained above. What is the difference?

### Step 2.2: Collect positive and negative training data

The model learned so far is too weak to work well. It is now time to use an SVM to learn a better one. In order to do so, we need to prepare suitable data. We already have positive examples (features extracted from object patches).

In order to collect negative examples (features extracted from non-object patches), we loop through a number of training images and sample patches uniformly. For speed, we only scan a subset of the images.

> **Task:** Run the following code to extract positive and negative HOG vectors.

> **Question:** How many negative examples are we collecting?

In [None]:
# Collect positive training data
pos = hog_train

# Collect negative training data
neg = []
num_training_images = 5
for image_path in imdb['train']['images'][:num_training_images]:
    print(f"Processing image {image_path}")
    image = Image.open(image_path)    
    hog = hog_extractor(lab.pil_to_torch(image))    
    patches = nn.Unfold(w.shape[2:], stride=5)(hog)
    n = patches.shape[1]
    patches = patches[:, :, :-1:n//100].permute(0,2,1)
    patches = patches.reshape(patches.shape[1], *w.shape[1:])
    neg.append(patches)
    
neg = torch.cat(neg, 0)

### Step 2.3: Learn a model with an SVM

Now that we have the data, we can learn an SVM model. To this end we will use the `lab.svm_scda()` function. This function requires the data to be in a $D \times N$ matrix, where $D$ are the feature dimensions and $N$ the number of training points. We also need a vector of binary labels, +1 for positive points and -1 for negative ones:

> **Task:** Run the folloing code to prepare the data for the SVM.

In [None]:
# Pack the data into a matrix with one datum per row
x = torch.cat((pos, neg), 0).reshape(len(pos) + len(neg), -1)

# Obtain the corresponding label vector
y = torch.cat((torch.ones(len(pos)), -torch.ones(len(neg))), 0)

print(f"The shape of x is {list(x.shape)}")
print(f"The shape of y is {list(y.shape)}")

Finally, we need to set the parameter $\lambda$ of the SVM solver. For reasons that will become clearer later, we use instead the equivalent $C$ parameter. Learning the SVM is then a one-liner.

> **Task.** Run the following code to train the SVM.

In [None]:
# Set the SVM parameter.
C = 10
lam = 1 / (C * (len(pos) + len(neg)))

# Run the SVM.
w, b = lab.svm_sdca(x, y, lam=lam)
w = w.reshape(1, *pos[0].shape)

# Plot the learned model
wim = hog_extractor.to_image(w)
plt.figure(figsize=(6, 6))
lab.imsc(wim[0]) ;

> **Question:** Visualize the learned model `w` using the supplied code. Does it differ from the naive model learned before? How?

### Step 2.4: Evaluate the learned model

> **Task:** Use the `detect_at_multiple_scale` seen above to evaluate the new SVM-based model.

In [None]:
for t, s in enumerate([1, .75, .5]):
    # Load a target image in PIL format.
    image = Image.open('data/mandatory.jpg')
    
    # Change the resolution of the image.
    image = image.resize([int(s*x) for x in image.size], Image.LANCZOS)

    # Get all boxes and scores
    boxes, scores, _ = hog_extractor.detect_at_multiple_scales(w, scales, image)

    # Pick the best box
    best, best_index = torch.max(scores, 0)
    box = boxes[best_index]

    # Plot the results.
    plt.figure(t, figsize=(10,10))
    plt.imshow(image)
    lab.plot_box(box)
    plt.title(f'Detected object (score: {best:.3f})') ;

> **Question:** Does the learned model perform better than the naive average?

> **Task:** Try different images. Does this detector work all the times? If not, what types of mistakes do you see? Are these mistakes reasonable?

<a id="part3"></a>
## Part 3: Multiple objects and evaluation

### Step 3.1: Multiple detections

Detecting at multiple scales is insufficient: we must also allow for more than one object occurrence in the image. In order to to so, rather than looking for the best box in the lot, we select several using *non-maximum suppression* (NMS).

The algorithm is simple: start from the highest-scoring detection, then remove any other detection whose overlap is greater than a threshold. The overlap metric used to compare a candidate detection to a ground truth bounding box is defined as the *ratio of the area of the intersection over the area of the union* of the two bounding boxes:
$$
\operatorname{overlap}(A,B) = \frac{|A\cap B|}{|A \cup B|}.
$$

This algorithm is implemented by the function `nms()` below, which returns a boolean vector `retain` of detections to preserve.

> **Tasks:**
> 1. Study the `lab.nms()` [function](lab.py) and make sure you understand how it works.
> 2. Study the `lab.topn()` [function](lab.py) and make sure you understand how it works.
> 3. Run the following code to obtain the top non-maxima supporessed boxes.

> **Question:** After non-maxima suppression we still get a significant number of boxes. Why do we want to return so many responses? In practice, it is unlikely that more than a handful of object occurrences may be contained in any given image...

In [None]:
# Load a target image in PIL format.
image = Image.open('data/signs-sample-image.jpg')

# Get all boxes and scores.
boxes, scores, _ = hog_extractor.detect_at_multiple_scales(w, scales, image)

# Get the top boxes.
boxes, scores, _ = lab.topn(boxes, scores, 100)

# Apply non-maxima suppression.
boxes, scores, _ = lab.nms(boxes, scores)

# Plot the results.
plt.figure(t, figsize=(10,10))
plt.imshow(image)
for box in boxes:
    lab.plot_box(box)
plt.title(f'Showing {len(boxes)} boxs') ;

### Step 3.2: Detector evaluation

We are now going to look at properly evaluating our detector. We use the [PASCAL VOC criterion](http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/devkit_doc.pdf), computing *Average Precision (AP)*. Consider a test image containing a number of ground truth object occurrences $(g_1,\dots,g_m)$ and a list $(b_1,s_1),\dots,(b_n,s_n)$ of candidate detections $b_i$ with score $s_i$. The following algorithm converts this data into a list of labels and scores $(s_i,y_i)$ that can be used to compute a precision-recall curve, for example using the `lab.pr()` function. The algorithm, implemented by `lab.eval_detections()`, is as follows:

0. Prerequisite: Assume that the candidate detections $(b_i,s_i)$ are sorted by decreasing score $s_i$.
1. Assign each candidate detection $(b_i,s_i)$ a true or false label $y_i \in \{+1,-1\}$. To do so, for each candidate detection in order:
    1. If there is a matching ground truth detection $g_j$ ($\operatorname{overlap}(b_i,g_j)$ larger than 50%), the candidate detection is considered positive ($y_i=+1$). Furthermore, the ground truth detection is *removed from the list* and not considered further.
    2. Otherwise, the candidate detection is negative ($y_i=-1$).
2. Add each ground truth object $g_i$ that is still unassigned to the list of candidates as pair $(g_j, -\infty)$ with label $y_j=+1$.

> **Questions:**
> * Why are ground truth detections removed after being matched?
> * What happens if an object is detected twice?
> * Can you explain why unassigned ground-truth objects are added to the list of candidates with $-\infty$ score?

In order to apply this algorithm, we first need to find the ground truth bounding boxes in the test image:

In [None]:
# Pick the first validation image.
image = imdb['val']['images'][1]
pil_image = Image.open(image)

# Pick all the gt boxes in the selected image.
sel = [i for i, box_image in enumerate(imdb['val']['box_images']) if box_image == image]
gt_boxes = imdb['val']['boxes'][sel]

# Plot the images and its boxes.
plt.figure(1, figsize=(8,8))
plt.imshow(pil_image)
for box in gt_boxes:
    lab.plot_box(box, color='y')
plt.title(f'Found {len(gt_boxes)} boxes in image {image}.') ;

Next, we run the detector code as before to extract the signs.

> **Task:** Use the code below to evaluate the detector on one image. Look carefully at the output and convince yourself that it makes sense

In [None]:
# Run the detector
boxes, scores, _ = hog_extractor.detect_at_multiple_scales(w, scales, pil_image)
boxes, scores, _ = lab.topn(boxes, scores, 100)
boxes, scores, _ = lab.nms(boxes, scores)

Next, we can evaluate the detections by matchign them to the ground truth.

> **Task:** Use the code below to compare the detections to the ground truth.

In [None]:
# Evaluate the detector and plot the results.
plt.figure(1, figsize=(12,12))
plt.imshow(pil_image)
results = lab.eval_detections(gt_boxes, boxes, plot=True)
plt.title('yellow: gt, red: false detections, green: ture detections') ;

Finally, we are ready to plot the PR curve and compute AP.

> **Task:** Use the code below to compute and plot the PR curve.

> **Question:** There are a large number of errors in each image. Should you worry?  In what manner is the PR curve affected? How would you eliminate the vast majority of those in a practice?

In [None]:
plt.figure(figsize=(5,5))
_, _, ap = lab.pr(results['labels'], scores, misses=results['misses'])
plt.title(f'Average precision: {ap*100:.2f}%') ;

### Step 3.3: Evaluation on multiple images

Evaluation is typically done on multiple images rather than just one. This is implemented by the `lab.evaluate_model` function.

In [None]:
# Evaluate the model on the first 20 validation images.
results = lab.evaluate_model(imdb, hog_extractor, w, scales=scales, subset=('val',0,20))

# PLot the resulting AP.
plt.figure(1, figsize=(5,5))
_, _, ap = lab.pr(results['labels'], results['scores'], misses=results['misses'])
plt.title(f'Average precision: {ap*100:.2f}%') ;

> **Task:** [Open](lab.py) `lab.evaluate_model` and make sure you understand the main steps of the evaluation procedure.

## Part 4: Hard negative mining

This part explores more advanced learning methods. So far, the SVM has been learned using a small and randomly sampled number of negative examples. However, in principle, every single patch that does not contain the object can be considered as a negative sample. These are of course too many to be used in practice; unfortunately, random sampling is ineffective as the most interesting (confusing) negative samples are a very small and special subset of all the possible ones.

*Hard negative mining* is a simple technique that allows finding a small set of key negative examples. The idea is simple: we start by initializing the model randomly, and then we alternate between evaluating the model on the training data to find erroneous responses and adding the corresponding examples to the training set.

### Step 4.1: Train with hard negative mining

The function `collect_hard_negatives` is called by the previously-defined function `evaluate_model` after the detector is run on each image to extract the HOG descriptors of the hardest negative patches. These patches are the ones that are highly scored by the models *and* incorrect according to the evaluation procedure. Recall that the result of the evaluation is a label which is +1 for good detections and -1 for incorrect ones. Collectively, these labels are stored in a $N$ tensor `labels` for a $N\times 4$ tensor `boxes` of detections.

`boxes` is expressed in pixels, whereas we need to obtain the corresponding HOG descriptors. In order to do so, each box coordinates are revese-mapped to a certain scale pyramid level, and then the corresponding HOG patch is extracted.

> **Task:** [Open](lab.py) the `collect_hard_negatives` function and inspect the code and make sure you understand how it works.

Next, we repeat SVM training, as seen above, a number of times, progressively increasing the size of the `neg` array containing the negative samples. This is updated using the output of `evaluate_model`, that in turns calls `collect_hard_negatives` discussed above.

In [None]:
batch_size = 8

# Collect positive training data.
pos = hog_train

def train_model(pos, imdb):
    # Initialize the model randomly.
    w = torch.randn((1, *pos[0].shape), dtype=torch.float32)

    n = len(imdb['train']['images'])
    for t in range(0, n, batch_size):
        tp = min(t + batch_size, n)
        print(f'Iteration {t}: running the model on images {t}-{tp-1}') ;

        # Evaluate the model on the training data.
        results = lab.evaluate_model(imdb, hog_extractor, w, scales=scales, subset=('train',t,tp), collect_negatives=True)

        # Collect more negative training patches.
        if t == 0:
            neg = torch.cat(results['negatives'], 0)
        else:
            neg = torch.cat((neg, *results['negatives']), 0)
        print(f"Iteration {t}: collected {len(pos)} positive and {len(neg)} negative patches")

        # Learn the SVM.
        C = 10
        lam = 1 / (C * (len(pos) + len(neg)))
        x = torch.cat((pos, neg), 0).reshape(len(pos) + len(neg), -1)
        y = torch.cat((torch.ones(len(pos)), -torch.ones(len(neg))), 0)

        plt.figure(3*t+1, figsize=(5, 5))
        w, b = lab.svm_sdca(x, y, lam=lam)
        w = w.reshape(1, *pos[0].shape)

        # Plot the learned model.
        wim = hog_extractor.to_image(w)
        plt.figure(3*t+2, figsize=(5, 5))
        lab.imsc(wim[0])
    return w

w = train_model(pos, imdb)

### Step 4.2: Evaluate the model trained using hard negative mining

> **Task:** Use the following code to evaluate AP on the first 20 test images.

In [None]:
# Evaluate the model on the first 20 validation images.
results = lab.evaluate_model(imdb, hog_extractor, w, scales=scales, subset=('val',0,20))

plt.figure(1, figsize=(6,6))
_, _, ap = lab.pr(results['labels'], results['scores'], misses=results['misses'])
plt.title(f'Average precision: {ap*100:.2f}%') ;

> **Task:** Use the following code to run the model on a validation image and visualzie the results.

In [None]:
# Pick the first validation image.
image = imdb['val']['images'][8]
pil_image = Image.open(image)

# Run the detector.
boxes, scores, hogs = hog_extractor.detect_at_multiple_scales(w, scales, pil_image)
boxes, scores, _ = lab.topn(boxes, scores, 100)
boxes, scores, _ = lab.nms(boxes, scores)

# Evaluate the detector and plot the results.
sel = [i for i, box_image in enumerate(imdb['val']['box_images']) if box_image == image]
gt_boxes = imdb['val']['boxes'][sel]
plt.figure(1, figsize=(15,15))
plt.imshow(pil_image)
lab.eval_detections(gt_boxes, boxes, plot=True)
plt.title('yellow: gt, green: true detections, red: false detections') ;

## Part 5: Train your own object detector

**Skip on fast track**

In this last part, you will learn your own object detector. To this end, open and look at `exercise5.m`. You will need to prepare the following data:

### Step 5.1: Preparing the training data

* A folder `data/pos` containing files `image1.jpg`, `image2.jpg`, ..., each containing a single cropped occurence of the target object. These crops can be of any size, but should be roughly square.
* A folder `data/neg` containing images `image1.jpg`, `image2.jpg`, ..., that *do not* contain the target object at all.
* A test image `data/test.jpg` containing the target object. This should not be one of the training images.

> **Task:** Understand the limitations of this simple detector and choose a target object that has a good chance of being learnable. 

**Hint:** Note in particular that object instances must be similar and roughly aligned. If your object is not symmetric, consider choosing instances that face a particular direction (e.g. left-facing horse head).

### Step 5.2: Learn the model

> **Task:** Use the code belwo to learn the model.

In [None]:
def load_custom_data(hog_extractor):
        from glob import glob
        pos = []
        for f in glob('data/pos/*.jpg'):
            pil_image = Image.open(f)           
            scaled_image = pil_image.resize([64, 64])
            pos.append(lab.pil_to_torch(scaled_image))
        pos = hog_extractor(torch.cat(pos, 0))
            
        imdb = {
            'images': glob('data/neg/*.jpg'),
            'boxes' : torch.Tensor(0,4),
            'box_images' : [],
            'box_labels' : torch.Tensor(0,1),
            'box_patches' : []
        }
        
        return {'train': imdb}, pos

imdb, pos = load_custom_data(hog_extractor)
w = train_model(pos, imdb)

In [None]:
torch.Tensor(0,4).reshape(2,-1,4).shape

### Step 5.3: Test the model

Use the code supplied in `example5.m` to evaluate the SVM model on a test image and visualize the result as in [Stage 2.1](#stage2.1).

> **Task:** Make sure you get sensible results. Go back to step 5.1 if needed and adjust your data.

**Hint:** For debugging purposes, try using one of your training images as test. Does it work at least in this case?

### Step 5.4: Detecting symmetric objects with multiple aspects

The basic detectors you have learned so far are *not* invariant to effects such as object deformations, out-of-plane rotations, and partial occlusions that affect most natural objects. Handling these effects requires additional sophistications, including using deformable templates, and a mixture of multiple templates.

In particular, many objects in nature are symmetric and, as such, their images appear flipped when the objects are seen from the left or the right direction (consider for example a face). This can be handled by a pair of symmetric HOG templates. In this part we will explore this option.

> **Task:** Using the procedure above, train a HOG template `w` for a symmetric object facing in one specific direction. For example, train a left-facing horse head detector.

> **Task:** Collect test images containing the object facing in both directions. Run your detector and convince yourself that it works well only for the direction it was trained for.

HOG features have a well defined structure that makes it possible to predict how the features transform when the underlying image is flipped. The transformation is in fact a simple *permutation* of the HOG elements. For a given spatial cell, HOG has 31 dimensions. The following code permutes the dimension to flip the cell around the vertical axis:

    perm = vl_hog('permutation') ;
    hog_flipped = hog(perm) ;

Note that this permutation applies to a *single* HOG cell. However, the template is a $H \times W \times 31$ dimensional array of HOG cells.

> **Task:** Given a `hog` array of dimension $H \times W \times 31$, write MATLAB code to obtain the flipped feature array `hog_flipped`.

**Hint:** Recall that the first dimension spans the vertical axis, the second dimension the horizontal axis, and the third dimension feature channels. `perm` should be applied to the last dimension. Do you need to permute anything else?

Now let us apply flipping to the model trained earlier:

> **Task:** Let `w` be the model you trained before. Use the procedure to flip HOG to generate `w_flipped`. Then visualize both `w` and `w_flipped` as done in [Sect. 1.3](#sect13). Convince yourself that flipping was successful.

We have now two models, `w` and `w_flipped`, one for each view of the object.

> **Task:** Run both models in turn on the same image, obtaining two list of bounding boxes. Find a way to merge the two lists and visualise the top detections. Convince yourself that you can now detect objects facing either way.

**Hint:** Recall how redundant detections can be removed using non-maximum suppression.

**Congratulations: This concludes the practical!**

## History

* Used in the Oxford AIMS CDT, 2014-19