## 7. Networks for computer vision

In [12]:
"""
    Initialization
"""


import torch
from torch import Tensor
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, utils
from torch.autograd import Variable
import torchvision
import PIL

### 1. Task and data-set
- Tasks:
    - Classification
    - Object detection
    - Semantic or instance segmentation
    - Other (tracking in videos, camera pose estimation, body pose estimation, 3d reconstruction, denoising, super-resolution, auto-captioning, synthesis,...)
- Small scale classification data-sets:
    - MNIST and Fashion-MNIST: 10 classes, 50000 train iamges, 1000 test images, 28 $\times$ 28 grayscale
    - CIFAR10 (10 classes) and CIFAR100 (5 $\times$ 20 super classes): 50000 train images, 10000 test images, 32 $\times$ 32 RGB
    - PASCAL VOC 2012: 20 classes, 11530 training + validation images
    - ImageNet (image-net.org): 14197122 images
    - ImageNet Large Scale Visual Recognition Challenge 2012: 1000 classes, 1200000 training images, 50000 validation images
    - Cityscapes (cityscapes-dataset.com): 30 classes, 5000 images with fine annotations, 20000 images with coarse annotations

### 2. Task and performance measure
- Image classification:
    1. Task: predicting its class
    2. Performance measure:
        - The **error rate** $P(\,f(X) \neq y)$ or **accuracy** $P(\,f(X) = y)$
        - The **balanced error rate** (BER) $\frac{1}{C} \sum^{C}_{y = 1} P(\,f(X) \neq y\,|\,Y = y)$
        - In two-class case, define **True Positive** and **False Positive** (idea algorithms have TP $\approx$ 1 and FP $\approx$ 0) $\rightarrow$ Area under **TP** vs **FP** curve is **Receiver operating characteristic** (ROC)
            <img width=40% src="images/7-1.png">
        - Another curve is **Precision** vs **Recall** $\rightarrow$ Area under **Precision** vs **Recall** curve is **Average precicion**
            > True Positive: $P(\,f(X) = 1\,|\,Y = 1)$  
            > False Positive: $P(\,f(X) = 1\,|\,Y = 1)$  
            > Presision: $P(\,f(X) = 1\,|\,Y = 1)$  
            > True Positive: $P(\,f(X) = 1\,|\,Y = 1)$  

- Object detection:
    1. Task: predicting classes and locations of targets in an image, output is a series of bounding boxes, each with a class label
    2. Performance measure: consider predicted bounding box $B'$ and annotated bounding box $B$, we always want the **Intersetion over Union (IoU)** is large enough
        <img width=40% src="images/7-2.png">
- Image segmentation:
    1. Task: labeling individual pixels with the class of the object or the instance it belongs to
    2. Performance measure: (classification) **Segmentation accuracy (SA)**
        $$SA = \frac{n}{n + e}$$
    > $n$: number of pixels of the right class predicted  
    > $e$: number of pixels erroneously labeled

### 3. Image classification, standard convnets
- The most standard networks for image classification:
    1. LeNet
        - LeNet5: 10 classes, input $1 \times 28 \times 28$
        <img width=60% src="images/7-3.png">

    2. AlexNet
        - 1000 classes, input $3 \times 224 \times 224$
        - Use **Data augmentation** during training to reduce over-fitting
        <img width=60% src="images/7-4.png">

    3. VGGNet
        - 1000 classes, input $3 \times 224 \times 224$
        - 16 convolutional layers + 3 fully connected layers
        <img width=60% src="images/7-5.png">
        <img width=60% src="images/7-6.png">


In [4]:
"""
    Example of Image classification
"""

def image_squaring(raw_image):
    raw_image_size = raw_image.size
    if raw_image_size[0] == raw_image_size[1]:
        return raw_image
    new_image_size = (max(raw_image_size), max(raw_image_size))
    new_image = PIL.Image.new("RGB", new_image_size)
    new_image.paste(raw_image, (int((new_image_size[0] - raw_image_size[0])/2),
                                  int((new_image_size[1] - raw_image_size[1])/2)))
    return new_image

def image_resizing(raw_image, size):
    raw_image.thumbnail(size, PIL.Image.ANTIALIAS)
    return raw_image

def image_preprocessing(raw_image, size):
    return image_resizing(image_squaring(raw_image), size)

# Load and nomalize the image
raw_image = PIL.Image.open('data/images/image4.jpg')
img = torchvision.transforms.ToTensor()(image_preprocessing(raw_image, (244, 244)))
img = img.view(1, img.size(0), img.size(1), img.size(2))
img = 0.5 + 0.5 + (img - img.mean()) / img.std()
print(img.size())

# Load an already trained network and compute its prediction
alexnet = torchvision.models.alexnet(pretrained = True)
alexnet.eval()

output = alexnet(Variable(img))

# Print the classes
scores, indexes = output.data.view(-1).sort(descending = True)
class_names = eval(open('model/imagenet1000_clsid_to_human.txt', 'r').read())

for k in range(15):
    print('#{:d} ({:.02f}) {:s}'.format(k, scores[k], class_names[int(indexes[k])]))

torch.Size([1, 3, 244, 244])
#0 (17.98) trolleybus, trolley coach, trackless trolley
#1 (16.24) minibus
#2 (15.64) passenger car, coach, carriage
#3 (14.38) fire engine, fire truck
#4 (14.30) streetcar, tram, tramcar, trolley, trolley car
#5 (13.73) electric locomotive
#6 (12.39) recreational vehicle, RV, R.V.
#7 (11.99) harvester, reaper
#8 (10.84) trailer truck, tractor trailer, trucking rig, rig, articulated lorry, semi
#9 (10.83) minivan
#10 (10.29) moving van
#11 (10.15) tow truck, tow car, wrecker
#12 (9.96) amphibian, amphibious vehicle
#13 (9.95) ambulance
#14 (9.77) school bus


### 4. Fully convolutional networks
- Transform a series of layers from a standard convnets to fully convolutional (convolutionize) 

|||
|---|---|
|<img src="images/7-7.png">|<img src="images/7-8.png">|

- Pratical consequence:
    - Re-use classification networks for **dense prediction** without re-training
    - Blur the conceptual boundary between "features" and "classifier"

### 5. Image classification
#### a. Network in network
- Re-interpret a convolution filter as a one-layer perceptron and extened it with an "MLP convolution" to omprove the capacity vs parameter ratio
    <img width=60% src="images/7-9.png">
    <img width=80% src="images/7-10.png">
- "Auxiliary classifiers" help the propagation of the gradient in the early layers $\rightarrow$ increase performance by the idea that early layers already encode informative and invariant features $\rightarrow$ GoogLeNet has 12 times less parameter than AlexNet but more accurate in ILSVRC14

#### b. Residual networks
<img width=40% src="images/7-11.png">

#### c. Summary
- Standard ones are **extensions of LeNet5**
- Everybody loves **ReLU**
- State-of-the-art networks have **100s of channels** and **10s of layers**
- Networks sould be **fully convolutional**
- **Pass-through connections** allow deeper "residual" nets
- **Bottelneck local structures** and **Aggregated pathways** reduce the number of parameters

<img width=80% src="images/7-12.png">

### 6. Object detection
- Simplest strategy is to classify local regions, at multiple scales and locations (kinda brute-force @\_@!) $\rightarrow$ cost increases with prediction accuracy
- The above strategy is mitigated by Sermanet by adding a regression part to predict object's bounding box
<img width=30% src="images/7-13.png">
- Example of bounding boxes produced by the regression network mentioned above
<img width=60% src="images/7-14.png">
    $\rightarrow$ Combining multiple boxes is done with an *ad hoc* greedy algorithm
- AlexNet approach: relying on **region proposals**
    - Generate thousands of proposal bounding boxes with a non-CNN "objectness" approach
    - Feed to an AlexNet-like network sub-images cropped and warp from the input image to detect
    $\rightarrow$ Suffer from the cost of the region proposal computation, which is non-convolutional and non-GPUified
- Ren with "Faster R-CNN" improve AlexNet by replacing the region proposal algorithm with a convolutional processing similar to Overfeat
- Most famous algorithm is **"You Only Look Once"** (YOLO, Redmon). Mechanism:
    <img width=80% src="images/7-15.png">
    <img width=60% src="images/7-16.png">

### 7. Semantic segmentation
<img width=60% src="images/7-17.png">
- Historical approach: define a measure of similarity between pixels, and to cluster gourps of similar pixels (poorly performance)
- Deep-learning approach: re-casts semantic segmentation as pixel classification, and re-uses networks trained for image classification by making them fully convolutional
    <img width=60% src="images/7-18.png">

### 8. `torch.utils.data.DataLoader`
- Large sets do not fit in memory, and samples have to be constanly loaded during training $\rightarrow$ `torch.utils.data.DataLoader`

In [14]:
"""
    Example of 'torch.utils.data.DataLoader'
"""
train_transforms = transforms.Compose(
        [
            transforms.RandomCrop(28, padding = 3),
            transforms.ToTensor(),
            transforms.Normalize(mean = (33.32, ), std = (78.56, ))
        ]
)

train_loader = DataLoader(
    datasets.MNIST(root = './data', train = True, download = True,
                  transform = train_transforms),
    batch_size = 100,
    num_workders = 4,
    shuffle = True,
    pin_memory = torch.cuda.is_available()
)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


TypeError: __init__() got an unexpected keyword argument 'num_workders'

- Given this 'train_loader', we can now rewrite our training procedure with a loop over the mini-batches
    - Before:

    ```python
    if torch.cuda.is_available():
        input, target = input.cuda(), target.cuda()

    input, target = Varible(input), Variable(target)

    for e in range(nb_epochs):
            output = model(input)
            loss = criterion(output, target)
            model.zero_grad()
            loss.backward()
            optimizer.step()
    ```
    ---  

    - After:

    ```python
    for e in range(nb_epochs):
        for input, target in iter(train_loader):
            if torch.cuda.is_available():
                input, target = input.cuda(), target.cuda()

            input, target = Varible(input), Variable(target)

            output = model(input)
            loss = criterion(output, target)
            model.zero_grad()
            loss.backward()
            optimizer.step()
    ```