# Automatic Image Processing

Classic approach
- Aim? => Image enhancement
- How? => by using various filters 

ML-based approaches
- Aim? => Image classification and recognition 
- How?
    * Feature extraction & ML algorithms
        - Features: Haar, HOG, SIFT, SURF, LBP
        - ML algorithms: kNN, SVM, Decision trees and Ada boost
    * Feature learning & ML algorithms
        - Use ANN for both feature extraction and ML



Various recognition tasks:
- image classiifcation <img src="images/imgClassification.png" alt="classification" width="300"/>

- object detection <img src="images/others.png" alt="detection" width="400"/>


## Image classification  <img src="images/imgClassif.png" alt="classification" width="400"/>

Input
- More labeled images (for training)
- More (n) images (for testing – without labels)

Output
- Label associated to input images

Evaluation 
- Datasets – image classification task form
    * MNIST
    * CIFAR
    * Pascal VOC http://host.robots.ox.ac.uk/pascal/VOC/
        - 2005 – image classification task (4 classes, 1578 images, 2209 objects)
        - 2006 – image classification task (10 classes, 2618 images, 4754 objects)
        - …
        - 2012 – image classification task (20 classes, 11 530 images, 6929 objects)

    * ImageNet http://www.image-net.org/
        - 2010 – image classification task only (1000 classes, 14,197,122 images, )
        - 2011, … - other tasks (localisation, segmentation, detection)

Metrics 
- Accuracy
- Precision
- Recall
- AUC

How?
- Features + ML algorithm
    * Features: histograms, HOG, Bag of words, …
    * ML algorithm: Decision trees, SVM, ANN
- ML algorithm (that processes the raw images)
    * kNN
    * ANN 



### Image classification = ML algorithm

A single ANN extract features and solve the classification problem

Common architectures:
- Classical CNNs
    * LeNet (1998)
        - proposed by Yann LeCun, 1998 - [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)
        - architecture: a conv layer + a pool layer
    * AlexNet (2012) – first Deep CNN
        - proposed by lex Krizhevsky, Ilya Sutskever and Geoff Hinton, 2012 [paper](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
        - architecture:  More conv layers + more pool layers
    * ZfNet (2013)
        - proposed by Matthew Zeiler and Rob Fergus, 2013 [paper](https://arxiv.org/pdf/1311.2901.pdf)
        - architecture: AlexNet + optimisation of hyper-parameters
- Modern CNNs
    * VGG (2014)
        - proposed by Karen Simonyan and Andrew Zisserman, 2014 [paper](https://arxiv.org/pdf/1409.1556.pdf)
        - architecture: 16 Conv/FC layers (FC -> a lot more memory; they can be eliminated)

    * NiN (2014)
    * GoogleLeNet (2014)
        - propsoed by Christian Szegedy et al., 2014 [paper](https://arxiv.org/pdf/1409.4842.pdf)
        - architecture:
            * Inception Module that dramatically reduced the number of parameters in the network (AlexNet 60M, GoogleLeNet 4M) [detalis](https://arxiv.org/pdf/1602.07261.pdf)
            * uses Average Pooling instead of Fully Connected layers at the top of the ConvNet => eliminating parameters
    * MobileNet (2017)
    * ResNet (2015)
        - proposed by Kaiming He et al., 2015 [paper](https://arxiv.org/pdf/1512.03385.pdf)
        - architecture:
            * skip connections 
            * batch normalization









# LeNet (1998)

Parent
- Yann LeCun (NYU), MNIST data
- LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. “Gradient-Based Learning Applied to Document Recognition.” In Proceedings of the IEEE, 2278–2324 http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf

Flow:
- Input -> 2 x \[Conv -> Pool\] -> FC -> FC -> softmax -> Output (10 classes)

 <img src="images/leNet.png" alt="classification" width="600"/>


Input:
- Grayscale image 28 x 28

Activation:
- Tahn

    * 0 centered: on avg., values -> 0 => derivative -> 1

- Sigm

    * 0.5 centered: on avg., values -> 0.5 => derivative -> ~0.25 < 1

<img src="images/activations.png" alt="classification" width="800"/>

- Issues :
    
    * vanishing gradient problem (VGP = the gradient of the activation becomes negligible)


Filters (#filters(size, padding, stride)
- 6(5 x 5, 2, 1), 16(5 x 5, 0, 1)

Pooling
- Avg-pooling 2x2, stride 2

Loss
- Softmax (cross-entropy)

#parameters
- 60 000


# AlexNet (2012)

Parents
- Alex Krizhevsky et al., ImageNet data
- Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, 1097–1105. NIPS’12. Lake Tahoe, Nevada [link](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
- Winner of ILSVRC 2012 (5-6 days for training - GTX 580 GPUs)

Flow:
- Input -> 2 x \[Conv -> Pool -> Norm\] -> 3 x Conv -> Pool -> 3 x \[FC -> DropOut\] -> Softmax -> Output (1000 classes)

<img src="images/alexNet.png" alt="classification" width="800"/>

Input
- RGB image 224 x 224 x 3

Activation
- ReLU, sigmoid
- Conv -> ReLU
    * Advantages
        - Feature sparsity
        - Reducing VGP
    * Drawbacks
        - Dying ReLU problem: output(node) < 0 => derivative  = 0 => weights are not changed / trained
        - Conv layers are more affected by VGP
- FC -> tahn
    * FC layers are less affected by VGP

<img src="images/activationsReLu.png" alt="classification" width="800"/>

Filters (#filters(size, padding, stride)
- 96(11 x 11, 0, 4), 256(5 x 5, 2, 1), 3 x [384(3 x 3, 1, 1)]

Pooling
- Max-pooling 3x3, stride 2

Normalisation layers
- normalize the activations of each node by subtracting its mean and dividing by its standard deviation estimating both quantities based on the statistics of the current the current minibatch
- typically applied BN after the convolution and before the nonlinear activation function
- applied on each channel / feature map 

<img src="images/batchNorm.png" alt="classification" width="400"/>


Dropout layers
- Help in removing complex co-adaptations (reducing overfiitting)
    * #training samples > 10 * # parameters 
- net is more robust to noise 

Loss
- Cross-entropy loss

#parameters
- 60 000 000



# ZfNet (2013)

Parents
- Rob Fergus, Matthew D. Zeiler (NYU) -> ClarifAI, CIFAR-10
- Zeiler, Matthew D., and Rob Fergus. 2013. “Visualizing and Understanding Convolutional Networks.” CoRR [paper](http://arxiv.org/abs/1311.2901)
- Winner of ILSVRC 2013

Flow 
- AlexNet improved based on visualisation of the feature maps

<img src="images/zfNet.png" alt="classification" width="800"/>


Input
- RGB image 224 x 224 x 3

Activation
- ReLU, tahn

Filters (#filters(size, padding, stride)
- 96(7 x 7, 0, 2), 256(5 x 5, 2, 1), 512(3 x 3, 1, 1), 1024(3 x 3, 1, 1), 512(3 x 3, 1, 1)

Pooling
- Max-pooling 3x3, stride 2

Loss
- Cross-entropy 

#parameters
- TBA


# VGG (2014)

Parents
- Simonyan and Zisserman (Visual Geometry Group, Oxford)
- Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” [paper](http://arxiv.org/abs/1409.1556.)
- Deeper is better!
- Winner of ILSVRC 2014 (localisation), 2nd place (classification)

Flow 
- Input -> 2 x \[Conv -> Pool\] -> 3 x \[Conv -> Conv -> Pool\] -> FC -> FC -> FC -> Softmax -> Output (1000 classes) => VGG-11
- Input -> 3 x \[Conv -> Conv -> Pool\] -> 2 x \[Conv -> Conv -> Conv -> Pool\] -> FC -> FC -> FC -> Softmax -> Output  => VGG-16
- Input -> 3 x \[Conv -> Conv -> Pool\] -> 2 x \[Conv -> Conv -> Conv -> Conv -> Pool\] -> FC -> FC -> FC -> Softmax -> Output => VGG-19

<img src="images/vgg.png" alt="classification" width="800"/>

Input
- RGB image 224 x 224 x 3

Activation
= ReLU, tahn

Blocks = patterns of layers (Conv + ReLU + Pool)
- Stacked smaller filters -> complex features learntat a lower cost
- \[Conv64(3 x 3, 1, 1) -> ReLU -> Max-pooling 3x3, stride 2\]
- \[Conv128(3 x 3, 1, 1) -> ReLU -> Max-pooling 3x3, stride 2\]
- \[2 x Conv256(3 x 3, 1, 1) -> ReLU -> Max-pooling 2x2, stride 2\]
- \[2 x Conv512(3 x 3, 1, 1) -> ReLU -> Max-pooling 2x2, stride 2\]
- \[2 x Conv512(3 x 3, 1, 1) -> ReLU -> Max-pooling 2x2, stride 2\]

Filters 
- Stacks of smaller filters
    * 3 Conv(3x3,0,1) <=> 1 Conv(7 x 7, 0, 1)
    * Deeper => more non-linearities 
    * Fewer parameters (3 * 32 * #Channels2 <=> 1 * 72 * #Channels$^2$)

Loss
- Cross-entropy loss

#parameters
- 138 000 000 (VGG16) – a large part of them are used in the final 3 Fully Connected layers
- First FC layer -> AvgPooling (InceptionNet, ResNet)


# GoogleLeNet / InceptionNet(2014)

Parents
- Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. 2016. “Inception-V4, Inception-Resnet and the Impact of Residual Connections on Learning.” [CoRR abs/1602.07261](https://arxiv.org/pdf/1409.4842.pdf)
- Inception-V4 [paper](http://arxiv.org/abs/1602.07261)
- Top perforamnce ILSVRC 2014 (winner of classification)
- movie Inception (“We Need To Go Deeper”)

Flow 
- Input -> Conv -> Pool -> Conv -> Conv -> Pool -> 2 x InceptionBlock -> Pool -> 5 x InceptionBlock -> Poll -> 2 x InceptionBlock -> GlobalPool -> FC -> Output (1000 classes) 

<img src="images/inception.png" alt="classification" width="800"/>


Particularities
- FC layers are replaced by AvgPooling => reducing #parameters
- Repeatedly usage of the inception block => multiple filters of different sizes.
- Naïve Inception block/module <img src="images/naiveInceptionModule.png" alt="classification" width="400"/>
    * More Conv Ops:
        - [1x1 conv, 128] 28x28x128x1x1x256
        - [3x3 conv, 192] 28x28x192x3x3x256
        - [5x5 conv, 96] 28x28x96x5x5x256
    * Total: 854M ops, 592 K param

- Reduced Inception module <img src="images/1to1conv.png" alt="classification" width="400"/>
    * Use 1x1 conv to reduce feature depth
        - [1x1 conv, 64] 28x28x64x1x1x256
        - [1x1 conv, 64] 28x28x64x1x1x256
        - [1x1 conv, 128] 28x28x128x1x1x256
        - [3x3 conv, 192] 28x28x192x3x3x64
        - [5x5 conv, 96] 28x28x96x5x5x64
        - [1x1 conv, 64] 28x28x64x1x1x256
    * Total: 358M ops, 376 K param

- InceptionNet = Stack of Inception module with dimension <img src="images/inceptionModule.png" alt="classification" width="400"/>


<img src="images/inceptionNet.png" alt="classification" width="800"/>

Input
- RGB image 224 x 224 x 3

Activation
- ReLU, than

Loss
- Softmax (cross-entropy)

#parameters
- 5 000 000 parameters (12x less than AlexNet)

Versions (v1, v2, v3, v4)
- More templates, but the same 3 main properties are kept:
- Multiple branches
- Shortcuts (1x1, concate.)
- Bottleneck

<img src="images/inceptionVersions.png" alt="classification" width="800"/>


# ResNet(2014) -  Residual Neural Networks
Parents
- He et al. (Microsoft)
- He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015a. “Deep Residual Learning for Image Recognition.” [CoRR paper](http://arxiv.org/abs/1512.03385)
- Winner ILSVRC 2015

Flow 
- 152 layers (following VGG design)

<img src="images/resnet.png" alt="classification" width="800"/>


Particularities
- Batch normalisation after each convolution and before activation 
- Ultra deep networks with residual connections
    * Why?
        - error information propagating back tends to get more and more diffused by the time it gets to the initial layers  the weights in the initial few layers are not modified in an optimal fashion => higher errors for deeper nets <img src="images/residual.png" alt="classification" width="800"/>

    * How?
        - Skip connections <img src="images/skipConnections.png" alt="classification" width="800"/>

- A new layer = residual block (skip / shortcut connections) 
    * <img src="images/residualBlock.png" alt="classification" width="400"/>
    * 2 x ( x 3 conv -> batch norm -> ReLU)
    * Periodically, double #filters and downsample spatially using stride 2 (/2 in each dimension)
    * ResNet ~ ensemble of shallow networks (paths of different lengths)
    * Help gradients to penetrate deeper into the network 

- Batch normalisation after each convolution and before activation <img src="images/batchNorm.png" alt="classification" width="400"/>

    * <img src="images/batchNormalisation.png" alt="classification" width="800"/>

    * Solves the problem of internal covariate shift (the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change)
    * Stabilise and accelerates the convergence
    * Group normalisation 
    * No dropout layers 


- Large learning rate (initial) 0.1
    * Divided by 10 when the validation error plateaus 

- Additional conv layer at the beginning

- No FC layers at the end


Input
- RGB image 224 x 224 x 3

Activation
- ReLU, than

Loss
- Softmax (cross-entropy)

#parameters
- TBA

# ResNeXt (2017)

Hybridisation of Inception and ResNet [paper](https://arxiv.org/pdf/1611.05431.pdf)

<img src="images/resnext.png" alt="classification" width="800"/>

Particularities
- Shortcut
- Bottleneck <img src="images/uniformMultibranch.png" alt="classification" width="800"/>
- Multi-branch 
    * concatenation and addition are interchangeable  General property for Deep CNNs
    * uniform multi-branching can be done by group-conv

<img src="images/uniformMultibranchGrouped.png" alt="classification" width="800"/>



# Other architectures

- Inception-ResNet (2016)
- Dense Net
    * Remember Taylor expansions for functions

<img src="images/denseNet.png" alt="classification" width="800"/>

- Xception, MobileNet (Google)
    * Depthwise convolutions (grouped conv with #channels = #group)

- others 


# Review

- Complexity

- Forward pass time and power consumtion 

<img src="images/review.png" alt="classification" width="800"/>




# Reducing overfitting

Possible solutions: 
- increasing the amount of training data
    * Artificially expanding the training data
    * Rotations, adding noise, 
- reduce the size of the network
    * Not recommended 
- regularization techniques 
    * effects:
        - the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function
        - a way of compromising between finding small weights and minimizing the original cost function (when λ is small we prefer to minimize the original cost function, but when λ is large we prefer small weights)

        - Give importance to all features

        > $X = [1,1,1,1]$
        
        > $W_1 = [1, 0, 0, 0]$
        
        > $W_2 = [0.25, 0.25, 0.25, 0.25]$
        
        > 
        
        > $W_1^T X  = W_2^T X = 1$
        
        > $L_1(W_1) = 0.25 + 0.25 + 0.25 + 0.25 = 1$
        
        > $L_1(W_2) = 1 + 0 + 0 + 0 = 1$

    * Methods 
        - L1 regularisation – add the sum of the absolute values of the weights $C = C_0 +  \frac{\lambda}{n} \sum{|w|}$
            * the weights shrink by a constant amount toward 0
            * sparsity (feature selection – more weights are 0)
        - weight decay (L2 regularization) - add an extra term to the cost function (the L2 regularization term =  the sum of the squares of all the weights in the network = λ/2n ∑w2 ): $C = C_0 +  \frac{\lambda}{2n} \sum{w^2}$ 
            * the weights shrink by an amount which is proportional to w
        - elastic net regularisation
            * $\lambda_1 ∣ w ∣ + \lambda_2 w^2 \lambda_1 ∣ w ∣ + \lambda_2 w^2$
        - Max norm constraints (clapping)
        - Dropout -  modify the network itself (see [link](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf))
            * Some neurons are temporarily deleted
            * propagate the input and backpropagate the result through the modified network
            * update the appropriate weights and biases. 
            * repeat the process, first restoring the dropout neurons, then choosing a new random subset of hidden neurons to delete

- initialisation of weights
    * Pitfall: all zero initialization
    * Small random numbers 
    > $W = 0.01 * random(D,H)$
    * Calibrating the variances with $\frac{1}{\sqrt{\#Inputs}}$
    > $w = random \frac{\#Inputs}{\sqrt{\#Inputs}}$
    * Sparse initialization
    * Initializing the biases
    * In practice: $w = random(\#Inputs) * \sqrt{\frac{2.0}{\#Inputs}}$


    



Other theoretical details:

* Bengio’s [papers](https://arxiv.org/pdf/1206.5533v2.pdf and http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf)

* Snock’s [paper](http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf)

* Goodfellow's [chapeter](https://www.deeplearningbook.org/contents/guidelines.html)

* Deisenroth's [book](https://d1wqtxts1xzle7.cloudfront.net/61538438/mml-book20191217-47161-13am889.pdf?1576576797=&response-content-disposition=inline%3B+filename%3DMATHEMATICS_FOR_MACHINE_LEARNING.pdf&Expires=1604243049&Signature=e2vXIp11Ww6zcLQtOJ2hypxbwrxR9FWfKa1sPHLoXrP86SJsUyvyU~H2bGFV2Y5sjSGl1IrnU5axULlg9LshykFuQ2EQSj1Cizn6vpd0O-Aoe6~0gSb6Pmv4YkcCUAXPDHrktMwF4p5qmw6g0uiLuhuRl1SKqfgO5L2fge6P0UlaSKrbM5QZ6YQDguz4MW2bhRzrGufIwrpycuSr1lTRTcTmhCRieeNrgtFWKcXuKmBtFIrRoiahT0mZXQzoDaaxdj~U6W6BtXCWguvpBhEH1aZJ6RHsrS~MI2S9Pe6UcSAsNyK6I0We-8qJdod9aXfbNW1wnGky5C4ng5GVaCuw2A__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA)

* Hand-on ML [book](https://github.com/ageron/handson-ml)


Implementations:
* [repo](https://github.com/rasbt/deeplearning-models)
* pre-trained models 
