# Image Modeling

<div class="slide-title">

# Image Modeling
    
</div>

Notes: A specific type of neuron was developed for working with image data - a convolutional neuron.

### Warm-up 🌶️
Spend some time looking at this explanation of [Image Kernels](https://setosa.io/ev/image-kernels/) (feel free to play around with all the settings!) and discuss the following questions:

+ What does the number at each pixel location in a greyscale image mean?
+ What is a kernel?
+ How is a kernel applied to an image?
+ How does changing the numbers in the kernel affect the output image?
+ What features of an image can we highlight using a kernel?


## Motivation

### Image Data

<div class="group">
    <div class="text_70">
        
* There is possibly no other data that has increased as much over the last decade - and filled our everyday life - as image data
        
    </div>
    <div class="images_30">

    </div>
</div>

* In a single day on Instagram 1.3 billion photos and videos are uploaded (2023)

* We want to process such data for many reasons
    * Image classification and captioning
    * Toxic content detection
    * Image quality improvements, etc.



Notes: Image data is everywhere and it invites an automatic procedure for analysing this data, hence image modelling.

## Introduction

### How does image data look like?

<div class="group">
    <div class="text_70">
        
* Image data has height, width and depth
    * Height and width are given in pixels
    * Depth is the number of channels (3 for **R**ed**G**reen**B**lue, 1 for **B**lack**W**hite)
        
    </div>
    <div class="images_30">
        <img src="../images/image_modelling/image_example.png">     
    </div>
</div>

Notes: Image data is encoded using 3 channels (we can think of it as a 3 dimensional picture) - Red, Green and Blue. On each layer a pixel defines the strength of the red, green, blue colour. Satellite images can have around 9 layers (infrared etc.)

* Each pixel corresponds to an integer between 0 and 255


* Example: A **1280 x 720** px color image has **1280 x 720 x 3** = **2,764,800** pixels = **2.76** MB
    * That is a lot to digest for an ML model using thousands of images for training and maybe hundreds in a batch



Notes: The 3 channels and the large amount of information conveyed in each image is different from tabular data we have dealt with before. Image data is unstructured by its nature and that is very much different from structured, tabular data.

### Unstructured data and locality

<div class="group">
    <div class="text_70">  

* In contrast to structured table data images show regularly transformations in local features
    * Different angles, rotations, flipping, scaling, translations
* Humans can see the same objects just naturally
* For an ML model this is a challenge
    * Transformation invariance is desired
        
    </div>
    <div class="images_30">
        <img src="../images/image_modelling/fish.png" width="90%">     
    </div>
</div>

Notes: “Transformation invariance” - seeing the same object from different angles is easy for human beings, but needs to be constructed for Machine Learning. Shearing

### Principle of locality

<div class="group">
    <div class="text_70"> 
        
* Nearby pixels often show the same or a similar color
* As pixels become further apart this similarity decreases or even breaks
* Locality describes this phenomenon:
    * Locally pixels are correlated
        
    </div>
    <div class="images_30">
        <img src="../images/image_modelling/img_p11_2.png">     
    </div>
</div>


Notes: Locally pixels are correlated and globally they are not. Shearing

### Structured and unstructured data

<div class="group">
    <div class="text"> 
        
* With structured data we would simply compare two feature vectors
    * Example: Structured truck data
        
        
</div>
    <div class="images">
        <img src="../images/image_modelling/img_p13_3.png">  
        

        
| Height (in m) | Weight (in m) |
|---------------|---------------|
| 2.420 | 11 | 
| 2.849 | 28 |        
| 2.975 | 50 |        

[Source](https://www.thelocal.de/20110217/33185)
        
</div>
</div>

Notes: Comparing 3 observations of the trucks is a specific case of measuring distance in vector space (Height being 1 dimension of a vector and Weight being the other). There are different norms (distance metrics) we could use to compute the distance.

### Structured and unstructured data

<div class="group">
    <div class="text"> 
        
* With structured data we would simply compare two feature vectors
    * Example: Structured truck data
        
Comparing the euclidean distance:

$(2.849-2.420)^{2}+(28-11)^{2}=289.19$ \
$(2.975-2.420)^{2}+(50-11)^{2}=1,521.31$
    
        
        
        
</div>
    <div class="images">
        <img src="../images/image_modelling/img_p13_3.png">  
        

        
| Height (in m) | Weight (in m) |
|---------------|---------------|
| 2.420 | 11 | 
| 2.849 | 28 |        
| 2.975 | 50 |        

[Source](https://www.thelocal.de/20110217/33185)
        
</div>
</div>

Notes: In this case we have decided to use Euclidean distance and we could see that the difference between Truck1 and Truck2 is smaller than between Truck2 and Truck3.

### Structured and unstructured data

<div class="group">
    <div class="text"> 
        
* With structured data we would simply compare two feature vectors
    * Example: Structured truck data

* This does not work with image data comparing the vectors of all pixels (rolling them out)
    * The reason is locality (more generally transformation)

        
</div>
    <div class="images">
        <img src="../images/image_modelling/img_p13_3.png"> 
        <img src="../images/image_modelling/fishes.png" width="80%">        
        
</div>
</div>

Notes: If we tried to apply the same procedure to images by converting data into a vector representation we would conclude that the 2 images we see are very different from each other. This kind of distance measurement doesn’t work on images.

### What is ideal feature information?



<div class="group">
    <div class="text_70"> 
        
* The primary information to identify objects in an image are the pixel relations
    * Pixels distributed in a certain way identify a certain object
    * Colour compositions play in certain areas a crucial role
* But how can we extract features that contain this information from an image?
    * This is a demanding task
* How do we humans identify a fish as a fish?
        
[Source](https://www.sciencedirect.com/science/article/pii/S0031320316303582?via%3Dihub)
        
</div>
    <div class="images_30">
        <img src="../images/image_modelling/img_p16_3.png" width=300> 
        <img src="../images/image_modelling/img_p16_4.png" width=300>
        <img src="../images/image_modelling/pixles.gif">
        
</div>
</div>

Notes: The randomly distributed pixels we see do not contain any meaningful object. The distribution of the pixels of real objects in nature are very different from the randomly distributed pixels. What we recognise is a distribution of pixels that we call the shape. 2nd image of the right - ML model gives more weight to these pixels which form the edge and which in turn help in recognising objects

### Image feature engineering

<div class="group">
    <div class="text_70"> 
        
* Earlier approaches to image processing relied heavily on feature engineering by hand
    * Using edges to identify shape: edges are sharp changes in color
* Some popular image features developed at these times
    * HOG (Histogram of Oriented Gradients) - Looking at changes in pixels
    * SIFT (Scale Invariant Feature Transform)
    * SURF (Speeded-Up Robust Feature)
        
</div>
    <div class="images_30">
        <img src="../images/image_modelling/img_p17_2.png"> 
        
</div>
</div>

Notes: Detecting the shape means focusing on the edges and earlier methods relied on feature engineering by hand to detect the edges. We won’t be focusing on feature engineering by hand because this method proved to be not very successful.

### The ideal feature information

* As an ideal, we want the model itself to find the best features to fulfill its task
* How can a model map the high-dimensional image space to a lower dimensional subspace?
    * In which each image is represented by a long feature vector that behaves like structural data
* In the following we try to find a model that can achieve this



Notes: Moving from a high-dimensional to a lower dimensional subspace allows us to separate between different classes.

## Linear models for image modeling


### How to prepare the image data for a linear model?

<div class="group">
    <div class="text"> 
        
* Given an image with height and width:
    * We could flatten the 2-dimensional tensor into a 1-dimensional one
    * Then just concatenating the channels therein
* Given the label to be one of [T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot]:
    * The label is represented by an integer between 0 and 9
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/flattend_2.png"> 
    </div>
</div>

Notes: The MNIST fashion dataset was used for a long time in an image classification challenges. It contains pictures of different clothing items and labelled by hand to specify the class. At first the images were black and white and later on they became coloured.

### Logistic regression

<div class="group">
    <div class="text"> 
        
* The logistic classification works on image data
    * On the MNIST Fashion dataset it reaches a 86% accuracy
    * That is almost a 15% error that could become costly in business

* Can we improve modeling these image data?
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/flattend.png"> 
    </div>
</div>


Notes: Input layer + Output-Layer with 10 neurons, sigmoid activation and cross-entropy loss. Output layer acts as a LogReg per neuron as they have sigmoid activation. For MNIST this error is not too large, but given how constricted and simple this dataset is this level of accuracy is not enough. In case of medical images for example it would be too low

### Drawbacks of linear models

* Remember the TensorFlow playground
    * Linear models perform bad on non-linear data
    * Complex structures in data are hard to capture for linear boundaries

Notes: Image data is inherently non-linear - edges and borders mark differences between objects and the colour changes from object to another are sudden and not linearly predictable.

* We need models capable of complex relationships between pixels
    * They must be able to approximate non-linear functions of high complexity
    * Such models must be able to do the feature engineering themselves


[TensorFlow Playground](https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=spiral&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=10&networkShape=4,1&seed=0.52914&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)

## Deep Neural Networks for image modeling


### Deep Neural Networks perform better on image data

<div class="group">
    <div class="text"> 
        
* Remember
    * Non-linear activation functions enable networks to model nonlinear functions
    * Hierarchy of neurons through layers make a complex feature extraction possible 

    </div>
    <div class="images">
        <img src="../images/image_modelling/dnn.png"> 
    </div>
</div>

Notes: Hierarchy of neurons and the resulting feature extraction enables us to deal with complex data

### Deep Neural Networks perform better on image data
 
<div class="group">
    <div class="text"> 
        
* Remember
    * Non-linear activation functions enable networks to model nonlinear functions
    * Hierarchy of neurons through layers make a complex feature extraction possible 

* Deep Neural Network on MNIST Fashion gets to a 91% accuracy
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/dnn.png"> 
    </div>
</div>


Notes: 91% accuracy is an improvement, but considering how simple of a dataset MNIST is it’s still too low.

### Drawbacks of Deep Neural Networks in image modeling

<div class="group">
    <div class="text_70"> 

* Theoretically we can build a very deep network and approximate any function to classify images (universal approximation theorem), but
    * Very deep networks have **high risk of overfitting**
        
    </div>
    <div class="images_30">
        <img src="../images/image_modelling/dnn_small.png"> 
    </div>
</div>

Notes: Bias reduced, but higher variance; Size of the network can become too large to train - even for a single GPU (memory); Order of pixels is highly relevant for us - randomly shuffling looks like noise to us

### Drawbacks of Deep Neural Networks in image modeling

<div class="group">
    <div class="text_70"> 

* Theoretically we can build a very deep network and approximate any function to classify images (universal approximation theorem), but
    * Very deep networks have **high risk of overfitting**
        
    * Real-world images have way more pixels 
        
    </div>
    <div class="images_30">
        <img src="../images/image_modelling/dnn_small.png"> 
    </div>
</div>

Notes: Bias reduced, but higher variance Size of the network can become too large to train - even for a single GPU (memory) Order of pixels is highly relevant for us - randomly shuffling looks like noise to us

### Drawbacks of Deep Neural Networks in image modeling

<div class="group">
    <div class="text_70"> 

* Theoretically we can build a very deep network and approximate any function to classify images (universal approximation theorem), but
    * Very deep networks have **high risk of overfitting**
        
    * Real-world images have way more pixels 
        
    * DNNs are not transformation invariant but are not dependent on the order of pixels either 
        * i.e. the relationship between pixels does not count
        
    </div>
    <div class="images_30">
        <img src="../images/image_modelling/dnn_small.png"> 
    </div>
</div>

Notes: Deep Neural Networks are highly random in their training and disregards the order of the pixels. In the case of images order of pixels matters. Bias reduced, but higher variance Size of the network can become too large to train - even for a single GPU (memory) Order of pixels is highly relevant for us - randomly shuffling looks like noise to us

### ImageNet Competition


<div class="group">
    <div class="text_70"> 

* ImageNet is the largest image recognition competition
    * ~1.2M images and 1,000 categories
* Until 2011 feature engineering defined a dominant part for winning teams
    * Xerox XRCE team won by using Fisher Vectors (multi-dimensional SWIFT) and SVMs
* 2012 AlexNet improved significantly and presented an implicit feature extraction
    * The authors used a {convolutional neural network (CNN)}
        
    </div>
    <div class="images_30">
        <img src="../images/image_modelling/cnn.png"> 
    </div>
</div>

Notes: Convolutional layers were a major breakthrough in image classification.

## Convolutional layers

### Introduction

<div class="group">
    <div class="text"> 

* A convolutional neural network consists of two special layers:
    * Convolutional layers
    * Pooling layers
* In the following we will consider these layers in detail
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/cnn_detail.png"> 
    </div>
</div>


Notes: AT&T 1995 split into NCR, Bell Labs, AT&T Technologies. Patent (2007) to NCT and Scientists to Bell Labs.

### Convolution as a grouping of information

<div class="group">
    <div class="text"> 

* We found out that the local relationships between pixels in an image contain important information  
    * How can this relationship be captured?
* Convolutional layers use a special mathematical method to capture these relationships: a **{convolution}**
    * Intuitively, they take snapshots from all regions of an image and apply filters to them
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/convolution.png"> 
    </div>
</div>

Notes: The aim is to extract information from some local regions. Convolutional layers take snapshots of images. Convolution takes a function and slides it over another function integrating (infinitely summing) both as it moves along. In this way the form of the function is captured - in the middle image the function peaks at a point where 2 functions cover each other. In a way it serves as a filter that is helping detect a certain shape, something that can be very helpful in an image. Convolution: How is the shape of a function modified by another? Shaded region: Product of f and g as a function of t Three carts with doses and 5 patients

### How does a convolution layer work?


<div class="group">
    <div class="text_70"> 
        <img src="../images/image_modelling/convolution_process.png">    
    </div>
    <div class="images_30">
    </div>
</div>

Notes: You slide the Kernel (filter) over the Image and perform cross-correlation as demonstrated in the slide. The result of 1 (in orange) is like a snapshot of the region. Convolution is a historical name that is used, but in reality we use cross-correlation.

### How does a convolution layer work?

<div class="group">
    <div class="text_70"> 
        <img src="../images/image_modelling/convolution_process.png">    
    </div>
    <div class="images_30">
<br>
        

$$ y_{i,j}=\sum_{k}\,\sum_{l}\,w_{k l}\,x_{i+k,j+l}$$
        
</div>
</div>

Notes: In reality we use cross-correlation rather than convolution. i,j - indices of the image rows and columns k,l - indices of the kernel rows and columns The kernel is moved by 1 pixel each time.

### Edge detection

<div class="group">
    <div class="text"> 

* A filter can detect different features in an image
    * Edges
    * Sharpening (edge detection + original image)
    * Blurring
* Here we only look at edge detection to get an intuition how filtering works

<img src="../images/image_modelling/calculation.png"  width=600> 
    </div>
    <div class="images">
        <img src="../images/image_modelling/fish_calculation.png" width=500 > 
    </div>
</div>


Notes: Using different features allows us to emphasise various types of elements in the image - in the example above the vertical and horizontal features are emphasised. The output of a filter gives a signal where a certain type of feature in the image might be.

### Major idea of a convolutional layer

<div class="group">
    <div class="text"> 

* Filters can be constructed by hand
    * Why not learning filter values {parameters} in training?
* A convolutional layer contains many filters
    * Each filter forms during training
* Each filter is of the same size and can process groups of nearby pixels
    * It looks at relationships between pixels {locality}
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/convolutional_layer.png" width=550> 
    </div>
</div>



Notes: A convolutional layer learns the filters in order to fulfill the task the best possible way. The number filters defines how many layers there are in the feature map.

### Reduction of dimensions

<div class="group">
    <div class="text"> 

* Convolutional layers reduce the dimension of the image
    * A 28 x 28 picture filtered by a 3 x 3 kernel shrinks to a dimension of 26 x 26
* This limits the number of convolutional layers applied to an image
    * At each layer the image shrinks
* Formula for the resulting image dimension:
        
$$m=n-k+1$$
$$28-3+1=26$$
    
</div>
    <div class="images">
        <img src="../images/image_modelling/convolutional_gif.gif">
        
[Source](https://www.freecodecamp.org/news/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050/#:~:text=Left:%20the%20filter%20slides%20over%20the%20input.%20Right:%20the%20result)
</div>
</div>


Notes: Shrinking the image at each convolutional step means that eventually we will not be able to apply filters anymore. Hence it won’t be possible to build deep convolutional neural networks on an original image and we would need to use “padding”. n - number of dimensions in the original image k - number of dimensions in the kernel m - number of dimensions in the new image

### How can we build deep convolutional networks?

* To avoid shrinking the image too quickly **{padding}** is applied



Notes: “Same” Padding means same output size as image size Pixel always in the center “Valid” padding is actually “no padding” - image gets fully covered

<div class="group">
    <div class="text"> 

* The image is ‘framed’ by zeros
     * A 28 x 28 image with a padding of 1 has dimension 30 x 30
     * WIth a kernel of 3 x 3 this results again in a 28 x 28 image
* Formula for the resulting image size

$$m=n+2p-k+1$$
$$28+2*1-3+1=28$$
        
</div>
    <div class="images">
        <img src="../images/image_modelling/kernel.png">
    </div>
</div>

Notes: n - number of dimensions in the original image k - number of dimensions in the kernel m - number of dimensions in the new image p - number of dimensions of padding “Same” Padding means same output size as image size Pixel always in the center “Valid” padding is actually “no padding” - image gets fully covered

### Making great strides

<div class="group">
    <div class="text_70"> 

* Stride is the step size of the filter {kernel} by which it moves along the image
* Sometimes a larger stride is desired
    * It reduces large image sizes to optimize memory usage and reduces computational costs
    * It reduces the risk of overfitting
* Formula for the resulting image dimension (stride of 2):

$$m=\big\lfloor{\frac{n+2p-k}{s}}+1\big\rfloor$$
$$m=\big\lfloor{\frac{28+2\cdot0-3}{2}}+1\big\rfloor=13$$

</div>
    <div class="images_30">
        <img src="../images/image_modelling/stride_2.png">
    </div>
</div>

Notes: Stride size sets by how many pixels we move the filter each time. n - number of dimensions in the original image k - number of dimensions in the kernel m - number of dimensions in the new image p - number of dimensions of padding s - stride size

### What happens during training?

* The filters {kernels} of a convolutional layer are variable
    * They get learned during training of the network
* The network learns thereby optimal filters to fulfill its task

<center>
    <img src="../images/image_modelling/feature_map.png" width=1000>
</center>

Notes: The kernels contain weights that are being trained and form the filters that are meaningful for the task. The size of the filter is a hyperparameter that depends on the specific domain. The stride is usually not learnt, but is set as a hyperparameter that depends on the specific domain. Padding of 0 (and not some other value) allows us to ensure we don’t induce a signal that wasn’t there anywhere. Images are typically converted to the same dimension, but convolutional layer is not restricted by the dimensions as it can slide over the image of any size (with the dimensions bigger than the filter)

### Stationarity principle

* There are features in an image that are essential and repeat themselves at different locations
* Statistical signals are uniformly distributed:
    * These features have therefore also be detected at each location
* This justifies {parameter sharing} which makes CNNs very parameter-efficient 


Notes: Stationarity means that filters can be reused - i.e. vertical features can be detected by the same filter in different areas of the image. In other words convolutional Neural Network can learn certain parameters in one area of the image and apply it to another area.

### Working through the channels

<div class="group">
    <div class="text"> 

* Color images have three channels
    * Convolutions take place across channels
    * I.e. a filter {kernel} is actually a cube
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/rgb.jpeg">
    </div>
</div>

Notes: The filter is like a cube that is applied to all channels at once and a single feature layer is output

### Working through the channels

<div class="group">
    <div class="text"> 

* Color images have three channels
    * Convolutions take place across channels
    * I.e. a filter {kernel} is actually a cube
* Per convolutional layer there are usually many filters applied
    * A filter per feature
    * I.e. the output is actually again multi-channel
    * The channel of the output is also called {feature map}
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/rgb_2.png">
    </div>
</div>

Notes: With 2 filters 2 Feature Maps will be created.

### Parameter sharing is caring

* Remember the large image example (1280px x 720px):
    * 27.6 Mio. parameters with 10 neurons in a DNN
    * 110 MB in memory



Notes: Training the network above is possible, but will take a long time and will not result in a well performing network.

* Consider instead a convolutional layer with 64 filters of size 3x3 applied to a colour image
    * 3x3x3x64 + 64= 1792 parameters!
    * 7 KiB in memory
    
* With parameter sharing we can train
    * faster
    * deeper networks

Notes: Because image data is stationary parameter sharing works. Otherwise the number of parameters and the large size of the input would mean that convolutional neural networks would take forever to train and would not work in practice.

## Pooling layers

Notes: Pooling layers are not trained, they just aggregate

### What is pooling?

* Pooling is aggregating features from {feature maps}
    * We pool over a certain window: we apply a summary of nearby pixels
    * The window is then moved across a feature map
* Pooling reduces a feature map in size:
    * A 26 x 26 feature map with a 2x2 pooling layer reduces the feature map to dimension 25x25
* Formula for the resulting dimension:

$$m=\big\lfloor{\frac{n+2p-k}{s}}+1\big\rfloor$$

$$m=\big\lfloor{\frac{26+2\cdot0-2}{1}}+1\big\rfloor=25$$

In practice pooling layers often use a stride corresponding to the kernel size:
2x2 kernel with stride of 2

Notes: Pooling is applying aggregation (sum, max or similar) rather than convolution. One of the task of pooling is to compress information and strengthen it. Max pooling has fallen out of favour (e.g. ResNet)

### How does a Pooling layer work?

<center>
    <img src="../images/image_modelling/max_pooling.png" width=1000>
</center>

Notes: In this example we use a max pooling layer and its effect is to strengthen the signal that comes in. In the example above no matter where the 6 appeared it was strengthened and passed through. Over the time of applying more and more layers we lose the signal and pooling enables us to strengthen the signal.

### Why pooling?

<div class="group">
    <div class="text"> 
        
Pooling has two major effects:
* It makes the network invariant to translations
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/pooling_translation.png" width=300 height=300>
    </div>
</div>

[source](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)


Notes:  Pooling layers are also useful in achieving translation invariance in the feature maps. This means that the position of an object in the image does not affect the classification result, as the same features are detected regardless of the position of the object.

### Why pooling?

<div class="group">
    <div class="text"> 
        
Pooling has two major effects:
* It makes the network invariant to translations
  
* It strengthens the signal while it is running through the network
    </div>
    <div class="images">
        <img src="../images/image_modelling/pooling_translation.png" width=300 height=300>
        <img src="../images/image_modelling/network.png" width=600>
    </div>
</div>


### Compositionality principle

* Any image is compositional:
    * Features compose the image in a hierarchical manner
* This justifies the use of multiple layers
    * Each layer composes features of the prior layer to a more complex one


### Building the network

* Stack layers together with a softmax or sigmoid layer at the end
* Train with gradient descent to optimize parameters

<center>
    <img src="../images/image_modelling/building_network.png">
</center>    

Notes: Sticking together multiple layhers of convolutional and pooling layers enables to gradually represent more and more complex features. Training is done with gradient descent and backpropagation as with any other neural network.

### Feature extraction

<div class="group">
    <div class="text"> 
        
* Hierarchical feature extraction
    * One pixel in deeper layer is connected to many pixels in shallow layer {sparse interactions}
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/layer.png">
    </div>
</div>

Notes: Each layer is a snapshot of a layer before which is similar to how mammal biological vision is organised. large receptive fields

* Feature complexity grows in each layer
    * First layers learn simple structures, like edges and corners
    * Later layers learn complex structures like eyes, nose, etc.

Notes: large receptive fields

### Feature Extraction

<center>
    <img src="../images/image_modelling/feature_extraction.png" width=1300>
</center>    

Notes: It is possible to create visual maps of the feature maps and visualise what the network learns. If the performance of the network is not good it is possible to inspect the feature maps and understand where and what is going wrong.

## Data augmentation and transfer learning


### Challenges with convolutional neural networks

* Deep convolutional neural networks are data hungry
    * Imagine all the features it has to learn from images
* Data sparsity then is a challenge
    * Could be too less, low-quality, too few variations
* Two ideas have emerged in the deep learning community to overcome these problems:
    * Data augmentation
    * Transfer learning



Notes: Convolutional neural networks are very data hungry - require large datasets to be trained on. In addition many examples need to be provided for the networks to be robust. Collecting so much data is expensive and difficult.

### Data augmentation

<div class="group">
    <div class="text"> 

* Idea:
    * Randomly transform images during training
        * Translate
        * Rotate
        * Zoom
        * Flip
        * ...
* Enlarges the data set by a significant factor
* Makes the network more robust
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/data_augmentation.png">
    </div>
</div>

Notes: Applying a random transformation to each picture in training enables significantly increasing the size of the dataset without needing to create new labels (which is time consuming and difficult). However it is important to have an initial dataset with good quality - not just the resolution, but also contain the same distribution of objects as the one you are planning to predict on.

### Transfer learning

<div class="group">
    <div class="text"> 

* Human beings learn tasks by transferring knowledge from other tasks they learned before
    * Can ANNs do the same?
* They can - given some conditions
    * Same input distribution for both tasks
    * Well-learned representations on the first task 
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/transfer_learning.png">
    </div>
</div>

Notes: Inspiration here comes from biological brains. A simple idea - we have a model that is able to classify fishes, can the same model be used to classify insects? Yes, we can but under the conditions above. If the distributions of input data are similar it should be possible for the network to detect similar features in both. Representation - “what makes fish a fish”, a feature map or a dense vector. If the model has been able to learn good representations of the training dataset and can classify the images there’s a chance we can apply it to a different dataset.

### How does transfer learning work?

* A model is trained on many data for one task
    * For example classifying fishes

<div class="group">
    <div class="text"> 

* Then the upper layers are replaced by a classifier for the second task
    * For example classifying hymenoptera (wasps, bees, etc.)
* This classifier for the second task gets trained
    * Usually significantly less data is needed for this task
        
    </div>
    <div class="images">
        <img src="../images/image_modelling/transfer_learning2.png">
    </div>
</div>

Notes: Upper (closer to the output) layers are the ones that decide on the actual high-level feature and final classes. You could fine tune your network for the specific class by training those layers on different output classes. Relying on someone else’s work might require an error analysis of the model you’ve downloaded.

### Where do we get pre-trained models?

* Famous ML frameworks offer pre-trained models
    * [TensorFlow Hub](https://www.tensorflow.org/hub)
    * [Keras Applications](https://keras.io/api/applications/#:~:text=Keras%20Applications%20are%20deep%20learning,They%20are%20stored%20at%20~%2F.)
    * [PyTorch Torchvision Models](https://pytorch.org/vision/stable/models.html)
* GitHub:
    * [Pretrained Models for PyTorch](https://github.com/Cadene/pretrained-models.pytorch)
    * [ONNX](https://github.com/onnx/models)
    * [Deepset.ai](https://github.com/deepset-ai/FARM)
* [Awesome Opensource List](https://awesomeopensource.com/projects/pretrained-models)



Notes: Keras applications allows much more control over the specific layers.

## Conclusion

### Image Modeling

* Image data is **unstructured** and shows regularly **transformations**
    * Locality, stationarity, compositionality principle

* Image models are able to offer **translation invariance**

* Special layers in image models are:
    * **Convolutional layers**: Extract features by filters
    * **Pooling layers**: Strengthen filter signals and make the net translation invariant

* Data **augmentation** helps to increase the data size and robustness of the net

* Transfer learning enables to use **pre-trained models** from one task for a second one

## Resources

- Géron, A. (2019), Chapter 14: Deep Computer Vision Using Convolutional Neural Networks
- Li, F.-F. and Johnson, J. (2017), Convolutional Neural Networks for Visual Recognition (much broader, lectures 1-5)

Image Feature Engineering:
- Dalal, N. and Triggs, B. (2005), “Histograms of Oriented Gradients for Human Detection”, CVPR 2005. IEEE Computer Society Conference on. IEEE, 2005. S. 886–893
- Lowe, D. (1999), “Object Recognition from Local Scale-Invariant Features”, ICCV '99 Proceedings of the International Conference on Computer Vision. Band 2, Seiten 1150–1157.
- Lowe, D. (2004), “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision. Band 60, Nr. 2, Seiten 91–110, 2004
- Bay, H. et al. (2006), “SURF: Speeded Up Robust Features”, European Conference for Computer Vision 2006

Drawbacks of deep neural networks:
- Hornik (1991), “Approximation Capabilities of Multilayer Feedforward Networks”
- Csáji (2001), “Approximation with Artificial Neural Networks”

ImageNet Competition:
- Perronin and Sanchez (2011), “Compressed Fisher Vectors for LSVRC”
- Krizhevsky et al. (2012), “ImageNet Classification with Deep Convolutional Neural Networks”

Strive:
- Springenberg et al. (2015), “Striving for Simplicity: The All Convolutional Net”
- Kong and Lucey (2017), “Take it in your stride: Do we need striding in CNNs?”

Data Augmentation:
- LeCun et al. (1998), “Gradient-Based Learning Applied to Document Recognition”
- Krizhevsky et al. (2012), “ImageNet Classification with Deep Convolutional Neural Networks”
- Wang, J. and Perez, L. (2017), “The Effectiveness of Data Augmentation in Image Classification Using Deep Learning”

Transfer Learning:
- Caruana, R. (1995), “Learning Many Different Tasks at the Same Time with Backpropagation”
- Yosinski, J. et al. (2014), “How transferable are features in deep neural networks?”
- Weiss, K. et al. (2016), “A survey of transfer learning”

Feature Extraction:
- Garcia, D. et al. (2018), “On the Behavior of Convolutional Nets for Feature Extraction”
- Zeiler, M.D. and Fergus, R. (2014), “Visualizing and Understanding Convolutional Networks”
- Yosinski, J. et al. (2015), “Understanding Neural Networks Through Deep Visualization”