# Computer Vision

## Applications :
 1. Image Classification : Categorize whether the image is of Cat(0/1)?
 2. object Detection : Detecting the objects in a given image
 3. Neural Style Transfer : Given  images, fill picture 1 with the 2nd picture
 <img src="./images/eg_1.png" height="50%" width="50%">
 
## Problems :
 * Image size could be an issue for developing a model. Works fine with 64x64 image, but for high quality image say 1000x1000 image the number of input features would be 1000x1000x3 = 3millions, and the corresponding weight for layer 1 with say 1000 neurons would be of shape -> (1000,3million), and would definitely be difficult to train model with 3 billion parameters since it would be difficult to get that much data to avoid overfitting.
 
## Understanding How it works :
 1. Consider the example of an Edge detection System, given a photo we are trying to detect the edges around the object in it.
 2. First we would try to find the vertical edges in the image and then will try to find all the horizontal edges in the image.
 <img src="./images/edge_1.png" height="50%" width="50%">
 3. **Vertical Edge Detection :**
   * Given a gray scale image of say 6x6x1 dimension(1 since it is a gray scale image), would use something called as a ***kernel or Filter*** of dimension say ***3x3***.
   * This filter is then used to calculate another matrix of dimension say 4x4 by applying an operation called as a ***Convolution*** and denoted by '*'.
   * First element of this final matrix is calculated by multiplying each element of the subset matrix(of dimension of 3x3) from the initial 6x6 matrix with the kernel/filter and multiplying and adding the corresponding elements to get a single number.
<img src="./images/vertical_1.png" height="50%" width="50%"></img>
   * For second element at (0,1) position, slide the kernel horizontal by one step and then repeat the above procedure for calculating the second element
 <img src="./images/vertical_2.png" height="50%" width="50%">
   * Repeat the above 2 steps for calculating elements horizontally.
   * For element at position (1,0) i.e. vertically down, come down by 1 step i.e. leave row 1 and then repeat the same process as above.
 <img src="./images/vertical_3.png" height="50%" width="50%">
   * Repeating the above steps will give us 4x4 matrix.
 4. Different programming languages have different syntax for convolution operation. e.g. **Python : conv-forward, tensorflow : tf.nn.conv2d, keras : Conv2D**
 5. **How does it work ?**
   * Consider we take an input image with a white image on left side and the gray image on the right, and the kernel has a white image on the left and gray in the middle and say black on the right, now when the convolution operation is performed, we get a matrix with gray area on the left and the right side, whereas in the middle we get a white region, which actually detects that Vertical line from the original image that seperated the dark and the light region.(10 -> bright pixel intensity, 0/-1 -> Dark pixel intensity)
<img src="./images/vertical_4.png" height="50%" width="50%">
   * The detected region is a bit broad(in the middle) because the image we are using is really small and with bigger image the size of the region will decrease.
 6. If we transition from Light to dark we see that the middle output matrix contains 30's and if we inverse the transition we would eventually get -30's as a middle number. Thus, the convolution operation also captures the color change.
 <img src="./images/vertical_5.png" height="50%" width="50%"> 
 7. Similarly for **Horizontal Edge Detection** what we can do is we can use different kernel/filter that can be obtained by flipping kernel used for Vertical Edge detection by $90^0$.
<img src="./images/hz_1.png" height="10%" width="10%">
 8. For more complex edge detection with say both horizontal and vertical line, consider the example with 4 regions with 1st region contains white light, 2nd -> Dark, 3rd -> dark and 4th -> white region. Then the corresponding output matrix might contain different pixel intensity values such as ***30 -> Capture Transition from light to dark, 10 -> Captures the middle region with combination of white and dark shades, -30 -> Transition from dark to light.***
<img src="./images/cpx_1.png" height="50%" width="50%"> 
## Kernel Choices :
 1. Kernel/Filter used can be of various types, each with it's own benefits some kernel with heavy weightage on the middle row(**Sobel Filter**) or with different values other than 1/2 (**Scharr filter**)
<img src="./images/kernel_1.png" height="50%" width="50%"> 
 2. ***Learning Kernels :*** Best practise could be to learn the 9 values of the filter via backpropogation and then use this for the required task. This would automatically choose a filter that best captures the statistics of the data, and rather than just learning Horizontal/Vertical filters it could detect other orientations as well.
<img src="./images/kernel_2.png" height="50%" width="50%">

## Padding :
 1. One of the issues with convolution is that if we use deeper NNs the size of the output matrix goes on shrinking and results in a very smaller image and thus results in ***Shrinkly Output***. E.g. 6x6 * 3x3 -> 4x4 and it goes on. It can be generalized as **nxn * fxf -> (n-f+1)x(n-f+1)**
 2. Other problem associated with Convolution is that the pixel values present in the middle are used more often than the ones present at the edges and this in turn results in ***loosing some valuable information present along the edges**'(i.e. used only once).
 3. To overcome this issue one can use ***Padding*** i.e. to add an extra pixel along the border(adding 2 extra rows and columns) such that the actual pixels along the edge now becomes the inner pixels.
 4. Thus a padding of 1 would result in 8x8 * 3x3 -> 6x6, Hence maintaining the original dimension of the image. It can be generalised as **(n+2p)x(n+2p) * fxf -> (n+2p-f+1)x(n+2p-f+1)*** where p is the number of padding.
<img src="./images/padding_1.png" height="50%" width="50%">  
 5. **Types of Padding :**
   * **Valid :** nxn * fxf -> n-f+1 x n-f+1, i.e. no padding
   * **Same :** Pad so that the o/p is the same size as the i/p. For that, ***n+2p-f+1 = n*** solving this would yield : $p=\frac{f-1}{2}$, thus the kernel always has an odd-valued dimension i.e. 1x1 | 3x3 | 5x5 etc
   
## Strided Convolution :
 1. In place of taking 1 step in horizontal/vertical direction, using **stride = #**, we are now moving # steps in both the vertical/horizontal direction.
 2. Now the generalized formula to know the output is governed by :
   <br>**nxn * fxf -> $\lfloor{\frac{n+2p-f}{s}+1}\rfloor$**, where s : number of strides to move and p : Padding and in case if the ouput is a fraction then the floor operation would be considered.
<img src="./images/stride_1.png" height="50%" width="50%"> 
 3. Note : In Mathematics Convolution operation usually means taking the filter matrix and flipping it horizontally and vertically and then using this matrix for multiplication and addition, whereas what we do in Deep learning is actually termed as **Cross-Correlation** but many use the term **Convolution**.
$$ \begin{bmatrix}    
  3&4&5\\      
  1&0&2 \\
  -1 & 9 & 7
\end{bmatrix} ->
% 
\begin{bmatrix}
7 & 9 & -1\\
2 & 0 & 1\\
5 & 4 & 3
\end{bmatrix}$$

## Principle behind Convolution over RGB image :
<img src="./images/conv_1.png" height="50%" width="50%"> 
 1. Consider the input image to be of the dimension say 6x6x3(3 for the RGB scale), and now the filter or kernel will also be of the dimension say 3x3x3 each for RGB.
 2. The evaluation for the output matrix would be the same as we did earlier with 1 dimensional martix i.e. First take 9 numbers from the Red Channel and then another 9 numbers from the Green channel and then remaining 9 from the Blue channel and then multiply with the corresponding 27 numbers that are in the filter, perform addition to get a single number.
 3. Benefit of having a 3x3x3 filter is that if the user wants to detect say just the vertial line in the Red Channel then it can do so by setting appropritate values of the filter matrix in the Red channel and setting the value 0 in other channels, or say if the user wants to detect vertical lines in all the channels, it can be done by setting appropriate filter values in all the channels.
<img src="./images/conv_rgb_1.png" height="50%" width="50%"> 
 4. The output matrix is still a 2D matrix.
 5. **Multiple filters :**
  * Let's say we want to detect more than one features(say vertical lines, horizontal lines) in the same image then different filter can be used on the same image to extract different features.
  * Then the output of the corresponding features are stacked together to get the final output matrix.
<img src="./images/conv_rgb_2.png" height="60%" width="60%"> 
  * Generalizing Dimensional Operation : **(n x n x $n_c$) * (f x f x $n_c$) -> (n-f+1) x (n-f+1) x ($n_c'$)** , where $n_c$ : # Channels, $n_c'$ : # filters(number of features extracted) 
  
## Understanding 1 Layered CNN :
 1. Generally a Forward Propogation in NN consists of following steps :
\begin{align}
z^{[1]} & = W^{[1]}a^{[0]} + b^{[1]}\\
a^{[1]} & = g(z^{[1]})\\
\end{align}
 2. Similary the input martix of size say 6x6x3 would be $a^{[0]}$ and the filter would be **$W^{[1]}$** , now suppose we are extracting say 2 features then the output matrix would be 4x4x2. Bias term would be added with 'W' via Broadcasting in Python.
<img src="./images/eg_layer1.png" height="60%" width="60%">
 3. **Notations(for layer l) :**
   * **Input :** $n_h^{[l-1]} x n_w^{[l-1]} x n_c^{[l-1]}$, where $n_h : Height, n_w : Width , n_c : Number of Channels$
   * **Output :** $n_h^{[l]} x n_w^{[l]} x n_c^{[l]}$, where $n_h^{[l]}=\frac{n_h^{[l-1]}+2p^{[l]}-f^{[l]}+1}{s}+1$, same for $n_w^{[l]}$
   * $f^{[l]}$ = filter size | $p^{[l]}$ = Padding | $s^{[l]}$ = Stride | $n_c^{[l]}$ = Number of features/filters
   * **Each filter :** $f^{[l]} x f^{[l]} x n_c^{[l-1]}$
   * **Activations :** $a^{[l]}=n_h^{[l]} x n_w^{[l]} x n_c^{[l]}$ and for m examples $A^{[l]}=m x n_h^{[l]} x n_w^{[l]} x n_c^{[l]}$
   * **Bias :** $n_c^{[l]}$
   
## Working of a ConvNet :
<img src='./images/conv_net_1.png' height="75%" width="75%">
 1. Consider the input image to be of the dimension say 39x39x3(also $a^{[0]}$) and is Convoluted via a filter($f^{[1]}$) of 3 and padding($p^{[1]}$) of 0, with the stride($s^{[1]}$) of 1, and # of filters($n_c^{[1]}$) is 10
 
 2. Using the formula mentioned in above section the Output of layer one i.e. $a^{[1]}$ would be 37x37x10.
 3. Following layer 1 the second layer contains :
   * $f^{[2]}$:5 , $s^{[2]}$:2 , $p^{[2]}$:0 with $n_c^{[2]}$:20
   * Output of this layer would be : $a^{[2]}$:17x17x20
 4. Following layer 2 the $3^{rd}$ layer contains :
   * $f^{[3]}$:5 , $s^{[3]}$:2 , $p^{[3]}$:0 with $n_c^{[3]}$:40
   * Output of this layer would be : $a^{[3]}$:7x7x40
 5. The Output from the previous layer is flatten and is converted into 1960 flat features and then fed into a NN with the last layer as a softmax / logistic function to predict the output.

## Types of Layers in ConvNet :
 * Bascially contains 3 layers :
   1. Convolution layer
   2. Pooling layer
   3. Fully connected layer
 
## Pooling Layer :
 1. Responsible for collecting features from the Convolution layer into a higher level by either selecting the Max value from the given regions/ by choosing the average value for the same.
 2. **Working of MaxPooling:**
<img src='./images/pool_1.png' height="50%" width="50%"></img>
   * Given an output of 5x5x$n_c^{[l-1]}$ from the convolution layer, suppose we use filter of say 3 and a stride of say 2, then just like convolution the pooling would select initially the first subset of the input(Matrix 3x3) Matrix.
   
   * Then if we are using MaxPooling then it'll extract the max value from this 3x3 matrix and place it in our o/p matrix at position(0,0).
   * Now since we have a stride of 2 then it'll hope 2 steps horizontally and then again select the Max value from the 3x3 matrix(Size of the output matrix can be evaluated via the same prev formulae $\lfloor\frac{n+2p-f}{s}+1\rfloor$)
   * Repeat till all the elements in this channel are not completed. 
   * The same process would be repeated if there are multiple channels present in the input image and the output image will contain the same number of channels.
 3. **Intution behind it :**
   * The main idea is from a given region/quadrant it tries to see if a particular feature is present or not, and if the value if high in that particular quadrant then it tries to preserve the feature value, and the low value indicate that particular feature is not present in that quadrant.
 4. **There are no parameters in Pooling layer that can be trained** and the Hyperparameters that are present includes : **Filter, Stride**. 
 5. Generally used values for these hyper parameters encompass **(f,s): (2,2) | (3,2)**
 6. **Average Pooling**
   * Works similar to MaxPooling but in place of taking the max value it takes the avg of all the values present in that quadrant.
   * Less frequently used as compared to MaxPooling.

## Why Convolution :
<img src='./images/conv_2.png' height="25%" width="25%"></img>
 1. Usually a traditional NN for same a small image of dimensions 32x32x3 will have 3,072 inputs and suppose if we are using say 5x5 dimensions with 6 different features then the resulting output would be say 28x28x6 which would be almost about 4,704 outputs.
 2. Thus even for a small image we would end up with almost (4704,3072) i.e. 14 million parameters to train. Thus would be computationally very expensive.
<img src='./images/conv_3.png' height="60%" width="60%"></img>
 3. Whereas on contrary while using a convolution :
   * **Parameter sharing :** A feature detector(i.e. filter) that's useful in 1 part of the image can be useful in the other part of the image
   * **Sparsity of Connections :** In each layer, each output value depends only on a small number of inputs.
   * **Less number of parametes :** Because of Parameter sharing we'll only have say in this case 9 paramters to train.