### What is computer vision?
Computer Vision is a subfield of [Deep Learning](https://github.com/letspython3x/Books/blob/master/Deep%20Learning%20with%20Python.pdf) and Artificial Intelligence, also refer [here](https://livebook.manning.com/book/grokking-deep-learning-for-computer-vision/chapter-1/66)  where humans teach computers to see and interpret the world around them.


Image Classification is one of the most fundamental tasks in computer vision. It has revolutionized and propelled technological advancements in the most prominent fields, including the automobile industry, healthcare, manufacturing, and more.

### How does image classification work?
Image Classification (often referred to as Image Recognition) is the task of associating one (single-label classification) or more (multi-label classification) labels to a given image.

### Basic steps in computer vision:


### Input Data:
   - Loading images or videos (image frames)
   
### Preprocessing: 

   - Normalization, rescaling pixels/resizing, color transformation, one Hot encoding etc
    
### Feature Extraction: 
   - Finding unique characteristic/features of an image
    
### Modeling
   - Learn from the extracted features to predict and classify object (source:[here](https://github.com/moelgendy/deep_learning_for_vision_systems)).

## How Computers "See" images?
When we look at an image, we see objects, landscape, colors, and so on. But that’s not the case with computers. Images are just a 2D arrays of numbers of pixels (for grayscale pictures); where a pixel is simply a number represented by a range of either zero to one or in 0 to 255.


<div>
<img src="images/cv1.png" width="600"/>
</div>

Fig.1 Source: [Computer vision: The power of seeing and interpreting images](https://datascience.aero/computer-vision-seeing-interpreting-images/)

>In grayscale images, each pixel represents the intensity of only one color, whereas in the standard RGB system, color images have three channels (red, green, and blue). In other words, color images are represented by three matrices: one represents the intensity of red in the pixel, one represents green, and one represents blue ([source](https://livebook.manning.com/book/grokking-deep-learning-for-computer-vision/chapter-1/125)).

<div>
<img src="images/cv2.png" width="600"/>
</div>
<br>

Fig.2:

>As you can see in figure above, the color image is composed of three channels: red, green, and blue. Now the question is, how do computers see this image? Again, they see the matrix, unlike grayscale images, where we had only one channel. In this case, we will have three matrices stacked on top of each other; that’s why it’s a 3D matrix. The dimensionality of 700 × 700 color images is (700, 700, 3). Let’s say the first matrix represents the red channel; then each element of that matrix represents an intensity of red color in that pixel, and likewise with green and blue. Each pixel in a color image has three numbers (0 to 255) associated with it. These numbers represent intensity of red, green, and blue color in that particular pixel. <br>
If we take the pixel $F(0,0)$ as an example, we will see that it represents the top-left pixel of the image of green grass. When we view this pixel in the color images, it looks like figure below which shows some shades of the color green and their RGB values ([source](https://livebook.manning.com/book/grokking-deep-learning-for-computer-vision/chapter-1/https://livebook.manning.com/book/grokking-deep-learning-for-computer-vision/chapter-1/)).

<div>
    <img src="images/cv3.png" width="300"/>
</div>
<div>
    <img src="images/cv17.png" width="300"/>
</div>

Fig.3:  `RGB` color cobination formula


# 2. Image Preprocessing
- Normalization, standardization, one Hot encoding etc 
Any adjustments that you need to apply to your dataset are part of preprocessing.The good news is that, unlike traditional machine learning, DL algorithms require minimum data preprocessing because neural networks do most of the heavy lifting in processing an image and extracting features.

- Image processing could involve simple tasks like image resizing. In order to feed a dataset of images to a convolutional network, the images all have to be the same size.
- Converting color images to grayscale to reduce computation complexity

- Standardizing images: images must be preprocessed and scaled to have identical widths and heights before being fed to the learning algorithm.

- Data augmentation --Another common preprocessing technique involves augmenting the existing dataset with modified versions of the existing images. Scaling, rotations, and other affine transformations are typically used to enlarge your dataset and expose the neural network to a wide variety of variations of your images. This makes it more likely that your model will recognize objects when they appear in any form and shape.

<div>
<img src="images/cnn16.png" width="600"/>
</div>


Fig.6. [source](https://www.kaggle.com/kedarsai/cifar-10-88-accuracy-using-keras)


# Building Neural Networks

### Let's start with building an Artificial Neural Network(ANN)
>In figure below, we can see an analogy between biological neurons and artificial systems. Both contain a main processing element, a `neuron`, with input signals $(x_1, x_2, ..., x_n)$ and an output (Mohammed 8).

<div>
<img src="images/cv15.png" width="400"/> 
<div>
   


Fig.ANN1: Single Neuron - Single layer Network (`Perceptron`)


<div>
<img src="images/cv16.png" width="600"/> 
<div>


Fig.ANN2: Multilayer Neuron
Deep learning involves layers of neurons in a network or  `Multilayer perceptron` <br>
Fig.7 ( for both `ANN1` and `ANN2`, above) source from (Mohammed 9).

* **ANN is imitation of how information is processed in human brain; when millions of neurons (`perceptrons in ANN`) are stacked in layers and connected together,a multilayer neural network is called `deep learning`.** 

# So, what is a `Perceptron`? 
* **Let's zoom in to a `Multilayer perceptron (MLP)` above.**

### Perceptron is the fundamental building blocks of `Neural Networks` in Deep Learning. 
- If we really want to know how neural network works, we better learn closely how perceptron works.<br>
According to [Wikipedia](https://en.wikipedia.org/wiki/Perceptron) definition, a `Perceptron` is an algorithm for learning a binary classifier called a `threshold function`: a function that maps its input $X$ (a real-valued vector) to an output value $f(X)$ (a single binary value):

$$
f(X) = \left\{
\begin{array}{ll}
      1  \quad \text{if} \quad W.X + b > 0, \\ 0 \quad \text{otherwise} \end{array} \right.
$$

where $W$ is a vector of real-valued weights, $W.X$ is the dot product $\sum_{i=1}^{m}w_ix_i$, where $m$ is the number of inputs to the perceptron, and $b$ is the bias. The bias shifts the decision boundary away from the origin and does not depend on any input value. <br>
The value of $f(X)$ ($0$ or $1$) is used to classify $X$  as either a positive or a negative instance, in the case of a binary classification problem.

In the context of neural networks, a perceptron is an `artificial neuron` using the Heaviside step function as the activation function. The perceptron algorithm is also termed the single-layer perceptron, to distinguish it from a multilayer perceptron, which is a misnomer for a more complicated neural network. As a linear classifier, the single-layer perceptron is the simplest feedforward neural network (source: [Wikipedia](https://en.wikipedia.org/wiki/Perceptron)).

## Notation:
The dot product of two vectors `A` and `B`: <br>
$A = [a_1, a_2, ..., a_n]$ and $B = [b_1, b_2, ..., b_n]$ is given by:<br>
$$
A.B = \sum_{i=1}^{n}a_i*b_i
$$

which is simply $A^T*B$, a matrix multiplication (source: [Dot product in matrix notation](https://mathinsight.org/dot_product_matrix_notation)).

Now, getting back to our perceptron concept, assume we have the following vectors: `X` and `W` where $X= [x_1, x_2, x_3]$ for input vectors and $W = [w_1, w_2, w_3]$ for weight vector. For the sake of simplicity, let's assume $x_1 = 3, x_2 = -2$ and $w_0 = 1$ be the weight.

>You might get the impression that neural networks only understand the most useful features, but that’s not entirely true. Neural networks scoop up all the features available and give them random weights. During the training process, the neural network adjusts these weights to reflect their importance and how they should impact the output prediction. The patterns with the highest appearance frequency will have higher weights and are considered more useful features. Features with the lowest weights will have very little impact on the output (Mohammed 32).

<div>
<img src="images/cv18.png" width="600"/> 
<div>

   Fig.8: Source from (Mohammed 40)

>In both artificial and biological neural networks, a neuron does not just output the bare input it receives. Instead, there is one more step, called an activation function; this is the decision-making unit of the brain. In ANNs, the activation function takes the same weighted sum input from before ($z = Σxi · wi + b$) and activates (fires) the neuron if the weighted sum is higher than a certain `threshold`. This activation happens based on the activation function calculations (Mohammed 42). <br>


### As we can see, a perceptron consists of 4 parts:
    1. Input values or One input layer
    1. Weights and Bias
    1. Net sum
    1. Activation Function

According to `Mohammed`, the perceptron's learning logic goes like this:<br>
>1. The neuron calculates the weighted sum and applies the activation function to
make a prediction $\hat y$. This is called the `feedforward process`: 

$$\hat y = activation(\sum x_i · w_i + b)$$ <br>

>2. It compares the output prediction with the correct label to calculate the error:<br> $e r r o r = y – \hat y$. <br>


>3. It then updates the weight. If the prediction is too high, it adjusts the weight to make a lower prediction the next time, and vice versa. <br>


>4. Repeat! <br>
>This process is repeated many times, and the neuron continues to update the weights to improve its predictions until step 2 produces a very small error (close to zero), which means the neuron’s prediction is very close to the correct value. At this point, we can stop the training and save the weight values that yielded the best results to apply to future cases where the outcome is unknown.

## Activation Functions:
**a) Hiden layer activayion functions:** <br> 

*Example 1*: <br>
*Sigmoid Activation Function*

$$
\sigma(z) = \frac{1}{1+e^{-z}}
$$


<div>
<img src="images/graph1.png" width="400"/> 
<div>

*Graph1. Sigmoid function (source: DSIR-111 presentation slide)*

*Example 2:* <br>
ReLU (Rectified Linear Unit): ReLU is prefereable activation function. <br>

$$
\sigma(z) = max\{0, z\}
$$

<div>
<img src="images/graph2.png" width="400"/> 
<div>

*Graph2. ReLU function (source: DSIR-111 presentation slide)*

**b) Output layer activation functions**

*Exampe*: <br>
 **Softmax**:for multiclass classification <br>
$$
 \sigma(z_i) = \frac{e^{z_i}}{\sum_{i=1}^{n} e^{z_i}}
$$

<div>
<img src="images/graph3.png" width="600"/> 
<div>

*Graph3. SoftMax function (source: DSIR-111 presentation slide)*

<div>
<img src="images/graph4.png" width="400"/> 
<div>

Graph4. SoftMaxfunction explained ([source](https://towardsdatascience.com/softmax-activation-function-explained-a7e1bc3ad60))
where, according to wikipedia, it applies the standard exponential function to each element $z_i$ of the input vector $z$ and normalizes these values by dividing by the sum of all these exponentials; this normalization ensures that the sum of the components of the output vector $\sigma(z)$ is 1(source: [Softmax function](https://en.wikipedia.org/wiki/Softmax_function)).

>Activations in the hidden layer provide a transformation that allow the neural net
to learn more complex relationships as calculations propagate through the
network(Source: DSIR-111 Classnote).

# Loss Functions

**Loss** is a measure of performance of a model. The lower, the better. When learning, the model aims to get the lowest loss possible. We use `crossentropy` for multiclass classification([source](https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451)).

## Examples:
a) For Binary Classification: <br>

$$
LOSS = -\frac{1}{m}\sum_{i=1}^{m}(y_i * log(\hat y_i) + (1-y_i) * log(1 - \hat y_i)) 
$$

b) For Multi-class Classification:


$$
LOSS = -\frac{1}{m}\sum_{i=1}^{m}(y_i * log(\hat y_i)  
$$

Source: [Most Common Loss Functions in Machine Learning](https://towardsdatascience.com/most-common-loss-functions-in-machine-learning-c7212a99dae0)

### Getting back to the perceptron: 
 **Note that, the input layer is given biase, `b`, by introducing an extra input node that always has a value 1.**

<div>
<img src="images/graph5.png" width="600"/> 
<div>

*Graph drawn by author*

Simplifying our example, let's assume $X = [x_1, x_2]$ and $W = [3, -2]$.<br>
Then our $\hat y = \sigma(b + X^TW)$.


<div>
<img src="images/eqn1.png" width="400"/> 
<div>

*Picture by author*

* Now, let's draw the equation of the hyperplane ( a line in 2D):

<div>
<img src="images/graph6.png" width="400"/> 
<div>

*Graph drawn by author*

* The hyperplane (labeled by broken-line) corresponds to the decesion line that the Neural network makes to classify a given input from the ($X_1, X_2$) plane.

Example: Assume we have an input value $X = [-1, 2]$, and substituting $x_1 = -1$ and $x_2 = 2$ into the previous equation, $\sigma (1 + 3x_1 - 2x_2) = \sigma(-6)$, assume our activation function is `sigmoid`, we have: 

$$
\sigma(z) = \frac{1}{1+e^{-z}}
 = \frac{1}{1+e^6} = 0.0025
$$

Now, since $\hat y = \sigma (1 + 3x_1 - 2x_2) \approx 0.0025 < 0.5$, the activation function, `Sigmoid`, assigns our $X$ to the left of our linear classifier (the broken-line in the $(x_1,x_2)$ plane above).


<div>
<img src="images/graph1.png" width="600"/> 
<div>

*Graph1. Sigmoid function (source from DSIR-111 presentation slide)*

* With this basic concept of perceptron, we intuitively conclude that multilayer perceptrons are just a stack of single layer perceptrons and hence the ANN.

<div>
<img src="images/cv20.png" width="600"/> 
<div>
    

*Fig.9: source: [Multi-Layer Neural Networks with Sigmoid Function](https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f)*

- Now if we want to define a multi-output NN, we can simply add another perceptron to this above picture so instead of having one perceptron now we have two perceptrons and so on. Here is an example of a multi-output perceptron. Note that perceptron is stacked and there are two outputs ([source](https://vitalflux.com/how-do-we-build-deep-neural-network-using-perceptron/)).

<div>
<img src="images/cv21.png" width="600"/> 
<div>

<div>
<img src="images/cv22.png" width="600"/> 
<div>

*Fig.10 Source: [Data Analytics](https://vitalflux.com/how-do-we-build-deep-neural-network-using-perceptron/).*

* Activation and Loss Functions are discussed in the next subtopic.

# Feature Extraction

The entire DL model works around the idea of extracting useful features that clearly define the objects in the image. Machine learning models are only as good as the features you provide. That means coming up with good features is an important job in building ML models.

>`DEFINITION`: <br>
A feature in machine learning is an individual measurable property or characteristic of an observed phenomenon. Features are the input that you feed to your ML model to output a prediction or classification. Suppose you want to predict the price of a house: your input features (properties) might include `square_foot`, `number_of_rooms`, `bathrooms`, and `so on`, and the model will output the predicted price based on the values of your features. Selecting good features that clearly distinguish your objects increases the predictive power of ML algorithms. <br> - In Computer Vision, a feature is a measurable piece of data in your image that is unique to that specific object. It may be a distinct color or a specific shape such as a line, edge, or image segment. A good feature is used to distinguish objects from one another (Mohammed).

**FEATURE GENERALIZABILITY**: A very important characteristic of a feature is repeatability.BUT, WHAT MAKES A GOOD FEATURE FOR OBJECT RECOGNITION? 
* Identifiable

* Easily tracked and compared

* Consistent across different scales, lighting conditions, and viewing angles

* Still visible in noisy images or when only part of an object is visible

## Extracting features
I would like to start with an example from a book, `Deep Learning for Vision Systems`, by Mohammed. <br> 
Suppose we have a database of U.S presidents and we want to build a classification pipeline to tell us which president this image is of. So we feed this image that we can see on the left hand side (`fig.7` below) to our model and we wanted to output the probability that this image is of any of these particular presidents that this dataset consists of.
In order to classify these images correctly though, our pipeline needs to be able to tell what is actually unique about a picture of Abraham Lincoln vs a picture of any other president like George Washington or Jefferson, or Obama.

* Remember, **Features make pictures unique**. <br>
Let's identify high level key features in the human, auto, and house image categories: 



<div>
<img src="images/cv6.png" width="600"/>
</div>

*Fig.12: Source from [Convolutional Neural Networks](http://introtodeeplearning.com)*

<div>
<img src="images/cv5.png" width="600"/>
</div>

*Fig.11: Source from [Convolutional Neural Networks](http://introtodeeplearning.com)*

- This way computers classify images by assigning the corresponding probabilities based on features of pictures.

## Convolution Layers
Now, suppose each feature is like a mini image; it's a patch. It's also a small 2D array of values and we'll use `filters` to pick up on the features.

>Convolution Layer:<br>
The convolution layer is where we pass a filter over an image and do some calculation at each step. Specifically, we take pixels that are close to one another, then summarize them with one number. The goal of the convolution layer is to identify important features in our images, like edges.
Source: [Here](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/)


Let's  use a $3X3$ `edge-detection` filter that amplifies the edges to the image below, then this

$$\begin{bmatrix} 0 & -1 & 0 \\ -1 & 4 & -1 \\ 0 & -1 & 0\end{bmatrix}$$ 

`kernel` is convoluted with the input image, say $F(x,y)$, it creates a new convolved image (a feature map) that amplifies the edges (See `Fig.9` below). Zooming-in, we see `Fig.10` where a small piece of an image shows how the convolution operation is applied to get the new pixel value.

![](images/cv7.png)

![](images/cv8.png)

`Fig.conv: Applying Filter - source from (Mohammed 109-110)`

>Other filters can be applied to detect different types of features. For example, some filters detect `horizontal edges`, others detect `vertical edges`, still others detect more complex shapes like corners, and so on. The point is that these filters, when applied in the convolutional layers, yield feature-learning behavior: first they learn simple features like edges and straight lines, and later layers learn more complex features.

Here are the three elements that enter into the convolution operation:

* Input image
* Feature detector or `kernel`, or `filter` used interchangeably 
* Feature map


<div>
<img src="images/cv9.png" width="500"/>
</div>



Fig.15: [source](https://www.superdatascience.com/blogs/the-ultimate-guide-to-convolutional-neural-networks-cnn)

The example we gave above is a very simplified one, though. In reality, convolutional neural networks develop multiple feature detectors and use them to develop several feature maps which are referred to as convolutional layers.
Through training, the network determines what features it finds important in order for it to be able to scan images and categorize them more accurately.


![](https://media3.giphy.com/media/i4NjAwytgIRDW/200.webp?cid=ecf05e471vftp51bx55s3lbh1el698xc1bv7l7rhy0igcpz3&rid=200.webp&ct=g)

### For a 3D array convolution 


<div>
<img src="img/conv.gif" width="500"/>
</div>

Source for the two `gif` images: [A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)


<div>
<img src="img/cnn15.png" width="500"/> 
<div>

Fig.16: [Source](https://pylessons.com/Logistic-Regression-part2)

- Putting all together:<br>
The term **`convolution`** refers to the mathematical combination of two functions to produce a third function. It merges two sets of information. In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel (these terms are used interchangeably) to then produce a feature map.

>We perform a series `convolution + pooling operations, followed by a number of fully connected layers`. If we are performing multiclass classification the output is softmax.[fig.*source*](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2) <br>
 

<div>
<img src="images/architecture.png" width="500"/> 
<div>

<div>
<img src="images/CNN_architecture.png" width="500"/> 
<div>

<div>
<img src="images/CNN_from_Scratch.png" width="500"/> 
<div>

*Fig.17: CNN Architecture ([Source](https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html)).*


## Pooling Layer

>It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 .

    

<div>
<img src="images/cv14.png" width="500"/> 
<div>
    

Fig.18 [Source](https://cs231n.github.io/convolutional-networks/#conv/)

>Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square)(Source:[Convolutional Neural Networks for Visual recognition](https://cs231n.github.io/convolutional-networks/#conv/)).

* We'll get back to the [code notebook](https://github.com/sthirpa/Data_Scince_Immersive-at-General-Assembly-/blob/Hirpa/CIFAR-10-SH.ipynb) for the implementation of this theory