### Getting back to the perceptron: 
 **Note that, the input layer is given biase, `b`, by introducing an extra input node that always has a value 1.**

<div>
<img src="images/graph5.png" width="600"/> 
<div>

*Graph drawn by author*

Simplifying our example, let's assume $X = [x_1, x_2]$ and $W = [3, -2]$.<br>
Then our $\hat y = \sigma(b + X^TW)$.


<div>
<img src="images/eqn1.png" width="400"/> 
<div>

*Picture by author*

* Now, let's draw the equation of the hyperplane ( a line in 2D):

<div>
<img src="images/graph6.png" width="400"/> 
<div>

*Graph drawn by author*

* The hyperplane (labeled by broken-line) corresponds to the decesion line that the Neural network makes to classify a given input from the ($X_1, X_2$) plane.

Example: Assume we have an input value $X = [-1, 2]$, and substituting $x_1 = -1$ and $x_2 = 2$ into the previous equation, $\sigma (1 + 3x_1 - 2x_2) = \sigma(-6)$, assume our activation function is `sigmoid`, we have: 

$$
\sigma(z) = \frac{1}{1+e^{-z}}
 = \frac{1}{1+e^6} = 0.0025
$$

Now, since $\hat y = \sigma (1 + 3x_1 - 2x_2) \approx 0.0025 < 0.5$, the activation function, `Sigmoid`, assigns our $X$ to the left of our linear classifier (the broken-line in the $(x_1,x_2)$ plane above).


<div>
<img src="images/graph1.png" width="600"/> 
<div>

*Graph1. Sigmoid function (source from DSIR-111 presentation slide)*

* With this basic concept of perceptron, we intuitively conclude that multilayer perceptrons are just a stack of single layer perceptrons and hence the ANN.

<div>
<img src="images/cv20.png" width="600"/> 
<div>
    

*Fig.9: source: [Multi-Layer Neural Networks with Sigmoid Function](https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f)*

- Now if we want to define a multi-output NN, we can simply add another perceptron to this above picture so instead of having one perceptron now we have two perceptrons and so on. Here is an example of a multi-output perceptron. Note that perceptron is stacked and there are two outputs ([source](https://vitalflux.com/how-do-we-build-deep-neural-network-using-perceptron/)).

<div>
<img src="images/cv21.png" width="600"/> 
<div>

<div>
<img src="images/cv22.png" width="600"/> 
<div>

*Fig.10 Source: [Data Analytics](https://vitalflux.com/how-do-we-build-deep-neural-network-using-perceptron/).*

* Activation and Loss Functions are discussed in the next subtopic.

# Feature Extraction

The entire DL model works around the idea of extracting useful features that clearly define the objects in the image. Machine learning models are only as good as the features you provide. That means coming up with good features is an important job in building ML models.

>`DEFINITION`: <br>
A feature in machine learning is an individual measurable property or characteristic of an observed phenomenon. Features are the input that you feed to your ML model to output a prediction or classification. Suppose you want to predict the price of a house: your input features (properties) might include `square_foot`, `number_of_rooms`, `bathrooms`, and `so on`, and the model will output the predicted price based on the values of your features. Selecting good features that clearly distinguish your objects increases the predictive power of ML algorithms. <br> - In Computer Vision, a feature is a measurable piece of data in your image that is unique to that specific object. It may be a distinct color or a specific shape such as a line, edge, or image segment. A good feature is used to distinguish objects from one another (Mohammed).

**FEATURE GENERALIZABILITY**: A very important characteristic of a feature is repeatability.BUT, WHAT MAKES A GOOD FEATURE FOR OBJECT RECOGNITION? 
* Identifiable

* Easily tracked and compared

* Consistent across different scales, lighting conditions, and viewing angles

* Still visible in noisy images or when only part of an object is visible

## Extracting features
I would like to start with an example from a book, `Deep Learning for Vision Systems`, by Mohammed. <br> 
Suppose we have a database of U.S presidents and we want to build a classification pipeline to tell us which president this image is of. So we feed this image that we can see on the left hand side (`fig.7` below) to our model and we wanted to output the probability that this image is of any of these particular presidents that this dataset consists of.
In order to classify these images correctly though, our pipeline needs to be able to tell what is actually unique about a picture of Abraham Lincoln vs a picture of any other president like George Washington or Jefferson, or Obama.

* Remember, **Features make pictures unique**. <br>
Let's identify high level key features in the human, auto, and house image categories: 



<div>
<img src="images/cv6.png" width="600"/>
</div>

*Fig.12: Source from [Convolutional Neural Networks](http://introtodeeplearning.com)*

<div>
<img src="images/cv5.png" width="600"/>
</div>

*Fig.11: Source from [Convolutional Neural Networks](http://introtodeeplearning.com)*

- This way computers classify images by assigning the corresponding probabilities based on features of pictures.

## Convolution Layers
Now, suppose each feature is like a mini image; it's a patch. It's also a small 2D array of values and we'll use `filters` to pick up on the features.

>Convolution Layer:<br>
The convolution layer is where we pass a filter over an image and do some calculation at each step. Specifically, we take pixels that are close to one another, then summarize them with one number. The goal of the convolution layer is to identify important features in our images, like edges.
Source: [Here](https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/)


Let's  use a $3X3$ `edge-detection` filter that amplifies the edges to the image below, then this

$$\begin{bmatrix} 0 & -1 & 0 \\ -1 & 4 & -1 \\ 0 & -1 & 0\end{bmatrix}$$ 

`kernel` is convoluted with the input image, say $F(x,y)$, it creates a new convolved image (a feature map) that amplifies the edges (See `Fig.9` below). Zooming-in, we see `Fig.10` where a small piece of an image shows how the convolution operation is applied to get the new pixel value.

![](images/cv7.png)

![](images/cv8.png)

`Fig.conv: Applying Filter - source from (Mohammed 109-110)`

>Other filters can be applied to detect different types of features. For example, some filters detect `horizontal edges`, others detect `vertical edges`, still others detect more complex shapes like corners, and so on. The point is that these filters, when applied in the convolutional layers, yield feature-learning behavior: first they learn simple features like edges and straight lines, and later layers learn more complex features.

Here are the three elements that enter into the convolution operation:

* Input image
* Feature detector or `kernel`, or `filter` used interchangeably 
* Feature map


<div>
<img src="images/cv9.png" width="500"/>
</div>



Fig.15: [source](https://www.superdatascience.com/blogs/the-ultimate-guide-to-convolutional-neural-networks-cnn)

The example we gave above is a very simplified one, though. In reality, convolutional neural networks develop multiple feature detectors and use them to develop several feature maps which are referred to as convolutional layers.
Through training, the network determines what features it finds important in order for it to be able to scan images and categorize them more accurately.


![](https://media3.giphy.com/media/i4NjAwytgIRDW/200.webp?cid=ecf05e471vftp51bx55s3lbh1el698xc1bv7l7rhy0igcpz3&rid=200.webp&ct=g)

### For a 3D array convolution 


<div>
<img src="img/conv.gif" width="500"/>
</div>

Source for the two `gif` images: [A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)


<div>
<img src="img/cnn15.png" width="500"/> 
<div>

Fig.16: [Source](https://pylessons.com/Logistic-Regression-part2)

- Putting all together:<br>
The term **`convolution`** refers to the mathematical combination of two functions to produce a third function. It merges two sets of information. In the case of a CNN, the convolution is performed on the input data with the use of a filter or kernel (these terms are used interchangeably) to then produce a feature map.

>We perform a series `convolution + pooling operations, followed by a number of fully connected layers`. If we are performing multiclass classification the output is softmax.[fig.*source*](https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2) <br>
 

<div>
<img src="images/architecture.png" width="500"/> 
<div>

<div>
<img src="images/CNN_architecture.png" width="500"/> 
<div>

<div>
<img src="images/CNN_from_Scratch.png" width="500"/> 
<div>

*Fig.17: CNN Architecture ([Source](https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html)).*


## Pooling Layer

>It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 .

    

<div>
<img src="images/cv14.png" width="500"/> 
<div>
    

Fig.18 [Source](https://cs231n.github.io/convolutional-networks/#conv/)

>Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square)(Source:[Convolutional Neural Networks for Visual recognition](https://cs231n.github.io/convolutional-networks/#conv/)).

* We'll get back to the [code notebook](https://github.com/sthirpa/Data_Scince_Immersive-at-General-Assembly-/blob/Hirpa/CIFAR-10-SH.ipynb) for the implementation of this theory