# Introduction to Deep Neural Networks

## 1. Binary Classification

### 1.1. The Problem

Consider a basic image classification problem where we are supposed to train a supervised machine learning model to identify if an image has a dog in it. We will have a number of labelled data for training, if dog is present labelled as 1 and if dog is absent labelled as 0.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/1.png" width=600 align="center">

### 1.2. Notations

So our target varibale, y, will have two options 0 and 1. Now consider each image is 64 by 64 pixels in size. Since "Red", "Green", and "Blue" are the basic colors and with their combination we can generate other colors we will have 64*64*3 pixel information. We will include all this information in a 12288 by 1 vector *x* (64*64*3=12288). Say, we have "m" such pictures.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/2.png" width=600 align="center">

### 1.3. Linear vs Logistic Regression

Classification problems don't work well with linear regression algorithm since the training data is of categorical nature. In addition, values predicted by linear regression may be outside the bounds of possible values. A logistic regression, on the other hand, works much better with binomial classification problem. It is built for predicting categorical values and values are always within bound of possible values. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/4.png" width=600 align="center">

### 1.4. Approach: Logistic Regression

At the end of model training process we want a model f(*x*) that maps a *x* not used for training to an appropriate y_hat. When using logistic regression the function is σ(*W.T*+b). As in case of linear regression, we need to determine the appropriate values of slope and intercept for our model, in case of logistic regression we need *w.T* and b. The sigmoid function, σ(x), (s-curve) is already defined as 1/(1+*e*^-x). Understand that *w* needs to be a vector of same dimension as *x*. When *w.T* and *x* are multiplied we get a real number and add b to it. Finally, calculate sigmoid of the resulting number.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/3.png" width=600 align="center">

### 1.5. Cost Function

We want to determine *w.T* and b in such a way that for a *x* our predicted value, y_hat, and actual value, *y*, are same for most of the inputs. To do that, we will need a cost/loss function, whose value we need to minimize. *J(w,b)* is the average of costs obtained using cost function *L(y_hat, y)* for m data points. Cost function for one data point, L, can be a number of functions involving y_hat and *y*. We can define out own cost functions as well. It can be as simple as |y_hat-y|. But for machine learning algorithms we use a cost function for which average cost when plotted against estimated parameters (here, w.T and b) has a global minima. For logistic regression, binary cross entropy (shown below) is used.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/5.png" width=600 align="center">

### 1.6. Gradient Descent Algorithm

As I said earlier, for machine learning, We use a cost function for which average cost when plotted against estimated parameters has a global minima (bowl shape). The idea is that we want to estimate the paramteres *w.T and b* such that the average cost is minimum. Although, *w* has a very high dimention, since we cannot plot that many dimensions, imagine it is one dimentional. Notice the convex curve below obtained by ploting w against average cost based.

Now, to find the maxima/minima, mathematical procedure would be to calculate first derivative of *J(w)* with respect to w and equate it to zero. However, since w is a high demensional vector, it gets diccicult to solve it analytically. Instead of trying to calcute exact value of w, we can estimate it. May not be the exact value, but good enough for the task at hand with less computational expense. The algorithm used for estimating value of parameters in order to minimize average cost is known as Gradient Descent Algorithm. It is also known as Hill climbing algorithm. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/6.png" width=600 align="center">

In the figure below, the left of minimum point has a negative gradient and the right of minimum point has a positive gradient. We start with a random value of *w*. Using that value calculate *J(w)*. Next, we calcuate first derivative at that point. If the slope is negative, we need a lower *w*, if it is positive, we need a higher value for *w*.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/7.png" width=600 align="center">

The process can be repeted until convergence. Note that w is not increased/decreased randomly. It is increased or decreased by  α times first derivative of *J* at previous point. The value α is known as learning rate and its value depends upon the work at hand. We use hyperparameter tuning along with domain knowledge for finding appropriate α.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/8.png" width=600 align="center">

Note that α can neither be too high or too low. If α is too high we may keep missing the minimum point, if α is too low, the process may take a lot of time for convergence. Either of them for a high dimensional vector *w* leads to high computational cost.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/9.png" width=600 align="center">

The way we followed for *w* is parallelly adopted for estimating *b*. But since *b* is just a simple number its impact on the process is not so much. Finally, we use the estimated *w* and *b* to calculate y_hat.

### 1.7. Summary: Logistic Regression

Let us summarize the steps we followed :

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/10.png" width=600 align="center">

## 2. Computational Graphs and Chain rule

### 2.1. Introduction to Computational Graphs

In simple terms, computational graph is graphical representation of steps followed to calculate value of a function. We will explore the idea using an example.

Consider a function *J(a, b, c)* defined as 5(ab+c). 
- The first step will be to calculate the product of a and b. Let's call it u.

- Next step will be to calculate sum of u and c. We will call it v.

- Finally, we will multiply v with 5. We will call it J.

Notice the computational graph drawn below. We have to move from left to right on the graph to calculate final value J.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/11.png" width=600 align="center">

### 2.2. Chain rule with Computational Graph

Now imagine we have to calculate partial derivative of *J(a,b,c)* with respect to a, b and c. We are familiar with the procedure using chain rule. The chain rule for partial derivative of *J* with respect to a is shown below.

Now we will try to link partial derivative with computational graph. The idea for calculating partial derivative in relation to computational graph is to move from right to left and calculate partial derivative at each step of variable calculated at previous step with respect to variable calculated at current step. Finally, multiply all those partial derivatives.

The steps for calculating partial derivative of *J* with respect to *a* are as follows:

- First calculate partial derivative of *J* with respect to *v*
- Second, calculate partial derivative of *v* with respect to *u*
- Next, calculate partial derivative of *u* with respect to *a*
- Finally, we multiply all three derivatives to find partial derivative of *J* with respect to *a*

The process for calculating partial derivative of *J* with respect to *b* and *c* is shown in figure below.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/12.png" width=600 align="center">

### 2.3. Logistic Regression with Computational Graph

We will now form a computational graph with steps from logistic regression problem we solved above. To simplify the problem assume we have only one *x* and *y* i.e. number of image, m, equals 1. Also, instead of being a large dimensional vector imagine, *x* is only two dimensional. Consequently, *w* will also be only two dimensional. Since we have only one data point, average cost function, *J*, will be same as cost for one data point, *L*.

Our aim with logistic regression was to determine such values for w.T and b such that the average cost is minimum. In order to do that we followed the following steps:

- We started by by assuming value of w and b
- Second, we calculated dot product of w.T and x and summed the product with b. For a two dimensional vector x, the result will be w1x1+w2x2+b. We will call it *z*
- Next, we calculated sigmoid of z as our predicted value, y_hat. We will call it a for now
- Then, we calculated the average loss function, *J*. For a single image data *J* is same as *L*

The computational graph for the above steps is shown below. We need to move from left to right. This step, in neural networks, is known as forward propagation.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/13.png" width=600 align="center">

Now, we need to use Gradient Descent algorithm to update the values of w and b. In order to do that, we have to calculate partial derivative of *J(w, b)* with respect to w and b. For a single image data *J* is same as *L*. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/14.png" width=600 align="center">

To calculate partial derivative of *L* with respect to *w* and *b* we have to move from right to left across the computational graph.

- We begin by calculating partial derivative of *L* with respect to a
- Next, we calculate partial derivative of *a* with respect to z
- Then, we calculate partial derivative of *z* with respect to w1, w2 and b

The steps are shown below. This step, in neural networks, is known as backward propagation or backpropagation.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/15.png" width=600 align="center">

### 2.4. Summary: Training w and b with one example

Let us summarize the steps.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/16.png" width=600 align="center">

### 2.5. Learning from m training examples

We don't actually learn from a single example. We train our algorithm on a number of examples. So we will scale our computational graph for m training examples.

In the graph below, each *x* represents one image data. Sigmoid function and consequently cost function, *L*, is calculated for each image, and finally average cost is calculated. We have to perform backpropagation calculation for each branch. The process is repeated a number of times. Understand that, m can be very large, even greater than a million, and all these computaions can cost alot. So, it is a big problem. 

Notice that w.T and b are constants actoss each branch, and the branches are not connected until we reach the calculation of *J*. So, a lot of parallel calculations are involved. We can take advantage of parallel computaions and make the process way more efficient using the idea of Vectorization. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/17.png" width=600 align="center">

### 2.6. Vectorization

We will create a matrix, X, and stack all the values of x, columnwise in it. So we have a matrix of size 12288 by m (64*64*3=12288). Note that we don't need to increase shape of w for the calculation. w is of size 12288 by 1, and therefore, w.T is of size 1 by 12288. Therefore, w.T and X are still compatible for dot product. Finally, we add b to the product. The result is a matrix, z, of size 1 by m.

Using vectorization, we can replace a number of parallel calculations with one dot product. In python, it can be done with a single line of code. We can take sigmoid of z and then compute average cost.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/18.png" width=600 align="center">

If we try to do the same thing in python without vectorization, it will require a lot more calculation, as shown below. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/19.png" width=600 align="center">

Forward propagation is followed by backward propagation and the cycle is repeated a number of times. The code for the process using vectorization is shown below. The cycle has been repeated for 1000 times.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/20.png" width=600 align="center">

### 2.7. Logistic Regression - Implementing

After we have calculated appropriate w and b, we have our model. We can now take new input *x*, calculate z, followed by sigmoid function. Generally, if sigmoid function returns a value greator than or equal to 0.5, it is termed as 1 and if sigmoid function returns value lower than 0.5, it is termed as 0. This will give us our prediction for the new case. If the image contains a dog, i.e. 1 or not, i.e. 0. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/21.png" width=600 align="center">

Notice that the above computational graph has a linear and a non-linear or activation(a) part. This combination of linear and non-linear process is called one neuron. One neuron has limited capability and often not be sufficient to map out complex decision boundaries, specially for higher dimensions. Thus, the need of deep neural network asrises.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/22.png" width=600 align="center">

### 2.8. Problem with one neuron

The figure below shows the output desion boundary given by a neuron for two dimensional data points. The cycle was repeated for 221 times with learning rate of 0.03 and sigmoid as the activation function. Note that both test and training loss is zero. One neuron was capable to find a good decision boundary.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/23.png" width=600 align="center">

The figure below shows the output desion boundary given by a neuron for other set of two dimensional data points. The cycle was repeated for 202 times with learning rate of 0.03 and sigmoid as the activation function. Note that both test and training loss are avove 0.5. One neuron was not capable to find a good decision boundary.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/24.png" width=600 align="center">

## 3. Neural Networks

### 3.1. Neural Network Terminology

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/25.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/26.png" width=600 align="center">

### 3.2. Training a Neural Network

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/27.png" width=600 align="center">

### 3.3. Forward Propagation

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/28.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/29.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/30.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/31.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/32.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/33.png" width=600 align="center">

### 3.4. Forward Propagation - m training examples

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/34.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/35.png" width=600 align="center">

### 3.5. Vectorizing Forward Propagation

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/36.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/37.png" width=600 align="center">

### 3.6 Summary: Forward Propagation 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/38.png" width=600 align="center">

### 3.7. Activation Functions

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/39.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/40.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/41.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/52.png" width=600 align="center">

### 3.8. Backpropagation - Revisiting Logistic Regression

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/42.png" width=600 align="center">

### 3.9. Backpropagation for Neural Networks

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/43.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/44.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/45.png" width=600 align="center">

### 3.10. Summary for Learning

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/46.png" width=600 align="center">

### 3.11. Prediction

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/47.png" width=600 align="center">

### 3.12. Initializing of weights

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/48.png" width=600 align="center">

### 3.13. Summary

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/49.png" width=600 align="center">

### 3.14. Applications

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/50.png" width=600 align="center">

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/51.png" width=600 align="center">