# Introduction to Deep Neural Networks

## 1. Binary Classification

### 1.1. The Problem

Consider a fundamental image classification problem where we are supposed to train a supervised machine learning model to identify if an image has a dog in it. We will have several labeled data for training. If a dog is present in the image, it has label one, and if there is no dog, the image has label 0.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/1.png" width=600 align="center">

### 1.2. Notations

So our target variable, y, will have two options 0 and 1. Now consider each image is 64 by 64 pixels in size. Since "Red," "Green," and "Blue" are the primary colors, and with their combination, we can generate other colors, we will have 64 by 64 by 3-pixel information. We will include all this information in a vector x. Say, we have "m" such images.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/2.png" width=600 align="center">

### 1.3. Linear vs. Logistic Regression

Classification problems don't work well with the linear regression algorithm since the training data is categorical. Besides, values predicted by linear regression may be outside the bounds of possible values. A logistic regression, on the other hand, works much better with a binomial classification problem. It predicts probability, and thus values are from zero to one, inclusive on both ends.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/4.png" width=600 align="center">

### 1.4. Approach: Logistic Regression

At the end of the model training process, we want a model f(x) that maps an x not used for training to an appropriate y_hat. When using logistic regression, the function is σ(W.T+b). As in linear regression, we need to determine the appropriate slope and intercept values for our model; in the logistic regression, we need w.T and b. The sigmoid function, σ(x), (s-curve) is already defined as 1/(1+e^-x). Understand that w needs to be a vector of the same dimension as x. When we multiply w.T and x, we get a real number, and then we add b to the number. Finally, we calculate the sigmoid of the resulting number.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/3.png" width=600 align="center">

### 1.5. Cost Function

We want to determine w.T and b in such a way that for new images, our predicted value, y_hat, and actual value, y, are the same for most of the images. To do that, we will need a cost function whose value we need to minimize. J(w,b) is the average of costs obtained using cost function L(y_hat, y) for "m" data points. The cost function for one data point, L, can be several functions involving y_hat and y. We can define our cost functions as well. It can be as simple as |y_hat-y|. But for machine learning algorithms, we use a cost function for which average cost, when plotted against estimated parameters (here, w.T and b), has a global minimum. For logistic regression, we use binary cross-entropy as a cost function(shown below).

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/5.png" width=600 align="center">

### 1.6. Gradient Descent Algorithm

As I said earlier, we use a cost function for which average cost, when plotted against estimated parameters, has a global minimum (bowl shape). The idea is that we want to estimate the parameters w.T and b such that the average cost is minimum. Vector w has a very high dimension, and we cannot plot that many dimensions. For simplicity,  imagine it is one dimensional. Notice the convex curve below obtained by plotting w against average cost function.

Now, to find the maxima/minima, the mathematical procedure would be to calculate the partial derivative of J(w) with respect to w and equate it to zero. However, since w is a high dimensional vector, it gets difficult to solve it analytically. Instead of trying to calculate the exact value of w, we can estimate it. It may not be the exact value, but good enough for the task at hand with less computational expenses. The algorithm used for estimating the value of parameters to minimize average cost is known as Gradient Descent Algorithm. It is also known as the Hill climbing algorithm.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/6.png" width=600 align="center">

The minimum point's left point has a negative gradient, and the minimum point's left point has a positive gradient. We start with random value of w. Using that value, we calculate J(w). Next, we calculate the first derivative at that point. If the slope is negative, we need a lower w; if it is positive, we need a higher value for w.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/7.png" width=600 align="center">

Note that w is not increased/decreased randomly. It is increased or decreased by α times the first derivative of J at the previous point. The value α is known as the learning rate, and its value depends upon the work at hand. We use hyperparameter tuning along with domain knowledge for finding appropriate α. We repeat the complete cycle until convergence. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/8.png" width=600 align="center">

Note that α can neither be too high or too low. If α is too high, we may keep missing the minimum point; if α is too low, it may take much time for convergence. Either low or high values of α for a high dimensional vector w leads to high computational cost.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/9.png" width=600 align="center">

The vector w is called the weight vector, and the number b as the bias. We parallelly use the way we followed for w for estimating b. We add b to make our function more flexible. Finally, we use the estimated w and b to calculate y_hat.

### 1.7. Summary: Logistic Regression

Let us summarize the steps we followed :

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/10.png" width=600 align="center">

## 2. Computational Graphs and Chain rule

### 2.1. Introduction to Computational Graphs

A computational graph is a graphical representation of steps followed to calculate the value of a function. We will explore this idea using an example.

Consider a function J(a, b, c) defined as 5(ab+c).

- The first step will be to calculate the product of a and b. Let's call it u.
- The next step will be to calculate the sum of u and c. We will call it v.
- Finally, we will multiply v with 5. We will call it J.

Notice the computational graph drawn below. We have to move from left to right on the graph to calculate the final value J.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/11.png" width=600 align="center">

### 2.2. Chain rule with Computational Graph

Suppose we have to calculate the partial derivative of J(a,b,c) with respect to a, b and c. We are familiar with the procedure using the chain rule. The chain rule for partial derivative of J with respect to a is shown below.

We will try to link partial derivative with the computational graph. The idea for calculating the partial derivative with the computational graph is to move from right to left and calculate the partial derivative at each step of the variable calculated at the previous stage with respect to the variable calculated at the current phase. Finally, multiply all those partial derivatives.

The steps for calculating partial derivative of J with respect to a are as follows:

- First, calculate the partial derivative of J with respect to v
- Second, calculate the partial derivative of v with respect to u
- Next, calculate the partial derivative of u with respect to a
- Finally, we multiply all three results to find the partial derivative of J with respect to a

The process for calculating the partial derivative of J with respect to b and c is shown in the figure below.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/12.png" width=600 align="center">

### 2.3. Logistic Regression with Computational Graph

We will now form a computational graph with steps from logistic regression problem we solved above. To simplify the problem assume we have only one *x* and *y* i.e. number of image, m, equals 1. Also, instead of being a large dimensional vector imagine, *x* is only two dimensional. Consequently, *w* will also be only two dimensional. Since we have only one data point, average cost function, *J*, will be same as cost for one data point, *L*.

Our aim with logistic regression was to determine such values for w.T and b such that the average cost is minimum. In order to do that we followed the following steps:

- We started by by assuming value of w and b
- Second, we calculated dot product of w.T and x and summed the product with b. For a two dimensional vector x, the result will be w1x1+w2x2+b. We will call it *z*
- Next, we calculated sigmoid of z as our predicted value, y_hat. We will call it a for now
- Then, we calculated the average loss function, *J*. For a single image data *J* is same as *L*

The computational graph for the above steps is shown below. We need to move from left to right. This step, in neural networks, is known as forward propagation.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/13.png" width=600 align="center">

Now, we need to use Gradient Descent algorithm to update the values of w and b. In order to do that, we have to calculate partial derivative of *J(w, b)* with respect to w and b. For a single image data *J* is same as *L*. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/14.png" width=600 align="center">

To calculate partial derivative of *L* with respect to *w* and *b* we have to move from right to left across the computational graph.

- We begin by calculating partial derivative of *L* with respect to a
- Next, we calculate partial derivative of *a* with respect to z
- Then, we calculate partial derivative of *z* with respect to w1, w2 and b

The steps are shown below. This step, in neural networks, is known as backward propagation or backpropagation.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/15.png" width=600 align="center">

### 2.4. Summary: Training w and b with one example

Let us summarize the steps.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/16.png" width=600 align="center">

### 2.5. Learning from m training examples

We don't actually learn from a single example. We train our algorithm on a number of examples. So we will scale our computational graph for m training examples.

In the graph below, each *x* represents one image data. Sigmoid function and consequently cost function, *L*, is calculated for each image, and finally average cost is calculated. We have to perform backpropagation calculation for each branch. The process is repeated a number of times. Understand that, m can be very large, even greater than a million, and all these computaions can cost alot. So, it is a big problem. 

Notice that w.T and b are constants actoss each branch, and the branches are not connected until we reach the calculation of *J*. So, a lot of parallel calculations are involved. We can take advantage of parallel computaions and make the process way more efficient using the idea of Vectorization. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/17.png" width=600 align="center">

### 2.6. Vectorization

We will create a matrix, X, and stack all the values of x, columnwise in it. The convention is to represent a one dimensional matrix with small letters and a two dimensional matrix with a capital letter. We now have a matrix of size 12288 by m (64*64*3=12288). Note that we don't need to increase shape of w for the calculation. w is of size 12288 by 1, and therefore, w.T is of size 1 by 12288. Therefore, w.T and X are still compatible for dot product. Finally, we add b to the product. The result is a matrix, z, of size 1 by m.

Using vectorization, we can replace a number of parallel calculations with one dot product. In python, it can be done with a single line of code. We can take sigmoid of z and then compute average cost.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/18.png" width=600 align="center">

If we try to do the same thing in python without vectorization, it will require a lot more calculation, as shown below. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/19.png" width=600 align="center">

Forward propagation is followed by backward propagation and the cycle is repeated a number of times. The code for the process using vectorization is shown below. The cycle has been repeated for 1000 times.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/20.png" width=600 align="center">

### 2.7. Logistic Regression - Implementing

After we have calculated appropriate w and b, we have our model. We can now take new input *x*, calculate z, followed by sigmoid function. Generally, if sigmoid function returns a value greator than or equal to 0.5, it is termed as 1 and if sigmoid function returns value lower than 0.5, it is termed as 0. This will give us our prediction for the new case. If the image contains a dog, i.e. 1 or not, i.e. 0. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/21.png" width=600 align="center">

Notice that the above computational graph has a linear and a non-linear or activation(a) part. This combination of linear and non-linear process is called one neuron. The decision boundary created by one neuron is linear and often not sufficient to map out complex decision boundaries, specially for higher dimensions. Thus, the need of deep neural network asrises.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/22.png" width=600 align="center">

### 2.8. Problem with one neuron

The figure below shows the output desion boundary given by a neuron for two dimensional data points. The cycle was repeated for 221 times with learning rate of 0.03 and sigmoid as the activation function. Note that both test and training loss is zero. One neuron was capable to find a good decision boundary.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/23.png" width=600 align="center">

The figure below shows the output desion boundary given by a neuron for other set of two dimensional data points. The cycle was repeated for 202 times with learning rate of 0.03 and sigmoid as the activation function. Note that both test and training loss are above 0.5. One neuron was not capable to find a good decision boundary.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/24.png" width=600 align="center">

## 3. Neural Networks

### 3.1. Neural Network Terminology

A neural network is a collection of neurons. In a neutal network, instead of representing a neuron as a combination or linear and non linear part, it is represented by a circle. Note that the difference is in representation only, underlying mathematics remains the same.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/25.png" width=600 align="center">

The figure below shows a two layered neural network. In the first layer (hidden layer) we have four neurons, and in the second layer (output layer), we have a single neuron. We also have leyer 0/input layer representing the inputs. All layer(s) except input and output layers are referred as hidden layer(s).

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/26.png" width=600 align="center">

### 3.2. Training a Neural Network

Let's revise the steps for a single neuron.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/27.png" width=600 align="center">

The same steps are repeated for deep neural networks but instead of one, we have many neurons.

### 3.3. Forward Propagation

We will take the input and hidden layer from the above neural network. To simplify the problem, we will again assume we have only one *x* and *y* i.e. number of image, m, equals 1. Also, instead of being a large dimensional vector imagine, *x* is only three dimensional. Consequently, *w* will also be only three dimensional. Since we have only one data point, average cost function, *J*, will be same as cost for one data point, *L*.

Notice that insted of passing *x* to a single neuron, we are passing to to four neurons. The idea is to pass the data vector to different neurons with different weight vectors and biases. A common way of restating the idea is to say each line in a neural network has different weight. The lines in first hidden layer acts as input for the next hidden layer or output layer, and are also weighted.

If we consider only the first neuron from the hidden layer. We will have our unit neuron with linear(z1) and non-linear(a1) part. Since the weight vector and bias for different neurons are different, we represent it with w1 and b1. Consequently, z becomes z1 and a becomes a1. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/28.png" width=600 align="center">

Also, since we have multiple layers, here one hidden layer and output layer, we add [1] in the superscript to denote layer 1.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/29.png" width=600 align="center">

For layer one we have four weight vectors and biases, one for each neuron. Note that, we are passing the same image to all four neurons so *x* vector is same for each neuron. This results in parallel computaion. To make the computations more efficient, we will use vectorization.

We will stack the tranpose of weight vectors row-wise such that it represents a matrix of size 4 by 3 represented by *W*. [1] in superscript represents layer 1. Similarly bold **z** and **b** represents vectors after stacking row-wise.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/30.png" width=600 align="center">

Let's talk a bit about dimension of each term above. For this particular example, dimension of **z** is (4,1), **W** is (4,3), **x** is (3,1), and **b** is (4,1).

To generalize the dimensions, say the dimension of our input layer is represented by n with superscript [0], and the number of neurons in first layer is represented by n with superscript [1]. The the respective dimensions of **z**, **W**, **x** and **b** are shown below. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/31.png" width=600 align="center">

When we use vectorization, instead of calculating sigmoid of individual z we can take sigmoid of the vector **z**.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/32.png" width=600 align="center">

Now, every layer has a different **W** and **b**, separated by the notation is superscript. The output form first layer ***a*** is passed to second layer. Except the first layer, for all other layers, we can generalize the equation as shown below.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/33.png" width=600 align="center">

### 3.4. Forward Propagation - m training examples

To repeat, we don't actually learn from a single example. We train our algorithm on a number of examples. So we will scale our computational graph for m training examples.

So what we did for one image data, passing the vector components of image data to four neurons, calculating z[1] and a[1], sending a[1] as input to output layer and then calculating z[2] and a[2] will be represented by a single green box. The input for the box will be an image vector and output will be a prediction.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/34.png" width=600 align="center">

For m data points, we will have m such boxes. We have to calculate loss for each image data and then calculate average loss, J. We, again, have parallel computations and the amount of computaion has exponentially increased with addition of every unit neuron, therefore, we need vectorization. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/35.png" width=600 align="center">

### 3.5. Vectorizing Forward Propagation

We will start by stacking image data vectors row-wise. The size of the matrix is 3 by m. We already have a matrix W[1]. W[1]]X plus bias matrix will result in a matrix Z[1] of size 4 by m where each column represents result of linear computation for one image data. We apply an activation function to the matrix. Finally, we will have a matrix A[1], where each column represents result of non-linear computation for one image data.

In the slide below, superscript [1] denotes layer 1 and superscript (1) represents first image data. Computaions are shown below for only one image vector. Also, W[1]X does not actual shape of matrix. It has been used to shown *W* is different for each image data, and we find dot product accordingly. The shape of W[1]X is same as the shape of Z[1].

The A[1] matrix acts as input matrix for output layer. Each column of A[1] represents outputs obtained from neurons for one image similar to how each column in X represents pixel values for one image. The process in sequence is same as the vectorised process for unit neuron we discussed above.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/36.png" width=600 align="center">

The first graph below shows forward propagation computational graph before vectorization of m data points. The second graph is a simplified computational graph for forward propagation of a two layered neural network. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/37.png" width=600 align="center">

### 3.6 Forward Propagation in python

Part of python code for a two layred neural network is shown below. The first set of code is used if we are not using vectorization technique, the second set of code is used if we have vectorized our input data.

Note that second set is easier to write and much better computationally.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/38.png" width=600 align="center">

### 3.7. Activation Functions

Until now, in this blog, we have used sigmoid function as our activation function. But does it always need to be sigmoid? No. For neural networks we have many other non linear functions.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/39.png" width=600 align="center">

Some of the common activation functions are shown below. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/40.png" width=600 align="center">

Equations of common activation functions and their derivatives are shown below.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/52.png" width=600 align="center">


Note that activation functions can vary with each layer. Activation functions, in general, are represented by *g*.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/41.png" width=600 align="center">

### 3.8. Backpropagation - Revisiting Logistic Regression

During Backward propagation, for logistic regression, we calculated partial derivative of average cost with respect to weight and bias, and used it to update values of weights and biases. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/42.png" width=600 align="center">

### 3.9. Backpropagation for Neural Networks

In case of neural networks, we don't have just w and b, instead, we have W[1], b[1] and W[2], b[2]. So we need to compute partial derivative of average cost, J, with respect to all four parameters. However, the process is same.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/43.png" width=600 align="center">

First we calculate partial derivative of J with respect to A[2] and then with respect to Z[2]. Since W[2] and b[2] are linked with Z[2]. We have to calculate partial derivative of Z[2] with respect to W[2] and b[2] right now.



<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/44.png" width=600 align="center">

Now for first layer, we need to compute partial derivative of Z[2] with respect to A[1]. Then we calculate partial derivative A[1] with respect to Z[1]. Since W[1] and b[1] are linked with Z[1]. We have to calculate partial derivative of Z[1] with respect to W[1] and b[1].

Finally, we update W[1], b[1] and W[2], b[2]. The cycle is repeated until sufficiently low average cost is achieved.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/45.png" width=600 align="center">

### 3.10. Summary for Learning

Let's summarize the steps.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/46.png" width=600 align="center">

### 3.11. Prediction

Once we have our optimum W[1], b[1] and W[2], b[2] we can begin predicting. We input a new x and pass it through layer 1 and 2 to get our final prediction.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/47.png" width=600 align="center">

### 3.12. Initializing of weights

Let's talk a little about initializing W[1], b[1] and W[2], b[2]. Can we intitialize with any numbers?

Say, our image vector is only two dimentional and we have to pass it through two neurons only. Consequently, W[1] will have dimension (2, 2). Say, we initialze W[1] as a null matrix and b[1] as a null vector. In such a case, irrespective of value of x, z[1] will be a null vector. If we choose sigmoid as activation function, a[1] will be a vector will all elements as 0.5. 

During back propagation, we need to calculate partial derivatives. Those value will also be symmetric. No matter how many iterations we try we will not be able to break symmetry i.e. all elements of a vector or matrix will always be equal. This, in most cases, will not result in minimum value of average cost. So, generally ewe need to randomly initialize W, and make sure it is not a null matrix. 

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/48.png" width=600 align="center">

## 4. Summary

In summary, for a machine learning problem, we are trying to learn a f(x) that can take new input and give us appropriate predictions. Logistic regression is one of the methods used for the task but it is a rather simple model and cannot form complex decision boundaries. Thus, we need neutal networks.

If we decide to use sigmoid as our activation function, a unit neuron is conceptually same as logistic regression. In order to generate complex decisions boundaries, we use a number of neurons arranged in a number of layers. In essence though, we will be performing the same calculations we did with a logistic regression, but at many times.

Mathematical equations for logistic regression and neural networks is shown below, assuming we decide to use sigmoid as activation function.

<img src="https://raw.githubusercontent.com/ujwal-sah/blogs/master/Introduction%20to%20Deep%20Neural%20Networks/49.png" width=600 align="center">

<hr>

Hope you liked this blog. I know it was really long, but I wanted to start at the beginning. Note that I have taken slides from Dr. Remun Koirala, from CEA Leti, France presentaion on Deep Neural Networks.