# Table of Contents
* [1) Motivations](#1%29-Motivations)
    * [1) Non-linear Hypothesis](#1%29-Non-linear-Hypothesis)
	* [2) Neurons and the Brain](#2%29-Neurons-and-the-Brain)
* [2) Neural Networks](#2%29-Neural-Networks)
    * [1) Model Representation I](#1%29-Model-Representation-I)
	* [2) Model Representation II](#2%29-Model-Representation-II)
* [3) Applications](#3%29-Applications)
	* [1) Examples and Intuitions I](#1%29-Examples-and-Intuitions I)
	* [2) Examples and Intuitions II](#2%29-Examples-and-Intuitions-II)
    * [3) Multiclass Classification](#3%29-Multiclass-Classification)

# 1) Motivations

## 1) Non-linear Hypothesis

I'd like to tell you about a learning algorithm called a **Neural Network**.

Neural networks is actually a pretty old idea, but had fallen out of favor for a while. But today, it is the state of the art technique for many different machine learning problems.

Consider a supervised learning classification problem below, if you want to apply logistic regression to this problem, one thing you could do is apply logistic regression with a lot of nonlinear features. So here, g as usual is the sigmoid function, if you include enough polynomial terms, maybe you can get a hypotheses that separates the positive and negative examples. This particular method works well when you have only, say, two features - x1 and x2 - because you can then include all those polynomial terms of x1 and x2.

However, if we have 100 features, if you were to include all the quadratic terms, you end up with about five thousand features. And, asymptotically, the number of quadratic features grows roughly as $O(n^{2})$. So 5000 features seems like a lot, if you were to include the cubic, or third order known of each others, the x1, x2, x3, they are going to be $O(n^{3})$, you end up with on the order of about 170,000 such cubic features.

https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses/discussions/qY4rhkX8EeWySA6VF2_0Lw

n = 100, but 170,000 features because of ~n^3/6

<img src="images/lec8_pic01.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses) 5:07*

<!--TEASER_END-->

For many machine learning problems, n will be pretty large. Here's an example. Let's consider the problem of computer vision. And suppose you want to use machine learning to train a classifier to examine an image and tell us whether or not the image is a car. To understand why computer vision is hard let's zoom into a small part of the image like that area where the little red rectangle is. It turns out that where you and I see a car, the computer sees this matrix, or this grid, of pixel intensity values that tells us the brightness of each pixel in the image.

<img src="images/lec8_pic02.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses) 5:15*

<!--TEASER_END-->

Concretely, when we use machine learning to build a car detector, what we do is we come up with a label training set, with a few label examples of cars and a few label examples of things that are not cars, then we give our training set to the learning algorithm trained a classifier and then, we may test it and show the new image and ask, "What is this new thing?". And hopefully it will recognize that that is a car. 

<img src="images/lec8_pic03.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses) 6:15*

<!--TEASER_END-->

To understand why we need nonlinear hypotheses, let's take a look at some of the images of cars and maybe non-cars that we might feed to our learning algorithm. 

Let's pick a couple of pixel locations in our images, so that's pixel one location and pixel two location, and let's plot this car, you know, at the location, at a certain point, depending on the intensities of pixel one and pixel two. So let's take a different example of the car and look at the same two pixel locations and that image has a different intensity for pixel one and a different intensity for pixel two. So, it ends up at a different location on the figure. And then let's plot some negative examples as well. 

<img src="images/lec8_pic04.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses) 7:00*

<!--TEASER_END-->

And if we do this for more and more examples using the pluses to denote cars and minuses to denote non-cars, what we'll find is that the cars and non-cars end up lying in different regions of the space, and what we need therefore is some sort of non-linear hypotheses to try to separate out the two classes.

What is the dimension of the feature space? Suppose we were to use just 50 by 50 pixel images. Now that suppose our images were pretty small ones, just 50 pixels on the side. Then we would have 2500 pixels in grayscale value. If we were using RGB images with separate red, green and blue values, we would have n equals 7500.

So, if we were to try to learn a nonlinear hypothesis by including all the quadratic features, that is all the terms of the form $(x_{i} \times x_{j})$, while with the 2500 pixels we would end up with a total of three million features. 

And that's just too large to be reasonable; the computation would be very expensive to find and to represent all of these three million features per training example. 

<img src="images/lec8_pic05.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses) 8:30*

<!--TEASER_END-->

<img src="images/lec8_pic06.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses) 8:54*

<!--TEASER_END-->

- Explaination of the quiz:
https://www.coursera.org/learn/machine-learning/lecture/OAOhO/non-linear-hypotheses/discussions/TtqcbxzGEeWmISIAC9QOog

Perhaps a simpler case will make things clearer. Lets say we use a 1x2 grid of pixels for our classification problem instead of a 100x100 grid.

```
Sample Pixels (1x2)
+----+----+
| x1 | x2 |
+----+----+
```

Imagine when plotting our training set, we noticed that it can't be separated easily with a linear model, so we choose to add polynomial terms to better fit the data.

Let's say, we decide to construct our polynomials by including all of the pixel intensities, and all possible multiples that can be formed from them.

Since our matrix is small, let's enumerate them:

$$ x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1}\cdot x_{2}, x_{2}\cdot x_{1}$$

After inspecting the above features, we can see that a pattern emerges. The first two terms, **group 1**, are features consisting only of their pixel intensity. The following two terms after that,** group 2**, are features consisting of the square of their intensity. The last two terms, **group 3**, are the product of all the combinations of pairwise (two) pixel intensities.

- group 1: $x_{1}, x_{2}$
- group 2: $x_{1}^{2}, x_{2}^{2}$
- group 3: $ x_{1}\cdot x_{2}, x_{2}\cdot x_{1}$
But wait, there is a problem. If you look at the group 3 terms in the sequence ($ x_{1}\cdot x_{2}$ and $x_{2}\cdot x_{1}$) you'll notice that they are equal. Remember our housing example. Imagine having two features x1 = square footage, and x2 = square footage, for the same house... That doesn't make any sense! Ok, so we need to get rid of the duplicate feature, lets say arbitrarily $x_{2}\cdot x_{1}$. Now we can rewrite the list of group three features as:

group 3: $ x_{1}\cdot x_{2}$
We count the features in all three groups and get 5.

But this is a toy example. Lets derive a generic formula for calculating the number of features. Let's use our original groups of features as a starting point.

$$sizegroup1 + sizegroup2 + sizegroup3=m\cdot n+m\cdot n+m\cdot n=3m\cdot n$$

Ah! But we had to get rid of the duplicate product in group 3.

So to properly count the features for group 3 we will need a way to count all unique pairwise products in the matrix. Which can be done with the binomial coefficient, which is a method for counting all possible unique subgroups of size k from an equal or larger group of size n. So to properly count the features in group 3 calculate C(m⋅n,2).

So our generic formula would be:

$$m\cdot n+m\cdot n+C(m\cdot n,2)=2m\cdot n+C(m\cdot n,2)$$
Lets use it to calculate the number of features in our toy example:

$$2\cdot 1\cdot 2+C(1\cdot 2,2)=4+1=5$$
Thats it!

## 2) Neurons and the Brain

- **The origins of Neural Networks** was as algorithms that try to mimic the brain and those a sense that if we want to build learning systems while why not mimic perhaps the most amazing learning machine we know about, which is perhaps the brain. 
- Neural Networks came to be very widely used throughout the 1980's and 1990's and for various reasons as popularity diminished in the late 90's. 
- But more recently, Neural Networks  have had a major recent resurgence. One of the reasons for this resurgence is that Neural Networks are computationally some what more expensive algorithm and so, it was maybe somewhat more recently that computers became fast enough to really run large scale Neural Networks and because of that as well as a few other technical reasons which we'll talk about later, modern Neural Networks today are the state of the art technique for many applications. 

<img src="images/lec8_pic07.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/IPmzw/neurons-and-the-brain) 1:40*

<!--TEASER_END-->

The brain can learn to see process images than to hear, learn to process our sense of touch, learn to do math, learn to do calculus, and the brain does so many different and amazing things. It seems like if you want to mimic the brain it seems like you have to write lots of different pieces of software to mimic all of these different fascinating, amazing things that the brain tell us.

There is a fascinating hypothesis that the way the brain does all of these different things is not worth like a thousand different programs, but instead, the way the brain does it is worth just <u>**a single learning algorithm**</u>.

This is just a hypothesis but let me share with you some of the evidence for this. This part of the brain, that little red part of the brain, is your auditory cortex and the way you're understanding my voice now is your ear is taking the sound signal and routing the sound signal to your auditory cortex and that's what's allowing you to understand my words. Neuroscientists have done the following fascinating experiments where you cut the wire from the ears to the auditory cortex and you re-wire, in this case an animal's brain, so that the signal from the eyes to the optic nerve eventually gets routed to the auditory cortex. If you do this it turns out, the auditory cortex will learn to see. And this is in every single sense of the word see as we know it. 

<img src="images/lec8_pic08.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/IPmzw/neurons-and-the-brain) 3:18*

<!--TEASER_END-->

Here's another example. That red piece of brain tissue is your **somatosensory cortex**. That's how you process your sense of touch. If you do a similar re-wiring process then the somatosensory cortex will learn to see. Because of this and other similar experiments, these are called **neuro-rewiring experiments**. 

There's this sense that if the same piece of physical brain tissue can process sight or sound or touch then maybe there is one learning algorithm that can process sight or sound or touch. And instead of needing to implement a thousand different programs or a thousand different algorithms to do, you know, the thousand wonderful things that the brain does, maybe what we need to do is figure out some approximation or to whatever the brain's learning algorithm is and implement that and that the brain learned by itself how to process these different types of data. 

<img src="images/lec8_pic09.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/IPmzw/neurons-and-the-brain) 3:30*

<!--TEASER_END-->

Here are a few more examples. 
- On the upper left is an example of learning to see with your tongue. The way it works is--this is actually a system called BrainPort undergoing FDA trials now to help blind people see--but the way it works is, you strap a grayscale camera to your forehead, facing forward, that takes the low resolution grayscale image of what's in front of you and you then run a wire to an array of electrodes that you place on your tongue so that each pixel gets mapped to a location on your tongue where maybe a high voltage corresponds to a dark pixel and a low voltage corresponds to a bright pixel and, even as it does today, with this sort of system you and I will be able to learn to see, you know, in tens of minutes with our tongues. 
- Here's a second example of human echo location or human sonar. So there are two ways you can do this. You can either snap your fingers, or click your tongue. But there are blind people today that are actually being trained in schools to do this and learn to interpret the pattern of sounds bouncing off your environment - that's sonar.
- Third example is the Haptic Belt where if you have a strap around your waist, ring up buzzers and always have the northmost one buzzing. You can give a human a direction sense similar to maybe how birds can, you know, sense where north is. 
- And, some of the bizarre example, but if you plug a third eye into a frog, the frog will learn to use that eye as well. 

<img src="images/lec8_pic10.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/IPmzw/neurons-and-the-brain) 6:28*

<!--TEASER_END-->

# 2) Neural Networks

## 1) Model Representation I

I want to start telling you about how we represent neural networks, or how we represent our hypothesis or how we represent our model when using neural networks.

Neural networks were developed as simulating neurons or networks of neurons in the brain. So, to explain the hypothesis
representation let's start by looking at what a single neuron in the brain looks like. Your brain and mine is jam packed full of neurons like these and neurons are cells in the brain. And two things to draw attention to are that first.

The neuron has a cell body, and a number of input wires, and these are called the dendrites. And a neuron also has an output
wire called an Axon, and this output wire is what it uses to send signals to other neurons. So, at a simplistic level what a neuron is, is a computational unit that gets a number of inputs through it input wires and does some computation and then it sends outputs via its axon to other nodes or to other neurons in the brain.

<img src="images/lec8_pic11.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ka3jK/model-representation-i) 1:00*

<!--TEASER_END-->

Here's a illustration of a group of neurons. The way that neurons communicate with each other is with little pulses of electricity, they are also called spikes. So here is one neuron and what it does is if it wants a send a message what it does is sends a little pulse of electricity. Varis axon to some different neuron and here, this axon that is this open wire, connects to the dendrites of this second neuron over here, which then accepts this incoming message that some computation. And they, in turn, decide to send out this message on this axon to other neurons, and this is the process by which all human thought happens. It's these Neurons doing computations and passing messages to other neurons as a result of what other inputs they've got.

<img src="images/lec8_pic12.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ka3jK/model-representation-i) 2:30*

<!--TEASER_END-->

In a neuro network, or rather, in an artificial neuron network that we've implemented on the computer, we're going to use a very simple model of what a neuron does we're going to model a neuron as just a logistic unit. 

The yellow circle as a playing a role analysis, who's maybe the body of a neuron, and we then feed the neuron a few inputs who's various dendrites or input wiles. And the neuron does some computation. And output some value on this output wire, or in the biological neuron, this is an axon.

The diagram represents a computation of $h_{\theta}(x) = \dfrac{1}{1 + e^{-\theta^{T}x}}$, where as usual, x and theta are our parameter vectors.

So this is a very simple model of the computations that the neuron does, where it gets a number of inputs, x1, x2, x3 and it outputs some value computed. Sometimes when it's useful to do so, I'll draw an extra node for x0. This x0 now that's sometimes called the bias unit or the bias neuron, but because x0 is already equal to 1, sometimes, I draw this, sometimes I won't just depending on whatever is more notationally convenient for that example.

Finally, one last bit of terminology when we talk about neural networks, sometimes we'll say that this is a neuron or an artificial neuron with a Sigmoid or logistic activation function. This is just another term for that function for that non-linearity: $g(z) = \dfrac{1}{1 + e^{-z}}$

In neural network literature sometimes you might hear people talk about weights of a model and weights just means exactly the same thing as parameters of a model, so:
- **theta: parameters of the a model, or weights of a model**

This little diagram represents a single neuron.

<img src="images/lec8_pic13.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ka3jK/model-representation-i) 5:20*

<!--TEASER_END-->

What a neural network is just a group of this different neurons strong together. We have:
- input: x1, x2, x3, and sometimes you can draw extra node x0 or sometimes not.
- neurons: a1, a2, a3, and you can also add the extra bias unit a0, the value of bias unit a0 is always 1.
- third node and final layer that outputs the value that the hypothesis h(x) computes.

**Terminology:**
- first layer: input layer
- final layer: ouput layer, because that layer has a neuron that outputs the final value computed by a hypothesis.
- layer 2 in between: hidden layer. The idea is that you get to see the inputs and outputs, where there's a hidden layer of values you don't get to observe in the training setup. It's not x, and it's not y, and so we call those hidden. 

<img src="images/lec8_pic14.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ka3jK/model-representation-i) 6:45*

<!--TEASER_END-->

To explain these specific computations represented by a neural network, here's a little bit more notation. 
- $\large a_{i}^{(j)}$: "activation" of unit i in layer j. So $a_{1}^{(2)}$ is the activation of the first unit in layer two, in our hidden layer.
- $\large\Theta^{(j)}$: matrix of weights controlling function mapping from layer j to layer j + 1, maybe the first layer to the second layer, or from the second layer to the third layer. 

So here are the computations that are represented by this diagram. This first hidden unit here has it's value computed as follows, there's a is $a_{1}^{(2)}$ is equal to the sigma function, or the sigma activation function, also called the logistics activation function, apply to this sort of linear combination of these inputs. And similarly then the second and third hidden
units have this activation value computed by that formula. 

We have 3 input units and 3 hidden units, so the dimension of theta 1, which is matrix of parameters governing our mapping from our three input units to 3 hidden units, is going to be a 3x4-dimensional matrix. Theta(1) is a 3x4-dimensional matrix because:
- Let 'n' be the number of features in the training set.
- Let 'h' be the number of units in the hidden layer;
- The size of Theta1 (Θ1) will be: [h x (n + 1)], where + 1 is referring to the bias unit.

Generally, **if a network has $s_{j}$ units in layer j and $s_{j + 1}$units in layer j + 1, then the matrix $\Theta^{(j)}$, which governs the function mapping from layer h to layer j + 1, that will have the dimension of $s_{j + 1} \times (s_{j} + 1) $.** Be careful about this notation, the first one is S (subscript j + 1), and the second one is Sj plus 1.

**Finally, this final output layer**, we have one more unit which compute: $h_{\Theta}(x)$. You need to notice the $\Theta^{(2)}$ here, because theta of superscript two is the matrix of parameters, or the matrix of weights that controls thefunction that maps from the hidden units, that is the layer two units to the one layer three unit, that is the output unit.

To summarize, what we've done is shown how a picture like this over here defines an artificial neural network which defines a function h that maps from x's input values to hopefully to some space of prediction y. And these hypothesis are parameterized by parameters denoting with a capital theta so that, as we vary theta, we get different hypothesis and we get different functions that map x to y.

<img src="images/lec8_pic15.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ka3jK/model-representation-i) 11:40*

<!--TEASER_END-->

<img src="images/lec8_pic16.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/ka3jK/model-representation-i) 11:42*

<!--TEASER_END-->

## 2) Model Representation II

In this one, I like show you how to actually carry out that computation efficiently, and that is show you a vector rise implementation. And second, and more importantly, I want to start giving you intuition about why these neural network representations might be a good idea and how they can help us to learn complex nonlinear hypotheses. 

Consider this neural network. Previously we said that the sequence of steps that we need in order to compute the output of a hypotheses is these equations given on the left where we compute the activation values of the three hidden uses and then we use those to compute the final output of our hypotheses h of x. 

Now, I'm going to define a few extra terms. The underlined term will be $\large z_{1}^{(2)}$, so:
$$\large a_{1}^{(2)} =  z_{1}^{(2)}$$
**The superscript 2** in parentheses means that these are values associated with layer 2, that is with the hidden layer in the neural network.

Similarly, we have:
$$\large a_{2}^{(2)} =  z_{2}^{(2)}$$
$$\large a_{3}^{(2)} =  z_{3}^{(2)}$$

Now if you look at this block of numbers: $(\Theta_{10}^{(1)}x_{0} + \Theta_{11}^{(1)}x_{1} + \Theta_{12}^{(1)}x_{2} + \Theta_{13}^{(1)}x_{3})$. You may notice that that block of numbers corresponds suspiciously similar to the matrix vector operation, matrix vector multiplication of x1 times the vector x. **Using this observation we're going to be able to vectorize this computation of the neural network.**

Concretely, let's define 
- the feature vector x as usual to be the vector of 
$\begin{bmatrix} 
x_{0}
\\ x_{1}
\\ x_{2}
\\ x_{3}
\end{bmatrix}$ where x0 as usual is always equal 1
- $z^{(2)}$ to be the vector of these z-values 
$\begin{bmatrix} 
z_{1}^{(2)}
\\ z_{2}^{(2)}
\\ z_{3}^{(2)}
\end{bmatrix}$, and $z^{(2)}$ is a three dimensional vector.

We can now vectorize the computation of $a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)}$ in two steps:
- $z^{(2)} = \Theta^{(1)}x$, and that would give us this vector $z^{(2)}$.
- $a^{(2)} =  g(z^{(2)})$, where $a^{(2)},  g(z^{(2)})$ are both 3-dimensional vector. So the activation g applies the sigmoid function element-wise to each of the $z^{(2)}$ elements.

And by the way, to make our notation a little more consistent with what we'll do later, in this input layer we have the inputs x, but we can also thing it is as in activations of the first layers. So, if I define
- $a^{(1)} = x$, where $a^{(1)}$ is a vector

I can now replace: $z^{(2)} = \Theta^{(1)}a^{(1)}$. With this function, we now have vector $a^{(2)} = 
\begin{bmatrix} 
a_{1}^{(2)}
\\ a_{2}^{(2)}
\\ a_{3}^{(2)}
\end{bmatrix}$. But I need one more value, which is I also want this $a_{0}^{(2)}$ and that corresponds to a bias unit in the hidden layer that goes to the output layer. 

what we're going to do is add an extra $a_{0}^{(2)} = 1$ to the vector $a^{(2)}$, and after taking this step we now have that $a^{(2)}$ is going to be a four dimensional feature vector ($a^{(2)} \in R^{4}$).

To compute the actual value output of our hypotheses, we then simply need to compute $z^{(3)}$, where
$$z^{(3)} = \Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{2}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)}$$,
or simply 
$$z^{(3)} = \Theta^{(2)}a^{(2)}$$

And finally my hypotheses output h of x which is a3 that is the activation of my one and only unit in the output layer, and that is just a real number.
$$h_{\Theta}(x) = a^{(3)} = g(z^{(3)})$$

This process of computing h of x is also called forward propagation and is called that because we start of with the activations of the input-units and then we sort of forward-propagate that to the hidden layer and compute the activations of the hidden layer and then we sort of forward propagate that and compute the activations of the output layer, but this process of computing the activations from the input then the hidden then the output layer, and that's also called forward propagation and what we just did is we just worked out a vector wise implementation of this procedure.

So, if you implement it using these equations that we have on the right, these would give you an efficient way or both of the efficient way of computing h of x. This forward propagation view also helps us to understand what Neural Networks might be doing and why they might help us to learn interesting nonlinear hypotheses.

<img src="images/lec8_pic17.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Hw3VK/model-representation-ii) 5:45*

<!--TEASER_END-->

Consider the following neural network and let's say I cover up the left path of this picture for now. If you look at what's left in this picture. This looks a lot like logistic regression where what we're doing is we're using that node, that's just the logistic regression unit and we're using that to make a prediction h of x. 

What the hypotheses is outputting is:
$$h_{\Theta}(x) = g(\Theta_{0}a_{0} + \Theta_{1}a_{1} + \Theta_{2}a_{2} + \Theta_{3}a_{3})$$
where values a1, a2, a3 are those given by these three hidden units.

Now, to be actually consistent to my early notation, we need to fill in these superscript 2, and I also have these indices 1 there because I have only one output unit.
$$h_{\Theta}(x) = g(\Theta_{10}^{(2)}a_{0}^{(2)} + \Theta_{11}^{(2)}a_{2}^{(2)} + \Theta_{12}^{(2)}a_{2}^{(2)} + \Theta_{13}^{(2)}a_{3}^{(2)})$$

This looks like the standard logistic regression model, except that I now have a capital theta instead of lower case theta. And what this is doing is just logistic regression. But the features fed into logistic regression are these values computed by the hidden layer. 

What this neural network is doing is just like logistic regression, except that rather than using the original features $x_{1},x_{2},x_{3}$ is using these new features $a_{1}^{(2)}, a_{2}^{(2)}, a_{3}^{(2)}$.

And the cool thing about this, is that the features a1, a2, a3, they themselves are learned as functions of the input. Concretely, the function mapping from layer 1 to layer 2, that is determined by some other set of parameters, $\Theta^{(1)}$. So it's as if the neural network, instead of being constrained to feed the features x1, x2, x3 to logistic regression. It gets to learn its own features, a1, a2, a3, to feed into the logistic regression. And as you can imagine depending on what parameters it chooses for theta 1, you can learn some pretty interesting and complex features and therefore you can end up with a better hypotheses than if you were constrained to use the raw features x1, x2 or x3 or if you will constrain to say choose the polynomial terms, you know, $x_{1}x_{2}$, $x_{2}x_{3}$, and so on. But instead, this algorithm has the flexibility to try to learn whatever features at once, using these a1, a2, a3 in order to feed into this last unit that's essentially a logistic regression here.

<img src="images/lec8_pic18.png">
<img src="images/lec8_pic19.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Hw3VK/model-representation-ii) 9:20*

<!--TEASER_END-->

You can have neural networks with other types of diagrams as well, and the way that neural networks are connected, that's called **the architecture**. So the term architecture refers to how the different neurons are connected to each other. 

This is an example of a different neural network architecture.
- second layer has 3 hidden units that are computing some complex function from input layers.
- third layer can take the second layer's features and compute even more complex features in layer three so that by the time you get to the output layer, layer four, you can have even more complex features of what you are able to compute in layer three and so get very interesting nonlinear hypotheses.

In a network like this,
- layer one: input layer
- layer four: output layer
- layer 2 and 3: there are two hidden layers

<img src="images/lec8_pic20.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Hw3VK/model-representation-ii) 10:30*

<!--TEASER_END-->

<img src="images/lec8_pic21.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/Hw3VK/model-representation-ii) 10:51*

<!--TEASER_END-->

To summarize, you may know how the feed forward propagation step in a neural network works where you start from the activations of the input layer and forward propagate that to the first hidden layer, then the second hidden layer, and then finally the output layer. And you also saw how we can vectorize that computation.

# 3) Applications

## 1) Examples and Intuitions I

Consider the following problem where we have features $x_{1}$ and $x_{2}$ that are binary values. So, $x_{1}$ and $x_{2}$ can each take on only one of two possible values. We only have two positive examples and two negative examples, you can think of this as a simplified version of a more complex learning problem where we may have a bunch of positive examples in the upper right and lower left and a bunch of negative examples denoted by the circles. 

**What we'd like to do is learn a non-linear desision boundary that may need to separate the positive and negative examples using neural network.**

Concretely what this is, is really computing the type of label y equals $x_{1}$ or $x_{2}$, so 
- **y = $x_{1}$ XOR $x_{2}$**, this is true only if exactly one of $x_{1}$ or $x_{2}$ is equal to 1

It turns out that these specific examples in the works out a little bit better if we use the XNOR example instead:
- **y = $x_{1}$ XNOR $x_{2}$ = NOT($x_{1}$ XOR $x_{2}$)**, this means we're going to have positive examples of either both are true or both are false, then y equals 1. And we're going to have y equals 0 if only one of them is true and we're going to figure out if we can get a neural network to fit to this sort of training set. 

In order to build up to a network that fits the XNOR example we're going to start with a slightly simpler one and show a network that fits the AND function. 

<img src="images/lec8_pic22.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/rBZmG/examples-and-intuitions-i) 1:55*

<!--TEASER_END-->

Let's say we have 
- input $x_{1}$ and $x_{2}$ that are again binaries so $x_{1}$, $x_{2} \in {0,1}$
- y = $x_{1}$ AND $x_{2}$. This is a logical AND.

**Can we get a one-unit network to compute this logical AND function?**

In order to do so, I'm going to actually draw in the bias unit which is the (+1) unit. Then let me assign some values:
- $x_{0}$ (+1 unit): -30, meaning a value of -30 to the value associated with $x_{0}$ unit.
- $x_{1}$: +20, meaning a parameter value of +20 that multiplies to $x_{1}$
- $x_{2}$: + 20, meaning a parameter value of +20 that multiplies to $x_{2}$

Based on this, we have the hypothesis:
$$h_{\Theta}(x) = g(-30 + 20x_{1} + 20x_{2})$$

- $\Theta_{10}^{(1)}$: -30
- $\Theta_{11}^{(1)}$: +20
- $\Theta_{12}^{(1)}$: +20

It's just easier to think about it as associating these parameters with the edges of the network.

Just to remind you the sigmoid activation function g(z). It starts from 0 rises smoothly crosses 0.5 and then it asymptotic as 1 and to give you some landmarks:
- if the horizontal axis value z is equal to 4.6 then the sigmoid function is equal to 0.99. This is very close to 1 
- if z = -4.6 then the sigmoid function there is 0.01 which is very close to 0.

Let's look at the four possible input values for $x_{1}$ and $x_{2}$ and look at what the hypotheses will output in that case. 

| $x_{1}$ | $x_{2}$ | $h_{\Theta}(x)$ |
| :-------------: |:-------------:| :-----:|
| 0 | 0 | $g(-30 + 20(0) + 20(0)) = g(-30) \approx 0$ |
| 0 | 1 | $g(-10) \approx 0$ |
| 1 | 0 | $g(-10) \approx 0$ |
| 1 | 1 | $g(10) \approx 1$ |

So, by writing out our little truth table like this we manage to figure what's the logical function that our neural network computes. 


<img src="images/lec8_pic23.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/rBZmG/examples-and-intuitions-i) 5:55*

<!--TEASER_END-->

**This example below is an OR function**

| $x_{1}$ | $x_{2}$ | $h_{\Theta}(x)$ |
| :-------------: |:-------------:| :-----:|
| 0 | 0 | $g(-10) \approx 0$ |
| 0 | 1 | $g(10) \approx 1$ |
| 1 | 0 | $g(10) \approx 1$ |
| 1 | 1 | $g(30) \approx 1$ |

<img src="images/lec8_pic24.png">
<img src="images/lec8_pic25.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/rBZmG/examples-and-intuitions-i) 6:15*

<!--TEASER_END-->

## 2) Examples and Intuitions II

I'd like to keep working through our example to show how a Neural Network can compute complex non linear hypothesis.

In the last video we saw how a Neural Network can be used to compute the functions **$x_{1}$ AND $x_{2}$**, and the function **$x_{1}$ OR $x_{2}$** when $x_{1}$ and $x_{2}$ binary, that is when they take on values 0,1. We can also have a network to compute **negation**, that is to compute the function NOT $x_{1}$.

Let me just write down the ways associated with this network. We have only one input feature $x_{1}$ in this case and the bias unit +1. And if I associate this with the weights +10 and -20, then my hypothesis is computing this: 
$$h_{\Theta}(x) = g(-10 + 20x_{1})$$

| $x_{1}$ | $h_{\Theta}(x)$ |
| :-------------: |:-------------:|
| 0 | $g(10) \approx 1$ |
| 1 | $g(-10) \approx 0$ |

And if you look at what these values are, that's essentially the NOT $x_{1}$ function. 


**Cells include negations, the general idea is to put that large negative weight in front of the variable you want to negate.**

For example, -20 multiplied by $x_{1}$ and that's the general idea of how you end up negating $x_{1}$. So if you want to compute: **(NOT $x_{1}$) AND (NOT $x_{2}$)**, part of that will probably be putting large negative weights in front of $x_{1}$ and $x_{2}$. So this logical function: **(NOT $x_{1}$) AND (NOT $x_{2}$)**, is going to be equal to 1 if and only if

$$x_{1} = x_{2} = 0$$
 
The logical function above **(NOT $x_{1}$) AND (NOT $x_{2}$)** means that:
- (NOT $x_{1}$) means $x_{1}$ must be 0 
- (NOT $x_{2}$) means $x_{2}$ must be 0 

So this logical function is equal to 1 if and only if both $x_{1}$ and $x_{2}$ are equal to 0 and hopefully you should be able to figure out how to make a small neural network to compute this logical function as well.



<img src="images/lec8_pic26.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 2:25*

<!--TEASER_END-->

<img src="images/lec8_pic27.png">
<img src="images/lec8_pic28.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 2:31*

<!--TEASER_END-->

Now, taking the three pieces that wehave put together as the network
- $x_{1}$ AND $x_{2}$
- (NOT $x_{1}$) AND (NOT $x_{2}$)
- $x_{1}$ OR $x_{2}$

we should be able to put these three pieces together to compute this **$x_{1}$ XNOR $x_{2}$** function.

And just to remind you, the **XNOR** function that we want to compute, this means we're going to have positive examples of either both are true or both are false, then y equals 1. And we're going to have y equals 0 if only one of them is true and we're going to figure out if we can get a neural network to fit to this sort of training set. And so clearly this will need a non linear decision boundary in order to separate the positive and negative examples. 

<img src="images/lec8_pic29.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii)*

<!--TEASER_END-->

**Let's draw the network.**

- <u>First hidden unit</u>: I'm going to take my input +1, $x_{1}$, $x_{2}$ and create my **first hidden unit** here. I'm gonna call this $a_{1}^{(2)}$ cuz that's my first hidden unit. And I'm gonna copy the weight over from the red network, which is $x_{1}$ AND $x_{2}$ function: -30, 20, 20.
- <u>Second hidden unit</u>: let's call this $a_{2}^{(2)}$, that is the second hidden unit of layer two. I'm going to copy over the cyan network, which is (NOT $x_{1}$) AND (NOT $x_{2}$) function, so I'm gonna have the weights: 10, -20, -20.

- Let's pull some of the truth table values:
    - For the red network, if we compute $x_{1}$ and $x_{2}$, this will be approximately: [0 0 0 1]
    - For the cyan network, if we compute $x_{1}$ and $x_{2}$, this will be approximately: [1 0 0 0]

- <u>Output node</u>: my output unit is $a_{1}^{(3)}$, this is what will output $h_{\Theta}(x)$.d I'm going to copy over the green network for that, and also need to add the +1 bias unit in the second layer. So I'm gonna have the weights: -10, 20, 20 for the OR function.

**Let's fill in the truth table entries**

| $x_{1}$ | $x_{2}$ | $a_{1}^{(2)}$ | $a_{2}^{(2)}$ | $h_{\Theta}(x) $  |
| :-------------: |:-------------:| :-----:| :-----:|:-------------:|
| 0 | 0 | 0 | 1 | 1  |
| 0 | 1 | 0 | 0 | 0  |
| 1 | 0 | 0 | 0 | 0  |
| 1 | 1 | 1 | 0 | 1  |

And thus $h_{\Theta}(x) $ is equal to 1 when either both $x_{1}$ and $x_{2}$ are zero or when $x_{1}$ and $x_{2}$ are both 1. With this neural network, which has a input layer, one hidden layer, and one output layer, we end up with a nonlinear decision
boundary that computes this XNOR function.

A more general intuition is that in the input layer, we just have our four inputs. Then we have a hidden layer, which computed some slightly more complex functions of the inputs. And then by adding yet another layer we end up with an even more complex non linear function. And this is a sort of intuition about why neural networks can compute pretty complicated functions. 

<img src="images/lec8_pic30.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 5:46*

<!--TEASER_END-->

That when you have multiple layers you have relatively simple function of the inputs of the second layer. But the third layer I can build on that to complete even more complex functions, and then the layer after that can compute even more complex functions. 

<img src="images/lec8_pic31.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 6:12*

<!--TEASER_END-->

Below is an example of an application of a the Neural Network that captures this intuition of the deeper layers computing more complex features. 

This is a research of Yann LeCunin in which he was using a neural network to do handwritten digit recognition. For more information please take a look at LeNet

- http://yann.lecun.com/exdb/lenet/
- http://deeplearning.net/tutorial/lenet.html
- http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/DeepLearning.pdf

<img src="images/lec8_pic32.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 7:50*

<!--TEASER_END-->

## 3) Multiclass Classification

In this video, I want to tell you about **how to use neural networks to do multiclass classification** where we may have more than one category that we're trying to distinguish amongst. 

In the last part of the last video, where we had the handwritten digit recognition problem, that was actually a multiclass classification problem because there were ten possible categories for recognizing the digits from 0 through 9.

**The way we do multiclass classification in a neural network is essentially an extension of the one versus all method.**

So, let's say that we have a computer vision example, and we're trying to recognize four categories of objects and given an image we want to decide if it is a pedestrian, a car, a motorcycle or a truck. If that's the case, what we would do is we would build a neural network with four output units so that our neural network now outputs a vector of four numbers. Example:
- first output unit: pedestrian, yes or no?
- second output unit: car, yes or no?

And thus, when the image is of:
- a pedestrian, we would ideally want the network to output: [1 0 0 0]
- a car, we want it to output: [0 1 0 0]

So this is just like the "one versus all" method that we talked about when we were describing logistic regression, and here we have essentially four logistic regression classifiers, each of which is trying to recognize one of the four classes that we want to distinguish amongst. 

<img src="images/lec8_pic33.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 2:02*

<!--TEASER_END-->

Here's our neural network with four output units and those are what we want $h_{\Theta}(x) $ to be when we have the different images, and the way we're going to represent the training set in these settings is as follows.

So, when we have a training set with different images, previously we had written out the labels as y being an integer from 1, 2, 3 or 4. Instead of representing y this way, we're going to instead represent y as follows:

$y^{(i)}$ is one of [1 0 0 0], [0 1 0 0], [0 0 1 0], [0 0 0 1]

And so one training example will be one pair $(x^{(i)}, y^{(i)})$ where $x^{(i)}$ is an image of one of the four objects and $y^{(i)}$ will be one of these vectors. 

And hopefully, we can find a way to get our Neural Networks to output some value, so that:
$h_{\Theta}(x^{(i)}) \approx y^{(i)} $, both of these are going to be in our example, four dimensional vectors when we have four classes. So, that's how you get neural network to do multiclass classification. 

<img src="images/lec8_pic34.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 3:20*

<!--TEASER_END-->

Generally, **if a network has $s_{j}$ units in layer j and $s_{j + 1}$units in layer j + 1, then the matrix $\Theta^{(j)}$, which governs the function mapping from layer h to layer j + 1, that will have the dimension of $s_{j + 1} \times (s_{j} + 1) $.**

So, in the quiz below:
- $\Theta^{(2)}$ is meaning from Layer 2 to Layer 3.
- hidden layer $s_{j}$: 5 units
- output layer $s_{j + 1}$: 10 units (10 classes)

$\Theta^{(2)}$ will have $s_{j + 1} \times (s_{j} + 1) = 10 \times (5 + 1) = 10 \times 6 = 60$ elements.

<img src="images/lec8_pic35.png">

*Screenshot taken from [Coursera](https://www.coursera.org/learn/machine-learning/lecture/solUx/examples-and-intuitions-ii) 3:31*

<!--TEASER_END-->