(NNB)=
# 2 Neural Networks Basics

Set up a machine learning problem with a neural network mindset and use vectorization to speed up your models.

**Learning Objectives**

- Build a logistic regression model structured as a shallow neural network.

- Build the general architecture of a learning algorithm, including parameter initialization, cost function and gradient calculation, and optimization implementation (gradient descent).

- Implement computationally efficient and highly vectorized versions of models.

- Compute derivatives for logistic regression, using a backpropagation mindset.

- Use Numpy functions and `Numpy` matrix/vector operations.

- Work with `iPython` Notebooks.

- Implement vectorization across multiple training examples.

- Explain the concept of broadcasting.

---

## Logistic Regreesion as a Neural Network

### Binary Classification

[Video](https://youtu.be/kbrSXl43iPM)

Logistic regression is an algorithm for binary classification. So let's start by setting up the problem. 

Here's an example of a binary classification problem. You might have an input of an image, and want to output a label to recognize this image as either being a cat, in which case you output 1, or not-cat in which case you output 0, and we're going to use $y$ to denote the output label. Let's look at how an image is represented in a computer. 

To store an image your computer stores three separate matrices corresponding to the red, green, and blue color channels of this image.

```{figure} images/2-1.png
---
height: 300px
name: 2-1
---
```

So if your input image is 64 pixels by 64 pixels ($64\times64$), then you would have **three** 64 by 64 matrices corresponding to the red, green and blue pixel intensity values for your images, which can be presented as $64\times64\times3$. Although to make this as an small example here, I drew these as much smaller matrices, so these are actually 5 by 4 matrices rather than 64 by 64 ($5\times4\times3$). 

So to turn these pixel intensity values into a feature vector, what we're going to do is ***unroll*** all of these pixel values into an input feature vector $x$. So to unroll all these pixel intensity values into a feature vector, what we're going to do is define a feature vector $x$ corresponding to this image as follows.

$$
 x = \left[
\begin{matrix}
255 \\
231 \\
42 \\
\vdots \\
124 \\
255 \\
134 \\
202 \\
\vdots \\
94 \\
255 \\
134 \\
93 \\
\vdots \\
142
\end{matrix}
\right], \quad y = 1
$$

- We're just going to take all the pixel values 255, 231, and so on until we've listed all the red pixels. 

- And then eventually 255, 134 and so on until we get a long feature vector listing out all the red, green and blue pixel intensity values of this image. 

If this image is a 64 by 64 image, the total dimension of this vector $x$ will be $64\times64\times3$ because that's the total numbers we have in all of these matrixes. Which in this case, turns out to be 12,288, that's what you get if you multiply all those numbers. Using $n_x = 12,288$ to represent the dimension of the input features $x$. And sometimes for brevity, I will also just use lowercase $n$ to represent the dimension of this input feature vector. 

So in binary classification, our goal is to learn a classifier that can input an image represented by this feature vector $x$. And predict whether the corresponding label $y$ is 1 or 0, that is, whether this is a cat image or a non-cat image.

Let's now lay out some of the notation that we'll use throughout the rest of this book.

- A single training example is represented by a pair $(x,y)$, where $x$ is an $x$-dimensional feature vector $(x \in  \mathbb{R}^{n_x})$ and $y$, the label, is either 0 or 1 $(y \in \{0,1\})$.

- $m$ training examples:  $\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m)}) \}$.

- $m_{\text{train}}$ denotes the number of training samples; $m_{\text{test}}$ denotes the number of test samples.

- $X \in \mathbb{R}^{n_x \times m}$ is the input matrix. In python, `X.shape` is $(n_x,m)$. $n_x$ rows and m columns.

- $x^{(i)} \in \mathbb{R}^{n_x}$ is the $i^{th}$ example represented as a column vector.

$$X= \left[
\begin{matrix}
\vdots & \vdots & \cdots & \vdots \\
x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\
\vdots & \vdots & \cdots & \vdots 
\end{matrix}
\right]
$$

- $Y \in \mathbb{R}^{1 \times m}$ is the label matrix. In python, `Y.shape` is $(1,m)$. 

$$Y=[y^{(1)}, y^{(2)}, \cdots, y^{(m)}]$$

- $y^{(i)}$ is the output label for the $i^{th}$ example.



### Logistic Regression

Logistic regression is a learning algorithm used in a supervised learning problem when the output $y$ are all either zero or one. The goal of logistic regression is to minimize the error between its predictions and training data.

[Video](https://youtu.be/4u0_TJhNGY8)

In this section, we will go over logistic regression. This is a learning algorithm that you use when the output labels $Y$ in a supervised learning problem are all either zero or one, so for binary classification problems.

Example: Cat vs No-cat

- Given an input feature vector $x$ maybe corresponding to an image that you want to recognize as either a cat picture or not a cat picture, you want an algorithm that can output a prediction, which we will call y hat $(\hat{y})$, which is your estimate of $y$.  More formally, you want $\hat{y}$ to be the ***probability of the chance*** that, $\hat{y} = \mathrm{P}(y=1\mid x)$ ($y$ is equal to one given the input features $x$).  So in other words, if $x$ is a picture, you want $\hat{y}$ to tell you, what is the chance that this is a cat picture.

$$\text{Given } x, \quad \hat{y} = \mathrm{P}(y=1\mid x), \quad \text{where } 0 \leq \hat{y} \leq 1$$

The parameters used in Logistic regression are:

- The input features vector: $x \in \mathbb{R}^{n_x}$, where $n_x$ is the number of features. That is, $x$ is an $n_x$ dimensional vector.

- The training label: $y \in \{0,1\}$.

- The weights: $w \in \mathbb{R}^{n_x}$. $w$ is also an $n_x$ dimensional vector.

- The threshold: $b \in \mathbb{R}$

- Summary: given that the parameters of logistic regression will be $w$ which is also an $n_x$ dimensional vector $(w \in \mathbb{R}^{n_x})$, together with $b$ which is just a real number $(b \in \mathbb{R})$. 

- So given an input $x$ and the parameters $w$ and $b$, how do we generate the output $\hat{y}$?  

$$\hat{y} = \text{sigmoid} (z) = \sigma(z) = \sigma(w^T x + b)$$ (2-1)


```{figure} images/2-2-sigmoid.png
---
height: 200px
name: 2-2
---
```

```{admonition} Why we are using the sigmoid function here?
:class: important
$(w^T x + b)$ is a linear function $(ax+b)$, but since we are looking for a probability constraint between $[0,1]$, the sigmoid function is used. The function is bounded between $[0,1]$ as shown in the graph above.
```

This is what the sigmoid function looks like.  

$$\sigma(z) = \dfrac{1}{1+e^{(-z)}}$$ (2-2)

- If $z$ is a very large positive number, then $\sigma(z)$ will be close to 1.

- If $z$ is a very large negative number, then $\sigma(z)$ will be close to 0.

- If $z = 0$, then  $\sigma(z) = 0.5$


So when you implement logistic regression, your job is to try to learn parameters $w$ and $b$ so that $\hat{y}$ becomes a good estimate of the chance of $y$ being equal to one. 

```{admonition} Practice Quiz
:class: tip
Q: **What are the parameters of logistic regression?**  </br>
A. $w$, an identity vector, and $b$, a real number.  </br>
B. $w$, an $n_x$ dimensional vector, and $b$, a real number.  </br>
C. $w$ and $b$,  both $n_x$ dimensional vector.  </br>
D. $w$ and $b$,  both real number. 
```

### Logistic Regression Cost Function

[Video](https://youtu.be/9K62xu8MKMk)

In the previous section, you saw the logistic regression model to train the parameters $w$ and $b$, of logistic regression model. You need to define a cost function, let's take a look at the cost function. 

To learn parameters for your model, you're given a training set of training examples and it seems natural that you want to find parameters $w$ and $b$.  So that at least on the training set, the outputs you have the predictions you have on the training set, which that the preicition values will be close to the true labels y that you got in the training set. 

- Given $m$ examples: $\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots, (x^{(m)}, y^{(m)}) \}$, want $\hat{y}^{(i)} \approx y^{(i)}$

- And of course for each training example, we're using these superscripts with round brackets with parentheses to index into different train examples. Your prediction on training example $i$: 

$$\hat{y}^{(i)} = \sigma(w^T x^{(i)} + b), \quad \text{where } \sigma(z^{(i)}) = \dfrac{1}{1+e^{(-z^{(i)})}}$$  (2-3)

- the $i$-th example: $x^{(i)}$, $y^{(i)}$, $z^{(i)}$.


Now let's see what loss function or an error function we can use to measure how well our algorithm is doing. 


**Loss (error) function:**

$$\mathcal{L}(\hat{y}^{(i)}, y^{(i)}) = - \Big( y^{(i)} \  \text{log}(\hat{y}^{(i)}) + (1-y^{(i)}) \ \text{log}(1-\hat{y}^{(i)}) \Big)$$

$$\mathcal{L}(\hat{y}, y) = - \Big( y \  \text{log}(\hat{y}) + (1-y) \ \text{log}(1-\hat{y}) \Big)$$ (2-4)


This function $\mathcal{L}$ is called the ***loss function*** is a function will need to define to measure how good our output $\hat{y}$ is when the true label is $y$. The loss function measures the discrepancy between the prediction $\hat{y}$ and the desired output $y$. In other words, the loss function computes the error for a single training example.

Keep in mind that if we are using squared error then you want to square error to be as small as possible. And with this logistic regression, lost function will also want this to be as small as possible. To understand why this makes sense, let's look at the two cases:

- If $y = 1$, then $\mathcal{L}(\hat{y}, y) = - \text{log}(\hat{y})$. So you want $- \text{log}(\hat{y})$ to be as small as possible, that means you want $\text{log}(\hat{y})$ to be as big as possible, and that means you want $\hat{y}$ to be large. But because $\hat{y}$ is you know the sigmoid function, it can never be bigger than one. So this is saying that if $y = 1$, you want $\hat{y}$ to be as big as possible, but it can't ever be bigger than one. So saying you want, $\hat{y}$ to be **close to one** as well.

- If $y = 0$, then $\mathcal{L}(\hat{y}, y) = - \text{log}(1-\hat{y})$. So if in your learning procedure you try to make the loss function small. What this means is that you want, $\text{log}(1-\hat{y})$ to be large. And then through a similar piece of reasoning, you can conclude that this loss function is trying to make $\hat{y}$ as small as possible, and again, because $\hat{y} \in \{0,1\}$ . This is saying that if $y = 0$ then your loss function will push the parameters to make $\hat{y}$ as **close to zero** as possible.

We just gave here a somewhat informal justification for this particular loss function, we will provide more details later to give a more formal justification for $y$. In logistic regression, we like to use the loss function with this particular form. 

```{admonition} Why DO NOT use the squared error, $\mathcal{L}(\hat{y}, y)= \dfrac{1}{2}(\hat{y}-y)^2$, in the loss function?
:class: important
It turns out that you could do this, but in logistic regression people don't usually do this. Because when you come to learn the parameters, you find that the optimization problem, which becomes non-convex. So you end up with optimization problem, you are with multiple local optima. So gradient descent, may not find a global optimum. 

Squared eror seems like it might be a reasonable choice except that it makes great in descent not work well. So in logistic regression were actually define a different loss function that plays a similar role as squared error but will give us an optimization problem that is convex.
```

The loss function was defined with respect to a single training example. It measures how well you're doing on a single training example, I'm now going to define something called the cost function, which measures how are you doing on the entire training set. 

**Cost function:**

The cost function is the average of the loss function of the entire training set. We are going to find the parameters $w$ 𝑎𝑛𝑑 $b$ that minimize the overall cost function.

$$
\begin{aligned}
\boldsymbol{J}(w,b) 
&= \dfrac{1}{m} \sum_{i=1}^{m}  \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) \\
&= - \dfrac{1}{m} \sum_{i=1}^{m} \Big[ \Big( y^{(i)} \  \text{log}(\hat{y}^{(i)}) + (1-y^{(i)}) \ \text{log}(1-\hat{y}^{(i)}) \Big) \Big]
\end{aligned}$$ (2-5)

So the cost function $\boldsymbol{J}$, which is applied to your parameters $w$ and $b$, is going to be the average, one of the $m$ of the sum of the loss function apply to each of the training examples in turn.

The terminology I'm going to use is that the loss function is applied to just a single training example, check out equation {eq}`2-4`. And the cost function is the cost of your parameters, so in training your logistic regression model, we're going to try to find parameters $w$ and $b$. That minimize the overall cost function $\boldsymbol{J}$, written at the equation {eq}`2-5`. 


```{admonition} Practice Quiz
:class: tip
Q: **What is the difference between the cost function and the loss function for logistic regression?**  </br>
A. The loss function computes the error for a single training example; the cost function is the average of the loss functions of the entire training set.  </br>
B. The cost function computes the error for a single training example; the loss function is the average of the cost functions of the entire training set.  </br>
C. They are different names for the same function.  </br>
```


### Explanation of Logisitic Regression Cost Function (Optional)

[Video](https://youtu.be/2vzqUcivE0A)

This is an optional section. In this part, we will have a quick justification for why we like to use that cost function for logistic regression. 

#### Logisitc regression cost function

In logistic regression, 

$$\hat{y} = \sigma(w^T + b), \quad \text{where } \sigma (z) = \dfrac{1}{1+e^{-z}}$$

The interpretation is $\hat{y} = \mathrm{P}(y=1\mid x)$. So we want our algorithm to output $\hat{y}$ has the chance that $y = 1$ for a given set of input features $x$. So another way to say this is that if $y$ is equal to 1 then the chance of $y$ given $x$ is equal to $\hat{y}$,  and conversely if $y$ is equal to 0 then the chance that y was 0 was $1-\hat{y}$.

$$
\begin{aligned}
\text{If } y = 1: \qquad \mathrm{P}(y\mid x) &= \hat{y} \\
\text{If } y = 0: \qquad \mathrm{P}(y\mid x) &= 1-\hat{y} 
\end{aligned}
$$

So if $\hat{y}$ was a chance that $y = 1$, then $1-\hat{y}$ is the chance that $y = 0$. 

So what I'm going to do is take these two equations which basically define $\mathrm{P}(y \mid x)$ for the two cases of $y = 0$ or $y = 1$. And then take these two equations and summarize them into a single equation. And just to point out $y$ has to be either $0$ or $1$ because in binary cost equations, $y = 0$ or $y = 1$ are the only two possible cases. When someone take these two equations and summarize them as follows:

$$\mathrm{P}(y \mid x) = \hat{y}^y \  (1-\hat{y})^{(1-y)}$$

Now, because the $\text{log}$ function is a strictly monotonically increasing function, your maximizing $\text{log} \ \mathrm{P}(y \mid x)$ should give you a similar result as optimizing $\mathrm{P}(y \mid x)$. So:

$$
\begin{aligned}
\text{log} \ \mathrm{P}(y \mid x)
&=  y \  \text{log}(\hat{y}) + (1-y) \ \text{log}(1-\hat{y})  \\
&=  - \mathcal{L}(\hat{y}, y)
\end{aligned}
$$

So this is actually negative of the loss function that we had to find previously. And there is a negative sign there because usually if you are training a learning algorithm, you want to make probabilities large. Whereas in logistic regression, we want to minimize the loss function. So minimizing the loss corresponds to maximizing the log of the probability. So this is what the loss function on a single example looks like.

So this is what the loss function on a single example looks like. How about the cost function, the overall cost function on the entire training set on m examples? Let's figure that out. 

#### Cost on m example

So, the probability of all the labels In the training set. If you assume that the training examples I've drawn independently or drawn IID, identically independently distributed, then the probability of the example is the product of probabilities.

$$\mathrm{P}(\text{labels in training set}) = \prod_{i=1}^{m} \mathrm{P}(y^{(i)} \mid x^{(i)})$$

And so if you want to carry out maximum likelihood estimation, then you want to find the parameters that maximizes the chance of your observations at training set. But maximizing this is the same as maximizing the log, so we just put logs on both sides:

$$
\begin{aligned}
\text{log} \ \mathrm{P}
&=  \sum_{i=1}^{m} \text{log} \  \mathrm{P}(y^{(i)} \mid x^{(i)})\\
&=  - \sum_{i=1}^{m} \ \mathcal{L}(\hat{y}^{(i)}, y^{(i)})
\end{aligned}
$$

And so in statistics, there's a principle called the principle of maximum likelihood estimation, which just means to choose the parameters that maximizes $\text{log} \ \mathrm{P}$. Or in other words, that maximizes this $\text{log} \ \mathrm{P}$.

So this justifies the cost we had for logistic regression which is $\boldsymbol{J}(w,b) $:

$$\boldsymbol{J}(w,b) = \dfrac{1}{m} \sum_{i=1}^{m}  \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) = - \dfrac{1}{m} \text{log} \ \mathrm{P}$$

Because we now want to minimize the cost instead of maximizing likelihood, we've got to rid of the minus sign. And then finally for convenience, to make sure that our quantities are better scale, we just add a 1 over $m$ extra scaling factor there. But so to summarize, by minimizing this cost function $\boldsymbol{J}(w,b) $ we're really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were IID, or identically independently distributed.

I hope this gives you a sense of why we use the cost function we do for logistic regression. 

### Gradient Descent

[Video](https://youtu.be/Nuxx65l5pUI)



```{admonition} Practice Quiz
:class: tip
**True or False**: A convex function always has multiple local optima.  __________  </br>
```

### Derivatives

[Video](https://youtu.be/Mz4VYF4CxGg)


### More Derivative Examples

[Video](https://youtu.be/6PSzTakWAO0)

1. **Derivative just means slope of a line.** 
	- The derivative of the function just means the slope of a function. The slope of a function can be different at different points on the function.
	- In our first example where $f(a) = 3a$ is a straight line. The derivative was the same everywhere, it was three everywhere. For other functions like $f(a) = a^2$ or $f(a) = \text{log}(a)$, the slope of the line varies. So, the slope or the derivative can be different at different points on the curve. 
	
2.  If you want to look up the derivative of a function, you can flip open your calculus textbook or look up Wikipedia and often get a formula for the slope of these functions at different points. 


### Computation Graph

[Video](https://youtu.be/mS-SIq7ENuQ)

```{admonition} Practice Quiz
:class: tip
Q: **One step of ________ propagation on a computation graph yields derivative of final output variable.**  </br>
A. Backward  </br>
B. Forward </br>
```


### Derivatives with a Computation Graph

[Video](https://youtu.be/VTlwSyuVIXU)


### Logistic Regression Gradient Descent

[Video](https://youtu.be/hZqWRCRu9N0)

$$
\begin{aligned}
z &= w^T x + b \\
\hat{y} &= a = \sigma(z) = \dfrac{1}{1+e^{-z}} \\
\mathcal{L}(a, y) &= - \Big( y \  \text{log}(a) + (1-y) \ \text{log}(1-a) \Big)
\end{aligned}$$ (2-6)

```{figure} images/2-3.png
---
height: 120px
name: 2-3
---
```

```{note} 
The Python coding convention `dvar` represents:
The derivative of a final output variable with respect to various intermediate quantities.
```
**Derivatives:**

`da` $ = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}a} = -\dfrac{y}{a} + \dfrac{1-y}{1-a}$ 

Using the chain rule to calculate `dz`

`dz` $ = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}z} = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}a} \times  \dfrac{\mathrm{d}a}{\mathrm{d}z}$ 

$$
\begin{aligned}
\dfrac{\mathrm{d}a}{\mathrm{d}z} &=  \dfrac{\mathrm{d}}{\mathrm{d}z} \sigma(z)  \\
&= \dfrac{\mathrm{d}}{\mathrm{d}z} \Bigg( \dfrac{1}{1+e^{-z}} \Bigg) \\
&= \dfrac{e^{-z}}{(1+e^{-z})^2}
\end{aligned}$$ (2-7)

From equation $a = \dfrac{1}{1+e^{-z}}$, we can derive:

$$\begin{aligned}
\dfrac{\mathrm{d}a}{\mathrm{d}z} &= \dfrac{e^{-z}}{(1+e^{-z})^2} \\
&= \dfrac{1}{1+e^{-z}} \cdot \Big(1- \dfrac{1}{1+e^{-z}} \Big) \\
&= a(1-a)
\end{aligned}$$


`dz` $ = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}a} \times  \dfrac{\mathrm{d}a}{\mathrm{d}z} = \Bigg( -\dfrac{y}{a} + \dfrac{1-y}{1-a}\Bigg) \times a(1-a) = a-y$ 


`dw1` $ = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}w_1} = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}a} \times \dfrac{\mathrm{d}a}{\mathrm{d}z} \times \dfrac{\mathrm{d}z}{\mathrm{d}w_1} = (a-y)x_1$ 

`dw2` $ = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}w_2} = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}a} \times \dfrac{\mathrm{d}a}{\mathrm{d}z} \times \dfrac{\mathrm{d}z}{\mathrm{d}w_2} = (a-y)x_2$ 

`db` $ = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}b} = \dfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}a} \times \dfrac{\mathrm{d}a}{\mathrm{d}z} \times \dfrac{\mathrm{d}z}{\mathrm{d}b} = a-y$ 

### Gradient Descent on m Examples

[Video](https://youtu.be/4ckUNzRGgDI)


## Python and Vectoraization

### Vectorization

[Video](https://youtu.be/vMFJQbMIaQc)

Vectorization is basically the art of getting rid of explicit for loops in your code. In the deep learning era, especially in deep learning in practice, you often find yourself training on relatively large data sets, because that's when deep learning algorithms tend to shine. And so, it's important that your code very quickly because otherwise, if it's training a big data set, your code might take a long time to run then you just find yourself waiting a very long time to get the result. So in the deep learning era, I think the ability to perform vectorization has become a key skill. Let's start with an example.

In logistic regreesion, you need to solve this kind of problem:

$$z = w^T x + b \quad \text{where } w \in \mathbb{R}^{n_x},  x \in  \mathbb{R}^{n_x}$$

where $w$ was this column vector and $x$ is also this vector. Maybe they are very large vectors if you have a lot of features. So, $w$ and $x$ were both $\mathbb{R}^{n_x}$ dimensional vectors. 

So, to compute $w$ transpose $x$, if you had a ***non-vectorized*** implementation, you would do something like `for` loop:

```python
z = 0
for i in range (n_x):
	z += w[i] * x[i]
z += b
```

That's a non-vectorized implementation. Then you find that that's going to be really slow. In contrast, a ***vectorized*** implementation would just compute $w$ transpose $x$ directly:

```python
z = np.dot(w,x) + b
```
And you find that this is much faster. Let's actually illustrate this with a little demo.

In [1]:
import numpy as np

a = np.array([1,2,3,4])
print(a)

[1 2 3 4]


In [2]:
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a,b)
toc = time.time()

print("c = " + str(c))
print("Vectorized version: " + str(1000*(toc-tic)) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
	c += a[i] * b[i]
toc = time.time()

print("c = " + str(c))
print("Non-Vectorized version: " + str(1000*(toc-tic)) + "ms")

c = 250053.1011983981
Vectorized version: 0.5631446838378906ms
c = 250053.101198407
Non-Vectorized version: 158.25510025024414ms


In both cases, the vectorize version and the non-vectorize version computed the same values, 249979, so on. The vectorize version took 0.4 milliseconds. The explicit for loop and non-vectorize version took about 255, almost 260 milliseconds. The non-vectorize version took something like 600 times longer than the vectorize version. With this example you see that if only you remember to vectorize your code, your code actually runs over 600 times faster.

If the engine slows down, it's the difference between your code taking maybe one minute to run versus taking say five hours to run. And when you are implementing deep learning algorithms, you can really get a result back faster. It will be much faster if you vectorize your code. 

Some of you might have heard that a lot of scaleable deep learning implementations are done on a GPU or a graphics processing unit. But all the demos I did just now in the Jupiter notebook where actually on the CPU. And it turns out that both GPU and CPU have parallelization instructions. They're sometimes called SIMD instructions. This stands for a single instruction multiple data. But what this basically means is that, if you use built-in functions such as this `np.dot()` function or other functions that don't require you explicitly implementing a `for` loop. It enables Phyton `Numpy` to take much better advantage of parallelism to do your computations much faster.  And this is true both computations on CPUs and computations on GPUs. It's just that GPUs are remarkably good at these SIMD calculations but CPU is actually also not too bad at that. Maybe just not as good as GPUs. 

You're seeing how vectorization can significantly speed up your code. The rule of thumb to remember is ***whenever possible, avoid using explicit for loops***.

### More Vectorization Examples

[Video](https://youtu.be/QLyqDRZrNm0)

```{admonition} Neural Network programming guideline
:class: hint
Whenever possible, avoid explicit `for` loops.
```
And it's not always possible to never use a for-loop, but when you can use a built in function or find some other way to compute whatever you need, you'll often go faster than if you have an explicit for-loop. 

Here is an another example, if ever you want to compute a vector $u$ as the product of the matrix $A$, and another vector $v$, then the definition of our matrix multiply is that:

$$
\begin{aligned}
u &= Av \\
u_i &= \sum_j A_{ij} v_j
\end{aligned}$$ 

Non-vectorized in coding:

```python
u = np.zeros((n,1))
for i in ... :
     for j in ... : 
	 u[i] += A[i][j] * v[i]
```

Vectorized in coding:

```python
u = np.dot(A,v)
```

#### Vectors and matrix valued functions

Say you need to apply the exponential operation on every element of matrix/vector.

$$v = \left[
\begin{matrix}
v_1 \\
\vdots \\
v_n
\end{matrix}
\right], \quad
u = \left[
\begin{matrix}
e^{v_1} \\
\vdots \\
e^{v_n} 
\end{matrix}
\right]
$$

Non-vectorized in coding:

```python
u = np.zeros((n,1))
for i in range(n) :
     u[i] = math.exp(v[i])
```

Vectorized in coding:

```python
import numpy as np
u = np.exp(v)
```
And so, notice that, whereas previously you had that explicit `for` loop, with just one line of code here, just $v$ as an input vector $u$ as an output vector, you've gotten rid of the explicit `for` loop, and the implementation on the right will be much faster that the one needing an explicit `for` loop. 

In fact, the NumPy library has many of the vector value functions. 

```python
import numpy as np

np.log(v)   #compute the element-wise log
np.abs(v)
np.maximum(v,0)
v**2
1/v
```

So, whenever you are tempted to write a `for` loop take a look, and see if there's a way to call a `NumPy` built-in function to do it without that `for` loop.

#### Gradient descent implementation

So, let's take all of these learnings and apply it to our logistic regression gradient descent implementation, and see if we can at least get rid of one of the two `for` loops we had. So here's our code for computing the derivatives for logistic regression, and we had two `for` loops. 

```python
import numpy as np
J = 0, dw1 = 0, dw2 = 0, db = 0

for i = 1 to m :
	z[i] = np.transpose(w) * x[i] + b
	a[i] = 1 / (1 + np.exp(-z[i]))
	J += - (y[i] * np.log(a[i]) + (1 - y[i]) * np.log(1 - a[i]))
	dz[i] = a[i] - y[i]
	
	# in this example we only have 2 features, if yo had more features, see below
	dw1 += x1[i] * dz[i]
	dw2 += x2[i] * dz[i]
	
	# more features
	for j = 1 to nx :
		dw[j] += x[i][j] * dz[i]
	db += dz[i]

J /= m, dw1 /= m, dw2 /= m, db /= m
```
So the way we'll do so is that instead of explicitly initializing `dw1`, `dw2`, and so on to zeros, we're going to get rid of this and instead make dw a vector. 

```python
import numpy as np
J = 0
dw = np.zeros((n_x,1))
db = 0

for i = 1 to m :
	z[i] = np.transpose(w) * x[i] + b
	a[i] = 1 / (1 + np.exp(-z[i]))
	J += - (y[i] * np.log(a[i]) + (1 - y[i]) * np.log(1 - a[i]))
	dz[i] = a[i] - y[i]
	
	dw += x[i] * dz[i]
	db += dz[i]
	
J /= m, dw /= m, db /= m
```
So now we've gone from having two `for` loops to just one `for` loop. We still have this one `for` loop that loops over the individual training examples.

So I hope this section gives you a sense of vectorization. And by getting rid of one for-loop your code will already run faster. But it turns out we could do even better. So the next section will talk about how to vectorize logistic aggression even further. And you see a pretty surprising result, that without using any for-loops, without needing a for-loop over the training examples, you could write code to process the entire training sets. So, pretty much all at the same time.


### Vectorizting Logistic Regression

[Video](https://youtu.be/A9Ag0PtZDLA)

To compute these predictions on our $m$ training examples, there is a way to do so, without needing an explicit `for` loop.

- First, remember that we defined a matrix capital X to be your training inputs, stacked together in different columns like this. 

$$X= \left[
\begin{matrix}
\vdots & \vdots & \cdots & \vdots \\
x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\
\vdots & \vdots & \cdots & \vdots 
\end{matrix}
\right]
$$

So, this is a matrix, that is a $n_x \times m$ dimensional matrix. Now, the first thing I want to do is show how you can compute $z^{(1)}$, $z^{(2)}$, $z^{(3)}$ and so on, all in one step, in fact, with one line of code. So, I'm going to construct a $1 \times m$ matrix that's really a row vector while I'm going to compute $z^{(1)}$, $z^{(2)}$, and so on, down to $z^{(m)}$, all at the same time. 

$$Z = [z^{(1)}, z^{(2)}, \cdots, z^{(m)}] = w^T X+ [b, b, \cdots, b] = [w^T x^{(1)} + b, w^T x^{(2)} + b, \cdots, w^T x^{(m)} + b]$$

The $w^T$ will be a row vector. $[b, b, \cdots, b]$ is a $1 \times m$ row vector. So you end up with another $1 \times m$ vector. 

So just as $X$ was once obtained, when you took your training examples and stacked them next to each other, stacked them horizontally. I'm going to define capital $Z$ to be this where you take the lowercase $z$'s and stack them horizontally. So when you stack the lower case $x$'s corresponding to a different training examples, horizontally you get this variable capital $X$ and the same way when you take these lowercase $z$ variables, and stack them horizontally, you get this variable capital $Z$.

And it turns out, that in order to implement this, the `Numpy` command is:

```python
Z = np.dot(w.T, X) + b
```
Now there is a subtlety in Python, which is at here `b` is a real number or if you want to say you know $1 \times 1$ matrix, is just a normal real number. But, when you add this vector to a real number, Python automatically takes this real number `b` and expands it out to this  $1 \times m$ row vector (i.e. $[b, b, \cdots, b]$). So in case this operation seems a little bit mysterious, this is called **broadcasting** in Python.

- Second, what we would like to do next, is find a way to compute $a^{(1)}$, $a^{(2)}$ and so on to $a^{(m)}$, all at the same time, and just as stacking lowercase $x$'s resulted in capital $X$ and stacking horizontally lowercase $z$'s resulted in capital $Z$, stacking lower case $a$, is going to result in a new variable, which we are going to define as capital $A$.

$$A = [a^{(1)}, a^{(2)}, \cdots, a^{(m)}] = \sigma(Z)$$

And in the program assignment, you see how to implement a vector valued sigmoid function, so that the sigmoid function, inputs this capital $Z$ as a variable and very efficiently outputs capital $A$. So you see the details of that in the programming assignment. 

What we've seen in this section is that instead of needing to loop over $m$ training examples to compute lowercase $z$ and lowercase $a$, one of the time, you can implement this one line of code, to compute all these $z$'s at the same time. And then, this one line of code, with appropriate implementation of lowercase Sigma to compute all the lowercase $a$'s all at the same time. So this is how you implement a vectorize implementation of the forward propagation for all $m$ training examples at the same time. 

So to summarize, you've just seen how you can use vectorization to very efficiently compute all of the activations, all the lowercase $a$'s at the same time. Next, it turns out, you can also use vectorization very efficiently to compute the backward propagation, to compute the gradients.


### Vectorizing Logistic Regresion's Gradient Output

[Video](https://youtu.be/Y5LsQZY-0QM)

In the previous section, you learned how you can use vectorization to compute their predictions. In this section, you will learn how you can use vectorization to also perform the gradient computations for all $m$ training samples. Again, all sort of at the same time. And then at the end of this part, we will put it all together and show how you can derive a very efficient implementation of logistic regression. 

$$\mathrm{d}Z = [\mathrm{d}z^{(1)}, \mathrm{d}z^{(2)}, \cdots, \mathrm{d}z^{(m)}] = [a^{(1)} - y^{(1)}, a^{(2)} - y^{(2)}, \cdots, a^{(m)} - y^{(m)}]$$

It's $1 \times m$ matrix or alternatively a $m$ dimensional row vector. 

$$A = [a^{(1)}, a^{(2)}, \cdots, a^{(m)}]$$

$$Y = [y^{(1)}, y^{(2)}, \cdots, y^{(m)}]$$

So, based on these definitions, maybe you can see for yourself that $\mathrm{d}Z$ can be computed as just $A-Y$.

$$\mathrm{d}Z = A-Y = [a^{(1)} - y^{(1)}, a^{(2)} - y^{(2)}, \cdots, a^{(m)} - y^{(m)}]$$

So, with just one line of code, you can compute all of this at the same time. 

Now, in the previous implementation, we've gotten rid of one `for` loop already but we still had this second `for` loop over training examples. 

$$
\begin{aligned}
\mathrm{d}w &= 0 \\
\mathrm{d}w &\ += x^{(1)}\mathrm{d}z^{(1)} \\
\mathrm{d}w &\ += x^{(2)}\mathrm{d}z^{(2)} \\
& \vdots \\
\mathrm{d}w  &\ /= m
\end{aligned} \ \qquad  \qquad 
\begin{aligned}
\mathrm{d}b &= 0 \\
\mathrm{d}b &\ += \mathrm{d}z^{(1)} \\
\mathrm{d}b &\ += \mathrm{d}z^{(2)} \\
& \vdots \\
\mathrm{d}b  &\ /= m
\end{aligned}$$


So that's what we had in the previous implementation. We'd already got rid of one `for` loop. So, at least now `dw` is a vector and we went separately updating `dw1`, `dw2` and so on. So, we got rid of that already but we still had the `for` loop over the $m$ examples in the training set. So, let's take these operations and vectorize them.

$$\mathrm{d}b = \dfrac{1}{m} \sum_{i=1}^{m} \mathrm{d}z^{(i)}$$

```python
db = 1/m * np.sum(dZ)
```

$$
\begin{aligned}
\mathrm{d}w 
&= \dfrac{1}{m} X (\mathrm{d}Z)^T \\
&= \dfrac{1}{m} \left[
\begin{matrix}
\vdots & \vdots & \cdots & \vdots \\
x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\
\vdots & \vdots & \cdots & \vdots 
\end{matrix}
\right] \left[
\begin{matrix}
\mathrm{d}z^{(1)} \\
\vdots \\
\mathrm{d}z^{(m)}
\end{matrix}
\right] \\
&= \dfrac{1}{m} [x^{(1)}\mathrm{d}z^{(1)} + \cdots + x^{(m)} \mathrm{d}z^{(m)}]
\end{aligned}$$


```python
dw = 1/m * np.dot(X, dZ.T)
```

So now, let's put all together into how you would actually implement logistic regression. This is our original, highly inefficient non vectorize implementation.

```python
import numpy as np
J = 0, dw1 = 0, dw2 = 0, db = 0

for i = 1 to m :
	z[i] = np.transpose(w) * x[i] + b
	a[i] = 1 / (1 + np.exp(-z[i]))
	J += - (y[i] * np.log(a[i]) + (1 - y[i]) * np.log(1 - a[i]))
	dz[i] = a[i] - y[i]
	dw1 += x1[i] * dz[i]
	dw2 += x2[i] * dz[i]
	db += dz[i]

J /= m, dw1 /= m, dw2 /= m, db /= m
```

But now, we will see that the whole vectorization process without expilcit two `for` loops. 

```python 
import numpy as np

Z = np.dot(w.T, X) + b
A = sigmoid(Z)
dZ = A - Y
dw = 1/m * np.dot(X, dZ.T)
db = 1/m * np.sum(dZ)
```
So, you've just done forward propagation and back propagation, really computing the predictions and computing the derivatives on all $m$ training examples without using a for loop. And so the gradient descent update then would be:

```python 
dw = w - \alpha dw
db = b - \alpha db
```

You have just implemented a **single iteration** of gradient descent for logistic regression.

:::{warning} 
Now, we talked about before we should get rid of explicit `for` loops whenever you can. However, if you want to implement multiple iterations as a gradient descent then you still need a `for` loop over the number of iterations. So, if you want to have a thousand iterations of gradient descent, you might still need a `for` loop over the iteration number. There is an outermost for loop like that then I don't think there is any way to get rid of that for loop. 
:::python 
for iter in range(1000):

	Z = np.dot(w.T, X) + b
	A = sigmoid(Z)
	dZ = A - Y
	dw = 1/m * np.dot(X, dZ.T)
	db = 1/m * np.sum(dZ)

	dw = w - \alpha dw
	db = b - \alpha db
:::

So, that's it you now have a highly vectorize and highly efficient implementation of gradient descent for logistic regression. There is just one more detail that I want to talk about in the section, which is in our description here I briefly alluded to this technique called broadcasting. Broadcasting turns out to be a technique that Python and numpy allows you to use to make certain parts of your code also much more efficient. 


### Broadcasting in Python

[Video](https://youtu.be/hnBXZO5JHUc)

In the previous section, we noticed that broadcasting is another technique that you can use to make your Python code run faster. In this section, let's delve into how broadcasting in Python actually works. Let's explore broadcasting with an example. 

In this matrix, it shows the number of calories from carbohydrates, proteins, and fats in 100 grams of four different foods:

```{figure} images/2-4.png
---
height: 100px
name: 2-4
---
```
So for example, a 100 grams of apples turns out, has 56 calories from carbs, and much less from proteins and fats. Whereas, in contrast, a 100 grams of beef has 104 calories from protein and 135 calories from fat. 

Now, let's say your goal is to calculate the percentage of calories from carbs, proteins and fats for each of the four foods. So, for example, if you look at this column and add up the numbers in that column you get that 100 grams of apple has 56 plus 1.2 plus 1.8 so that's 59 calories. And so as a percentage the percentage of calories from carbohydrates in an apple would be 56 over 59, that's about 94.9\%. So most of the calories in an apple come from carbs, whereas in contrast, most of the calories of beef come from protein and fat and so on. So the calculation you want is really to sum up each of the four columns of this matrix to get the total number of calories in 100 grams of apples, beef, eggs, and potatoes. And then to divide throughout the matrix, so as to get the percentage of calories from carbs, proteins and fats for each of the four foods. So the question is, can you do this without an explicit for-loop?

What I'm going to do is show you how you can set, say this matrix equal to three by four matrix $A$. 

- And then with one line of Python code we're going to sum down the columns. So we're going to get four numbers corresponding to the total number of calories in these four different types of foods, 100 grams of these four different types of foods. 

- Using a second line of Python code to divide each of the four columns by their corresponding sum.

In [3]:
import numpy as np

A = np.array([[56.0, 0.0, 4.4, 68.0],
	      [1.2, 104.0, 52.0, 8.0],
	      [1.8, 135.0, 99.0, 0.9]])
		     
print(A)

cal = A.sum(axis = 0)
print(cal)

percentage = 100 * A / cal.reshape(1,4)
print(percentage)


[[ 56.    0.    4.4  68. ]
 [  1.2 104.   52.    8. ]
 [  1.8 135.   99.    0.9]]
[ 59.  239.  155.4  76.9]
[[94.91525424  0.          2.83140283 88.42652796]
 [ 2.03389831 43.51464435 33.46203346 10.40312094]
 [ 3.05084746 56.48535565 63.70656371  1.17035111]]


- To add a bit of detail this parameter, `(axis = 0)`, means that you want Python to sum vertically. So if this is axis 0 this means to sum vertically, where as the horizontal axis is axis 1. So be able to write axis 1 or sum horizontally instead of sum vertically.

- So this is a three by four matrix and you divide it by a one by four matrix. And technically, after this first line of codes `cal`, the variable `cal`, is already a one by four matrix. So technically you don't need to call reshape here again, so that's actually a little bit redundant. But when I'm writing Python codes if I'm not entirely sure what matrix, whether the dimensions of a matrix I often would just call a reshape command just to make sure that it's the right column vector or the row vector or whatever you want it to be. The reshape command is a constant time. It's a order one operation that's very cheap to call. So don't be shy about using the reshape command to make sure that your matrices are the size you need it to be.

#### Broadcasting examples

Now, let's explain in greater detail how this type of operation works. We had a three by four matrix and we divided it by a one by four matrix. So, how can you divide a three by four matrix by a one by four matrix? Or by one by four vector? Let's go through a few more examples of broadcasting. 

1. If you take a 4 by 1 vector and add it to a number, what Python will do is take this number and auto-expand it into a four by one vector as well, as follows. And so the vector [1, 2, 3, 4] plus the number 100 ends up with that vector on the right. You're adding a 100 to every element, and in fact we use this form of broadcasting where that constant was the parameter $b$ in an earlier section. And this type of broadcasting works with both column vectors and row vectors, and in fact we use a similar form of broadcasting earlier with the constant we're adding to a vector being the parameter $b$ in logistic regression. 

$$\left[
\begin{matrix}
1 \\
2 \\
3 \\
4
\end{matrix}
\right] + 100 = 
\left[
\begin{matrix}
1 \\
2 \\
3 \\
4
\end{matrix}
\right]  + 
\left[
\begin{matrix}
100 \\
100 \\
100 \\
100
\end{matrix}
\right] =
\left[
\begin{matrix}
101 \\
102 \\
103 \\
104
\end{matrix}
\right] $$


2.  Here's another example. Let's say you have a two by three matrix and you add it to this one by $n$ matrix. So the general case would be if you have some $(m,n)$ matrix here and you add it to a $(1,n)$ matrix. What Python will do is copy the matrix $m$ times to turn this into $m$ by $n$ matrix, so instead of this one by three matrix it'll copy it twice in this example to turn it into this. Also, two by three matrix and we'll add these so you'll end up with the sum on the right. So you taken, you added 100 to the first column, added 200 to second column, added 300 to the third column. And this is basically what we did on the previous steps, except that we use a division operation instead of an addition operation.

$$\left[
\begin{matrix}
1  &  2  & 3 \\
4  &  5  & 6
\end{matrix}
\right] + 
\left[
\begin{matrix}
100 &  200  & 300
\end{matrix}
\right] = 
\left[
\begin{matrix}
1  &  2  & 3 \\
4  &  5  & 6
\end{matrix}
\right]  + 
\left[
\begin{matrix}
100 &  200  & 300 \\
100 &  200  & 300
\end{matrix}
\right]   =
\left[
\begin{matrix}
101 &  202  & 303 \\
104 &  205  & 306
\end{matrix}
\right] $$

3. So one last example, whether you have a $(m,n)$ matrix and you add this to a $(m,1)$ vector or $(m,1)$ matrix. Then just copy this $n$ times horizontally. So you end up with an $(m,n)$ matrix. So as you can imagine you copy it horizontally three times. And you add those. So when you add them you end up with this. So we've added 100 to the first row and added 200 to the second row.

$$\left[
\begin{matrix}
1  &  2  & 3 \\
4  &  5  & 6
\end{matrix}
\right] + 
\left[
\begin{matrix}
100 \\ 
200
\end{matrix}
\right] = 
\left[
\begin{matrix}
1  &  2  & 3 \\
4  &  5  & 6
\end{matrix}
\right]  + 
\left[
\begin{matrix}
100 &  100  & 100 \\
200 &  200  & 200
\end{matrix}
\right]   =
\left[
\begin{matrix}
101 &  102  & 103 \\
204 &  205  & 206
\end{matrix}
\right] $$

#### General Principle

1. Here's the more general principle of broadcasting in Python. If you have an $(m,n)$ matrix and you add or subtract or multiply or divide with a $(1,n)$ matrix, then this will copy it $m$ times into an $(m,n)$ matrix. And then apply the addition, subtraction, and multiplication of division element wise. 

   If conversely, you were to take the $(m,n)$ matrix and add, subtract, multiply, divide by an $(m,1)$ matrix, then also this would copy it now $n$ times. And turn that into an $(m,n)$ matrix and then apply the operation element wise. 

2. Just one of the broadcasting, which is if you have an $(m,1)$ matrix, so that's really a column vector like $[1,2,3]^T$, and you add, subtract, multiply or divide by a row number. So maybe a $(1,1)$ matrix. So such as that plus 100, then you end up copying this real number $m$ times until you'll also get another $(m,1)$ matrix. And then you perform the operation such as addition on this example element-wise. And something similar also works for row vectors.

$$\left[
\begin{matrix}
1 \\
2 \\
3 
\end{matrix}
\right] + 100 = 
\left[
\begin{matrix}
101 \\
102 \\
103 
\end{matrix}
\right] $$

$$\left[
\begin{matrix}
1  &  2  & 3 
\end{matrix}
\right] + 100 = 
\left[
\begin{matrix}
101 &  102  & 103 
\end{matrix}
\right] $$

So, that was broadcasting in Python. I hope that when you do the programming homework that broadcasting will allow you to not only make a code run faster, but also help you get what you want done with fewer lines of code.

### A Note on Python/Numpy Vectors

[Video](https://youtu.be/Yx8LWTEKaxg)

[Quick tour of Jupyter/iPython Notebooks](https://youtu.be/TPRB5n9ckQs)

The ability of python to allow you to use broadcasting operations and more generally, the great flexibility of the python numpy program language is both a strength as well as a weakness of the programming language.

- The strength because they create expressivity of the language. A great flexibility of the language lets you get a lot done even with just a single line of code. 

- But there's also weakness because with broadcasting and this great amount of flexibility, sometimes it's possible you can introduce very subtle bugs or very strange looking bugs, if you're not familiar with all of the intricacies of how broadcasting and how features like broadcasting work.  
	- For example, if you take a column vector and add it to a row vector, you would expect it to throw up a dimension mismatch or type error or something. But you might actually get back a matrix as a sum of a row vector and a column vector. 
	
There is an internal logic to these strange effects of Python. But if you're not familiar with Python, I've seen some students have very strange, very hard to find bugs. So in this section is share with you some couple tips and tricks that have been very useful for me to eliminate or simplify and eliminate all the strange looking bugs in my own code. And I hope that with these tips and tricks, you'll also be able to much more easily write bug-free, python and numpy code.

In [4]:
import numpy as np

a = np.random.rand(5)
print(a)


[0.63483229 0.20965002 0.8200479  0.88321784 0.31629843]


In [5]:
print(a.shape)


(5,)


From the coding results, the shape of `a` is $(5,\ )$ structure. This is called a rank 1 array in Python and it's neither a row vector nor a column vector. And this leads it to have some slightly non-intuitive effects.  For example:

In [6]:
print(a.T)


[0.63483229 0.20965002 0.8200479  0.88321784 0.31629843]


In [7]:
print(np.dot(a,a.T))


1.9995621740520717


If I print `a` transpose, it ends up looking the same as `a`. So `a` and `a.T` end up looking the same. And if I print the inner product between `a` and `a.T`, you might think `a` times `a.T` is maybe the outer product should give you matrix maybe. But the result above shows that, you instead get back a number. So what I would recommend is that when you're coding new networks, that you just not use data structures where the shape is $(5,)$ or $(n,)$ - a rank 1 array. Instead, if you set `a` to be (5,1) like below:

In [8]:
a = np.random.rand(5,1)
print(a)


[[0.84647718]
 [0.13185321]
 [0.60876327]
 [0.99692069]
 [0.71809643]]


In [9]:
print(a.T)


[[0.84647718 0.13185321 0.60876327 0.99692069 0.71809643]]


Then this commits a to be $(5,1)$ column vector. Now `a` transpose is a row vector. Notice one subtle difference. In this data structure, there are two square brackets when we print `a.T`. Whereas previously, there was one square bracket. So that's the difference between this is really a 1 by 5 matrix versus one of these rank 1 arrays.

And if you print the product between `a` and `a.T`, then this gives you the outer product of a vector:

In [10]:
print(np.dot(a,a.T))


[[0.71652362 0.11161073 0.51530422 0.84387061 0.60785225]
 [0.11161073 0.01738527 0.08026739 0.13144719 0.09468332]
 [0.51530422 0.08026739 0.37059272 0.6068887  0.43715074]
 [0.84387061 0.13144719 0.6068887  0.99385085 0.71588519]
 [0.60785225 0.09468332 0.43715074 0.71588519 0.51566249]]


```{caution} 
The first command that we ran, just now. And this created a data structure with `a.shape` was this funny thing $(5,)$ so this is called a rank 1 array. And this is a very funny data structure. It doesn't behave consistently as either a row vector nor a column vector, which makes some of its effects nonintuitive. So what I'm going to recommend is that when you're doing your programing exercises, or in fact when you're implementing logistic regression or neural networks that you just do not use these rank 1 arrays.

Instead, if every time you create an array, you commit to making it either a column vector, so this creates a $(5,1)$ vector, or commit to making it a row vector, then the behavior of your vectors may be easier to understand. 

```python
a = np.random.randn(5)   #DO NOT USE

a = np.random.randn(5,1)
```

One more thing that I do a lot in my code is if I'm not entirely sure what's the dimension of one of my vectors, I'll often throw in an assertion statement like this, 

```python
assert(a.shape == (5,1))
```

to make sure, in this case, that this is a $(5,1)$ vector. So this is a column vector. These assertions are really inexpensive to execute, and they also help to serve as documentation for your code. So don't hesitate to throw in assertion statements like this whenever you feel like.

And then finally, if for some reason you do end up with a rank 1 array, You can reshape this.





## Quiz

1. In logistic regression given $x$, and parameters $w \in \mathbb{R}^{n_x}, \ b \in \mathbb{R}$. How do we generate the output $\hat{y}$?

    A. $\sigma(wx)$

    B. $\text{tanh}(wx+b)$

    C. $\sigma(wx + b)$

    D. $wx + b$

2.  In logistic regression given the input $x$ and parameters $w \in \mathbb{R}^{n_x}, \ b \in \mathbb{R}$. Which of the following best expresses what we want $\hat{y}$ to tell us?

    A. $\sigma(wx)$

    B. $\mathrm{P}(y=1 \mid x)$

    C. $\mathrm{P}(y=\hat{y} \mid x)$

    D. $\sigma(wx + b)$

3.  Which of these is the "Logistic Loss"?

    A. $\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = \lvert y^{(i)} - \hat{y}^{(i)} \rvert$

    B. $\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = \text{max}(0, y^{(i)} - \hat{y}^{(i)})$

    C. $\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = - \Big( y^{(i)} \  \text{log}(\hat{y}^{(i)}) + (1-y^{(i)}) \ \text{log}(1-\hat{y}^{(i)}) \Big)$

    D. $\mathcal{L}^{(i)}(\hat{y}^{(i)}, y^{(i)}) = \lvert y^{(i)} - \hat{y}^{(i)} \rvert^{2}$

4. Suppose that $\hat{y} = 0.5$ and $y = 0$. What is the value of the "Logistic Loss"? Choose the best option.

    A. $+ \infty$

    B. $0.5$

    C. $0.693$

    D. $\mathcal{L}(\hat{y}, y) = - \Big( y \  \text{log}(\hat{y}) + (1-y) \ \text{log}(1-\hat{y}) \Big)$

5.  Suppose $\mathbf{x}$ is a $(8, 1)$ array. Which of the following is a valid reshape?

    A. `x.reshape(1, 4, 3)`

    B. `x.reshape(2, 2, 2)`

    C. `x.reshape(-1, 3)`

    D. `x.reshape(2, 4, 4)`

6.  Conside the Numpy array $\mathbf{x}$: `x = np.array([[[1],[2]],[[3],[4]]])`. What is the shape of $\mathbf{x}$?

    A. $(4,)$

    B. $(1, 2, 2)$

    C. $(2, 2, 1)$

    D. $(2, 2)$

7.  Consider the following random arrays $a$, $b$ and $c$:
    ```python
    a = np.random.randn(3, 4)  # a.shape = (3, 4)
    b = np.random.randn(1, 4)  # a.shape = (1, 4)
    c = a + b
    ```
    What will be the shape of $c$?

    A. The computation cannot happen because it is not possible to broadcast more than one dimension.

    B. c.shape = $(3, 1)$

    C. c.shape = $(1, 4)$

    D. c.shape = $(3, 4)$
 
8.  Consider the following random arrays $a$, $b$ and $c$:
    ```python
    a = np.random.randn(2, 3)  # a.shape = (2, 3)
    b = np.random.randn(2, 1)  # a.shape = (2, 1)
    c = a + b
    ```
    What will be the shape of $c$?

    A. The computation cannot happen because the sizes do not match. It's going to be "Error"!

    B. c.shape = $(2, 1)$

    C. c.shape = $(2, 3)$

    D. c.shape = $(3, 2)$
    
9.  Consider the following random arrays $a$, $b$ and $c$:
    ```python
    a = np.random.randn(4, 3)  # a.shape = (4, 3)
    b = np.random.randn(1, 3)  # a.shape = (1, 3)
    c = a * b
    ```
    What will be the shape of $c$?

    A. The computation cannot happen because the sizes do not match. 

    B. c.shape = $(1, 3)$

    C. The computation cannot happen because it is not possible to broadcast more than one dimension.

    D. c.shape = $(4, 3)$    
    
10.  Suppose you have $n_x$ onput features per example. Recall that $\boldsymbol{X} = [x^{(1)} \  x^{(2)} \  \cdots \  x^{(m)}]$. What is the dimension of $\boldsymbol{X}$?

      A. $(m, 1)$

      B. $(m, n_x)$

      C. $(1, m)$

      D. $(n_x, m)$    

    
11.  Suppose our input batch consists of 8 grayscale images, each of dimension $8\times8$. We reshape these images into feature column vectors $x^{j}$. Remember that $\boldsymbol{X} = [x^{(1)} \  x^{(2)} \  \cdots \  x^{(8)}]$. What is the dimension of $\boldsymbol{X}$?

      A. $(512, 1)$

      B. $(8, 64)$

      C. $(64, 8)$

      D. $(8, 8, 8)$    

12. Consider the following array:
    ```python
    a = np.array([[2,1], [1,3]])
    ```
    What is the result of `a*a`?

    A. The computation cannot happen because the sizes do not match. It's going to be "Error"!

    B. $\begin{pmatrix} 4 & 2 \\ 6 &6\\ \end{pmatrix}$

    C. $\begin{pmatrix} 4 & 1 \\ 1 &9\\ \end{pmatrix}$

    D. $\begin{pmatrix} 5 & 5 \\ 5 &10\\ \end{pmatrix}$
    
13. Consider the following code snippet:
    ```python
    a.shape = (3, 4)
    b.shape = (4, 1)
    
    for i in range(3):
         for j in range(4):
              c[i][j] = a[i][j] * b[j]
    ```
    How do you vectorize this?

    A. `c = a * b`

    B. `c = a.T * b`

    C. `c = a * b.T`

    D. `c = np.dot(a, b)`

14. Consider the following code snippet:
    ```python
    a.shape = (4, 3)
    b.shape = (4, 1)
    
    for i in range(3):
         for j in range(4):
              c[i][j] = a[j][i] + b[j]
    ```
    How do you vectorize this?

    A. `c = a + b.T`

    B. `c = a.T + b`

    C. `c = a.T + b.T`

    D. `c = a + b`
    
15. Consider the following code:
    ```python
    a = np.random.randn(3, 3) 
    b = np.random.randn(3, 1)
    
    c = a * b
    ```
    What will be $c$?

    A. It will lead to an error since you cannot use `*` to operate on these two matrices. You need to instead use `np.dot(a, b)`. 

    B. This will multiply a $3\times3$ matrix a with a $3\times1$ vector, thus resulting in a $3\times1$ vector. That is, c.shape = $(3, 1)$.

    C. This will invoke broadcasting, so `b` is copied three times to become $(3, 3)$, and `*` is an element-wise product so c.shape will be $(3, 3)$.

    D. This will invoke broadcasting, so `b` is copied three times to become $(3, 3)$, and `*` invokes a matrix multiplication operation of two $3\times3$ matrices so c.shape will be $(3, 3)$.

16. Consider the code snippet:
    ```python
    a.shape = (3, 3) 
    b.shape = (3, 3)
    
    c = a ** 2 + b.T ** 2
    ```
    Which of the following gives an equivalent output for $c$?

A.

   ```python
   for i in range(3):
        for j in range(3):
             c[i][j] = a[i][j]**2 + b[i][j]**2
   ``` 

B. 

```python
for i in range(3):
     c[i] = a[i]**2 + b[i]**2
``` 

C. 

```python
for i in range(3):
     for j in range(3):
          c[i][j] = a[i][j]**2 + b[j][i]**2
``` 

D. The computation cannot happen because the sizes do not match. It's going to be "Error"!

17. Consider the following computational graph:
    
     ```{figure} images/2-e17.png
     ---
     height: 200px
     name: 2-e17
     ---
     ```
    
    What is the output of $J$?

    A. $(a + c)(b - 1)$ 

    B. $ab + bc + ac$

    C. $(c - 1)(a + c)$ 
    
    D. $(a - 1)(b + c)$
    
18. Consider the following computational graph:
    
     ```{figure} images/2-e18.png
     ---
     height: 200px
     name: 2-e18
     ---
     ```
    
    What is the output of $J$?

    A. $J = a\times b + b \times c + a \times c $ 

    B. $J = (c - 1)(b + a)$ 

    C. $J = (a - 1)(b + c)$ 
    
    D. $J = (b - 1)(c + a)$  
   

:::{admonition} Click here for answers!
:class: tip, dropdown

1. C </br>

    B. Yes. in logisitc regression we use a linear function $wx + b$ followed by the sigmoid function $\sigma$, to get an output $y$, referred to as $\hat{y}$, such that $0 < \hat{y} < 1$. </br>

2. B </br>

    B. Yes. We want the output $\hat{y}$ to tell us the probability that $y=1$ given $x$. </br>

3. C </br>

4. C </br>

5. B </br>

    B. Yes. This generates uses $2\times2\times2 = 8$ entries. </br>
    
6. C </br>

    C. Yes. This array has two rows and in each row it has 2 arrays of $1\times1$. </br>
    
7. D </br>

    D. Yes. Broadcasting is used, so row `b` is copied 3 times so it can be summed to each row of `a`. </br>

8. C </br>

    C. Yes. This is broadcasting.  `b` (column vector) is copied 3 times so that it can be summed to each column of `a`. </br>
    
9. D </br>

    D. Yes. Broadcasting is invoked, so row `b` is multiplied element-wise each row of `a` to create `c`. </br>

10. D </br>

11. C </br>

      C. Yes. After converting the $8\times8$ gray scale images to a column vector we get a vector of size 64, thus $\boldsymbol{X}$ has dimension $(64, 8)$. </br>
      
12. C </br>

      C. Yes. Recall that * indicates element-wise multiplication. </br>
      
13. C </br>

      C. Yes. `b.T` gives a column vector with shape $(1, 4)$. The result of `c` is equivalent to broadcasting `a*b.T`. </br>

14. C </br>

      C. Yes. `a[j][i]` being used fo `a[i][j]` indicates we are using `a.T` and the element in the row j is used in the column j thus we are using `b.T`. </br>

15. C </br>

16. C </br>

      C. Yes. Notice that to operate with `b.T` we need to use `b[j][i]`. </br>

15. A </br>

16. C </br>
:::