# Vectorization

This is important to cleverly avoid explicit for-loops - like we encountered in the previous chapter. Let's take a look at what is vectorization in the context of logistic regression.

`numpy` utilizes many Single Instruction Multiple Data (SIMD) operations, which allows us to parallelize these arithmetic operations.

## Goals

1. Avoid explicit for loops
2. Use `numpy` built-in functions as much as possible

## `numpy` for Logistic Regression Derivatives

Recall from the previous example:

$$z = w_1 x_1 + w_2 x_2 + b$$
$$\hat{y} = a = \sigma(z)$$
$$\mathcal{L}(\hat{y}, y) = - \big(y\ \log\hat{y} + (1 - y)\ \log(1 - \hat{y}) \big)$$

After performing the partial differentiation, we can apply the updates in each iteration of our gradient descent for each of our data point.

$$w_1 := w_1 + \alpha \frac{\delta \mathcal{L}(a, y^{(i)})}{\delta w_1}$$
$$w_2 := w_2 + \alpha \frac{\delta \mathcal{L}(a, y^{(i)})}{\delta w_2}$$
$$b := b + \alpha \frac{\delta \mathcal{L}(a, y^{(i)})}{\delta z}$$

Then we divide each of the values by $m$, which is the total amount of training examples we have. Here is a brief overview of the gradient descent algorithm.

```
J = 0, dw1 = 0, dw2 = 0, db = 0
for i = 1 to m:
    z[i] = w^T x[i] + b
    a[i] = sigma(z[i])
    J += loss(yhat, y)
    dz[i] = a[i] (1 - a[i])
    dw1 += x_1[i] * dz[i]
    dw2 += x_2[i] * dz[i]
    db += dz[i]
J = J/m, dw1 = dw1/m, dw2 = dw2/m, db = db/m
```

The first step in vectorizing this operation is to define our `dw1` and `dw2` as a single vector. (Remember that `dw1` and `dw2` are variable naming convention for $\frac{\delta \mathcal{L}(a, y^{(i)})}{\delta w_1}$ and $\frac{\delta \mathcal{L}(a, y^{(i)})}{\delta w_2}$ respectively) We replace it with:

```
dw = np.zeros((n_x, 1))
```

Now we can replace the `dw1` and `dw2` updates to:

```
dw += x[i]dz[i]
```

We removed an extra increment step just by vectorizing. This is only the very basic example.

# Zero `for` loops

Let's take a look at logistic regression without a single for loop.

## Forward Propagation

Rather than performing forward propagation over several input data individually, we can vectorize this operation.

Since

$$X = \Big[x^{(1)}, x^{(2)}, x^{(3)}, ... , x^{(m)} \Big]$$

We can define

$$Z = \Big[z^{(1)}, z^{(2)}, z^{(3)}, ...,  z^{(4)} \Big] = w^T X + \Big[b, b, b, ..., b \Big]$$

How does this work? We know that $w^T \in \mathbb{R}^{1 \times n}$ and $X \in \mathbb{R}^{n \times m}$. Therefore, performing a dot product between the two would yield a result vector $Z \in \mathbb{R}^{1 \times m}$. Then we can add it to the vector full of $b$ which is our bias value to get our final $Z$.

In `numpy`, this is done with:

```python
Z = np.dot(w.T, X) + b
```

Note that if `b` is a *real number*, `numpy` will expand it out into a vector in $\mathbb{R}^m$ via a method known as **broadcasting**. This will be elaborated more later.

Then we can implement (in the programming assignment) a efficient way to calculate the loss (sigma) in one single operation as well.

## Backward Propagation

First we take a look at how to calculate `dz`. We observe that `dz` is simply `A - Y`.

$$A = \Big[ a^{(1)}, a^{(2)}, a^{(3)}, ..., a^{(m)} \Big]$$
$$Y = \Big[ y^{(1)}, y^{(2)}, y^{(3)}, ..., y^{(m)} \Big]$$

Thus we can easily calculate `dz` for all training examples in a single operation without a for-loop.

Now let's take a look at how to calculate `db`. We observe that `db` is simply averaging `dz` from the derivation in our previous notebook. Recall:

$$\frac{\delta \mathcal{L}(a, y)}{\delta b} = \frac{\delta \mathcal{L}(a, y)}{\delta z}$$

The `db` over all training examples is simply adding them all up and dividing by `m`.

Therefore, we can perform this operation in a single operation as well in `numpy`.

```python
1/m * np.sum(dz)
```

The last operation that we can vectorize for backpropagation is our computation of `dw`. Recall earlier that we combined both `dw1` and `dw2` into a single vector. For computation of `dw` using vectorized operations, we can perform the following:

$$dw = \frac{1}{m} X (dz)^T$$

We can reason about why this works by expanding out the matrices and figuring out what the dot product means. This step can be done in Python:

```python
dw = 1/m * np.dot(X, dz.T)
```