Skip to content
177 changes: 133 additions & 44 deletions Python/Module3_IntroducingNumpy/Broadcasting.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jupyter:
extension: .md
format_name: markdown
format_version: '1.2'
jupytext_version: 1.3.0rc1
jupytext_version: 1.5.0
kernelspec:
display_name: Python 3
language: python
Expand Down Expand Up @@ -486,16 +486,16 @@ Performing this computation using for-loops proceeds as follows:
def pairwise_dists_looped(x, y):
""" Computing pairwise distances using for-loops

Parameters
----------
x : numpy.ndarray, shape=(M, D)
y : numpy.ndarray, shape=(N, D)
Parameters
----------
x : numpy.ndarray, shape=(M, D)
y : numpy.ndarray, shape=(N, D)

Returns
-------
numpy.ndarray, shape=(M, N)
The Euclidean distance between each pair of
rows between `x` and `y`."""
Returns
-------
numpy.ndarray, shape=(M, N)
The Euclidean distance between each pair of
rows between `x` and `y`."""
# `dists[i, j]` will store the Euclidean
# distance between `x[i]` and `y[j]`
dists = np.empty((5, 6))
Expand All @@ -514,6 +514,16 @@ def pairwise_dists_looped(x, y):
```

Be sure to step through this code and see that `dists` stores each pair of Euclidean distances between the rows of `x` and `y`.

```python
# produces a shape-(5, 6) result
>>> pairwise_dists_looped(x, y)
array([[ 3.678 , 8.4524, 10.3057, 7.3711, 6.2152, 5.5548],
[10.1457, 5.8793, 2.9274, 4.1114, 3.9098, 5.2259],
[ 7.3219, 0.8439, 6.8734, 4.5687, 7.3283, 4.8216],
[10.339 , 7.032 , 7.4745, 7.0633, 3.5999, 4.0107],
[ 8.2878, 3.5468, 6.336 , 4.9014, 4.1858, 2.0257]])
```
<!-- #endregion -->

<!-- #region -->
Expand Down Expand Up @@ -545,23 +555,31 @@ Voilà! We have produced the distances in a vectorized way. Let's write this out
def pairwise_dists_crude(x, y):
""" Computing pairwise distances using vectorization.

This method uses memory-inefficient broadcasting.
Parameters
----------
x : numpy.ndarray, shape=(M, D)
y : numpy.ndarray, shape=(N, D)
Returns
-------
numpy.ndarray, shape=(M, N)
The Euclidean distance between each pair of
rows between `x` and `y`."""
This method uses memory-inefficient broadcasting.

Parameters
----------
x : numpy.ndarray, shape=(M, D)
y : numpy.ndarray, shape=(N, D)

Returns
-------
numpy.ndarray, shape=(M, N)
The Euclidean distance between each pair of
rows between `x` and `y`."""
# The use of `np.newaxis` here is equivalent to our
# use of the `reshape` function
return np.sqrt(np.sum((x[:, np.newaxis] - y[np.newaxis])**2, axis=2))
```

```python
# produces a shape-(5, 6) result
>>> pairwise_dists_crude(x, y)
array([[ 3.678 , 8.4524, 10.3057, 7.3711, 6.2152, 5.5548],
[10.1457, 5.8793, 2.9274, 4.1114, 3.9098, 5.2259],
[ 7.3219, 0.8439, 6.8734, 4.5687, 7.3283, 4.8216],
[10.339 , 7.032 , 7.4745, 7.0633, 3.5999, 4.0107],
[ 8.2878, 3.5468, 6.336 , 4.9014, 4.1858, 2.0257]])
```
Regrettably, there is a glaring issue with the vectorized computation that we just performed. Consider the largest sized array that is created in the for-loop computation, compared to that of this vectorized computation. The for-loop version need only create a shape-$(M, N)$ array, whereas the vectorized computation creates an intermediate array (i.e. `diffs`) of shape-$(M, N, D)$. This intermediate array is even created in the one-line version of the code. This will create a massive array if $D$ is a large number!

Suppose, for instance, that you are finding the Euclidean between pairs of RGB images that each have a resolution of $32 \times 32$ (in order to see if the images resemble one another). Thus in this scenario, each image is comprised of $D = 32 \times 32 \times 3 = 3072$ numbers ($32^2$ pixels, and each pixel has 3 values: a red, blue, and green-color value). Computing all the distances between a stack of 5000 images with a stack of 100 images would form an intermediate array of shape-$(5000, 100, 3072)$. Even though this large array only exists temporarily, it would have to consume over 6GB of RAM! The for-loop version requires $\frac{1}{3072}$ as much memory (about 2MB).
Expand Down Expand Up @@ -611,7 +629,7 @@ Thus the third term in our equation, $-2\sum_{i=0}^{D-1}{x_{i} y_{i}}$, for all
```python
# computing the third term in the distance
# equation, for all pairs of rows
>>> x_y_prod = -2 * np.matmul(x, y.T) # `np.dot` can also be used to the same effect
>>> x_y_prod = 2 * np.matmul(x, y.T) # `np.dot` can also be used to the same effect
>>> x_y_prod.shape
(5, 6)
```
Expand All @@ -620,34 +638,106 @@ Having accounted for all three terms, we can finally compute the Euclidean dista

```python
# computing all the distances
>>> dists = np.sqrt(x_y_sqrd + x_y_prod)
>>> dists = np.sqrt(x_y_sqrd - x_y_prod)
>>> dists.shape
(5, 6)
```
In total, we have successfully used vectorization to compute the all pairs of distances, while only requiring an array of shape-$(M, N)$ to do so! This is the memory-efficient, vectorized form - the stuff that dreams are made of. Let's write the function that performs this computation in full.

#### A Subtle Issue with Floating-point Precision

There is one more important and very subtle detail that we have to deal with.
In terms of pure mathematics, `x_y_sqrd - x_y_prod` must be a strictly non-negative value (i.e. its smallest possible value is $0$), since it is equivalent to

\begin{equation}
\sum_{i=0}^{D-1}{(x_{i} - y_{i})^2}
\end{equation}

That being said, are working with floating-point numbers, which do not always behave exactly like rational numbers when we do arithmetic with them.
Indeed, we saw earlier that the [quirks of floating-point arithmetic](https://www.pythonlikeyoumeanit.com/Module2_EssentialsOfPython/Basic_Objects.html#Understanding-Numerical-Precision) can lead to surprising results.
Here, the strange behavior is that `x_y_sqrd - x_y_prod` **can produce negative numbers**!

Let's see this in action.
We'll take the following shape-(2, 3) array

```python
def pairwise_dists(x, y):
""" Computing pairwise distances using memory-efficient
vectorization.
# A carefully-selected input that will trigger surprising
# floating-point precision issues in our distance calculation
>>> x = np.array([[4.700867387959219, 4.700867387959219, 4.700867387959219],
... [4.700867387959219, 4.700867387959219, 4.700867387959219]])
```

Parameters
----------
x : numpy.ndarray, shape=(M, D)
y : numpy.ndarray, shape=(N, D)
And compute the square distance between all combinations of rows `x`; the result should simply be a shape-(2, 2) array of zeros.

Returns
-------
numpy.ndarray, shape=(M, N)
The Euclidean distance between each pair of
rows between `x` and `y`."""
dists = -2 * np.matmul(x, y.T)
dists += np.sum(x**2, axis=1)[:, np.newaxis]
dists += np.sum(y**2, axis=1)
return np.sqrt(dists)
```python
# The square-distances should be exactly zero, but they
# will instead be (*very* small) negative numbers!
>>> sqr_dists = -2 * np.matmul(x, x.T) # `x_x_prod`
>>> sqr_dists += np.sum(x**2, axis=1)[:, np.newaxis]
>>> sqr_dists += np.sum(x**2, axis=1)
>>> sqr_dists
array([[-2.842170943040401e-14, -2.842170943040401e-14],
[-2.842170943040401e-14, -2.842170943040401e-14]])
```

These values are *very* close to being zero, so it is not as if the magnitude of our result is wildly off, but the critical issue here is that the quirks of floating-point arithmetic produced (very small) negative numbers.
We are going to be in for a rude awakening when we take the square-root of these values to get our final distances:

```python
# Taking the square-root of negative floats will produce NaNs
>>> np.sqrt(sqr_dists)
array([[nan, nan],
[nan, nan]])
```

`nan` stands for "Not a Number", which is to say that the square-root of a negative number cannot produce a real-valued float (the result will be an [imaginary number](https://www.pythonlikeyoumeanit.com/Module2_EssentialsOfPython/Basic_Objects.html#Complex-Numbers), but NumPy will not swap out real numbers for imaginary ones unless we explicitly permit it to).

The solution to this problem comes from the realization that this occurrence is truly an edge-case, and that it will only manifest when the result *ought* to have been zero and indeed is very close to zero.
Thus we can [clip](https://numpy.org/doc/stable/reference/generated/numpy.clip.html) `sqr_dists` such that any value that falls below zero gets set to zero and all other values will be left alone:

```python
# "Clipping" all negative entries so that they are set to 0
>>> np.sqrt(np.clip(sqr_dists, a_min=0, a_max=None))
array([[0., 0.],
[0., 0.]])
```
<!-- #endregion -->

<!-- #region -->
#### The Final Answer, At Last!

In total, we have successfully used vectorization to compute all the pairs of distances, while only requiring an array of shape-$(M, N)$ to do so! This is the memory-efficient, vectorized form – the stuff that dreams are made of.
Let's write the function that performs this computation in full.

```python
def pairwise_dists(x, y):
""" Computing pairwise distances using memory-efficient
vectorization.

Parameters
----------
x : numpy.ndarray, shape=(M, D)
y : numpy.ndarray, shape=(N, D)

Returns
-------
numpy.ndarray, shape=(M, N)
The Euclidean distance between each pair of
rows between `x` and `y`."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to leave the end-quote on the same line or a newline? not sure what the consistent thing is off the top of my head

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think same line is what black does. So I'll go with it.

sqr_dists = -2 * np.matmul(x, y.T)
sqr_dists += np.sum(x**2, axis=1)[:, np.newaxis]
sqr_dists += np.sum(y**2, axis=1)
return np.sqrt(np.clip(sqr_dists, a_min=0, a_max=None))
```

```python
# produces a shape-(5, 6) result
>>> pairwise_dists(x, y)
array([[ 3.678 , 8.4524, 10.3057, 7.3711, 6.2152, 5.5548],
[10.1457, 5.8793, 2.9274, 4.1114, 3.9098, 5.2259],
[ 7.3219, 0.8439, 6.8734, 4.5687, 7.3283, 4.8216],
[10.339 , 7.032 , 7.4745, 7.0633, 3.5999, 4.0107],
[ 8.2878, 3.5468, 6.336 , 4.9014, 4.1858, 2.0257]])
```
<!-- #endregion -->

<div class="alert alert-info">
Expand Down Expand Up @@ -676,7 +766,7 @@ Use the function [numpy.allclose](https://numpy.org/doc/stable/reference/generat

## Reading Comprehension Solutions

<!-- #region -->

**Broadcast Compatibility: Solution**

1\. Incompatible
Expand All @@ -688,7 +778,6 @@ Use the function [numpy.allclose](https://numpy.org/doc/stable/reference/generat
4\. `9 x 2 x 5`

5\. Incompatible
<!-- #endregion -->

<!-- #region -->
**Basic Broadcasting: Solution**
Expand Down