diff --git a/Python/Module3_IntroducingNumpy/Broadcasting.md b/Python/Module3_IntroducingNumpy/Broadcasting.md index 572c2bd7..3baae4d5 100644 --- a/Python/Module3_IntroducingNumpy/Broadcasting.md +++ b/Python/Module3_IntroducingNumpy/Broadcasting.md @@ -5,7 +5,7 @@ jupyter: extension: .md format_name: markdown format_version: '1.2' - jupytext_version: 1.3.0rc1 + jupytext_version: 1.5.0 kernelspec: display_name: Python 3 language: python @@ -486,16 +486,16 @@ Performing this computation using for-loops proceeds as follows: def pairwise_dists_looped(x, y): """ Computing pairwise distances using for-loops - Parameters - ---------- - x : numpy.ndarray, shape=(M, D) - y : numpy.ndarray, shape=(N, D) + Parameters + ---------- + x : numpy.ndarray, shape=(M, D) + y : numpy.ndarray, shape=(N, D) - Returns - ------- - numpy.ndarray, shape=(M, N) - The Euclidean distance between each pair of - rows between `x` and `y`.""" + Returns + ------- + numpy.ndarray, shape=(M, N) + The Euclidean distance between each pair of + rows between `x` and `y`.""" # `dists[i, j]` will store the Euclidean # distance between `x[i]` and `y[j]` dists = np.empty((5, 6)) @@ -514,6 +514,16 @@ def pairwise_dists_looped(x, y): ``` Be sure to step through this code and see that `dists` stores each pair of Euclidean distances between the rows of `x` and `y`. + +```python +# produces a shape-(5, 6) result +>>> pairwise_dists_looped(x, y) +array([[ 3.678 , 8.4524, 10.3057, 7.3711, 6.2152, 5.5548], + [10.1457, 5.8793, 2.9274, 4.1114, 3.9098, 5.2259], + [ 7.3219, 0.8439, 6.8734, 4.5687, 7.3283, 4.8216], + [10.339 , 7.032 , 7.4745, 7.0633, 3.5999, 4.0107], + [ 8.2878, 3.5468, 6.336 , 4.9014, 4.1858, 2.0257]]) +``` @@ -545,23 +555,31 @@ Voilà! We have produced the distances in a vectorized way. Let's write this out def pairwise_dists_crude(x, y): """ Computing pairwise distances using vectorization. - This method uses memory-inefficient broadcasting. - - Parameters - ---------- - x : numpy.ndarray, shape=(M, D) - y : numpy.ndarray, shape=(N, D) - - Returns - ------- - numpy.ndarray, shape=(M, N) - The Euclidean distance between each pair of - rows between `x` and `y`.""" + This method uses memory-inefficient broadcasting. + + Parameters + ---------- + x : numpy.ndarray, shape=(M, D) + y : numpy.ndarray, shape=(N, D) + + Returns + ------- + numpy.ndarray, shape=(M, N) + The Euclidean distance between each pair of + rows between `x` and `y`.""" # The use of `np.newaxis` here is equivalent to our # use of the `reshape` function return np.sqrt(np.sum((x[:, np.newaxis] - y[np.newaxis])**2, axis=2)) ``` - +```python +# produces a shape-(5, 6) result +>>> pairwise_dists_crude(x, y) +array([[ 3.678 , 8.4524, 10.3057, 7.3711, 6.2152, 5.5548], + [10.1457, 5.8793, 2.9274, 4.1114, 3.9098, 5.2259], + [ 7.3219, 0.8439, 6.8734, 4.5687, 7.3283, 4.8216], + [10.339 , 7.032 , 7.4745, 7.0633, 3.5999, 4.0107], + [ 8.2878, 3.5468, 6.336 , 4.9014, 4.1858, 2.0257]]) +``` Regrettably, there is a glaring issue with the vectorized computation that we just performed. Consider the largest sized array that is created in the for-loop computation, compared to that of this vectorized computation. The for-loop version need only create a shape-$(M, N)$ array, whereas the vectorized computation creates an intermediate array (i.e. `diffs`) of shape-$(M, N, D)$. This intermediate array is even created in the one-line version of the code. This will create a massive array if $D$ is a large number! Suppose, for instance, that you are finding the Euclidean between pairs of RGB images that each have a resolution of $32 \times 32$ (in order to see if the images resemble one another). Thus in this scenario, each image is comprised of $D = 32 \times 32 \times 3 = 3072$ numbers ($32^2$ pixels, and each pixel has 3 values: a red, blue, and green-color value). Computing all the distances between a stack of 5000 images with a stack of 100 images would form an intermediate array of shape-$(5000, 100, 3072)$. Even though this large array only exists temporarily, it would have to consume over 6GB of RAM! The for-loop version requires $\frac{1}{3072}$ as much memory (about 2MB). @@ -611,7 +629,7 @@ Thus the third term in our equation, $-2\sum_{i=0}^{D-1}{x_{i} y_{i}}$, for all ```python # computing the third term in the distance # equation, for all pairs of rows ->>> x_y_prod = -2 * np.matmul(x, y.T) # `np.dot` can also be used to the same effect +>>> x_y_prod = 2 * np.matmul(x, y.T) # `np.dot` can also be used to the same effect >>> x_y_prod.shape (5, 6) ``` @@ -620,34 +638,106 @@ Having accounted for all three terms, we can finally compute the Euclidean dista ```python # computing all the distances ->>> dists = np.sqrt(x_y_sqrd + x_y_prod) +>>> dists = np.sqrt(x_y_sqrd - x_y_prod) >>> dists.shape (5, 6) ``` -In total, we have successfully used vectorization to compute the all pairs of distances, while only requiring an array of shape-$(M, N)$ to do so! This is the memory-efficient, vectorized form - the stuff that dreams are made of. Let's write the function that performs this computation in full. + +#### A Subtle Issue with Floating-point Precision + +There is one more important and very subtle detail that we have to deal with. +In terms of pure mathematics, `x_y_sqrd - x_y_prod` must be a strictly non-negative value (i.e. its smallest possible value is $0$), since it is equivalent to + +\begin{equation} +\sum_{i=0}^{D-1}{(x_{i} - y_{i})^2} +\end{equation} + +That being said, are working with floating-point numbers, which do not always behave exactly like rational numbers when we do arithmetic with them. +Indeed, we saw earlier that the [quirks of floating-point arithmetic](https://www.pythonlikeyoumeanit.com/Module2_EssentialsOfPython/Basic_Objects.html#Understanding-Numerical-Precision) can lead to surprising results. +Here, the strange behavior is that `x_y_sqrd - x_y_prod` **can produce negative numbers**! + +Let's see this in action. +We'll take the following shape-(2, 3) array ```python -def pairwise_dists(x, y): - """ Computing pairwise distances using memory-efficient - vectorization. +# A carefully-selected input that will trigger surprising +# floating-point precision issues in our distance calculation +>>> x = np.array([[4.700867387959219, 4.700867387959219, 4.700867387959219], +... [4.700867387959219, 4.700867387959219, 4.700867387959219]]) +``` - Parameters - ---------- - x : numpy.ndarray, shape=(M, D) - y : numpy.ndarray, shape=(N, D) +And compute the square distance between all combinations of rows `x`; the result should simply be a shape-(2, 2) array of zeros. - Returns - ------- - numpy.ndarray, shape=(M, N) - The Euclidean distance between each pair of - rows between `x` and `y`.""" - dists = -2 * np.matmul(x, y.T) - dists += np.sum(x**2, axis=1)[:, np.newaxis] - dists += np.sum(y**2, axis=1) - return np.sqrt(dists) +```python +# The square-distances should be exactly zero, but they +# will instead be (*very* small) negative numbers! +>>> sqr_dists = -2 * np.matmul(x, x.T) # `x_x_prod` +>>> sqr_dists += np.sum(x**2, axis=1)[:, np.newaxis] +>>> sqr_dists += np.sum(x**2, axis=1) +>>> sqr_dists +array([[-2.842170943040401e-14, -2.842170943040401e-14], + [-2.842170943040401e-14, -2.842170943040401e-14]]) ``` +These values are *very* close to being zero, so it is not as if the magnitude of our result is wildly off, but the critical issue here is that the quirks of floating-point arithmetic produced (very small) negative numbers. +We are going to be in for a rude awakening when we take the square-root of these values to get our final distances: +```python +# Taking the square-root of negative floats will produce NaNs +>>> np.sqrt(sqr_dists) +array([[nan, nan], + [nan, nan]]) +``` + +`nan` stands for "Not a Number", which is to say that the square-root of a negative number cannot produce a real-valued float (the result will be an [imaginary number](https://www.pythonlikeyoumeanit.com/Module2_EssentialsOfPython/Basic_Objects.html#Complex-Numbers), but NumPy will not swap out real numbers for imaginary ones unless we explicitly permit it to). + +The solution to this problem comes from the realization that this occurrence is truly an edge-case, and that it will only manifest when the result *ought* to have been zero and indeed is very close to zero. +Thus we can [clip](https://numpy.org/doc/stable/reference/generated/numpy.clip.html) `sqr_dists` such that any value that falls below zero gets set to zero and all other values will be left alone: + +```python +# "Clipping" all negative entries so that they are set to 0 +>>> np.sqrt(np.clip(sqr_dists, a_min=0, a_max=None)) +array([[0., 0.], + [0., 0.]]) +``` + + + +#### The Final Answer, At Last! + +In total, we have successfully used vectorization to compute all the pairs of distances, while only requiring an array of shape-$(M, N)$ to do so! This is the memory-efficient, vectorized form – the stuff that dreams are made of. +Let's write the function that performs this computation in full. + +```python +def pairwise_dists(x, y): + """ Computing pairwise distances using memory-efficient + vectorization. + + Parameters + ---------- + x : numpy.ndarray, shape=(M, D) + y : numpy.ndarray, shape=(N, D) + + Returns + ------- + numpy.ndarray, shape=(M, N) + The Euclidean distance between each pair of + rows between `x` and `y`.""" + sqr_dists = -2 * np.matmul(x, y.T) + sqr_dists += np.sum(x**2, axis=1)[:, np.newaxis] + sqr_dists += np.sum(y**2, axis=1) + return np.sqrt(np.clip(sqr_dists, a_min=0, a_max=None)) +``` + +```python +# produces a shape-(5, 6) result +>>> pairwise_dists(x, y) +array([[ 3.678 , 8.4524, 10.3057, 7.3711, 6.2152, 5.5548], + [10.1457, 5.8793, 2.9274, 4.1114, 3.9098, 5.2259], + [ 7.3219, 0.8439, 6.8734, 4.5687, 7.3283, 4.8216], + [10.339 , 7.032 , 7.4745, 7.0633, 3.5999, 4.0107], + [ 8.2878, 3.5468, 6.336 , 4.9014, 4.1858, 2.0257]]) +```