# NumPy 2D arrays: Conditional filtering and summary statistics

This notebook covers many manipulations of NumPy 2D arrays, including sorting, combining, filtering and copying.

## Setup

Please import NumPy as its accepted alias, `np`.

In [1]:
import numpy as np

## Sorting NumPy 2D Arrays

NumPy can sort rows or columns of a 2D array using `np.sort()`. Note that the `np.sort()` function does not change the original array given, so it's best to assign the new sorted array to a variable for later access. Calling the `sort()` method directly on the array with `<arr>.sort()` will change the original array variable in place.

Sorting within rows or columns can be specified with the `axis` argument of `np.sort()`. This argument takes a value of `0` or `1` for 2D arrays, which specifies if you want to sort the rows or the columns of the array. I find it a bit unintuitive, but here's what happens for different values of the `axis` argument:

- `axis=0` → sort each column ("across" axis 0, which is the row axis)

- `axis=1` → sort across each row ("across axis 1, which is the column axis).

- Default is `axis=-1` (last axis)

Let's look at a couple of examples.

In [2]:
# Initialize an 2D array of size 3x3
arr = np.array([
    [9, 2, 7],
    [4, 6, 1],
    [8, 3, 5]
])

arr

array([[9, 2, 7],
       [4, 6, 1],
       [8, 3, 5]])

## Sort each row

In [3]:
row_sorted = np.sort(arr, axis=1)
row_sorted

array([[2, 7, 9],
       [1, 4, 6],
       [3, 5, 8]])

## Sort each column

In [4]:
col_sorted = np.sort(arr, axis=0)
col_sorted

array([[4, 2, 1],
       [8, 3, 5],
       [9, 6, 7]])

## Conditional Filtering in 2D Arrays

Boolean masks work on 2D arrays the same way as 1D arrays, selecting elements for which the condition evaluates as `True`. 

But since the 2D contains both rows and columns, we have to think about how we want to filter: by rows, columns, or just across all elements.

### Conditional filtering elements
If you are interested in applying a condition that matches across elements in the 2D array regardless of rows and columns, you can apply a conditional filter across the entire 2D array.

Note that the mask will return a "flattened", 1D array since otherwise it's unclear how to maintain the shape of the original 2D array with just some elements selected.

## Example: get all elements < 5

In [5]:
arr[arr < 5]

array([2, 4, 1, 3])

## Advanced: Conditional Filtering of Rows
Let's say you have a more complicated use case: you'd like to select all rows that match certain criteria, like the element in the first column is greater than 5. Let's walk through how to do this.

In [6]:
arr

array([[9, 2, 7],
       [4, 6, 1],
       [8, 3, 5]])

We'll construct a boolean mask using all the first elements of the rows. So we'll have to select those first elements of all the rows (the first column).

In [7]:
arr[:, 0] # Will select the first columns of all the rows. Hence the result is [9, 4, 8]

array([9, 4, 8])

Now we can create a boolean mask across that selection.

In [8]:
mask = arr[:, 0] > 5 # will result in mask where the elements are greater than 5.
mask

array([ True, False,  True])

Finally, we apply the mask to the rows, since we want rows where the first column is > 5.

In [9]:
arr[mask, :] # This will apply the mask to the rows and select all the columns

array([[9, 2, 7],
       [8, 3, 5]])

It's also possible to put it together skipping the mask, but it starts to look a bit complicated.

In [None]:
arr[arr[:, 0] > 5] # Select rows where the first column's value is greater than 5

The above masking can be a little confusing but remember that in indexing `arr[a,b]`, `a` always refers to the row and `b` refers to the columns (RC).


## Advanced: Conditional Filtering of Columns
The same conditional filtering can be accomplished across columns. Let's work through an example of selecting columns where the first row is greater than 5.

First, we'll create a mask for the first row being greater than 5.

In [10]:
mask = arr[0,:] > 5
mask

array([ True, False,  True])

We need to select the **columns**, so the boolean mask must be placed in the **column position** of the indexing brackets.

In [11]:
arr[:, mask]

array([[9, 7],
       [4, 1],
       [8, 5]])

Or skipping the mask:

In [None]:
arr[:, arr[0, :] > 5] #Select columns where the first row's value is greater than 5

## Summary Statistics Across Axes

When calculating summary statistics across 2D arrays, you have to think about whether you want to calculate the mean, for example, of each row, each column, or across all elements in the array.

To specify this, NumPy summary functions accept an `axis` argument. Again, I find it counterintuitive, but this specifies the axis across which you calculate the statistic. Axis 0 is across rows and axis 1 is across columns.

- `axis=0` → Calculate the summary statistic for each column ("across" all the rows)

- `axis=1` → Calculate the summary statistic for each row ("across" all the columns)

## Summary statistics for each column in an array
An `axis` setting of 0 means the metric will be computed across each row, i.e., there will be a summary statistic output for each column.

In [None]:
print(arr.mean(axis=0))  # mean of each column
print(arr.std(axis=0))   # std dev of each column

## Summary statistics for each row in an array
An `axis` setting of 1 means the metric will be computed across each column, i.e., there will be a summary statistic output for each row.

In [None]:
print(arr.mean(axis=1))   # mean of each row
print(arr.std(axis=1))   # std dev of each row

## All rows and columns
To compute a summary statistic across all elements in an array regardless or rows or columns, provide `None` for the `axis` argument.

In [None]:
print(arr.mean(axis = None))
print(arr.std(axis = None))

## Slicing & Copying: Avoiding Side Effects

Now, an important practical note to avoid unexpected errors and behaviors ("side effects"). Slicing a NumPy array does NOT create a copy. It still has the reference to the original array, meaning any changes made to the slice affect the original array.

To avoid this behavior, it's best to run the `copy()` method on the slice.

In [None]:
sub = arr[0:2, 0:2]  # top-left 2x2 slice
sub[:] = 999         # Make all elements equal to 999.

In [None]:
arr

Note that the original has now been changed! This is often not ideal. To avoid this behavior, it's good practice make a copy with the `.copy()` method on the extracted slice of the array.

In [12]:
# Re-initialize an 2D array of size 3x3
arr = np.array([
    [9, 2, 7],
    [4, 6, 1],
    [8, 3, 5]
])

arr

array([[9, 2, 7],
       [4, 6, 1],
       [8, 3, 5]])

In [13]:
sub_copy = arr[0:2, 0:2].copy()
sub_copy[:] = 999    # safe
print(sub_copy)
arr                  # unchanged

[[999 999]
 [999 999]]


array([[9, 2, 7],
       [4, 6, 1],
       [8, 3, 5]])

# Practice Exercise 1

Given the below 2D array of size 2 * 5 (2 rows and 5 columns)

- Select the rows where the third column values are greater than 3

- Select the columns where the first row element is greater than the second row element (Hard)

- Select rows where the mean > 4

In [None]:
sample_arr = [
    [4,6,3,2,1],
    [5,4,8,9,10]
]
sample_arr