# Filtering data

Manaully selecting data from an array by giving indices or ranges works if you want contiguous chunks of data, but sometimes you want to be able to grab arbitrary data from an array.

Let's explore the options for more advanced indexing of arrays.

In [1]:
import numpy as np

We'll use the following array in our examples. To make it easier to understand what's going on, I've set the value of the digit after the decimal place to match the index (i.e. the value at index <u>3</u> is 10.<u>3</u>):

In [2]:
a = np.array([6.0, 2.1, 8.2, 10.3, 5.4, 1.5, 7.6, 3.7, 9.8])

## Select by index

We've already seen that you can access a single element by giving its index:

In [3]:
a[5]

1.5

and that we can also pass a slice (just like in a Python list):

In [4]:
a[5:7]

array([1.5, 7.6])

To start to get more advanced than what Python lists support, we can also provide a list of element indices that we want. So, if we want elements `5` and `6` then we can pass in a list of those indices:

In [5]:
wanted_elements = [5, 6]
a[wanted_elements]

array([1.5, 7.6])

So this gave us back the item at index `5` an the item at index `6`.

The indices we pass in do not have to be contiguous, they can be from anywhere in the array:

In [6]:
wanted_elements = [4, 7]
a[wanted_elements]

array([5.4, 3.7])

They can be in any order:

In [7]:
wanted_elements = [7, 4]
a[wanted_elements]

array([3.7, 5.4])

And we can even repeat them and the matching value is repeated in the output:

In [8]:
wanted_elements = [4, 4, 7, 4]
a[wanted_elements]

array([5.4, 5.4, 3.7, 5.4])

In all these cases we have made a separate variable to hold the indices and then passed in that variable. We can do this in one step too:

In [9]:
a[[4, 4, 7, 4]]

array([5.4, 5.4, 3.7, 5.4])

The outer `[]` are the "index the array `a`" syntax, and the inner `[]` are the "make a list of these numbers" syntax.

### Exercise 8
> Using this indexing syntax, extract the following results:
> 
> 1. `[6. , 8.2, 5.4, 7.6, 9.8]`
> 2. `[ 6. ,  2.1,  8.2, 10.3]`
> 3. `[9.8, 1.5, 3.7, 5.4]`


In [10]:
# Write your exercise code.


[<small>answer</small>](../solutions/filtering_index.ipynb)

## Filter by boolean

Let's say that we want all the values in the array that are larger than 4.0. We could do this by manually finding all those indices which match and constructing a list of them to use as a selector:

In [11]:
a = np.array([6.0, 2.1, 8.2, 10.3, 5.4, 1.5, 7.6, 3.7, 9.8])

In [12]:
large_indices = [0, 2, 3, 4, 6, 8]
a[large_indices]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

Or, we can use a list of boolean values, where we set those elements we want to extract to `True` and those we want to be rid of to `False`:

In [13]:
mask = [True, False, True, True, True, False, True, False, True]
a[mask]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

These lists of `True` and `False` are referred to as *boolean arrays*.

With a larger array it would be tedious to create this list by hand. Luckily NumPy provides us with a way of constructing these automatically. If we want a boolean array matching the values which are greater than 4, we can use the same sort of syntax we used for multiplication, but use `>` instead:

In [14]:
a > 4

array([ True, False,  True,  True,  True, False,  True, False,  True])

Or, diagramatically (using ■ for `True` and □ for `False`):

![](../assets/diag_04-1.png) 

This mask can be saved to a variable and passed in as an index:

In [15]:
mask = a > 4
a[mask]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

Or, in one line:

In [16]:
a[a > 4]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

which can be read as "select from `a` the elements where `a` is greater than 4"

![](../assets/diag_04-2.png) 

### Exercise 9

> Extract all the values in `a` which are:
> 1. Less than 5
> 2. Equal to 6.0
> 3. Greater than or equal to 8.2 and less than 10
    - Hint: You can pass multiple conditions that need to be fullfilled with `(condition 1) & (condition 2)`


In [17]:
# Write your exercise code here.


[<small>answer</small>](../solutions/filtering_compare.ipynb)

## Setting from filters

Just like at the beginning of the course when we set values in an array with:
```python
a[4] = 99.4
```

We can also use the return value of any filter to set values. For example, if we wanted to set all values greater than 4 to be 0 we can do:

In [18]:
a[a > 4] = 0
a

array([0. , 2.1, 0. , 0. , 0. , 1.5, 0. , 3.7, 0. ])

![](../assets/diag_04-3.png) 

## Filtering multi-dimensional data

In [19]:
grid = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [20]:
grid[grid > 4]

array([5, 6, 7, 8, 9])

![](../assets/diag_04-4.png) 

In this case, it has lost the information about *where* the numbers have come from. The dimensionality has been lost.

NumPy hasn't much choice in this case, as if it were to return the result with the same shape as the original array `grid`, what should it put in the spaces that we've filtered out?

```
[[? ? ?]
 [? 5 6]
 [7 8 9]]
```

You might think that it could fill those with `0`, or `-1` but any of those could very easily cause a problem with code that follows. NumPy doesn't take a stance on this as it would be dangerous.

In your code, you know what you're doing with your data, so it's ok for you to decide on a case-by-case basis. If you decide that you want to keep the original shape, but replace any filtered-out values with a `0` then you can use NumPy's [`where`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function. It takes three arguments:
1. the boolean array selecting the values,
2. an array or values to use in the spots you're keeping, and
3. an array or values to use in the spots you're filtering out.

So, in the case where we want to replace any values less-than or equal-to 4 with `0`, we can use:

In [21]:
np.where(grid > 4, grid, 0)

array([[0, 0, 0],
       [0, 5, 6],
       [7, 8, 9]])

![](../assets/diag_04-5.png) 

Note that this has not affected the original array:

In [22]:
grid

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

One final way that you can reduce your data, while keeping the dimensionality is to use [masked arrays](https://numpy.org/doc/stable/reference/maskedarray.html). This is useful in situations where you have *missing data*. The advantage of masked arrays is that operations like averaging are not affected by the cells that are masked out. The downside is that for much larger arrays they will use more memory and can be slower to operate on.

In [23]:
masked_grid = np.ma.masked_array(grid, grid <= 4)
print(masked_grid)

[[-- -- --]
 [-- 5 6]
 [7 8 9]]


![](../assets/diag_04-6.png) 

In [24]:
np.mean(masked_grid)

7.0

### Exercise 10

> We continue with the same file that we used before to grab the "temperature" array, but this time we work with the `rain` array. The `rain` data set represents the forecast of metres of rainfall per day in an area and is again based on the data from [ECMWF](https://apps.ecmwf.int/codes/grib/param-db/?id=228). It is two-dimensional with axes of latitude and longitude. We will also load some regional masks with:
> 
> ```python
> with np.load("../assets/weather_data.npz") as weather:
>     rain = weather["rain"]
>     uk_mask = weather["uk"]
>     irl_mask = weather["ireland"]
>     spain_mask = weather["spain"]
> ```
> 
> - Calculate the mean, max and min of the entire 2D `rain` data set.
> - Look at the `uk_mask` array, including its `dtype` and `shape`
> - Create a figure of the `rain` data and overlay the three regional masks
    - Hint: After creating the rain map, you can overlay more data with e.g. `ax.contour(...)`
    - Hint: Have a look at the [contour plot function documentation](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.contour.html) and add each mask with a different color to the rain map 
> - Filter the `rain` data set to contain only those values from within the UK.
>   - Does `[]` indexing, `np.where` or `masked_arrays` make the most sense for this task?
> - Calculate the mean of the masked data
> - Do the same with Ireland and Spain and compare the numbers
> - Do you see any problems with this relatively simple comparison of the mean rainfall in the three countries?


In [25]:
# Write your exercise code here.


[<small>answer</small>](../solutions/filtering_weather.ipynb)

## [[Previous: Multi-dimensional arrays](./03-multi-dimensional_arrays.ipynb)] | [[Next: Combining arrays](./05-combining_arrays.ipynb)]