# Data Science Operations: Filter, Order, Aggregate

That wraps up a section that was heavy in theory but a little light on practical, real-world examples. In this section, you’ll work through some examples of real, useful data science operations: filtering, sorting, and aggregating data.

## Indexing

Indexing uses many of the same idioms that normal Python code uses. You can use positive or negative indices to index from the front or back of the array. You can use a colon (`:`) to specify “the rest” or “all,” and you can even use two colons to skip elements as with regular Python lists.

Here’s the difference: NumPy arrays use commas between axes, so you can index multiple axes in one set of square brackets. An example is the easiest way to show this off. It’s time to confirm [Dürer’s magic square](https://en.wikipedia.org/wiki/Magic_square#Albrecht_D%C3%BCrer's_magic_square)!

The number square below has some amazing properties. If you add up any of the rows, columns, or diagonals, then you’ll get the same number, 34. That’s also what you’ll get if you add up each of the four quadrants, the center four squares, the four corner squares, or the four corner squares of any of the contained 3 × 3 grids. You’re going to prove it!

> **Fun fact**: In the bottom row, the numbers 15 and 14 are in the middle, representing the year that Dürer created this square. The numbers 1 and 4 are also in that row, representing the first and fourth letters of the alphabet, A and D, which are the initials of the square’s creator, Albrecht Dürer!

In [187]:
import numpy as np

square = np.array([
    [16, 3, 2, 13],
    [5, 10, 11, 8],
    [9, 6, 7, 12],
    [4, 15, 14, 1]
])

In [191]:
mylist = [
    [16, 3, 2, 13],
    [5, 10, 11, 8],
    [9, 6, 7, 12],
    [4, 15, 14, 1]
]

In [188]:
square.shape

(4, 4)

In [201]:
square[:, 0]

array([16,  5,  9,  4])

In [202]:
for i in range(4):
    assert square[:, i].sum() == 34
    assert square[i, :].sum() == 34

In [203]:
assert square[:2, :2].sum() == 34

In [204]:
assert square[2:, :2].sum() == 34

In [205]:
assert square[:2, 2:].sum() == 34

In [206]:
assert square[2:, 2:].sum() == 34

Inside the for loop, you verify that all the rows and all the columns add up to 34. After that, using selective indexing, you verify that each of the quadrants also adds up to 34.

One last thing to note is that you’re able to take the sum of any array to add up all of its elements globally with `square.sum()`. This method can also take an axis argument to do an axis-wise summing instead.

In [23]:
square.sum()

136

## Masking and Filtering

Index-based selection is great, but what if you want to filter your data based on more complicated nonuniform or nonsequential criteria? This is where the concept of a **mask** comes into play.

A mask is an array that has the exact same shape as your data, but instead of your values, it holds Boolean values: either `True` or `False`. You can use this mask array to index into your data array in nonlinear and complex ways. It will return all of the elements where the Boolean array has a True value.

Here’s an example showing the process, first in slow motion and then how it’s typically done, all in one line:

In [227]:
# slow motion mode
import numpy as np

numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)

numbers

array([[ 5,  6,  8, 10, 12, 14],
       [16, 18, 20, 22, 24, 26],
       [28, 30, 32, 34, 36, 38],
       [40, 42, 44, 46, 48, 50]])

In [250]:
mask = numbers % 4 == 0

In [25]:
mask = numbers % 4 == 0

mask

array([[False, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False]])

In [236]:
numbers[mask]

array([ 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48])

In [241]:
# how it's typically done
by_four = numbers[numbers % 4 == 0]

by_four

array([ 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48])

You’ll see an explanation of the new array creation tricks in input 2 in a moment, but for now, focus on the meat of the example. These are the important parts:

- **`mask = numbers % 4 == 0`** creates the mask by performing a **vectorized Boolean computation**, taking each element and checking to see if it divides evenly by four. This returns a mask array of the same shape with the element-wise results of the computation.
- **`numbers[mask]`** uses this mask to index into the original numbers array. This causes the array to lose its original shape, reducing it to one dimension, but you still get the data you’re looking for.
- **`by_four = numbers[numbers % 4 == 0]`** provides a more traditional, idiomatic masked selection that you might see in the wild, with an anonymous filtering array created inline, inside the selection brackets. This syntax is similar to usage in the R programming language.

Coming back to `numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)`, you encounter three new concepts:

- Using [`np.linspace()`](https://realpython.com/np-linspace-numpy/) to generate an evenly spaced array
- Setting the `dtype` of an output
- Reshaping an array with `-1`

[`np.linspace()`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html) generates n numbers evenly distributed between a minimum and a maximum, which is useful for evenly distributed sampling in scientific plotting

Because of the particular calculation in this example, it makes life easier to have integers in the `numbers` array. But because the space between 5 and 50 doesn’t divide evenly by 24, the resulting numbers would be floating-point numbers. You specify a `dtype` of `int` to force the function to round down and give you whole integers. You’ll see a more detailed discussion of data types later on.

Finally, `array.reshape()` can take `-1` as one of its dimension sizes. That signifies that NumPy should just figure out how big that particular axis needs to be based on the size of the other axes. In this case, with 24 values and a size of `4` in axis 0, axis 1 ends up with a size of `6`.

Here’s one more example to show off the power of masked filtering. The [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) is a probability distribution in which roughly 95.45% of values occur within two standard deviations of the mean.

You can verify that with a little help from NumPy’s random module for generating random values:

In [28]:
import numpy as np

from numpy.random import default_rng

rng = default_rng()

values = rng.standard_normal(10000)

values[:5]

array([ 0.2354462 ,  0.8596593 ,  1.95996126,  1.21886792, -1.057614  ])

In [29]:
std = values.std()

std

1.0075881741494412

In [30]:
filtered = values[(values > -2 * std) & (values < 2 * std)]

filtered.size

9537

In [31]:
values.size

10000

In [32]:
filtered.size / values.size

0.9537

Here you use a potentially strange-looking syntax to combine filter conditions: a **binary & operator**. Why would that be the case? It’s because NumPy designates `&` and `|` as the vectorized, element-wise operators to combine Booleans. If you try to do `A` and `B`, then you’ll get a warning about how the truth value for an array is weird, because the and is operating on the truth value of the whole array, not element by element.

## Transposing, Sorting, and Concatenating

Other manipulations, while not quite as common as indexing or filtering, can also be very handy depending on the situation you’re in. You’ll see a few examples in this section.

Here’s **transposing** an array:

In [267]:
import numpy as np

a = np.array([
    [1, 2],
    [3, 4],
    [5, 6],
])

In [272]:
a.T

array([[1, 3, 5],
       [2, 4, 6]])

In [273]:
a.transpose()

array([[1, 3, 5],
       [2, 4, 6]])

When you calculate the transpose of an array, the row and column indices of every element are switched. Item `[0, 2]`, for example, becomes item `[2, 0]`. You can also use `a.T` as an alias for `a.transpose()`.

The following code block shows sorting, but you’ll also see a more powerful sorting technique in the coming section on structured data:

In [275]:
import numpy as np

data = np.array([
    [7, 1, 4],
    [8, 6, 5],
    [1, 2, 3]
])

np.sort(data)

array([[1, 4, 7],
       [5, 6, 8],
       [1, 2, 3]])

In [276]:
np.sort(data, axis=None)

array([1, 1, 2, 3, 4, 5, 6, 7, 8])

In [277]:
np.sort(data, axis=0)

array([[1, 1, 3],
       [7, 2, 4],
       [8, 6, 5]])

Omitting the `axis` argument automatically selects the last and innermost dimension, which is the rows in this example. Using `None` flattens the array and performs a global sort. Otherwise, you can specify which axis you want. In output of `np.sort(data, axis=0)`, each column of the array still has all of its elements but they have been sorted low-to-high inside that column.

Finally, here’s an example of **concatenation**. While there’s a `np.concatenate()` function, there are also a number of helper functions that are sometimes easier to read.

Here are some examples:

In [279]:
import numpy as np

a = np.array([
    [4, 8],
    [6, 1]
])

b = np.array([
    [3, 5],
    [7, 2],
])


In [281]:
a

array([[4, 8],
       [6, 1]])

In [282]:
b

array([[3, 5],
       [7, 2]])

In [283]:
np.hstack((a, b))

array([[4, 8, 3, 5],
       [6, 1, 7, 2]])

In [284]:
np.vstack((b, a))

array([[3, 5],
       [7, 2],
       [4, 8],
       [6, 1]])

In [285]:
np.concatenate((a, b))

array([[4, 8],
       [6, 1],
       [3, 5],
       [7, 2]])

In [286]:
np.concatenate((a, b), axis=0)

array([[4, 8],
       [6, 1],
       [3, 5],
       [7, 2]])

In [287]:
np.concatenate((a, b), axis=1)

array([[4, 8, 3, 5],
       [6, 1, 7, 2]])

In [288]:
np.concatenate((a, b), axis=None)

array([4, 8, 6, 1, 3, 5, 7, 2])

`np.hstack((a, b))` and `np.vstack((b, a))` show the slightly more intuitive functions `hstack()` and `vstack()`. `np.concatenate((a, b))` and `np.concatenate((a, b), axis=None)` show the more generic `concatenate()`, first without an `axis` argument and then with `axis=None`. This flattening behavior is similar in form to what you just saw with `sort()`.

One important stumbling block to note is that all these functions take a tuple of arrays as their first argument rather than a variable number of arguments as you might expect. You can tell because there’s an extra pair of parentheses.

## Aggregating

Your last stop on this tour of functionality before diving into some more advanced topics and examples is **aggregation**. You’ve already seen quite a few aggregating methods, including `.sum()`, `.max()`, `.mean()`, and `.std()`. You can reference NumPy’s larger library of [functions](https://numpy.org/doc/stable/reference/routines.html) to see more. Many of the mathematical, financial, and statistical functions use aggregation to help you reduce the number of dimensions in your data.