<img src="../../../images/banners/data_processing.png" width="600"/>

# <img src="../../../images/logos/python.png" width="23"/> Data Science Operations: Indexing, Filter, Order, Aggregate 


## <img src="../../../images/logos/toc.png" width="20"/> Table of Contents 
* [Indexing](#indexing)
    * [Slicing and striding](#slicing_and_striding)
    * [Dimensional indexing tools](#dimensional_indexing_tools)
    * [Advanced indexing](#advanced_indexing)
        * [Integer array indexing](#integer_array_indexing)
        * [Boolean array indexing](#boolean_array_indexing)

---

In [5]:
import numpy as np

In this section, you’ll work through some examples of real, useful data science operations: filtering, sorting, and aggregating data.

<a class="anchor" id="indexing"></a>


## Indexing

ndarrays can be indexed using the standard Python `x[obj]` syntax, where `x` is the array and `obj` the selection. There are different kinds of indexing available depending on `obj`:

- Basic indexing
- Advanced indexing
- Field access.

Here’s the difference: NumPy arrays use commas between axes, so you can index multiple axes in one set of square brackets.

> Note that in Python, `x[(exp1, exp2, ..., expN)]` is equivalent to `x[exp1, exp2, ..., expN]`; the latter is just syntactic sugar for the former. This will come handy later in Advanced Indexing.

<a class="anchor" id="slicing_and_striding"></a>

### Slicing and striding

Basic slicing extends Python’s basic concept of slicing to N dimensions. Basic slicing occurs when obj is a [slice](https://docs.python.org/3/library/functions.html#slice) object (constructed by `start:stop:step` notation inside of brackets), an integer, or a tuple of slice objects and integers. [Ellipsis](https://docs.python.org/3/library/constants.html#Ellipsis) and [newaxis](https://numpy.org/doc/stable/reference/constants.html#numpy.newaxis) objects can be interspersed with these as well.

All arrays generated by basic slicing are always views of the original array.

> NumPy slicing creates a [view](https://numpy.org/doc/stable/glossary.html#term-view) instead of a copy as in the case of built-in Python sequences such as string, tuple and list. Care must be taken when extracting a small portion from a large array which becomes useless after the extraction, because the small portion extracted contains a reference to the large original array whose memory will not be released until all arrays derived from it are garbage-collected. In such cases an explicit `copy()` is recommended.

The standard rules of sequence slicing apply to basic slicing on a per-dimension basis (including using a step index). Some useful concepts to remember include:

- The basic slice syntax is `i:j:k` where `i` is the starting index, j is the stopping index, and `k` is the step.

In [28]:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a[1:7:2]

array([1, 3, 5])

- An integer, `i`, returns the same values as `i:i+1` except the dimensionality of the returned object is reduced by 1. In particular, a selection tuple with the p-th element an integer (and all other entries :) returns the corresponding sub-array with dimension `N - 1`. If N = 1 then the returned object is an array scalar. These objects are explained in Scalars.

In [24]:
a = np.random.randint(10, size=(3, 4))

In [25]:
a[0].shape

(4,)

In [26]:
a[0:1].shape

(1, 4)

- You may use slicing to set values in the array, but (unlike lists) you can never grow the array. The size of the value to be set in `x[obj] = value` must be (broadcastable) to the same shape as `x[obj]`.

In [29]:
l = [1, 2, 3]

In [30]:
l[0:1] = [4, 5, 6]

In [32]:
l

[4, 5, 6, 2, 3]

In [33]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [34]:
# np.array([2, 3]) has shape (2, ) and 
# is not broadcastable to (5, )
a[:5] = np.array([2, 3])

ValueError: could not broadcast input array from shape (2,) into shape (5,)

In [38]:
np.array([2, 3]).shape, a[:5].shape

((2,), (5,))

In [39]:
# 8 is broadcastable!
a[:5] = 8

In [40]:
a

array([8, 8, 8, 8, 8, 5, 6, 7, 8, 9])

- A slicing tuple can always be constructed as `obj` and used in the `x[obj]` notation. Slice objects can be used in the construction in place of the `[start:stop:step]` notation. For example, `x[1:10:5, ::-1]` can also be implemented as `obj = (slice(1, 10, 5)`, `slice(None, None, -1)); x[obj]` . This can be useful for constructing generic code that works on arrays of arbitrary dimensions. See [Dealing with variable numbers of indices within programs](https://numpy.org/doc/stable/user/basics.indexing.html#dealing-with-variable-indices) for more information.

<a class="anchor" id="dimensional_indexing_tools"></a>

### Dimensional indexing tools

There are some tools to facilitate the easy matching of array shapes with expressions and in assignments.

[Ellipsis](https://docs.python.org/3/library/constants.html#Ellipsis) expands to the number of `:` objects needed for the selection tuple to index all dimensions. In most cases, this means that the length of the expanded selection tuple is `x.ndim`. There may only be a single ellipsis present. From the above example:

In [54]:
a = np.random.randint(10, size=(2, 3, 4, 5))

In [55]:
a.ndim

4

In [58]:
b = a[:, :, :, 3]

In [59]:
b.shape

(2, 3, 4)

In [60]:
c = a[..., 3]

In [61]:
c.shape

(2, 3, 4)

In [62]:
(c == b).all()

True

In [65]:
a[Ellipsis, 3]

array([[[4, 4, 3, 9],
        [8, 6, 3, 1],
        [5, 7, 2, 8]],

       [[0, 8, 2, 5],
        [5, 1, 0, 3],
        [7, 9, 9, 0]]])

In [69]:
(a[...] == a).all()

True

Its interpretation is purely up to whatever implements the `__getitem__` function and sees `Ellipsis` objects there, but its main (and intended) use is in the numpy third-party library, which adds a multidimensional array type. Since there are more than one dimensions, slicing becomes more complex than just a start and stop index; it is useful to be able to slice in multiple dimensions as well.

Extending this further, `Ellipsis` is used here to indicate a placeholder for the rest of the array dimensions not specified. Think of it as indicating the full slice `[:]` for all the dimensions in the gap it is placed, so for a 3d array, `a[...,0]` is the same as `a[:,:,0]` and for 4d, `a[:,:,:,0]`, similarly, `a[0,...,0]` is `a[0,:,:,0]` (with however many colons in the middle make up the full number of dimensions in the array).

In [66]:
...

Ellipsis

> Just in case you're curious: it's also used in the standard-library typing module: e.g. `Callable[..., int]` to indicate a callable that returns an `int` without specifying the signature, or `Tuple[str, ...]` to indicate a variable-length homogeneous tuple of strings.

Each [`newaxis`](https://numpy.org/doc/stable/reference/constants.html#numpy.newaxis) object in the selection tuple serves to expand the dimensions of the resulting selection by one unit-length dimension. The added dimension is the position of the `newaxis` object in the selection tuple. `newaxis` is an alias for `None`, and `None` can be used in place of this with the same result.

In [71]:
a[None, ..., np.newaxis].shape

(1, 2, 3, 4, 5, 1)

This can be handy to combine two arrays in a way that otherwise would require explicit reshaping operations. For example:

In [72]:
a = np.arange(5)

In [73]:
a

array([0, 1, 2, 3, 4])

In [77]:
a[:, np.newaxis] + a[np.newaxis, :]

array([[0, 1, 2, 3, 4],
       [1, 2, 3, 4, 5],
       [2, 3, 4, 5, 6],
       [3, 4, 5, 6, 7],
       [4, 5, 6, 7, 8]])

An example is the easiest way to show indexing and slicing off. It’s time to confirm [Dürer’s magic square](https://en.wikipedia.org/wiki/Magic_square#Albrecht_D%C3%BCrer's_magic_square)!

The number square below has some amazing properties. If you add up any of the rows, columns, or diagonals, then you’ll get the same number, 34. That’s also what you’ll get if you add up each of the four quadrants, the center four squares, the four corner squares, or the four corner squares of any of the contained 3 × 3 grids. You’re going to prove it!

> **Fun fact**: In the bottom row, the numbers 15 and 14 are in the middle, representing the year that Dürer created this square. The numbers 1 and 4 are also in that row, representing the first and fourth letters of the alphabet, A and D, which are the initials of the square’s creator, Albrecht Dürer!

In [1]:
import numpy as np

square = np.array([
    [16,  3,  2, 13],
    [ 5, 10, 11,  8],
    [ 9,  6,  7, 12],
    [ 4, 15, 14,  1],
])

In [2]:
mylist = [
    [16, 3, 2, 13],
    [5, 10, 11, 8],
    [9, 6, 7, 12],
    [4, 15, 14, 1]
]

In [4]:
square.shape

(4, 4)

In [5]:
square[:, 0]

array([16,  5,  9,  4])

In [6]:
for i in range(4):
    assert square[:, i].sum() == 34
    assert square[i, :].sum() == 34

In [7]:
assert square[:2, :2].sum() == 34

In [8]:
assert square[2:, :2].sum() == 34

In [9]:
assert square[:2, 2:].sum() == 34

In [10]:
assert square[2:, 2:].sum() == 34

Inside the for loop, you verify that all the rows and all the columns add up to 34. After that, using selective indexing, you verify that each of the quadrants also adds up to 34.

One last thing to note is that you’re able to take the sum of any array to add up all of its elements globally with `square.sum()`. This method can also take an axis argument to do an axis-wise summing instead.

In [11]:
square.sum()

136

<a class="anchor" id="advanced_indexing"></a>

### Advanced indexing

Advanced indexing is triggered when the selection object, obj, is a non-tuple sequence object, an `ndarray` (of data type integer or bool), or a tuple with at least one sequence object or ndarray (of data type integer or bool). There are two types of advanced indexing:

- Integer
- Boolean.

> **Note:** Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).

> **Note:** The definition of advanced indexing means that `x[(1, 2, 3),]` is fundamentally different than `x[(1, 2, 3)]`. The latter is equivalent to `x[1, 2, 3]` which will trigger basic selection while the former will trigger advanced indexing. Be sure to understand why this occurs.

<a class="anchor" id="integer_array_indexing"></a>

#### Integer array indexing

Integer array indexing allows selection of arbitrary items in the array based on their N-dimensional index. Each integer array represents a number of indices into that dimension.

Negative values are permitted in the index arrays and work as they do with single indices or slices:

In [85]:
a = np.arange(10, 1, -1)

In [86]:
a.shape

(9,)

In [95]:
b = a[np.array([3, 3, 1, 8])]

In [97]:
# Advanced Indexing results in a copy, not a view
np.may_share_memory(a, b)

False

In [91]:
a[[3, 3, 1, 8]]

array([7, 7, 9, 2])

In [93]:
# ((3, 3, 1, 8), ) is a tuple with one sequence object --> Advanced Indexing
a[(3, 3, 1, 8), ] 

array([7, 7, 9, 2])

In [88]:
# Negative values work as before
a[np.array([3, 3, -3, 8])]

array([7, 7, 4, 2])

In [89]:
# Note that this will raise error as a is 1 dimensional
a[3, 3, 1, 8]  # (3, 3, 1, 8) is a tuple -> Basic indexing

IndexError: too many indices for array: array is 1-dimensional, but 4 were indexed

<a class="anchor" id="boolean_array_indexing"></a>

#### Boolean array indexing

This advanced indexing occurs when obj is an array object of Boolean type, such as may be returned from comparison operators.

If `obj.ndim == x.ndim`, `x[obj]` returns a 1-dimensional array filled with the elements of `x` corresponding to the `True` values of `obj`. The search order will be row-major, C-style. If `obj` has `True` values at entries that are outside of the bounds of `x`, then an index error will be raised. If `obj` is smaller than `x` it is identical to filling it with `False`.

In [138]:
a = np.random.randint(10, size=(3, 4))

In [139]:
a

array([[9, 3, 0, 6],
       [4, 2, 3, 0],
       [7, 4, 7, 8]])

In [140]:
a[:, [True, False, True, False]]

array([[9, 0],
       [4, 3],
       [7, 7]])

In [141]:
a[[True, False, True], :]

array([[9, 3, 0, 6],
       [7, 4, 7, 8]])

A common use case for this is filtering for desired element values. For example, one may wish to select all entries from an array which are not NaN:

In [116]:
x = np.array([[1., 2.], [np.nan, 3.], [np.nan, np.nan]])

In [123]:
np.isnan(x)

(array([1, 2, 2]), array([0, 0, 1]))

In [124]:
x[~np.isnan(x)]

array([1., 2., 3.])

In [127]:
x[(~np.isnan(x)).nonzero()]

array([1., 2., 3.])

In [131]:
(~np.isnan(x)).nonzero()

(array([0, 0, 1]), array([0, 1, 1]))

Or wish to add a constant to all negative elements:

In [145]:
x = np.array([1., -1., -2., 3])
x[x < 0] += 20

In [146]:
x

array([ 1., 19., 18.,  3.])

In general if an index includes a Boolean array, the result will be identical to inserting `obj.nonzero()` into the same position and using the integer array indexing mechanism described above. `x[ind_1, boolean_array, ind_2]` is equivalent to `x[(ind_1,) + boolean_array.nonzero() + (ind_2,)]`.

This is where the concept of a **mask** comes into play.

A mask is an array that has the exact same shape as your data, but instead of your values, it holds Boolean values: either `True` or `False`. You can use this mask array to index into your data array in nonlinear and complex ways. It will return all of the elements where the Boolean array has a True value.

Here’s an example showing the process, first in slow motion and then how it’s typically done, all in one line:

In [119]:
numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)

numbers

array([[ 5,  6,  8, 10, 12, 14],
       [16, 18, 20, 22, 24, 26],
       [28, 30, 32, 34, 36, 38],
       [40, 42, 44, 46, 48, 50]])

In [120]:
mask = numbers % 4 == 0

mask

array([[False, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False],
       [ True, False,  True, False,  True, False]])

In [122]:
numbers[mask]

array([ 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48])

In [104]:
np.may_share_memory(numbers[mask], numbers)

False

In [21]:
# how it's typically done
by_four = numbers[numbers % 4 == 0]

by_four

array([ 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48])

You’ll see an explanation of the new array creation tricks in input 2 in a moment, but for now, focus on the meat of the example. These are the important parts:

- **`mask = numbers % 4 == 0`** creates the mask by performing a **vectorized Boolean computation**, taking each element and checking to see if it divides evenly by four. This returns a mask array of the same shape with the element-wise results of the computation.
- **`numbers[mask]`** uses this mask to index into the original numbers array. This causes the array to lose its original shape, reducing it to one dimension, but you still get the data you’re looking for.
- **`by_four = numbers[numbers % 4 == 0]`** provides a more traditional, idiomatic masked selection that you might see in the wild, with an anonymous filtering array created inline, inside the selection brackets. This syntax is similar to usage in the R programming language.

Coming back to `numbers = np.linspace(5, 50, 24, dtype=int).reshape(4, -1)`, you encounter three new concepts:

- Using [`np.linspace()`](https://realpython.com/np-linspace-numpy/) to generate an evenly spaced array
- Setting the `dtype` of an output
- Reshaping an array with `-1`

[`np.linspace()`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html) generates n numbers evenly distributed between a minimum and a maximum, which is useful for evenly distributed sampling in scientific plotting

Because of the particular calculation in this example, it makes life easier to have integers in the `numbers` array. But because the space between 5 and 50 doesn’t divide evenly by 24, the resulting numbers would be floating-point numbers. You specify a `dtype` of `int` to force the function to round down and give you whole integers. You’ll see a more detailed discussion of data types later on.

Finally, `array.reshape()` can take `-1` as one of its dimension sizes. That signifies that NumPy should just figure out how big that particular axis needs to be based on the size of the other axes. In this case, with 24 values and a size of `4` in axis 0, axis 1 ends up with a size of `6`.

Here’s one more example to show off the power of masked filtering. The [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) is a probability distribution in which roughly 95.45% of values occur within two standard deviations of the mean.

You can verify that with a little help from NumPy’s random module for generating random values:

In [22]:
import numpy as np

from numpy.random import default_rng

rng = default_rng()

values = rng.standard_normal(10000)

values[:5]

array([-1.39451419, -2.23846695,  0.05942294, -1.13963306,  1.81834453])

In [23]:
std = values.std()

std

1.0114056607166995

In [24]:
filtered = values[(values > -2 * std) & (values < 2 * std)]

filtered.size

9545

In [25]:
values.size

10000

In [26]:
filtered.size / values.size

0.9545

Here you use a potentially strange-looking syntax to combine filter conditions: a **binary & operator**. Why would that be the case? It’s because NumPy designates `&` and `|` as the vectorized, element-wise operators to combine Booleans. If you try to do `A` and `B`, then you’ll get a warning about how the truth value for an array is weird, because the and is operating on the truth value of the whole array, not element by element.