In [2]:
import numpy as np

# Slicing and filtering arrays

## $ \S 1 $ Slicing $ 1D $ arrays

NumPy arrays can be sliced in a similar way to lists, by using the `:`
operator. More precisely, in the case of $ 1D $ arrays, the syntax is
`<array>[<start>:<stop>:<step>]`

__Exercise:__ Let $ \mathbf a = (0, 1, \cdots, 10) $.

(a) Instantiate this array using `arange` or with `linspace` together with the instruction `a.astype(int)`.

(b) Take a slice of $ \mathbf a $ resulting in $ (0, 1, \cdots, 4) $.

(c) Slice $ \mathbf a $ to obtain the array $ (5, 6, \cdots, 10) $.

(d) Construct a slice to retrieve the subarray $ (3, 5, 7, 9) $.

(e) Take a full slice of $ \mathbf a $, call it $ \mathbf b $, and modify its $ 0
$-th element. Is $ \mathbf a $ affected?

📝 _Slicing an array only creates a view on the original array, not a copy_.
This behavior is by design, for efficiency reasons. However, this means that
modifying the data in the slice will also modify the original array. We will
discuss later how to create independent copies. (Note in contrast that slices of
Python lists _are_ independent from their originals.)

We can also extract multiple elements of an array by listing their specific indices as follows:

In [None]:
b = np.arange(-9, 10, 2)  # Odd numbers between -9 and 9
print(b)
print(b[[0, 2, 5]])  # Elements at positions 0, 2, and 5


[-9 -7 -5 -3 -1  1  3  5  7  9]
[-9 -5  1]


## $ \S 2 $ Slicing general arrays

Slicing becomes more interesting with higher-dimensional arrays. For a $ 2D $
array, in principle we need to specify slices for both dimensions, separated by
a comma. If we use a single slice, then we are indexing into full rows.

In [None]:
M = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
print(M, '\n')

# Accessing a specific row:
print(M[1, :], '\n')  # 'First' row

# Accessing a specific column:
print(M[:, 2], '\n')  # 'Second' column

# Sub-array slicing:
print(M[0:2, 1:3], '\n')  # Top right 2x2 sub-array

# If we use only one slice, then entire rows are extracted:
print(M[0:2])  # First 2 rows


[[1 2 3]
 [4 5 6]
 [7 8 9]] 

[4 5 6] 

[3 6 9] 

[[2 3]
 [5 6]] 

[[1 2 3]
 [4 5 6]]


📝 If instead of using a slice we specify a single index for some dimension,
then that dimension "collapses". In particular, the resulting array will have a
smaller number of dimensions. This is illustrated in the following example:

In [None]:
M = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
print(M, '\n')

# Extracting row 1 through a double slice:
print(M[1:2, :], '\n')  # The result is still a 2D array

# Extracting row 1 by indexing into it:
print(M[1, :], '\n')  # The result is a 1D array

[[1 2 3]
 [4 5 6]
 [7 8 9]] 

[[4 5 6]] 

[4 5 6] 



Finally, just as for $ 1D $ arrays, instead of using a slice, we can also
specify a list of the indices that we want to index into, along any dimension.
This allows us to work with more complicated sets of indices.

In [None]:
# A 3x5 array with uniformly random integer coordinates between 0 and 9:
R = np.random.randint(0, 10, (3, 5)) 
print(R, '\n')

# Extracting its columns of indices 0, 1 and 3:
S = R[:, [0, 1, 3]]
print(S)

[[3 0 1 6 8]
 [8 4 8 3 8]
 [3 4 3 8 7]] 

[[3 0 6]
 [8 4 3]
 [3 4 8]]


__Exercise:__ Create a $ 2D $ array of shape $ (4, 4) $ filled with your
favorite integers and print it for reference.

(a) Extract the last row to produce a $ 1D $ array.

(b) Extract the column of index $ 3 $, as a $ 2D $ array.

(c) Extract a $ 2\times 2 $ sub-array from the center of this array.

(d) Extract the $ 3 \times 2 $ lower-left corner of the array.

(e) Extract the subarray consisting of columns indexed by $ 1 $ and $ 3 $ in two
different ways: by specifying these columns explicitly, and by using a slice
with step size $ 2 $.

📝 All of the principles we have seen extend to arrays of higher dimensions. For
each dimension, you can specify a `<start>:<stop>:<step>` slice, separated by
commas.

## $ \S 3 $ Boolean indexing

__Boolean indexing__ is a powerful feature that allows us to select elements
from an array using an array of boolean values of the same shape. This method is
particularly useful for filtering elements based on some condition. Here's an
example:

In [8]:
v = np.array([1, 2, 3, 4, 5])
print("original array:", v)

# Create a boolean index array:
filter = v > 2
print("filter: ", filter)

# Use boolean indexing to select elements:
selected_elements = v[filter]
print("selected elements:", selected_elements)

original array: [1 2 3 4 5]
filter:  [False False  True  True  True]
selected elements: [3 4 5]


In NumPy, and Python in general, the boolean value `True` is treated as
equivalent to $ 1 $ in numerical contexts, and `False` is equivalent to
$ 0 $. Thus, in the preceding example, we can find the number of elements
greater than $ 2 $ by taking the sum of the entries in the filter
(we will return to the `np.sum` function later):

In [9]:
print("# of elements greater than two:", np.sum(filter))

# of elements greater than two: 3
2


Here's a more interesting example. Consider a workplace where employees log
their daily hours. Consider the problem of retrieving the instances where
one of the employees worked overtime. The information is stored over two arrays:
* A $ 1D $ array `names` containing the names of five employees.
* A $ 2D $ array `hours` whose $ i $-th row corresponds to the $ i $-th employee
  listed in `names`, and whose columns correspond to the number of hours worked over two weeks.

In [10]:
# Names of employees:
names = np.array(["Alice", "Bob", "Carol", "Dave", "Alice", "Carol", "Dave", "Bob"])

# Array of hours worked each day for two weeks:
hours = np.array([
    [9, 8, 10, 8, 7],  # Alice, week 1
    [7, 8, 9, 8, 6],   # Bob, week 1
    [10, 7, 12, 8, 9], # Carol, week 1
    [6, 5, 7, 6, 8],   # Dave, week 1
    [8, 9, 7, 10, 8],  # Alice, week 2
    [8, 9, 10, 8, 10], # Carol, week 2
    [8, 9, 7, 8, 7],   # Dave, week 2
    [7, 8, 6, 5, 9]    # Bob, week 2
])

We can extract all of the hours worked by Carol during these two weeks by:
* Creating a boolean filter `names == "Carol"`.
* Using this to index into `hours`, more precisely its rows.

In [None]:
# Filter to select rows for a specific employee, Carol:
mask = (names == "Carol")
hours_Carol = hours[mask]
print(hours_Carol)

[[10  7 12  8  9]
 [ 8  9 10  8 10]]


Now we can use another filter to obtain the times when Carol worked overtime
(and the corresponding hours). Note that the result is a $ 1D $ array:

In [None]:
overtime_Carol = hours_Carol[hours_Carol > 8]
print(overtime_Carol)

[10 12  9  9 10 10]


📝 We can craft more complex filters by using the usual boolean operators
__and__, __or__ and __negation__. However, their Python versions `and`, `or` and
`not` do not work with boolean arrays. Instead, we should use `&`, `|` and `~`,
respectively. The __exclusive or__ (__xor__) operator is denoted by `^`.

__Exercise:__ Referring to the preceding example, extract:

(a) The overtime hours worked by Alice or by Bob.

(b) The numbers of hours between $ 7 $ and $ 9 $ worked by anyone except Dave.

__Exercise:__ You're provided below with a dataset `rainfall_data` of daily
rainfall measurements (in millimeters) in a city for four months (June to
September), stored as a $ 2D $ NumPy array.  Each row represents a month, and each
column represents a day.

(a) Compute the average rainfall and the standard deviation of the sample
across the entire period using the functions `np.mean` and `np.std`.

(b) Identify days with above-average rainfall for each month, and calculate the
percentage of such days in each month. _Hint:_ Use a for loop to iterate over
each month/row of the dataset. Use a Boolean filter to find which days
had greater than average rainfall and take the sum of the `True` values
as discussed above to compute the percentages. 

In [None]:
np.random.seed(123456789)

# Sample mean and standard deviation for each month (in mm):
mean_rainfall = [1.5, 2.7, 2.7, 0.9]  # Average daily rainfall (mm) for June, July, August, September
std_deviation = [1.5, 2, 2.5, 1.5]  # Variability in daily rainfall (mm)

# Generating the 2D array of rainfall data:
months = ["June", "July", "August", "September"]
days_in_month = 30
rainfall_data = np.zeros((4, days_in_month))

for i in range(4):
    # Simulate daily rainfall using a normal distribution:
    rainfall_data[i, :] = np.clip(np.random.normal(mean_rainfall[i],
                                  std_deviation[i], days_in_month), 0, None)
# Round to 2 decimal digits:
rainfall_data = np.round(rainfall_data, 2)

In [None]:
print(rainfall_data)

[[4.82 4.69 4.26 1.62 2.79 0.26 3.24 3.56 2.91 2.77 2.29 0.65 2.75 1.85
  0.23 0.   1.55 2.07 0.3  0.25 0.   1.78 1.42 3.09 2.37 1.02 1.1  1.98
  3.56 1.19]
 [3.25 3.03 1.83 2.89 0.   3.86 9.27 3.12 1.66 6.75 3.34 1.62 1.77 3.44
  2.05 4.97 0.   0.19 2.66 4.4  1.84 2.44 0.76 0.   1.29 1.43 2.95 4.09
  4.81 3.46]
 [3.49 1.97 4.84 1.42 0.52 3.14 1.3  1.48 1.29 0.71 4.79 1.14 1.37 0.43
  6.15 2.31 1.46 4.01 3.85 0.56 0.65 4.01 4.39 1.43 2.33 6.08 7.46 3.95
  7.37 2.54]
 [0.09 3.55 0.37 0.   1.54 1.46 0.61 0.   1.08 2.33 0.19 0.   0.   1.42
  0.   2.57 1.65 2.28 0.48 0.48 4.24 0.48 0.   1.08 1.62 0.   2.26 2.73
  0.   1.73]]
