# Boolean indexing

Let's start with boolean indexing on a numpy array. What does this mean?

Well let's sat that you have a set of numbers and you want to identify all those numbers whose value is larger than 2. You could write a for loop (which is often quite slow to run in python), or you could use "boolean indexing". This goes something like:

In [1]:
import numpy as np
# Here's a for loop showing what we're doing
some_numbers = [1,10,2,20,3,30,4,40,5,50,6,60]
large_nums = []

for num in some_numbers:
    if num > 4:
        large_nums.append(num)

# But we could also do this as using boolean indexing:

some_numbers = np.array(some_numbers)
large_nums = some_numbers[some_numbers > 4]
print(large_nums)


[10 20 30 40  5 50  6 60]


Let's try and break down what this is doing. It only works with numpy arrays so if you have a list, first convert it to a numpy array by doing `some_numbers = np.array(some_numbers)`.

Then we do `some_numbers > 4`. This creates a boolean array (an array whose entries are either `True` or `False`).

We then use this boolean array to index `some_numbers`. It will extract all entries for which the boolean array is True. The boolean array must be the same length as the array being indexed.

So for example

`np.array([1,4,9,16])[np.array([True,False,False,True])]`

will return the first and fourth entries of this array (1 and 16) in a new array.

In [2]:
np.array([1,4,9,16])[np.array([True,False,False,True])]

array([ 1, 16])

We can split this into two lines if we also want to store the Boolean array.

In [3]:
some_numbers = np.array([1,10,2,20,3,30,4,40,5,50,6,60])
logic_array = some_numbers > 4
print(logic_array)
large_nums = some_numbers[logic_array]
print(large_nums)

[False  True False  True False  True False  True  True  True  True  True]
[10 20 30 40  5 50  6 60]


We can also save a boolean array, and set values through later manipulation

In [4]:
# Who is teaching each lecture in this course?
ian = np.array([0, 5, 16, 17, 19, 20, 21, 23])
laura = np.array([1, 3, 4, 7, 8, 11, 13, 18, 22])
gareth = np.array([2, 6, 9, 10, 12, 14, 15])

bool_ian = np.zeros(24)
bool_ian[ian] = True

bool_ian

array([1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 0., 1., 1., 1., 0., 1.])

## EXERCISE

Let's practice this.

* Use boolean indexing to extract all values in some_numbers where the number is smaller than 15
* Use boolean indexing to extract all values in some_numbers where $\mathrm{num}^2 - 15$ is larger than 100

In [5]:
some_numbers = np.array([1,10,2,20,3,30,4,40,5,50,6,60])

# COMPLETE BELOW

print(some_numbers[some_numbers < 15])
print(some_numbers[(some_numbers **  2 - 15) > 100])



[ 1 10  2  3  4  5  6]
[20 30 40 50 60]


How is this better than using a `for` loop? It is significantly faster, so if you are dealing with very large arrays, you might want this. But another huge advantage to this comes when you're dealing with multiple arrays of related data. Let's illustrate this using a pandas DataFrame.

In [6]:
import numpy as np
import pandas as pd
def func_makedata(a):
    x1 = a**2
    y1 = np.cos(a)
    z1 = 3*a**2 
    return x1, y1, z1

aa = np.linspace(0.,10.,50)
x1, y1, z1 = func_makedata(aa)

data_dict = {}
data_dict['aa'] = aa
data_dict['x1'] = x1
data_dict['y1'] = y1
data_dict['z1'] = z1
pd_dataframe = pd.DataFrame(data_dict)
pd_dataframe

Unnamed: 0,aa,x1,y1,z1
0,0.0,0.0,1.0,0.0
1,0.204082,0.041649,0.979248,0.124948
2,0.408163,0.166597,0.917851,0.499792
3,0.612245,0.374844,0.81836,1.124531
4,0.816327,0.666389,0.684902,1.999167
5,1.020408,1.041233,0.523018,3.123698
6,1.22449,1.499375,0.339426,4.498126
7,1.428571,2.040816,0.141746,6.122449
8,1.632653,2.665556,-0.061817,7.996668
9,1.836735,3.373594,-0.262815,10.120783


This data structure contains 4 arrays of data, but these are all related to each other (in particular x1, y1 and z1 are all functions of the first array, aa). So how could we extract all values of x1 for which y1 is larger than 0.5? This could be done with a `for` loop, but the following approach is *significantly* faster and makes the code much more compact.

In [7]:
# Create the boolean array
large_y1_logic = pd_dataframe['y1'] > 0.5
# Access corresponding values of x1
x1_values = pd_dataframe['x1'][large_y1_logic]
# Match this against the table above .. pandas also retains the indexes!
print(x1_values)

# Similarly we can quickly obtain the corresponding values of aa or z1
aa_values = pd_dataframe['aa'][large_y1_logic]
z1_values = pd_dataframe['z1'][large_y1_logic]

0      0.000000
1      0.041649
2      0.166597
3      0.374844
4      0.666389
5      1.041233
26    28.154935
27    30.362349
28    32.653061
29    35.027072
30    37.484382
31    40.024990
32    42.648896
33    45.356102
34    48.146606
35    51.020408
Name: x1, dtype: float64


With these tools you **never need to use Excel again**. You now have access to a much more powerful, faster, and more flexible method to slice and dice data and to graphically display the output however you want! We'll learn more about this below.

## EXERCISES

Let's try a few more exercises on our `pd_dataframe` dataset:

* Extract all values of `z1` for which the corresponding value of $\mathrm{x1} - 15$ is larger than 20.
* Extract all values of `x1` for which the corresponding value of $\mathrm{aa} * (\mathrm{x1} - 50) * \mathrm{y1}$ is larger than 20

In [8]:
print(z1[x1 - 15 > 20])
print(x1[aa * (x1-50) * y1 > 20])

[105.08121616 112.45314452 120.07496876 127.94668888 136.06830487
 144.43981674 153.06122449 161.93252811 171.05372761 180.42482299
 190.04581424 199.91670137 210.03748438 220.40816327 231.02873803
 241.89920866 253.01957518 264.38983757 276.00999584 287.88004998
 300.        ]
[ 3.37359434  4.16493128  5.03956685  5.99750104  7.03873386  8.16326531
  9.37109538 10.66222407 12.0366514  13.49437734 15.03540192 16.65972511
 18.36734694 20.15826739]


Let's add one more layer of complexity to this. What if we want to access all values of `z1` for which $\mathrm{y1} > 0.5$ *and* $\mathrm{x1} < 10$. One could do some sort of nested approach (which actually might be the most computationally efficient way of doing this, but isn't the most elegant):

In [9]:
y1_logic = pd_dataframe['y1'] > 0.5
reduced_x1 = pd_dataframe['x1'][y1_logic]
reduced_z1 = pd_dataframe['z1'][y1_logic]

final_z1_values = reduced_z1[reduced_x1 < 10]

print (final_z1_values)

0    0.000000
1    0.124948
2    0.499792
3    1.124531
4    1.999167
5    3.123698
Name: z1, dtype: float64


However, we can combine this logic together into one array. Numpy supplies a `logical_and` function for this, but the simplest way to construct a single logic array satisfying two conditions is:

In [10]:
# WARNING: Note the brackets here! `&` is evaluated before `>` so make sure you don't forget them!
y1_and_x1_logic = (pd_dataframe['y1'] > 0.5) & (pd_dataframe['x1'] < 10)
final_z1_values = pd_dataframe['z1'][y1_and_x1_logic]
print (final_z1_values)

0    0.000000
1    0.124948
2    0.499792
3    1.124531
4    1.999167
5    3.123698
Name: z1, dtype: float64


## EXERCISE

* Find all values of `aa` for which `abs(x1 - 50)` is bigger than 20, `arccos(y1) > 45 degrees` and `z1**0.5 > 4`

In [11]:
all_logic = (abs(x1-50) > 20) & (np.arccos(y1) > np.pi / 4) & (z1**0.5 > 4)

print(aa[all_logic])

[ 2.44897959  2.65306122  2.85714286  3.06122449  3.26530612  3.46938776
  3.67346939  3.87755102  4.08163265  4.28571429  4.48979592  4.69387755
  4.89795918  5.10204082  5.30612245  8.36734694  8.57142857  8.7755102
  8.97959184  9.18367347  9.3877551   9.59183673  9.79591837 10.        ]
