# Agenda: Mask indexes

1. Comparisons and broadcasting
2. Boolean series -- what can we do with it?
3. Combining these ideas into a "mask index"
4. Complex comparisons with "and" and "or" (or, *not* with them)

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
s = Series([10, 20, 30, 40, 50, 60],
           index=list('abcdef'))
s

a    10
b    20
c    30
d    40
e    50
f    60
dtype: int64

In [4]:
# I can retrieve an element of s with either .loc or .iloc
# .loc means that we want to use the index

s.loc['b']

20

In [5]:
s.loc['e']

50

In [6]:
# .iloc goes based on the position, starting with 0
s.iloc[3]

40

In [7]:
s.iloc[5]

60

In [8]:
# Inside of the [], I can put a list of indexes/positions that I want
# This is known as "fancy indexing"
s.loc[['b', 'e']]

b    20
e    50
dtype: int64

In [9]:
s.iloc[[3, 5]]

d    40
f    60
dtype: int64

# Using booleans

We can retrieve elements in a completely different way, too -- we can pass a list (or series)
of boolean values (True/False). This list/series needs to be the same length as the one against
which we're running it. Wherever there is a True value, we get the value from the original series. Wherever
there is a False value, it's ignored.

In [10]:
s.loc[ [True, False, True, False, True, False] ]

a    10
c    30
e    50
dtype: int64

In [11]:
s.loc[ [False, False, True, True, False, True]]

c    30
d    40
f    60
dtype: int64

Boolean indexing is a great technique. But you can't seriously imagine typing True and False a large number of times to get the values that you want. That's OK -- we're going to automate that. We're going to ask Pandas to generate such a series of boolean values, and then we'll use that series to get the ones that we want.

This is known as "mask indexing," or also as "boolean indexing."

# The idea is:

1. Pass, inside of `[]`, a list/series of booleans
2. Wherever there is a True value, we get the original value
3. Wherever there is a False value, the original is ignored.

You cannot end up with a larger series than the original, and the idea is to get a smaller one.

How can we create this sort of boolean series automatically?

We'll use Python's comparison operators.

In [12]:
# remember broadcasting?

s

a    10
b    20
c    30
d    40
e    50
f    60
dtype: int64

In [13]:
s + 3   # the + operator broadcasts its functionality with 3 to each element of s -- we get a new series back

a    13
b    23
c    33
d    43
e    53
f    63
dtype: int64

In [14]:
s ** 2

a     100
b     400
c     900
d    1600
e    2500
f    3600
dtype: int64

In [15]:
s - 8.2

a     1.8
b    11.8
c    21.8
d    31.8
e    41.8
f    51.8
dtype: float64

In [16]:
# what happens if we broadcast a comparison operator?

s > 30

a    False
b    False
c    False
d     True
e     True
f     True
dtype: bool

In [19]:
# this is how a mask index looks:
# typically (but not always!) the same variable will be inside of the [] and also outside of it
# the inside one is generating a boolean series
# the outside one is applying that boolean series, so that we can filter the values

# read this as: First generate a boolean series, where s>30. Then apply that series to s.

s.loc[ s>30 ]

d    40
e    50
f    60
dtype: int64

In [20]:
s.mean()

35.0

In [21]:
s > s.mean()

a    False
b    False
c    False
d     True
e     True
f     True
dtype: bool

In [23]:
# show me all elements of s greater than s's mean

s.loc[ s>s.mean()]

d    40
e    50
f    60
dtype: int64

In [24]:
z = Series([True, False, True, False, True, False])
s.loc[z]

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

# Exercise: Find extreme temperatures

1. Create two series, `highs` and `lows`, containing the forecast high and low temps for the coming 10 days.
2. Find all high temps greater than the mean.
3. Are any high temps greater than the mean + 1 standard deviation?
4. Calculate the difference in temp between highs and lows.
5. Show all days in which the difference was greater than the median difference.

In [29]:
highs = Series([26, 24, 23, 23, 27, 28, 26, 22])
lows = Series([14, 14, 16, 13, 13, 15, 17, 12])

In [30]:
# high temps greater than the mean

highs.loc[highs > highs.mean()]

0    26
4    27
5    28
6    26
dtype: int64

In [32]:
# are any greater than mean + 1 std?

highs.loc[highs > highs.mean() + highs.std()]

5    28
dtype: int64

In [34]:
diffs = highs - lows
diffs

0    12
1    10
2     7
3    10
4    14
5    13
6     9
7    10
dtype: int64

In [36]:
diffs.loc[diffs > diffs.median()]

0    12
4    14
5    13
dtype: int64

In [37]:
# Find high temps on the days when the diff is > median

highs.loc[ diffs > diffs.median()]

0    26
4    27
5    28
dtype: int64

# Mask index across series

We can create a boolean series based on one query, on one series, and apply it to another.

But, as I keep saying, the length and index must match 100%. If they don't, you'll get errors.

# Combine conditionals

In Python, we're used to being able to say "and" and "or" on our conditions. That way, we can build up more sophisticated conditions and decision making.

Can we do that in Pandas? 

In [39]:
np.random.seed(0)   # this makes it consistent across platforms
s = Series(np.random.randint(0, 1000, 10),
           index=list('abcdefghij'))
s

a    684
b    559
c    629
d    192
e    835
f    763
g    707
h    359
i      9
j    723
dtype: int64

In [40]:
# let's find all elements of s that are odd

s%2  # find the remainder after dividing by 2 -- evens will have a 0, and odds will have a 1

a    0
b    1
c    1
d    0
e    1
f    1
g    1
h    1
i    1
j    1
dtype: int64

In [41]:
# what if I do this:

s.loc[s%2]

KeyError: "None of [Index([0, 1, 1, 0, 1, 1, 1, 1, 1, 1], dtype='int64')] are in the [index]"

In [43]:
s.iloc[s%2]  # this is a fancy index in which all elements are either position 0 or position 1 -- not helpful to anyone!

a    684
b    559
b    559
a    684
b    559
b    559
b    559
b    559
b    559
b    559
dtype: int64

In [44]:
# if I really want to get only the odd numbers, I need to use a comparison that returns a boolean

s.loc[ s%2==1 ]

b    559
c    629
e    835
f    763
g    707
h    359
i      9
j    723
dtype: int64

In [45]:
s

a    684
b    559
c    629
d    192
e    835
f    763
g    707
h    359
i      9
j    723
dtype: int64

In [46]:
# let's find, in s, all numbers that are odd *and* greater than the mean

s.loc[ s%2==1  and   s>s.mean()  ]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

# Python and boolean context

When we use `if` in Python, `if` looks to its right and expects to get a boolean value. If the value to its right is *not* boolean (True/False), it asks the value what its boolean equivalent is. Everything (just about) in Python says that its boolean equivalent is True, except for:

- None
- 0
- False
- Anything empty -- empty string, empty list, empty dict, etc.

In [47]:
while True:
    name = input('Enter your name: ').strip()

    if not name:    #this means: If I got the empty string
        break

    print(f'Hello, {name}!')

Enter your name:  asdfadfas


Hello, asdfadfas!


Enter your name:  asdfasfasfasfa


Hello, asdfasfasfasfa!


Enter your name:  


It turns out that `and` and `or` do the same thing: They expect to see boolean values on their left and right. Any non-boolean value there is asked to convert itself.

This is where things go wrong with Pandas: A series is a collection of values, and we vectorize everything. An emopty series is indeed False. And a series containng 1 value can be True/False. But any larger series, Pandas refuses to converrt to a boolean value, because it doesn't make sense.

Here, we got the message because we tried to use "and".

If we cannot use "and", then what do want to do?

Our goal is basically to take two boolean series (one on the left, one on the right) and get a new boolean series back -- one where the value is True if both original values were True, and False in all other cases.

Pandas takes advantage of "operator overloading" in Python, meaning that certain symbols can be turned into method calls. Pandas uses the `&` for "and" across boolean series, and `|` for "or" across boolean series.

In [49]:
s.loc[ (s%2==1)  &   (s>s.mean())  ]

b    559
c    629
e    835
f    763
g    707
j    723
dtype: int64

# Using "and" and "or" in our queries

1. First set up two queries, each of which returns a boolean series. The series must have the same index and length.
2. Put these two queries inside of parentheses
3. Put `&` (and) or `|` (or) between the two sets of parentehses
4. Around this, put `df.loc` and `[]`.

# Exericse: More complex comparisons

1. Create a series of 20 random ints between 0 and 1,000. The index should contain unique letters (a-t). You can generate these random numbers with `np.random.randint(0, 1000, 20)`.
2. Find all of the values that are < mean - 1 std.
3. Find all of the values that are > mean + 1 std.
4. Find all values that are *either* < mean - 1 std *or* > mean + 1 std.
5. Find even numbers greater than the mean.
6. Find even numbers greater than the mean, and also odd numbers less than the mean.