# Agenda: Mask indexes

- Comparisons
- Broadcasts and comparisons
- Using that to filter our series with a "boolean index" or a "mask index"
- Complex comparisons with "and" and "or"

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30, 40, 50, 60, 70],
           index=list('abcdefg'))

In [3]:
s

a    10
b    20
c    30
d    40
e    50
f    60
g    70
dtype: int64

In [4]:
# I can retrieve any element of the series with either .loc (based on the index) or .iloc (based on the position)

In [5]:
s.loc['d']

40

In [6]:
s.iloc[4]

50

In [7]:
# Inside of the [], I can put a list of locations that I want to retrieve
# this is known as "fancy indexing"

s.loc[['a', 'd']]

a    10
d    40
dtype: int64

In [8]:
s.iloc[[2, 5]]

c    30
f    60
dtype: int64

In [9]:
# there is another way that we can retrieve values, though
# we can pass a list of boolean values (True and False)

s.loc[ [True, False, False, True, True, False, True] ]

a    10
d    40
e    50
g    70
dtype: int64

# Boolean/mask index

The idea here is:
- Pass, inside of `[]`, a list of booleans
- Wherever there is a True value, we get the value from the original series
- Wherever there is a False value, the original value is ignored

This is used all of the time, but you will almost never actually be typing True and False into square brackets.

That's because we can ask Pandas to create it for us automatically.

How? With comparison operators.

In [10]:
s > 30    # this is a comparison operation, broadcast across all values of s

a    False
b    False
c    False
d     True
e     True
f     True
g     True
dtype: bool

In [12]:
# I can take this boolean series and use it as a mask index with .loc

s.loc[ s > 30]   # only have [] once here, because we're getting a series back from s>30

# say this as: show the values of s where s > 30

d    40
e    50
f    60
g    70
dtype: int64

# How to read a mask index expression

- First, look at the stuff inside of the `[]`. What expression is there, and what does it return?
- Next, think of it as an existing boolean series
- Then apply that boolean series to the series on the outside

In [13]:
# Let's find all of the values that are greater than the mean

s > s.mean()

a    False
b    False
c    False
d    False
e     True
f     True
g     True
dtype: bool

In [15]:
# this gives us all of the values (and indexes) in s
# where the value is > the mean of the series

s.loc[ s>s.mean() ]

e    50
f    60
g    70
dtype: int64

In [16]:
s.loc[ [1, 0, 1, 0, 1, 1, 0] ]

KeyError: "None of [Index([1, 0, 1, 0, 1, 1, 0], dtype='int64')] are in the [index]"

# Exercise: Finding extreme temperatures

1. Create two series, `highs` and `lows`, which contain the forecast high and low temps for the coming 10 days.
2. Find all high temps that are greater than the mean.
3. Are any high temps greater than the mean + 1 standard deviation?
4. Calculate the difference in temperature for each day in the data set.
5. Show all days on which the difference was greater than the median difference.

In [32]:
highs = Series([23, 22, 25, 26, 24, 26, 27, 26],
               index='Tue Wed Thu Fri Sat Sun Mon Tue'.split())
lows = Series([14, 15, 13, 14, 16, 15, 14, 14],
               index='Tue Wed Thu Fri Sat Sun Mon Tue'.split())


In [33]:
highs.mean()

24.875

In [34]:
highs > highs.mean()

Tue    False
Wed    False
Thu     True
Fri     True
Sat    False
Sun     True
Mon     True
Tue     True
dtype: bool

In [35]:
highs.loc[highs > highs.mean()]

Thu    25
Fri    26
Sun    26
Mon    27
Tue    26
dtype: int64

In [36]:
# show me high temps
# where the temp is higher than the mean

highs.loc[highs > highs.mean()]

Thu    25
Fri    26
Sun    26
Mon    27
Tue    26
dtype: int64

In [37]:
# Are any high temps greater than the mean + 1 standard deviation?

highs > (highs.mean() + highs.std())

Tue    False
Wed    False
Thu    False
Fri    False
Sat    False
Sun    False
Mon     True
Tue    False
dtype: bool

In [38]:
highs.loc[highs > (highs.mean() + highs.std())]

Mon    27
dtype: int64

In [39]:
# spread out across numerous lines

(
    highs
    .loc[highs > 
          (highs.mean() + highs.std())
    ]
)

Mon    27
dtype: int64

In [40]:
# Calculate the difference in temperature for each day in the data set.

diffs = highs - lows

In [41]:
diffs

Tue     9
Wed     7
Thu    12
Fri    12
Sat     8
Sun    11
Mon    13
Tue    12
dtype: int64

In [43]:
# Show all days on which the difference was greater than the median difference.

diffs.loc[diffs > diffs.median()]

Thu    12
Fri    12
Mon    13
Tue    12
dtype: int64

In [45]:
# we can create a boolean series with one series
# and then apply it on another series

# for example: Show me the values of "highs" where "lows" is less than the mean

highs.loc[lows < lows.mean()]

Tue    23
Thu    25
Fri    26
Mon    27
Tue    26
dtype: int64

# Mask index across series

You can, as we've seen, create a mask index with one series and then apply it to another series. Wherever we got a True value back from the comparison, the value in the applied series will make it through, and be visible.

Make sure, however, that the indexes match up! Also, the length has to match.

In [51]:
m = Series([True, False, True, False, True, True, False, False])

In [52]:
highs.loc[m]

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

In [53]:
m.index = highs.index

In [56]:
highs.loc[m ]

Tue    23
Thu    25
Sat    24
Sun    26
dtype: int64

In [55]:
m.index

Index(['Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue'], dtype='object')

# How can we combine conditionals with "and" and "or"?

In [57]:
s

a    10
b    20
c    30
d    40
e    50
f    60
g    70
dtype: int64

In [58]:
# I want all elements of s that are odd

s.loc[s%2]   # s%2 doesn't return booleans! It returns integers, in a series without an index

KeyError: "None of [Index([0, 0, 0, 0, 0, 0, 0], dtype='int64')] are in the [index]"

In [62]:
# we need to get back a boolean series!
# We get that (inside of the []), and then apply it to s with ".loc"

s = Series([10, 15, 20, 25, 30], index=list('abcde'))
s.loc[s%2==1]

b    15
d    25
dtype: int64

In [63]:
s.loc[s%2 == 0]

a    10
c    20
e    30
dtype: int64

In [64]:
# How can I find all of those values that are even *and* greater than the mean?

s.loc[s%2==0] and s.loc[s > s.mean()]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

# What's going on?

In Python, when we use `if`, `if` looks to its right, and sees if the expression is True or False.

If it's neither True nor False, `if` asks the object what it's closer to. In other words, it tries to coerce the value into a boolean. In such cases, everything in Python is considered True except for:

- None
- 0
- False
- Anything empty -- '', [], (), {}

It's normal in Python code to take advantage of this.

It turns out that "and" and "or" also force their operands to be booleans. So whatever is before "and" and whatever is after "and" is forced into a boolean value.

A Series is the only type of value I know of that refuses to give a boolean value in such contexts.

For any Python expressions X and Y, if you say 

    X and Y

Python forces both X and Y to be boolean values. In other words, the above code becomes

    bool(X) and bool(Y)

In almost all cases, every Python object knows how to return a boolean value for itself. In the specific case of a Pandas series, there is no such conversion, because it would be strange and ambiguous.  Instead of misleading people some of the time, the decision was made to have a series raise an exception if you try to force it into a boolean context.

All this happens because of "and"!

How can we get around this?

The answer: Don't use "and". Instead, use the symbol `&`.

The `&` is supported in Python for bitwise "and" operations. Pandas uses this operator, overloading its meaning for series objects. The same is true for `|`, which is "or" for series.

If you run 

    s1 & s2

where both s1 and s2 are boolean series, you'll get back a new boolean series, with the same (shared) index, in which the value is True if both inputs were True. And the value is False if both inputs were False.

In [70]:
# here is how it has to look in the end:
# - parentheses around each expression
# - & between them
# - Each expression returns a boolean series

(s%2==0) & (s > s.mean())

a    False
b    False
c    False
d    False
e     True
dtype: bool

In [71]:
s.loc[(s%2==0) &       # values that are even
      (s > s.mean())]  # values that are > mean

e    30
dtype: int64

In [67]:
(s.loc[s%2==0])

a    10
c    20
e    30
dtype: int64

In [72]:
# lists -- if a list is non-empty, it's True in a boolean context

bool([10, 20, 30])

True

In [73]:
bool([])

False

In [74]:
bool(Series([10, 20, 30]))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [75]:
# | is the "or" operator that does roughly the same thing

# here's how I get all of the values in s that are *either* even *or* greater than the mean

s.loc[(s%2==0) |       # values that are even
      (s > s.mean())]  # values that are > mean

a    10
c    20
d    25
e    30
dtype: int64