# Agenda: Mask indexes

1. Comparisons and broadcasting
2. Boolean series -- what can we do with it?
3. Combining these ideas into a "mask index"
4. Complex comparisons with "and" and "or" (or, *not* with them)

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
s = Series([10, 20, 30, 40, 50, 60],
           index=list('abcdef'))
s

a    10
b    20
c    30
d    40
e    50
f    60
dtype: int64

In [4]:
# I can retrieve an element of s with either .loc or .iloc
# .loc means that we want to use the index

s.loc['b']

20

In [5]:
s.loc['e']

50

In [6]:
# .iloc goes based on the position, starting with 0
s.iloc[3]

40

In [7]:
s.iloc[5]

60

In [8]:
# Inside of the [], I can put a list of indexes/positions that I want
# This is known as "fancy indexing"
s.loc[['b', 'e']]

b    20
e    50
dtype: int64

In [9]:
s.iloc[[3, 5]]

d    40
f    60
dtype: int64

# Using booleans

We can retrieve elements in a completely different way, too -- we can pass a list (or series)
of boolean values (True/False). This list/series needs to be the same length as the one against
which we're running it. Wherever there is a True value, we get the value from the original series. Wherever
there is a False value, it's ignored.

In [10]:
s.loc[ [True, False, True, False, True, False] ]

a    10
c    30
e    50
dtype: int64

In [11]:
s.loc[ [False, False, True, True, False, True]]

c    30
d    40
f    60
dtype: int64

Boolean indexing is a great technique. But you can't seriously imagine typing True and False a large number of times to get the values that you want. That's OK -- we're going to automate that. We're going to ask Pandas to generate such a series of boolean values, and then we'll use that series to get the ones that we want.

This is known as "mask indexing," or also as "boolean indexing."

# The idea is:

1. Pass, inside of `[]`, a list/series of booleans
2. Wherever there is a True value, we get the original value
3. Wherever there is a False value, the original is ignored.

You cannot end up with a larger series than the original, and the idea is to get a smaller one.

How can we create this sort of boolean series automatically?

We'll use Python's comparison operators.

In [12]:
# remember broadcasting?

s

a    10
b    20
c    30
d    40
e    50
f    60
dtype: int64

In [13]:
s + 3   # the + operator broadcasts its functionality with 3 to each element of s -- we get a new series back

a    13
b    23
c    33
d    43
e    53
f    63
dtype: int64

In [14]:
s ** 2

a     100
b     400
c     900
d    1600
e    2500
f    3600
dtype: int64

In [15]:
s - 8.2

a     1.8
b    11.8
c    21.8
d    31.8
e    41.8
f    51.8
dtype: float64

In [16]:
# what happens if we broadcast a comparison operator?

s > 30

a    False
b    False
c    False
d     True
e     True
f     True
dtype: bool

In [19]:
# this is how a mask index looks:
# typically (but not always!) the same variable will be inside of the [] and also outside of it
# the inside one is generating a boolean series
# the outside one is applying that boolean series, so that we can filter the values

# read this as: First generate a boolean series, where s>30. Then apply that series to s.

s.loc[ s>30 ]

d    40
e    50
f    60
dtype: int64

In [20]:
s.mean()

35.0

In [21]:
s > s.mean()

a    False
b    False
c    False
d     True
e     True
f     True
dtype: bool

In [23]:
# show me all elements of s greater than s's mean

s.loc[ s>s.mean()]

d    40
e    50
f    60
dtype: int64

In [24]:
z = Series([True, False, True, False, True, False])
s.loc[z]

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

# Exercise: Find extreme temperatures

1. Create two series, `highs` and `lows`, containing the forecast high and low temps for the coming 10 days.
2. Find all high temps greater than the mean.
3. Are any high temps greater than the mean + 1 standard deviation?
4. Calculate the difference in temp between highs and lows.
5. Show all days in which the difference was greater than the median difference.