# Agenda: Mask indexes

- Comparisons
- Broadcasts and comparisons
- Using that to filter our series with a "boolean index" or a "mask index"
- Complex comparisons with "and" and "or"

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30, 40, 50, 60, 70],
           index=list('abcdefg'))

In [3]:
s

a    10
b    20
c    30
d    40
e    50
f    60
g    70
dtype: int64

In [4]:
# I can retrieve any element of the series with either .loc (based on the index) or .iloc (based on the position)

In [5]:
s.loc['d']

40

In [6]:
s.iloc[4]

50

In [7]:
# Inside of the [], I can put a list of locations that I want to retrieve
# this is known as "fancy indexing"

s.loc[['a', 'd']]

a    10
d    40
dtype: int64

In [8]:
s.iloc[[2, 5]]

c    30
f    60
dtype: int64

In [9]:
# there is another way that we can retrieve values, though
# we can pass a list of boolean values (True and False)

s.loc[ [True, False, False, True, True, False, True] ]

a    10
d    40
e    50
g    70
dtype: int64

# Boolean/mask index

The idea here is:
- Pass, inside of `[]`, a list of booleans
- Wherever there is a True value, we get the value from the original series
- Wherever there is a False value, the original value is ignored

This is used all of the time, but you will almost never actually be typing True and False into square brackets.

That's because we can ask Pandas to create it for us automatically.

How? With comparison operators.

In [10]:
s > 30    # this is a comparison operation, broadcast across all values of s

a    False
b    False
c    False
d     True
e     True
f     True
g     True
dtype: bool

In [12]:
# I can take this boolean series and use it as a mask index with .loc

s.loc[ s > 30]   # only have [] once here, because we're getting a series back from s>30

# say this as: show the values of s where s > 30

d    40
e    50
f    60
g    70
dtype: int64

# How to read a mask index expression

- First, look at the stuff inside of the `[]`. What expression is there, and what does it return?
- Next, think of it as an existing boolean series
- Then apply that boolean series to the series on the outside

In [13]:
# Let's find all of the values that are greater than the mean

s > s.mean()

a    False
b    False
c    False
d    False
e     True
f     True
g     True
dtype: bool

In [15]:
# this gives us all of the values (and indexes) in s
# where the value is > the mean of the series

s.loc[ s>s.mean() ]

e    50
f    60
g    70
dtype: int64

In [16]:
s.loc[ [1, 0, 1, 0, 1, 1, 0] ]

KeyError: "None of [Index([1, 0, 1, 0, 1, 1, 0], dtype='int64')] are in the [index]"

# Exercise: Finding extreme temperatures

1. Create two series, `highs` and `lows`, which contain the forecast high and low temps for the coming 10 days.
2. Find all high temps that are greater than the mean.
3. Are any high temps greater than the mean + 1 standard deviation?
4. Calculate the difference in temperature for each day in the data set.
5. Show all days on which the difference was greater than the median difference.

In [18]:
highs = Series([23, 22, 25, 26, 24, 26, 27, 26],
               index='Tue Wed Thu Fri Sat Sun Mon Tue'.split())
low = Series([14, 15, 13, 14, 16, 15, 14, 14],
               index='Tue Wed Thu Fri Sat Sun Mon Tue'.split())


In [19]:
highs.mean()

24.875

In [20]:
highs > highs.mean()

Tue    False
Wed    False
Thu     True
Fri     True
Sat    False
Sun     True
Mon     True
Tue     True
dtype: bool

In [21]:
highs.loc[highs > highs.mean()]

Thu    25
Fri    26
Sun    26
Mon    27
Tue    26
dtype: int64

In [23]:
highs.loc[highs > highs.mean()]

Thu    25
Fri    26
Sun    26
Mon    27
Tue    26
dtype: int64