# Agenda: Mask Index

- Comparisons
- Broadcasts and comparisons
- Using that to filter our series with a "boolean index" or a "mask index"
- Complex comparisons with "and" and "or"

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [7]:
s = Series([10,20,30,40,50,60,70], index=list('abcdefg'))

In [9]:
s.loc['d']

40

In [10]:
s.iloc[4]

50

In [11]:
#inside of the [], I can put a list of locations that I want to retrieve

s.loc[['a','d']]

a    10
d    40
dtype: int64

In [12]:
s.iloc[[2,5]]

c    30
f    60
dtype: int64

In [13]:
#there is anoether way that we can retrieve values, though we can pass a list of boolean values

s.loc[[True,False,False,True,True,False,True]]

a    10
d    40
e    50
g    70
dtype: int64

# Boolean/mask index

The idea here is: 
- Pass, inside of [], a list of booleans
- Whereever there is a True value, we get the value from the original series
- Whereever there is a False value, the original value is ignored

In [14]:
s > 30 #This is a comparison operation, broadcast across all values of s

a    False
b    False
c    False
d     True
e     True
f     True
g     True
dtype: bool

In [16]:
# I can take this boolean and use it as a mask index with .loc

s.loc[s>30] #only have [] once here, becasue we're getting a series back from s>30

#say this as: show the values of s where s>30

d    40
e    50
f    60
g    70
dtype: int64

# How to read a mask index expression

- First, look at the stuff inside of the []. What expression is there, and what does it return?
- Next, think of it as an existing boolean series
- Then apply that boolean series to the series on the outside

In [18]:
#Let's find all of the values that are greater than the mean

s>s.mean()

a    False
b    False
c    False
d    False
e     True
f     True
g     True
dtype: bool

In [19]:
#this gives us all of the values where the value is greater than the mean of the series

s.loc[s>s.mean()]

e    50
f    60
g    70
dtype: int64

In [21]:
#this does not work with integer values

s.loc[[1,0,1,0,1,1,0]]

KeyError: "None of [Index([1, 0, 1, 0, 1, 1, 0], dtype='int64')] are in the [index]"

# Exercise: Finding extreme temperatures

1. Create two series, 'highs' and 'lows', which contain the forecast high and low temps for the coming 10 days
2. Find all high temps that are greater than the mean
3. Are any high temps greater than the mean + 1 standard deviation?
4. Calculate the difference in temperature fore each of these days.
5. Show all days on which the difference was greater than the median difference.

In [48]:
h_temp = Series([73,66,65,72,66,68,69,76,73,75],
                 index='Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split())
l_temp = Series([54,58,57,57,56,55,55,54,54,52], 
                index='Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split())

In [49]:
h_temp.loc[h_temp>h_temp.mean()]

Tue    73
Fri    72
Tue    76
Wed    73
Thu    75
dtype: int64

In [50]:
h_temp.loc[h_temp>(h_temp.mean()+h_temp.std())]

Tue    76
Thu    75
dtype: int64

In [51]:
diff = h_temp - l_temp
diff

Tue    19
Wed     8
Thu     8
Fri    15
Sat    10
Sun    13
Mon    14
Tue    22
Wed    19
Thu    23
dtype: int64

In [52]:
diff.loc[diff>diff.median()]

Tue    19
Fri    15
Tue    22
Wed    19
Thu    23
dtype: int64

In [53]:
#we can create a boolean series with one series
# and we then apply it on another series

#for example: Show me values of "highs" where "lows" is less than the mean

h_temp.loc[l_temp < l_temp.mean()]

Tue    73
Sun    68
Mon    69
Tue    76
Wed    73
Thu    75
dtype: int64

# Mask index across series

You can create a mask index with one series and then apply it to another series. Wherever we got a True value back from the comparison, the value in the applied series will make it through.

In [55]:
m = Series([True,False,True,False,True,True,False,True,False, True])

m.index = h_temp.index

In [56]:
h_temp.loc[m]

Tue    73
Thu    65
Sat    66
Sun    68
Tue    76
Thu    75
dtype: int64

In [57]:
m.index

Index(['Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu'], dtype='object')

# How can we combine conditionals with "and" and "or"?

In [58]:
s

a    10
b    20
c    30
d    40
e    50
f    60
g    70
dtype: int64

In [60]:
# I want all elements of s that are odd

s.loc[s%2] #s%2 doesn't return booleans! It returns integers, in a series without an index

KeyError: "None of [Index([0, 0, 0, 0, 0, 0, 0], dtype='int64')] are in the [index]"

In [63]:
# we need to get back a boolean series!
# we get that (inside of [], and then apply it to s with ".loc"

s = Series([10,15,20,25,30], index=list('abcde'))

s.loc[s%2==1]

b    15
d    25
dtype: int64

In [64]:
#How can I find all of those values that are even *and* greater than the mean?

s.loc[s%2==0] and s.loc[s>s.mean()]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

# What's going on?

In Python, when we use 'if', 'if' looks to its right and sees if the expression is True of False.

If it's neither True nor False, 'if' asks the object what it's closer to. In other words, it tries to coerce the value into a boolean. In such cases, everything in Python is considered True except for:

- None
- 0
- False
- Anything empty -- '',(),[],{}

A Series is the only type of value that refuses to give a boolean value in such contexts.

# What's happening?

For any Python expressions X and Y, if you say

    X and Y

Python forces both X and Y to be boolean values. In other words, the above code becomes

    bool(X) and bool (Y)

In almost all cases, every Python object knows how to return a boolean value for itself. In the specific case of a Pandas series, there is no such conversion, because it would be strange and ambiguous. Instead of misleading people some of the time, the decision was mad eto have a series raise an exception if you try to force it into a boolean context.

All this happens because of "and"!

How can we get around this?

Don't use "and". Instead, use the symbol '&'.

If you run 

    s1 & s2

where both s1 and s2 are boolean series, you'll get back a new boolean series, with the same (shared) index, in which the value is True if bothe inputs wer True. And the value is False if both inputs were False.

In [75]:
#How can I find all of those values that are even *and* greater than the mean?
# - Parentheses around each expression
# - & between them
# - Each expression returns a boolean series

(s%2==0) & (s>s.mean())

a    False
b    False
c    False
d    False
e     True
dtype: bool

In [70]:
s.loc[(s%2==0) &       #values that are even
      (s>s.mean())]    # values that are > mean

e    30
dtype: int64

In [73]:
(s.loc[s%2==0])

a    10
c    20
e    30
dtype: int64

# Exercise: More complex comparisons

1. Create a series of 20 random integers between 0-1000. The index should contain unique letters (a-t). The values can be generated with 'np.random.randint(0,1000,20)'.
2. Find all of the values that are < mean - 1std
3. Find all of the values that are > mean + 1std
4. Find all of the values that are *either* <mean- 1std or >mean+1std
5. Find even numbers that are also greater than the mean
6. Find even numbers that are greater than mean, and also odd numbers that are less than the mean

In [92]:
np.random.seed(0)
rser = Series(np.random.randint(0,1000,20), index = list('abcdefghijklmnopqrst'))
rser

a    684
b    559
c    629
d    192
e    835
f    763
g    707
h    359
i      9
j    723
k    277
l    754
m    804
n    599
o     70
p    472
q    600
r    396
s    314
t    705
dtype: int64

In [93]:
rser.mean()

522.55

In [99]:
#Find all of the values that are < mean - 1 std.

low_outlier = rser.loc[(rser < (rser.mean()-rser.std()))]
low_outlier

d    192
i      9
o     70
dtype: int64

In [100]:
#Find all of the values that are > mean + 1std.

high_outlier = rser.loc[(rser > (rser.mean()+rser.std()))]
high_outlier

e    835
m    804
dtype: int64

In [101]:
#Find all of the values that are either < mean - 1 std or > mean + 1std.

all_outlier = rser.loc[(rser > (rser.mean()+rser.std())) | 
                       (rser < (rser.mean()-rser.std()))]
all_outlier

d    192
e    835
i      9
m    804
o     70
dtype: int64

In [102]:
#find numbers taht are both even and greater than the mean

even_high = rser.loc[(rser%2==0) & (rser>rser.mean())]
even_high

a    684
l    754
m    804
q    600
dtype: int64

In [103]:
#find even numbers bigger than the mean and also odd numbers that are less than the mean

bigevensmallodd = rser.loc[((rser%2==0) & (rser>rser.mean())) | ((rser%2!=0) & (rser<rser.mean()))]
bigevensmallodd

a    684
h    359
i      9
k    277
l    754
m    804
q    600
dtype: int64

In [105]:
df = pd.read_csv('../data/taxi.csv') #load entire data frame of 10,000 NYC taxi rides

In [106]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [108]:
#show me the mean trip_distance for all taxi rides

df['trip_distance'].mean()

3.1585108510851083

In [110]:
#show me the total_amount column wherever trip_distance is > its mean

df['total_amount'].loc[df['trip_distance']>df['trip_distance'].mean()]

7       73.84
20      19.55
24      26.75
39      24.96
43      33.35
        ...  
9992    23.30
9993    18.30
9995    20.30
9996    22.30
9998    26.75
Name: total_amount, Length: 2597, dtype: float64

In [111]:
#what are the min,max,etc. total_amount values wherever trip_distance is > mean?

df['total_amount'].loc[df['trip_distance']>df['trip_distance'].mean()].describe()

count    2597.000000
mean       34.802360
std        18.715147
min         3.300000
25%        21.300000
50%        29.300000
75%        44.810000
max       252.350000
Name: total_amount, dtype: float64

In [112]:
(

df['total_amount']  #get the total amount
    .loc[df['trip_distance']>df['trip_distance'].mean()]  #only keep rows where trip_distance is > mean
    .describe()  #describe the rows that remain

)

count    2597.000000
mean       34.802360
std        18.715147
min         3.300000
25%        21.300000
50%        29.300000
75%        44.810000
max       252.350000
Name: total_amount, dtype: float64