# Boolean indexing
This is the third post in the series on indexing and selecting data in pandas. If you haven't read the others yet, see [the first post](https://www.wrighters.io/2020/12/26/indexing-and-selecting-in-pandas-part-1/) that covers the basics of selecting based on index or relative numerical indexing, and [the second post](https://www.wrighters.io/2020/12/29/indexing-and-selecting-in-pandas-slicing/), that talks about slicing. In this post, I'm going to talk about boolean indexing, the way that I mostly select subsets of data when I work with pandas.


## What is boolean indexing?
For those familiar with NumPy, this may already be second nature, but for beginners it is not so obvious. Boolean indexing works for a given array by passing a boolean vector into the indexing operator (```[]```), returning all values that are ```True```.

One thing to note, this array needs to be the same length as the array dimension being indexed.

Let's look at an example.

In [74]:
import pandas as pd
import numpy as np

a = np.arange(5)
a

array([0, 1, 2, 3, 4])

Now we can select the first, second and last elements of our array using a list of array indices.

In [75]:
a[[0, 1, 4]]

array([0, 1, 4])

Boolean indexing can do the same, by creating a boolean array of the same size as the original array, with elements 0, 1 and 4 set to ```True```, all others ```False```.

In [76]:
mask = np.array([True, True, False, False, True])
a[mask]

array([0, 1, 4])

## Boolean operators
So now we know how to index our array with a single boolean array. But building that array by hand is a pain, so what you will usually end up doing is applying operations to the original array that return a boolean array themselves.

For example, to select all elements less than 3:

In [77]:
a[a < 3]

array([0, 1, 2])

or all even elements:

In [78]:
a[a % 2 == 0]

array([0, 2, 4])

And we can combine these using expressions, to AND them (```&```) or OR them (```|```). With these operators, we can select the same elements from our first example.

In [79]:
a[(a < 2) | (a >= 4)]

array([0, 1, 4])

Another very helpful operators is the inverse or not operator, (```~```). Remember to watch your parentheses.

In [80]:
a[~((a < 2) | (a >= 4))]

array([2, 3])

## On to pandas
In pandas, boolean indexing works pretty much like in NumPy, especially in a ```Series```. You pass in a vector the same length as the ```Series```. Note that this vector doesn't have to have an index, but if you use a ```Series```, it does. A common method of using boolean indexing is to apply functions to the original series.

### Series

In [81]:
s = pd.Series(np.arange(5), index=list("abcde"))
s

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [82]:
s[[True, True, False, False, True]]            # this vector is just a list of boolean values

a    0
b    1
e    4
dtype: int64

In [83]:
s[np.array([True, True, False, False, True])]  # this vector is a NumPy array of boolean values

a    0
b    1
e    4
dtype: int64

But, since our index is not the default (i.e. not a ```RangeIndex```), if we use another ```Series``` of the same length, it will not work. It needs a matching index.

In [84]:
try:
    s[pd.Series([True, True, False, False, True])]
except Exception as ie:
    print(ie)
    
    
s[pd.Series([True, True, False, False, True], index=list("abcde"))]

Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).


a    0
b    1
e    4
dtype: int64

But instead of making a new ```Series```, we'll just base all of our expressions on our source data ```Series``` or ```DataFrame```, then they'll share an index.

In [85]:
# just like before with NumPy
s[(s < 2) | (s > 3)]

a    0
b    1
e    4
dtype: int64

Make note that you need to surround each expression with parentheses because the Python parser will apply the boolean operators incorrectly. For the example above, it would apply it as s < (2 | s ) < 3. You'll realize you're forgetting parentheses when you get complaints about the boolean operators being applied to a series. See?

In [86]:
s[s < 2 | s > 3]    

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

### DataFrame
We can also do boolean indexing on ```DataFrames```. A popular way to create the boolean vector is to use one or more of the columns of the ```DataFrame```.

In [87]:
df = pd.DataFrame({'x': np.arange(5), 'y': np.arange(5, 10)})
df[df['x'] < 3]

Unnamed: 0,x,y
0,0,5
1,1,6
2,2,7


You can also supply multiple conditions, just like before with ```Series```. (Remember those parentheses!)

In [88]:
df[(df['x'] < 3) & (df['y'] > 5)]

Unnamed: 0,x,y
1,1,6
2,2,7


### ```.loc```, ```.iloc```, and ```[]```
If you remember from previous posts, pandas has three primary ways to index the containers. The indexing operator (```[]```) is sort of a hybrid of using the index labels or location based offsets. ```.loc``` is meant for using the index labels, ```.iloc``` is for integer based indexing.  The good news is that all of them accept boolean arrays, and return subsets of the underlying container.

In [104]:
mask = (df['x'] < 3) & (df['y'] > 5)
mask

0    False
1     True
2     True
3    False
4    False
dtype: bool

In [91]:
display(df[mask])
display(df.loc[mask])

Unnamed: 0,x,y
1,1,6
2,2,7


Unnamed: 0,x,y
1,1,6
2,2,7


Note that ```.iloc``` is a little different than the others, if you pass in this mask, you'll get an exception.

In [92]:
try:
    df.iloc[mask]
except NotImplementedError as nie:
    print(nie)

iLocation based boolean indexing on an integer type is not available


This is by design, ```.iloc``` is only intended to take positional arguments. However, our mask is a ```Series``` with an index, so it is rejected. You can still pass in a boolean vector, but just pass in the vector itself without the index.

In [93]:
df.iloc[mask.to_numpy()]
# or
df.iloc[mask.values]

Unnamed: 0,x,y
1,1,6
2,2,7


## Examples!
I think one of the most helpful things when thinking about boolean indexing is to see some examples. You are only limited by what you can express by grouping together any combination of expressions on your data. You do this by carefully grouping your boolean expressions and using parentheses wisely. It can also help to break the problem into pieces as you work on it.

In this series I've been grabbing data from the [Chicago Data Portal](https://data.cityofchicago.org). This time, I thought the [list of lobbyists](https://data.cityofchicago.org/Ethics/Lobbyist-Data-Lobbyists/tq3e-t5yq) might be interesting. Due to the need for lobbyists to re-register every year, there's some repeating data. Let's take a look.


In [94]:
# you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
lbys = pd.read_json("https://data.cityofchicago.org/resource/tq3e-t5yq.json")

In [95]:
lbys.dtypes

year               int64
lobbyist_id        int64
salutation        object
first_name        object
last_name         object
address_1         object
city              object
state             object
zip               object
country           object
email             object
phone             object
fax               object
employer_id        int64
employer_name     object
created_date      object
middle_initial    object
address_2         object
suffix            object
dtype: object

In [96]:
lbys['created_date'] = pd.to_datetime(lbys['created_date'])

# I'll drop the personally identifiable data, just to be nice
lbys = lbys.drop(['email', 'phone', 'fax', 'last_name'], axis=1)

lbys.head(3)

In terms of examples, there's not really too much complexity to deal with, but here's a few to give you an idea what boolean indexing looks like.

In [99]:
lbys[lbys['year'] == 2020]  # all lobbyists registered in 2020
lbys[(lbys['year'] == 2020) & (lbys['city'] == 'CHICAGO')] # lobbyists registered in 2020 from Chicago


# let's get the most popular employer for 2020
pop_emp_id = lbys[lbys['year'] == 2020].groupby('employer_id').count().sort_values(by='lobbyist_id', ascending=False).index[0]

# who works for them?
lbys[(lbys['employer_id'] == pop_emp_id) & (lbys['year'] == 2020)]

Unnamed: 0,year,lobbyist_id,salutation,first_name,address_1,city,state,zip,country,employer_id,employer_name,created_date,middle_initial,address_2,suffix
619,2020,24484,,BRIAN,1330 W FULTON ST,CHICAGO,IL,60607,United States,726370277,"STERLING BAY, LLC AND ITS AFFILIATES",2020-01-15,,STE 800,
654,2020,24106,,HOWARD,1330 W. FULTON ST,CHICAGO,IL,60607,United States,726370277,"STERLING BAY, LLC AND ITS AFFILIATES",2020-01-17,,SUITE 800,
982,2020,23828,MS.,SHELLY,1330 W. FULTON ST,CHICAGO,IL,60607,United States,726370277,"STERLING BAY, LLC AND ITS AFFILIATES",2020-01-15,,SUITE 800,


If we only want to deal with 2020 data, we can just make a new smaller ```DataFrame``` with that data.

In [100]:
lbys = lbys[lbys['year'] == 2020]

### Boolean indexing with ```isin```
A helpful method that is often paired with boolean indexing is ```Series.isin```. It returns a boolean vector with all rows that match one of the elements in the arguments.

In [101]:
display(lbys['state'].head())
display(lbys['state'].isin(['IL']).head())
lbys[lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])] # lobbyists from bordering states

0     IL
4     IL
22    IL
27    PA
32    IL
Name: state, dtype: object

0      True
4      True
22     True
27    False
32     True
Name: state, dtype: bool

Unnamed: 0,year,lobbyist_id,salutation,first_name,address_1,city,state,zip,country,employer_id,employer_name,created_date,middle_initial,address_2,suffix
164,2020,24879,,DAN,1340 RUSSELL CAVE RD,LEXINGTON,KY,405053114,United States,3439332326,"GALLS, LLC",2020-06-15,,,
791,2020,15543,MR.,LORENZO,501 N BROADWAY,ST. LOUIS,MO,63102,United States,2465452260,"STIFEL, NICOLAUS & COMPANY, INC.",2020-01-10,,,


I'll wrap it up with a slightly more complicated expression.

In [102]:
lbys[
    ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) & # lobbyists NOT from bordering states
    (lbys['state'] != 'IL') &                               # and NOT from IL
    (lbys['created_date'] >= '2020-07-01')                  # created in the last half of the year
    ]

Unnamed: 0,year,lobbyist_id,salutation,first_name,address_1,city,state,zip,country,employer_id,employer_name,created_date,middle_initial,address_2,suffix
27,2020,24967,,JON,7450 TILGHMAN ST.,ALLENTOWN,PA,18106,United States,3314122666,CLEAR CHANNEL AIRPORTS,2020-10-21,,SUITE 104,
227,2020,24909,MS.,LAKEITHA,"1201 F STREET NW, SUITE 1000",WASHINGTON,DC,20004,United States,1059549017,RAI SERVICES,2020-07-14,,,
700,2020,24925,,TAMI,205 S. FRONT STREET,MARQUETTE,MI,49855,United States,3989938709,THALES CONSULTING INC,2020-08-06,,,
862,2020,24969,MR.,ALEX,201 WEST ST,ANNAPOLIS,MD,21401,United States,1710218143,"REALTERM US, INC",2020-08-31,,,


I also find it helpful to sometimes create a variable for storing the mask. So for the above example, instead of having to parse the entire expression when reading the code, it can be helpful to have expressive variable names for the parts of the indexing expression.

In [103]:
non_bordering = ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN']))
non_illinois = (lbys['state'] != 'IL')

# more readable maybe?
lbys[non_bordering & non_illinois & (lbys['created_date'] >= '2020-07-01')]

Unnamed: 0,year,lobbyist_id,salutation,first_name,address_1,city,state,zip,country,employer_id,employer_name,created_date,middle_initial,address_2,suffix
27,2020,24967,,JON,7450 TILGHMAN ST.,ALLENTOWN,PA,18106,United States,3314122666,CLEAR CHANNEL AIRPORTS,2020-10-21,,SUITE 104,
227,2020,24909,MS.,LAKEITHA,"1201 F STREET NW, SUITE 1000",WASHINGTON,DC,20004,United States,1059549017,RAI SERVICES,2020-07-14,,,
700,2020,24925,,TAMI,205 S. FRONT STREET,MARQUETTE,MI,49855,United States,3989938709,THALES CONSULTING INC,2020-08-06,,,
862,2020,24969,MR.,ALEX,201 WEST ST,ANNAPOLIS,MD,21401,United States,1710218143,"REALTERM US, INC",2020-08-31,,,


Often when building a complex expression, it can be helpful to build it in pieces, so assigning parts of the mask to variables can make a much more complicated expression easier to read, at the cost of extra variables to deal with. In general, I use variables in the mask if I have to reuse them multiple times, but if only used once, I do the entire expression in place.

Boolean indexing is really quite simple, but powerful. I'll be looking at a few other ways to select data in pandas in upcoming posts.