# Data Filtering with pandas

Filtering data is one of the most common and useful techniques in data analysis.

Below is a crash course on the basics of filtering with the [pandas](https://pandas.pydata.org/) data analysis library. Along the way, we'll pull back the curtains on the inner workings of filtering a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame), the workhorse of the`pandas` library.

We'll use this toy data set to demonstrate concepts and get some practice.

In [None]:
data = [
    {'name': 'John Doe', 'state': 'NY', 'salary': 50000, 'birthday': '1997-06-25'},
    {'name': 'Jane Smith', 'state': 'CA', 'salary': 65000, 'birthday': '1997-02-20'},
    {'name': 'Michael Johnson', 'state': 'IL', 'salary': 40000, 'birthday': '1994-07-12'},
    {'name': 'Emily Davis', 'state': 'IL', 'salary': 180000, 'birthday': '1998-12-09'},
    {'name': 'David Wilson', 'state': 'CA', 'salary': 60000, 'birthday': '2003-07-03'},
    {'name': 'Sarah Brown', 'state': 'PA', 'salary': 52000, 'birthday': '2003-04-22'},
    {'name': 'Alex Martinez', 'state': 'NY', 'salary': 85000, 'birthday': '2005-12-15'},
    {'name': 'Maria Garcia', 'state': 'NY', 'salary': 160000, 'birthday': '2010-08-11'},
    {'name': 'James Lee', 'state': 'IL', 'salary': 80000, 'birthday': '1992-10-30'},
    {'name': 'Linda Harris', 'state': 'CA', 'salary': 100000, 'birthday': '2001-12-22'}
]

## Create a DataFrame
Let's create a DataFrame using our toy data.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data)
df

## Basic filtering

The standard way to filter a DataFrame involves supplying a Boolean "expression" that evaluates to True or False for each value (or row) in question.

Let's say we wanted to find all the California residents in the data.

Here's an example of the common style of filtering you'll encounter in `pandas`.

In [None]:
# Kinda hard to read, no?

df[df['state'] == 'CA']  # Is that a DataFrame *inside* a DataFrame?!?

The above uses the standard *square-bracket notation* to filter the DataFrame for rows where the `state` is equal to `CA`.

All those square brackets are bit cumbersome to read, but it's quite a common syntax.

In cases where the column does not contain spaces or other troublesome characters, you can improve the readability of your code by using [dotted-attribute notation](classes_and_oop/README.ipynb) to access the DataFrame's column in a filter.


In [None]:
df[df.state == 'CA'] # Easier on the eyes, no?

## But what exactly *is* a filter, anyway?

The human-friendly version of the last filter above (using `df.state == 'CA'`) is still, arguably, quite confusing.

And you might reasonably ask: Why do we have to pass the original DataFrame (as part of an expression) *back into the same DataFrame* to get the desired result?

That seems quite strange, no?

We can't say why any developer chooses to design syntax a particular way: just be thankful `pandas` exists!

But we can help you see *how filtering works* on a deeper level.

**The trick to understanding filter syntax is to _create the filter before using it_.**

So instead of this:

```python
df[df.state == 'CA']
```

You can do this:

```python
ca_filter = df.state == 'CA'
df[ca_filter]
```

## "Truthiness tables"

Let's walk through the latter approach, using a few common strategies along the way to gain some insight into the nature of filtering.

Again, the most critical technique involves executing a filter **outside the context of the DataFrame being filtered.**

In [None]:
df.state == 'CA'

Interesting? Indeed! 

But what the heck is going on here?

We see a numeric column on the left side of the output, and values of `True` or `False` on the right.

Print the `df` DataFrame below and see if you can figure out what the above output is all about.

In [None]:
df

Did you guess that the numbers on the left side of our "naked" filter expression (`df.state == 'CA'`) represent the row number, or in `pandas` lingo, the `index`? 

And the values on the right are the result of the comparison? 

In other words, the filter output is telling us that:

- row 0 evaluates to `False` because the `state` value is *not* equal to `CA`
- row 1 evaluates to `True` because the `state` *is equal* to `CA`
- ...and so on for all the rows in the DataFrame...

So the filter expression produces a sort of ["truthiness" table](https://www.cc.com/video/63ite2/the-colbert-report-the-word-truthiness) (please feel free to enjoy this deep cut. We'll wait for you).

And by passing this "table" back into the DataFrame, we're able to select only the rows where the condition is true.

But is the output of the filter really a "table"? What is the actual data type returned by the filter expression?

Let's find out by storing it in a variable and using the built-in `type` function to get a handle on things.

In [None]:
ca_filter = df.state == 'CA'
type(ca_filter)

AHA!! We can see that the filter expression produces a [pandas.Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html(), a more highfalutin cousin of the humble `list` data type in Python.

So we have a handle on the nature of what a filter actually is.

Now we can apply it to our original DataFrame and get back the same result as earlier.

In [None]:
# Isn't this so much more readable? And still accurate!
df[ca_filter]

## Complex filters

The strategy of preparing filters *before applying them* really shines as filtering logic grows more complex.

In [None]:
# Find all people in CA with a salary less than $10000
df[(df.state == 'CA') & (df.salary < 100_000)]

In [None]:
# Same as above, but as separate steps. A bit more readable?

# 1. Create the filter
ca_less_than_100k = (df.state == 'CA') & (df.salary < 100_000)

# 2. Apply the filter
df[ca_less_than_100k]

Above, we introduced a few new bits of syntax:

- The ampersand `&` can be used to combine filters
- The parentheses `()` group the filters to avoid confusing `pandas`

Note that you can also use the pipe character (`|`) to create `OR` filters. 

Here's one that finds all people in California OR those named `Maria Garcia`.

In [None]:
my_filter = (df.state == 'CA') | (df.name == 'Maria Garcia')
df[my_filter]

## Exercise

Try answering the below questions by creating your own filters:

* How many people live in Illinois?
* How many people earn \$100,000 or more?
* How many people live in New York AND make more than \$60,000?
* How many people live in CA OR earn less than \$80,000?

In [None]:
# Here's some scratch space for you to work

## Date filtering

You may have noticed we've been avoiding dates in our filters. 

That's not because `pandas` is unable to filter by date. We were just trying to minimize distractions until you got a handle on the basics of filtering.

But you're ready now, right? AWESOME. Here we go...

When dealing with dates, you want to make sure that the column containing a date is truly a [datetime]() column, as opposed to a text representation of a date.

Let's check our original DataFrame to see what we're working with.

In [None]:
df.info()

In the arcane world of `pandas`, the `birthday` column's `Dtype` (aka data type), is something called an *object*.

That's `pandas` lingo for text, or a string.

To ensure our date filtering works correctly, let's update `birthday` to make it a proper date column.

In [None]:
df['birthday'] = pd.to_datetime(df.birthday)

In [None]:
df.info()

You should now see `datetime64[ns]` as the data type for `birthday`, which is what we were after.

> The `64` stands for 64-bit integer and is an artifact of the data type's [origins in numpy](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.datetime64). And the `[ns]` stands for nanosecond, a much more precise time format than what you get with the standard Python `datetime` object.

Okay, let's create a few basic date filters. 

First, it's worth noting that you can use the string representation of a date in the `YYYY-MM-DD` format to filter on date columns.

In [None]:
# People born since 2010

zygotes = df.birthday >= '2010-01-01'

df[zygotes]

Or you could use a proper `datetime` object to accomplish the same.

In [None]:
from datetime import datetime

zygotes_dt_filter = df.birthday >= datetime(2010, 1, 1)

df[zygotes_dt_filter]

You'll probably agree the string format (`YYYY-MM-DD`) is easier on the eyes, so we'll stick with that from here on out.

Here's a more complex filter with a date.

In [None]:
# People born before 2000, earning less than $100K

pre2000_and_less_than_100k = (df.birthday < '2000-01-01') & (df.salary < 100_000)

df[pre2000_and_less_than_100k]

### Exercises

Now it's your turn to practice. Try creating filters to answer the following questions.

- How many people born since 1990 earn less than \$70,000?
- How many people were born in the 1990s? *HINT: You'll need to combine two date filters.*
- Which people born in the 90s earn more than \$70,000? *Hint: You'll need to combine the result from the last question with another filter for salary. This one will get gnarly!*


In [None]:
# Here's some scratch space for you to work

## Summary and what's next

Pythonistas (and coders in general) are allergic to keystrokes.

Once they're comfortable with a language, they tend to craft code incantations that are quite terse. This is normally a good thing. It saves on typing and mental processing when reading code. But it can require a fair degree of fluency with the nuances of a given language.

Alas, in the context of `pandas` filtering, common practice can produce fairly inscrutable code, especially to the eyes of someone new to Python and `pandas`.

Just remember: If you find yourself staring at a gnarly filter crammed into a DataFrame, you can deconstruct the filter and save it in one or more separate variables as a preliminary step. 

Unraveling the code in this way can help illuminate the inner workings of complex filtering logic.

One final important note: `pandas` offers quite a few other ways to filter data. If you'd like  to learn more, check out:

- [DataFrame.filter](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html) 
- [where method](https://pandas.pydata.org/docs/user_guide/indexing.html#the-where-method-and-masking)
- [query method](https://pandas.pydata.org/docs/user_guide/indexing.html#the-query-method)

Happy filtering!