# Filtering

Earlier, we mentioned that most of our notebooks, and even some of our individual cells, follow a fairly standard pattern:

1. Read Data
2. Clean Data
3. **Filter Data**
4. Process Data
5. Output Data

This article will be about filtering data.

Once you have a good dataset in memory, your going to want to do stuff to that dataset.

Sometimes though, there is more data in the dataset than you really need.  Perhaps you only want to work on a small subset, a few weeks of data for instance. Maybe you want to throw out any rows that you know have nothing to do with your problem based on some particular columns value.

As a result, you will want to do some filtering to get to just the data you need.

This phase of the task should be done as early as possible.  The less data in your dataframe, the less work you will have to do to transform it a process it, so its always a good idea to filter the data down to a smaller working set as you go. This may take more than one filter phase, and may require calculations to be done to work out what to filter out, but as a general rule of thumb, once you have a clean typed data frame, try to shrink it down to just what you need.

## How Filtering Works

Lets look at how the pandas filtering works internally.

When you do filtering or boolean indexing in pandas, you're specifying conditions to select specific rows (or columns) in the DataFrame (or Series). Here's a general overview of how this works:

#### Step 1: Evaluation of the Boolean Condition

When you do something like:

```python
df[df['Age'] > 30]
```
What happens under the hood is similar to applying the '>' operation to the 'Age' series, and this results in a Boolean series with the same index as that of DataFrame 'df'.

#### Step 2: Masking with the Boolean Series

This Boolean series is then used to mask the original DataFrame. It returns only the rows where the corresponding value in the Boolean series is True. Any False values result in those rows being dropped from the subset.

#### Step 3: Return Subset DataFrame

The end result is a new DataFrame that includes only rows that satisfied the condition (or conditions when you use multiple conditions chained together with '&' or '|'). The original DataFrame remains unaltered, unless you assign the result to the original DataFrame.

For complex conditions, pandas use bitwise operators (& for and, | for or, ~ for not) instead of the typical python logical operators (and, or, not). For using bitwise operators, each condition has to be enclosed in brackets, as bitwise operators have higher precedence than comparison operators in Python.

For functions like isin and str.contains, pandas first figures out which rows (or columns) pass the condition and then the same process of masking happens.

It's also important to note that the filtering operation in pandas is quite efficient, especially for large datasets, as under the hood pandas uses a vectorized approach (operating on whole arrays) for these computations. This makes pandas filtering faster and more performant than standard Python looping or list comprehension mechanisms.

## Examples

Below are some examples of how to filter a pandas DataFrame.

Example 1: Filtering a DataFrame based on one condition

In [1]:
import pandas as pd

# Assume we have the following DataFrame
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'James'],
    'Age': [28, 24, 35, 32, 30],
    'City': ['New York', 'Paris', 'Berlin', 'London', 'Sydney']
}
df = pd.DataFrame(data)

# Filter the DataFrame to only include rows where Age > 30
filtered_df = df[df['Age'] > 30]
filtered_df

Unnamed: 0,Name,Age,City
2,Peter,35,Berlin
3,Linda,32,London


Example 2: Filtering a DataFrame based on multiple conditions

In [2]:
# Filter the DataFrame to include rows where Age > 30 and City is 'Berlin'
filtered_df = df[(df['Age'] > 30) & (df['City'] == 'Berlin')]
filtered_df

Unnamed: 0,Name,Age,City
2,Peter,35,Berlin


Example 3: Filtering a DataFrame using the isin function

In [3]:
# Filter the DataFrame to include rows where City is either 'New York' or 'Berlin'
cities = ['New York', 'Berlin']
filtered_df = df[df['City'].isin(cities)]
filtered_df

Unnamed: 0,Name,Age,City
0,John,28,New York
2,Peter,35,Berlin


Example 4: Filtering a DataFrame using the str.contains function (useful for string matching)

In [4]:
# Filter the DataFrame to include rows where Name contains the substring 'a'
filtered_df = df[df['Name'].str.contains('a')]
filtered_df

Unnamed: 0,Name,Age,City
1,Anna,24,Paris
3,Linda,32,London
4,James,30,Sydney


In practice, you may find that filtering occurs in many steps.  You may do some basic filtering, then run some tasks to generate synthetic columns, for instance, and then filter some more.  The basic point I'm trying to make sure I get across is for a variety of reasons, its a good idea to try to shrink your data set down as small as it can get before you start doing your really complex work.

Doing your complex work over fewer rows is going to be faster, and in the world of iterative interactive program exploration that were in, having a nice tight fast retry loop where you keep retrying your experiments and attempts as quickly as possible is very helpful.
