# Filtering rows 2: Using `filter` and the Expression API

By the end of this lecture you will be able to:
- apply conditions with the `filter` method
- add a row number column
- parition a `DataFrame`

The `filter` method is our first example of the *Expression API*.

_**Learning to use the *Expression API* is the most important step to writing high performance queries in Polars**_


In [None]:
import polars as pl

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

## Applying conditions with `filter`

We use the `filter` method to filter rows according to a condition.

> In Pandas we often use a boolean mask to filter rows but in Polars we use `filter`. Note also that the `filter` method in Polars is quite different from the filter method in Pandas.

We first use an *expression* in the `filter` method before we examine the syntax in more detail.

In this example we want to keep all rows with the first class passengers

In [None]:
(
    df
    .filter(
        pl.col('Pclass') == 1
    )
    .head(2)
)

## Syntax of `filter`
Inside the `filter` method we pass an _**expression**_ and apply a Boolean condition to it:

`pl.col('Pclass') == 1`

This expression has two parts:
- `pl.col('Pclass')` expression selects the `Pclass` column from `df`
- `== 1` applies a Boolean condition to this expression

In this example we choose all rows with the number of parents & children (`Parch`) is greater than 1

In [None]:
(
    df
    .filter(
        pl.col('Parch') > 1
    )
    .head(2)
)

As well as the mathemtical operators such as `==`,`>`,`<` there are corresponding text operators that some people find more readable

In [None]:
(
    df
    .filter(
        pl.col('Parch').gt(1)
    )
    .select("PassengerId","Parch","SibSp")
    .head(5)
)

You can see the full set of operators here: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/operators.html

We can make a filter condition based on two expressions (i.e. comparing data in one column to another) rather than one expression and a constant. In this example we find rows where the number of parents & children (`Parch`) is greater than the number of siblings (`SibSp`)

In [None]:
(
    df
    .filter(
        pl.col('Parch').gt(pl.col("SibSp"))
    )
    .select("PassengerId","Parch","SibSp")
    .head(5)
)

To save a bit of typing we can also apply a filter to a column by passing the column name directly

In [None]:
(
    df
    .filter(
        Parch = 3,
    )
    .select("PassengerId","Parch","SibSp")
    .head(5)
)

This approach only works for equality conditions (i.e. not for >,< etc). 

Why does this simple approach only work for equalities? Because in this approach Polars takes advantage of Python keyword arguments - we are basically "pretending" we are calling `filter` with an argument called `Parch` equal to 3 which Polars internally converts to `pl.col("Parch") == 3`. Python only lets us use this trick with the `=` operator

### Conditions based on row numbers with `filter`

We can add an explicit row number column using `with_row_index` on a `DataFrame`

In [None]:
df = pl.read_csv(csv_file)
df = df.with_row_index(name='index')
df.head(3)

We can then use `filter` to apply a condition based on row number

In [None]:
(
    df
    .filter(
        pl.col('index') < 4
    )
)

However, a simpler way to do this is with `slice`

In [None]:
(
    df
    .slice(0,4)
)

### Filtering on a Boolean column
We can filter for `True` values on a Boolean column by passing the column as an expression to `filter` without a condition

In [None]:
(
    df
    .with_columns(
        less_than_30 = pl.col("Age") < 30
    )
    .filter(
        pl.col("less_than_30")
    )
    .head(2)
)

We can negate a filter condition with `~`

In [None]:
(
    df
    .with_columns(
        less_than_30 = pl.col("Age") < 30
    )
    .filter(
        ~pl.col("less_than_30")
    )
    .head(2)
)

or with the `not_` expression

In [None]:
(
    df
    .with_columns(
        less_than_30 = pl.col("Age") < 30
    )
    .filter(
        pl.col("less_than_30").not_()
    )
    .head(2)
)

## Partitioning a `DataFrame`
In some cases we want to get the different subsets of the `DataFrame` that result from a single condition. 

We can do this partition into sub-`DataFrames` with the `partition_by` method.

In this example we partition by the `Pclass` column

In [None]:
df_pclass_dict = (
    df
    .partition_by(by=["Pclass"],as_dict=True)
)

The output is a python `dict` mapping from the unique values in `Pclass` to the sub-`DataFrame` for each class. This partition requires copying the data in `df` to new sub-`DataFrames`.

Note that the keys of this `dict` are always tuples even if there is just one element in the tuple for each key

In [None]:
df_pclass_dict.keys()

Note that if we don't pass the `as_dict=True` argument we instead get a python `list` of sub-`DataFrames`.

We can get the rows with first class passengers from this `dict` (note the `,` which turns `1` into the tuple `(1,)`

In [None]:
df_pclass_dict[1,].head(2)

## Filter in lazy mode
We create a `LazyFrame` by scanning the CSV and adding a `filter` operation

In [None]:
(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 30)
)

When we print the optimized plan we see the `filter` operation is part of the `SELECTION`. This query optimisation is called **predicate pushdown**. With predicate pushdown Polars tries to apply a `filter` as early as possible in a query plan to reduce the amount of data that must be processed

In [None]:
print(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 30)
    .explain()
)

In this case of a `filter` applied on a query like this from a CSV on our local machine the query optimisation will not have much impact: Polars just reads the CSV, makes a `DataFrame` in memory and then filters the `DataFrame`. The result would probably be similar to doing the query in eager mode.

However, if we are reading a file from cloud storage then Polars tries to apply the condition in `SELECTION` in the cloud storage and so reduces the amount of data that must be transferred across the network. The transfer across the network is typically the slowest and most expensive part of the query.



If we set `streaming=True` in `explain` we see that the `filter` operation is inside the 
```
--- STREAMING
--- END STREAMING
```
part of the query plan - this means that Polars can do this filter operation in streaming mode if we evaluate the lazy query with `.collect(streaming=True)`

In [None]:
print(
    pl.scan_csv(csv_file)
    .filter(pl.col("Age") > 30)
    .explain(streaming=True)
)

# Exercises
In the exercises you will develop your understanding of
- using the `filter` method
- adding a row number column
- partitioning a `DataFrame`

### Exercise 1 
Select all rows where `Age` is greater than 30

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Select all rows where `Embarked` is equal to "C" - using the keyword approach

Select all rows where `Embarked` is equal to "C" - use `pl.col` with the text operator rather than the mathematical operator this time

Select all rows where `Embarked` is **not** equal to "C" 

### Exercise 2 

In this exercise we filter on row numbers.

First add a row number column

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Continue by selecting the first 5 rows using `filter` on the row number column

### Exercise 3
Partition the `DataFrame` by the `Survived` and `Pclass` columns as a `dict` (you may want to check the API docs for help: https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.partition_by.html#polars.DataFrame.partition_by)

In [None]:
survived_pclass_dict = (
    pl.read_csv(csv_file)
    <blank>
)

Return the sub-`DataFrame` with the passengers who did not survive from the third class

### Exercise 4
In this exercise we load data from the Spotify charts

In [None]:
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv)
spotify_df.head()

Filter the `DataFrame` to find all rows with artist Post Malone

In [None]:
(
    spotify_df
    <blank>
)

## Solutions

### Solution to Exercise 1
Select all rows with `Age` greater than 30

In [None]:
(
    pl.read_csv(csv_file)
    .filter(pl.col('Age') > 30)
    .head(3)
)

Select all rows where `Embarked` is equal to "C" - using the keyword approach

In [None]:
(
    pl.read_csv(csv_file)
    .filter(Embarked = "C")
    .head(3)
)

Select all rows where `Embarked` is equal to "C" - use `pl.col` with the text operator rather than the mathematical operator this time

In [None]:
(
    pl.read_csv(csv_file)
    .filter(pl.col("Embarked").eq("C"))
    .head(3)
)

Select all rows where `Embarked` is **not** equal to "C" 

In [None]:
(
    pl.read_csv(csv_file)
    .filter(~pl.col("Embarked").eq("C"))
    .head(3)
)

### Solution to Exercise 2
Add a row number column

In [None]:
(
    pl.read_csv(csv_file)
    .with_row_index("row_nr")
)

Continue by selecting the first 5 rows using `filter` on the row number column

In [None]:
(
    pl.read_csv(csv_file)
    .with_row_index("row_nr")
    .filter(pl.col("row_nr")<5)
)

### Solution to Exercise 3
Partition the `DataFrame` by the `Survived` and `Pclass` columns as a `dict`

In [None]:
survived_pclass_dict = (
    pl.read_csv(csv_file)
    .partition_by("Survived","Pclass",as_dict=True)
)

In [None]:
survived_pclass_dict.keys()

Return the sub-`DataFrame` with the passengers who did not survive from the third class

In [None]:
(
    survived_pclass_dict[(0,3)]
    .head(2)
)

### Solution to Exercise 4
In this exercise we load data from the Spotify charts in a compressed CSV

In [None]:
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv)
spotify_df.head()

Filter the `DataFrame` to find all rows with artist Post Malone

In [None]:
(
    spotify_df
    .filter(
        pl.col("artist") == "Post Malone"
    )
)