# Filtering rows 3: multiple filter conditions
By the end of this lecture you will be able to:
- use multiple AND conditions in `filter`
- use multiple OR conditions in `filter`
- optimise multiple conditions in lazy mode

In [None]:
import polars as pl

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)

## Multiple conditions

### Apply `AND` conditions

We can apply filter `AND` conditions where all conditions must be met in a number of ways.

The first way is to **chaining** multiple calls to `filter`.

In this example we keep all first class passengers that are over 70

In [None]:
(
    df
    .filter(
        Pclass = 1
    )
    .filter(
        pl.col('Age') > 70
    )
    .head(3)
)

In eager mode chaining is inefficient. For each call to `filter` Polars has to do a full pass through the rows of the `DataFrame`. It is better to combine everything into a single condition 

One way to do this is to **concatenate** multiple `AND` conditions in a single `filter` call using `&`

In [None]:
(
    df
    .filter(
        (pl.col('Age') > 70) & (pl.col('Pclass') == 1)
    )
    .head(2)
)

There is a less verbose way to do this by passing the predicates as a comma-separated list of expressions

In [None]:
(
    df
    .filter(
        pl.col("Pclass") == 1,
        pl.col("Age") > 70
    )
    .head(2)
)

If we are applying multiple *equality* conditions we can do this with keywords (note the single `=` in this format)

In [None]:
(
    df
    .filter(
        Pclass = 1,
        Age = 70
    )
)

### Apply an AND condition using `pl.all_horizontal`
Specifying multiple conditions in chained `filters` or using `&` is fine when we have a small number of conditions to apply. However, we can use the `pl.all_horizontal` method when we want to apply an AND condition on many columns.

> The methods above with a comma-separate list of conditions are equivalent to `pl.all_horizontal`

In this example we:
- first call `pl.all().is_not_null()` to create a Boolean `DataFrame` where each call is True if the underlying value is not `null`
- then call `pl.all_horizontal` to find rows where all values are `True` (i.e. all values are not `null`)

In [None]:
(
    df
    .filter(
        pl.all_horizontal(
            pl.all().is_not_null()
        )
    )
    .head(2)
)

### Apply `AND` condition on a range

We use `in_between` to apply a condition on a range. In this case we are looking for values **greater than or equal to** 10 and **less than or equal to** 13

In [None]:
(
    df
    .filter(
        pl.col("Age").is_between(10,13)
    )
    .head(2)
)

We use the `closed` argument to specify if we want the range to be open, closed on both sides or open on the left or right. The default is for the range to be closed (with a value of `"both"`). 

In this example we are looking for values from 10 to 13 exclusive of the boundaries

In [None]:
(
    df
    .filter(
        pl.col("Age").is_between(10,13,closed="none")
    )
    .head(2)
)

### Apply `OR` conditions

We can apply an OR filter using the pipe `|` operator.

In this example we look for rows where the passenger is over 70 OR the passenger is in first class

In [None]:
(
    df
    .filter(
        (pl.col('Age') > 70) | (pl.col('Pclass') == 1)
    )
    .head(2)
)

One kind of OR condition is when we want to check if a row is equal to any value in a `list`. We can do this with `is_in`

In [None]:
(
    df
    .filter(
        pl.col('Pclass').is_in([2,3])
    )
    .head(3)
)

### Multiple conditions in lazy mode
In *lazy mode* if we pass multiple `filter` calls then the query optimizer combines these into a *single condition* inside `SELECTION`.

In this example we filter for first class passengers over the age of 70.

In [None]:
df = (
    pl.scan_csv(csv_file)
    .filter(
        pl.col('Pclass')==1
    )
    .filter(
        (pl.col('Age') > 70)
    )
)
print(df.explain())

In the query plan we see the conditionsa are combined to a single condition by the query optimiser

## Exercises
In the exercises you will develop your understanding of:
- applying multiple AND conditions
- applying multiple OR conditions

### Exercise 1 
Filter the `DataFrame` to find rows where `Age` is between 30 and 50 (including the lower bound) and the passenger is in 2nd class. Do this in eager mode in a single pass through the `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Do this again combining the range condition with the keyword approach for the 2nd class condition - does the order you pass the conditions matter?

### Exercise 2
Return all the rows of the `DataFrame` where at least one column on the row is `null` (excluding the `Cabin` column with many `null` values)

In [None]:
(
    pl.read_csv(csv_file)
    .drop("Cabin")
    <blank>
)

### Exercise 3
Create a `DataFrame` where the passengers got on in Cork ("C") or Southampton ("S") using the pipe operator

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Do this again using the `is_in` approach

### Exercise 4
Load the Spotify CSV data into a `DataFrame`

In [None]:
pl.Config.set_fmt_str_lengths(30)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv)
spotify_df.head()

Find all rows where the number of streams is greater than 10 million and the trend is "NEW_ENTRY"  

Find the rows where the artist is either Drake or Ed Sheeran and the rank is less than (better than) 5

## Solutions
### Solution to Exercise 1
Create a `DataFrame` where `Age` is between 30 and 50 (including the lower bound) and the passenger is in 2nd class. Do this in eager mode in a single pass through the `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    .filter(
        (pl.col('Age').is_between(30,50,closed="left")) & (pl.col('Pclass')==2)
    )
    .head()
)

Do this again combining the range condition with the keyword approach for the 2nd class condition - does the order you pass the conditions matter?

In [None]:
(
    pl.read_csv(csv_file)
    .filter(
        pl.col('Age').is_between(30,50,closed="left"),
        Pclass=2,
    )
    .head()
)

The order you do this matters - Python keyword arguments must be the last on the list

### Solution to Exercise 2
Return all the rows of the `DataFrame` where at least one column on the row is `null` (excluding the `Cabin` column with many `null` values)

In [None]:
(
    pl.read_csv(csv_file)
    .drop("Cabin")
    .filter(
        pl.any_horizontal(
            pl.all().is_null()
        )
    )
    .head()
)

### Solution to Exercise 3
Create a `DataFrame` where the passengers got on in Cork ("C") or Southampton ("S") using the pipe operator

In [None]:
(
    pl.read_csv(csv_file)
    .filter(
        (pl.col('Embarked') == "C") | (pl.col('Embarked') == "S")
    )
    .head(3)
)

Do this again using the `is_in` approach

In [None]:
(
    pl.read_csv(csv_file)
    .filter(
        pl.col('Embarked').is_in(["C","S"])
    )
    .head()
)

### Solution to Exercise 4
Load the Spotify CSV data into a `DataFrame`

In [None]:
pl.Config.set_fmt_str_lengths(30)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv)
spotify_df.head()

Find all rows where the number of streams is greater than 10 million and the trend is "NEW_ENTRY"  

In [None]:
(
    spotify_df
    .filter(
        pl.col("streams")>1e7,
        trend = "NEW_ENTRY"
    )
)

Find the rows where the artist is either Drake or Ed Sheeran and the rank is less than (better than) 5

In [None]:
(
    spotify_df
    .filter(
        pl.col("artist").is_in(["Drake","Ed Sheeran"])
    )
    .filter(pl.col("rank")<5)
)