# DataFrames III - Filtering
- To **filter** means to select a subset of `DataFrame` rows that satisfy a condition.

## Introducing the Dataset
- The `coffee_sales` CSV stores transactions for a fictional coffee shop.
- Polars will import the `time_of_purchase` column as strings.
- Use the `try_parse_dates` parameter to parse string columns as datetimes.

## The filter Method
- The `filter` method extracts a subset of `DataFrame` rows that satisfy a condition.
- The `filter` method accepts a Boolean expression (one that produces a column of Booleans).
- The `filter` method will keep the rows with a value of `true` and exclude the rows with a value of `false`.
- Polars supports the standard comparison operators like `==`, `!=`, `>`, `<`, and more.
- Boolean comparisons on `null` will evaluate to `null`.

- The `==` operator compares the equality of every row value with a constant.
- The `eq` method is equivalent to `==`.

- Pass the Boolean expression to the `filter` method.

### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#filter
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#filter
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html

## Filtering with Mathematical Operators

- The `!=` operator compares the inequality of every column value with a constant.
- The `ne` method is the equivalent method.

- Polars supports the standard mathematical operators for comparisons:
    - `<` or `lt` for less than
    - `<=` or `le` for less than or equal to
    - `>` or `gt` for greater than
    - `>=` or `ge` for greater than or equal to

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.eq.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.ne.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.lt.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.le.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.gt.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.ge.html

## Filtering with Missing Values
- The `filter` method excludes `null` values by default.
- Polars has a family of comparison methods ending in `_missing` that include null values.
- For example, the `eq_missing` method compares the equality of a `null` with the argument.

- We can pass in `None` to compare equality with `null`/missing values...

- ...but the more idiomatic solution is the `is_null` method.
- The `is_null` method returns a Boolean `Series` where a true represents a missing value.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.null_count.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.ne_missing.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.eq_missing.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_null.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_not_null.html

## Filtering with Boolean Columns
- Polars prints its Booleans in lowercase (`true`, `false`).
- Python's Booleans have a capital first letter (`True`, `False`)
- We _could_ compare every value in a Boolean column to `True`.
- But a cleaner approach is to pass the Boolean column directly to the `filter` method.

## Applying And Logic
- The `&` operator applies AND logic between two Boolean expressions.
- Both conditions must be met for the evaluation to be true.
- The AND (`&`) logic is called "conjunction" in mathematics.

- Wrap each expression in parentheses to isolate it. Otherwise, Polars will get confused.
- The following example extracts rows with a `milk_type` of "Soy" and a `size` greater than 18.

- Methods allow the omission of parentheses.
- Alternatively, assign the expressions to variables.

- The `filter` method supports any number of expressions.

- As an alternative to the `&` symbol, we can also pass sequential Boolean expressions to the `filter` method.

- The `and_` method is equivalent to the `&` operator.
- Polars chose these method names to avoid conflicts with Python's built-in keywords (`and`, `or` ,`not`).

- These methods on expressions produce new expressions.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.and_.html

## Keyword Argument Filtering
- The `filter` method supports keyword arguments.
- Pass the column name as the parameter and the comparison value as the argument.
- Keyword argument filtering only works for equality comparisons.
- The Polars team generally advises against this filtering approach. Use expressions.

- The following example filters row that have a `location` of "Uptown" and a `barista` of "Danielle".

- Boolean columns require an explicit `True` argument.
- The following example adds an additional criteria of a `True` in the `iced` column.

## Applying or Logic
- The `|` operator applies OR logic between two Boolean expressions.
- At least one condition must be true for the final evaluation to be true.
- The OR (`|`) logic is called "disjunction" in mathematics.

- The next example filters row where `price` is less than or equal to 3 or `location` is equal to "Downtown".

- The `le` method is equivalent to the `<=` symbol.
- Methods allow for the omission of parentheses.

- The `or_` method is the equivalent of the `|` operator.
- We build up a complex expression that captures the same either/or logic.
- The Polars team chose this awkard name intentionally to avoid conflict/confusion with Python's `or` keyword.

- To simplify code, assign expressions to descriptive variables.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.or_.html

## Operator Precedence
- Polars prioritizes the AND (`&`) operator over the OR (`|`) operator.

- Let's start by defining some Boolean expressions.

- How does Polars interpret the filter below?
- Option A ❌: Extract rows that are (almond milk OR iced) AND cheap
- Option B ✅: Extract rows that are almond milk OR (iced AND cheap)
- The `&` wins out so Polars group `is_iced & is_cheap` together.

- Use parentheses to group together the related expressions.
- Use parentheses even if Polars' default inference is correct.

- The placement of parentheses switches up the result.
- The next example extracts rows that are (Almond milk OR Iced) AND cheap

## Exclusive OR
- The exclusive OR operator (`^`) requires that only one condition be met.
- If one condition is true, the other condition must be false.
- if both conditions are true, the `^` expression evaluates to false.
- The method equivalent is `xor`.

- Let's filter for the rows with either a location of "Uptown" or a milk of "Oat".
- Polars will _exclude_ a row if it has both a location of "Uptown" and a milk of "Oat".

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.xor.html

## Filtering for Unique and Duplicate Values

- The `is_unique` method identifies if a row value is unique (only occurs once in the column).
- The `is_duplicated` method identifies if a row value is a duplicate. First occurrences count as duplicates.

- The `is_first_distinct` method identifies if a row holds the first occurrence of a value, regardless of whether it is a duplicate or not.
- The `is_last_distinct` method identifies if a row holds the last occurrence of a value, regardless of whether it is a duplicate or not.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_duplicated.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_first_distinct.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_last_distinct.html

## Filtering with Datetimes
- The `filter` method supports filtering date, time, and datetime columns.

- Let's find all the coffees purchased on or after June 1st, 2025.
- Polars is less flexible than Pandas. It will not allow comparison of dates with time strings.

- To filter datetimes, we need to provide a valid date/datetime object from Python or Polars.
- Polars has a `date` function which has the same API as `dt.date` (year, month, and day).
- The `date` function returns an expression representing a date in time.
- `pl.date` is different from `pl.Date` which is the `Date` type.
- When using a date on a datetime column, Polars assumes a time of midnight (00:00:00).

- The `dt.datetime` constructor accepts the hours, minutes, and seconds after the year, month, and day.
- Here, we filter for the coffees purchased after June 1st, 2025 12:30:45pm.
- Polars has a complementary `pl.datetime` constructor.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.date.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.datetime.html

## The is_between Method

- Let's say that we're targeting all transactions with a price between 3 and 4 dollars.
- When checking for inclusion in a range, one option is to declare each bound as a separate condition.

- The `is_between` method returns a Boolean expression that indicates whether a row value falls within a range.
- Both the lower bound and upper bound are inclusive.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_between.html

## The is_in Method
- The `is_in` method validates the inclusion of a row value within a collection of values.

- Say we want to extract rows where the `coffee_name` is either "Americano" or "Espresso".
- We can apply the `|` operator or the `_or` method to check for multiple independent expressions.

- The `is_in` method accepts a collection (list, tuple, etc) or another expression.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_in.html

## The remove Method
- The `filter` method includes rows that satisfy an expression.
- The `remove` method _excludes_ rows that satisfy an expression.
- The `remove` method performs the inverse logic of the `filter` method.

- Say we want to extract any rows where the `size` value is not equal to 12.

- Alternatively, we could `remove` any rows where the `size` value is equal to 12.

- We can validate the equality of `DataFrames` with the `equals` method.
- The `equals` method returns `True` if two `DataFrames` are equal (same columns, rows, values, etc).

- If you use `&` or provide multiple arguments, Polars will confirm both conditions are met to exclude a row.
- Polars will remove rows where the `size` is 12 and the `barista` is either "Christopher" or "Erica".
- Polars will keep rows with a size of 12 and a barista who is not Christopher or Erica.
- Polars will keep rows with a barista of Christopher or Erica and a size not equal to 12.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.remove.html

## Negation with the Tilde Symbol
- The tilde symbol (`~`) inverts a Boolean expression's values.
- A true becomes a false, and a false becomes a true.
- The method equivalent is `not_`. Python has a reserved `not` keyword.

- Let's extract the rows with coffee prices less than `$3` and greater than `$4`.

- Alternatively, we can target all rows with prices that don't fall between `$3` and `$4`.

- Use variables to provide context for what an expression represents.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.not_.html

## When, Then, Otherwise
- The `when` method creates a conditional expression.
- The `then` method assigns a value based on the condition being met.
- The `otherwise` method provides an optional fallback value if the condition is not met.

- Say we wanted to provide a description of every coffee based on the milk type.
- If the milk is "Oat", the description should be `"Creamy"`. Otherwise, fallback to `"Unknown"`.
- The `pl.lit` function returns an expression that represents a literal (i.e., fixed, constant) value.
- Without a fallback, Polars falls back to a missing value (`null`) for any non-matching value.

- Use the `otherwise` method to provide a fallback to use instead of `null`.

- We can chain multiple invocations of the `when` method in sequence.
- Each `when` method call creates a separate condition to match.
- The design is similar to a `switch case` statement in various programming languages.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/basic-operations/#conditionals
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.when.html

## Partitioning DataFrames
- Say we wanted to extract a `DataFrame` for every unique coffee (`"Flat White"`, `"Latte"`).

- Use the `unique` method to return a column of unique values from a target column.
- Writing 12 `filter` statements is verbose and not future-proof.

- The `partition_by` method uses a column's unique values to return a list of `DataFrames`.
- The `partition_by` method accepts a string with the column's name (an expression is invalid).
- The word __partition__ means to "divide into parts".
- If we have 12 different coffees, then `partition_by` will return a list of 12 `DataFrames`.

- The list isn't particularly helpful because the `DataFrames` are accessible by index position.
- Polars will store the `DataFrames` in the order that it encounters the unique values in the original `DataFrame`.
- In this example, "Flat White" comes first, then "Latte", then "Macchiato", etc.
- The `as_dict` parameter returns a dictionary where the keys are tuples with the unique column values and the values are the `DataFrames`.

- Use square brackets to index into the dictionary and provide a tuple key to access the corresponding `DataFrame`.
- A tuple only requires a comma but is usually written in parentheses.

- The reason that the dictionary stores tuple keys is to support multi-key partitioning.
- Let's instead the split the `DataFrame` by every combination of coffee and milk.

- Pass a tuple with both the coffee and milk entry to extract the target `DataFrame`.
- A single value (i.e., "Americano") will trigger an error.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.partition_by.html