# DataFrames III - Filtering
- To **filter** means to select a subset of `DataFrame` rows that satisfy a condition.

In [1]:
import polars as pl

## Introducing the Dataset
- The `coffee_sales` CSV stores transactions for a fictional coffee shop.
- Polars will import the `time_of_purchase` column as strings.
- Use the `try_parse_dates` parameter to parse string columns as datetimes.

In [2]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(5)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,False,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,False,"""Jeffrey""",5.05
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,False,"""Danielle""",3.74


## The filter Method
- The `filter` method extracts a subset of `DataFrame` rows that satisfy a condition.
- The `filter` method accepts a Boolean expression (one that produces a column of Booleans).
- The `filter` method will keep the rows with a value of `true` and exclude the rows with a value of `false`.
- Polars supports the standard comparison operators like `==`, `!=`, `>`, `<`, and more.
- Boolean comparisons on `null` will evaluate to `null`.

In [3]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- The `==` operator compares the equality of every row value with a constant.
- The `eq` method is equivalent to `==`.

In [4]:
5 == 10

False

In [5]:
coffees.select(
    pl.col("coffee_name"),
    (pl.col("coffee_name") == "Flat White").alias("is_flat_white"),
    pl.col("coffee_name").eq("Flat White").alias("is_flat_white_eq"),
)

is_flat_white = pl.col("coffee_name").eq("Flat White").alias("is_flat_white_eq")

- Pass the Boolean expression to the `filter` method.

In [6]:
coffees.filter(pl.col("coffee_name") == "Flat White")
coffees.filter(pl.col("coffee_name").eq("Flat White"))
coffees.filter(is_flat_white)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-03 10:57:47,"""Midtown West""","""Flat White""","""Soy""",12,true,"""Anthony""",3.69
2025-01-07 23:54:11,"""Downtown""","""Flat White""","""2%""",16,false,"""Jill""",5.08
…,…,…,…,…,…,…,…
2025-06-01 03:42:07,"""Midtown West""","""Flat White""","""Oat""",16,false,"""Erica""",6.21
2025-06-02 13:54:05,"""Downtown""","""Flat White""","""Oat""",12,false,"""Erica""",5.76
2025-06-06 11:23:23,"""East Village""","""Flat White""","""Coconut""",16,false,"""Angel""",5.23
2025-06-06 14:36:47,"""SoHo""","""Flat White""","""Oat""",20,false,"""Danielle""",3.47


In [7]:
coffees.filter(pl.col("location") == "SoHo")
coffees.filter(pl.col("location").eq("SoHo"))

coffees.filter(pl.col("location").eq("Soho"))
coffees.filter(pl.col("location").eq("SoHo "))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64


### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#filter
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#filter
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html

## Filtering with Mathematical Operators

In [8]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- The `!=` operator compares the inequality of every column value with a constant.
- The `ne` method is the equivalent method.

In [9]:
coffees.filter(pl.col("milk_type") != "Oat")
coffees.filter(pl.col("milk_type").ne("Oat"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 03:28:39,"""East Village""","""Espresso""","""2%""",20,true,"""Joshua""",4.79
…,…,…,…,…,…,…,…
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


- Polars supports the standard mathematical operators for comparisons:
    - `<` or `lt` for less than
    - `<=` or `le` for less than or equal to
    - `>` or `gt` for greater than
    - `>=` or `ge` for greater than or equal to

In [10]:
coffees.filter(pl.col("price") > 5)
coffees.filter(pl.col("price").gt(5))

coffees.filter(pl.col("price") < 5)
coffees.filter(pl.col("price").lt(5))

coffees.filter(pl.col("price") >= 5)
coffees.filter(pl.col("price").ge(5))

coffees.filter(pl.col("price") <= 5)
coffees.filter(pl.col("price").le(5))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
…,…,…,…,…,…,…,…
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.eq.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.ne.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.lt.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.le.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.gt.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.ge.html

## Filtering with Missing Values
- The `filter` method excludes `null` values by default.
- Polars has a family of comparison methods ending in `_missing` that include null values.
- For example, the `eq_missing` method compares the equality of a `null` with the argument.

In [11]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


In [12]:
coffees.null_count()

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
u32,u32,u32,u32,u32,u32,u32,u32
0,0,6,0,0,0,0,0


In [13]:
coffees.filter(pl.col("coffee_name").ne("Mocha"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


In [14]:
coffees.filter(pl.col("coffee_name").ne_missing("Mocha"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


- We can pass in `None` to compare equality with `null`/missing values...

In [15]:
coffees.filter(pl.col("coffee_name").eq(None))
coffees.filter(pl.col("coffee_name").eq_missing(None))

  coffees.filter(pl.col("coffee_name").eq(None))


time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,False,"""Jeffrey""",5.05
2025-01-07 00:36:36,"""Downtown""",,"""Oat""",16,False,"""Christopher""",3.92
2025-01-29 15:43:56,"""Downtown""",,"""Oat""",16,True,"""Jill""",6.09
2025-02-22 23:58:14,"""SoHo""",,"""2%""",20,True,"""Robert""",2.72
2025-04-09 22:00:09,"""Downtown""",,"""Whole""",16,True,"""Angel""",2.8
2025-05-10 16:28:05,"""SoHo""",,"""2%""",20,False,"""Erica""",4.4


- ...but the more idiomatic solution is the `is_null` method.
- The `is_null` method returns a Boolean `Series` where a true represents a missing value.

In [16]:
coffees.filter(pl.col("coffee_name").is_null())

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,False,"""Jeffrey""",5.05
2025-01-07 00:36:36,"""Downtown""",,"""Oat""",16,False,"""Christopher""",3.92
2025-01-29 15:43:56,"""Downtown""",,"""Oat""",16,True,"""Jill""",6.09
2025-02-22 23:58:14,"""SoHo""",,"""2%""",20,True,"""Robert""",2.72
2025-04-09 22:00:09,"""Downtown""",,"""Whole""",16,True,"""Angel""",2.8
2025-05-10 16:28:05,"""SoHo""",,"""2%""",20,False,"""Erica""",4.4


In [17]:
coffees.filter(pl.col("coffee_name").ne_missing(None))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


In [18]:
coffees.filter(pl.col("coffee_name").is_not_null())

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.null_count.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.ne_missing.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.eq_missing.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_null.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_not_null.html

## Filtering with Boolean Columns
- Polars prints its Booleans in lowercase (`true`, `false`).
- Python's Booleans have a capital first letter (`True`, `False`)
- We _could_ compare every value in a Boolean column to `True`.
- But a cleaner approach is to pass the Boolean column directly to the `filter` method.

In [19]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head()

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,False,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,False,"""Jeffrey""",5.05
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,False,"""Danielle""",3.74


In [20]:
coffees.filter(pl.col("iced") == True)
coffees.filter(pl.col("iced").eq(True))

coffees.filter(pl.col("iced"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 03:28:39,"""East Village""","""Espresso""","""2%""",20,true,"""Joshua""",4.79
2025-01-02 06:17:11,"""SoHo""","""Mocha""","""Coconut""",20,true,"""Joshua""",5.67
…,…,…,…,…,…,…,…
2025-06-05 07:39:21,"""Midtown West""","""Americano""","""Coconut""",12,true,"""Anthony""",4.79
2025-06-06 10:01:56,"""Uptown""","""Americano""","""2%""",16,true,"""Jill""",5.59
2025-06-06 19:12:31,"""SoHo""","""Latte""","""Coconut""",16,true,"""Erica""",4.96
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33


## Applying And Logic
- The `&` operator applies AND logic between two Boolean expressions.
- Both conditions must be met for the evaluation to be true.
- The AND (`&`) logic is called "conjunction" in mathematics.

In [21]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Wrap each expression in parentheses to isolate it. Otherwise, Polars will get confused.
- The following example extracts rows with a `milk_type` of "Soy" and a `size` greater than 18.

In [22]:
coffees.filter((pl.col("milk_type") == "Soy") & (pl.col("size") > 18))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-04 18:19:57,"""Midtown West""","""Espresso""","""Soy""",20,false,"""Patricia""",5.79
2025-01-04 18:46:05,"""Uptown""","""Espresso""","""Soy""",20,false,"""Joshua""",2.76
2025-01-10 19:45:31,"""East Village""","""Americano""","""Soy""",20,false,"""Jill""",4.87
2025-01-14 18:13:09,"""SoHo""","""Flat White""","""Soy""",20,true,"""Christopher""",2.75
…,…,…,…,…,…,…,…
2025-05-20 16:47:12,"""Downtown""","""Macchiato""","""Soy""",20,true,"""Jeffrey""",5.93
2025-05-25 20:18:39,"""Midtown West""","""Latte""","""Soy""",20,true,"""Jeffrey""",6.3
2025-05-26 12:49:07,"""Downtown""","""Espresso""","""Soy""",20,true,"""Christopher""",6.49
2025-05-29 17:02:49,"""Downtown""","""Americano""","""Soy""",20,true,"""Christopher""",4.16


- Methods allow the omission of parentheses.
- Alternatively, assign the expressions to variables.

In [23]:
coffees.filter(pl.col("milk_type").eq("Soy") & pl.col("size").gt(18))

with_soy_milk = pl.col("milk_type") == "Soy"
with_size_greater_than_18 = pl.col("size") > 18
coffees.filter(with_soy_milk & with_size_greater_than_18)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-04 18:19:57,"""Midtown West""","""Espresso""","""Soy""",20,false,"""Patricia""",5.79
2025-01-04 18:46:05,"""Uptown""","""Espresso""","""Soy""",20,false,"""Joshua""",2.76
2025-01-10 19:45:31,"""East Village""","""Americano""","""Soy""",20,false,"""Jill""",4.87
2025-01-14 18:13:09,"""SoHo""","""Flat White""","""Soy""",20,true,"""Christopher""",2.75
…,…,…,…,…,…,…,…
2025-05-20 16:47:12,"""Downtown""","""Macchiato""","""Soy""",20,true,"""Jeffrey""",5.93
2025-05-25 20:18:39,"""Midtown West""","""Latte""","""Soy""",20,true,"""Jeffrey""",6.3
2025-05-26 12:49:07,"""Downtown""","""Espresso""","""Soy""",20,true,"""Christopher""",6.49
2025-05-29 17:02:49,"""Downtown""","""Americano""","""Soy""",20,true,"""Christopher""",4.16


- The `filter` method supports any number of expressions.

In [24]:
coffees.filter((pl.col("milk_type") == "Soy") & (pl.col("size") > 18) & pl.col("iced"))

coffees.filter(pl.col("milk_type").eq("Soy") & pl.col("size").gt(18) & pl.col("iced"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-14 18:13:09,"""SoHo""","""Flat White""","""Soy""",20,true,"""Christopher""",2.75
2025-01-21 15:54:37,"""Downtown""","""Cappuccino""","""Soy""",20,true,"""Erica""",6.16
2025-02-17 16:05:25,"""Uptown""","""Latte""","""Soy""",20,true,"""Jill""",2.59
2025-02-25 08:08:15,"""SoHo""","""Flat White""","""Soy""",20,true,"""Danielle""",6.31
…,…,…,…,…,…,…,…
2025-05-20 16:47:12,"""Downtown""","""Macchiato""","""Soy""",20,true,"""Jeffrey""",5.93
2025-05-25 20:18:39,"""Midtown West""","""Latte""","""Soy""",20,true,"""Jeffrey""",6.3
2025-05-26 12:49:07,"""Downtown""","""Espresso""","""Soy""",20,true,"""Christopher""",6.49
2025-05-29 17:02:49,"""Downtown""","""Americano""","""Soy""",20,true,"""Christopher""",4.16


- As an alternative to the `&` symbol, we can also pass sequential Boolean expressions to the `filter` method.

In [25]:
coffees.filter(pl.col("milk_type").eq("Soy"), pl.col("size").gt(18), pl.col("iced"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-14 18:13:09,"""SoHo""","""Flat White""","""Soy""",20,true,"""Christopher""",2.75
2025-01-21 15:54:37,"""Downtown""","""Cappuccino""","""Soy""",20,true,"""Erica""",6.16
2025-02-17 16:05:25,"""Uptown""","""Latte""","""Soy""",20,true,"""Jill""",2.59
2025-02-25 08:08:15,"""SoHo""","""Flat White""","""Soy""",20,true,"""Danielle""",6.31
…,…,…,…,…,…,…,…
2025-05-20 16:47:12,"""Downtown""","""Macchiato""","""Soy""",20,true,"""Jeffrey""",5.93
2025-05-25 20:18:39,"""Midtown West""","""Latte""","""Soy""",20,true,"""Jeffrey""",6.3
2025-05-26 12:49:07,"""Downtown""","""Espresso""","""Soy""",20,true,"""Christopher""",6.49
2025-05-29 17:02:49,"""Downtown""","""Americano""","""Soy""",20,true,"""Christopher""",4.16


- The `and_` method is equivalent to the `&` operator.
- Polars chose these method names to avoid conflicts with Python's built-in keywords (`and`, `or` ,`not`).

In [26]:
coffees.filter(
    pl.col("milk_type").eq("Soy").and_(pl.col("size").gt(18)).and_(pl.col("iced"))
)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-14 18:13:09,"""SoHo""","""Flat White""","""Soy""",20,true,"""Christopher""",2.75
2025-01-21 15:54:37,"""Downtown""","""Cappuccino""","""Soy""",20,true,"""Erica""",6.16
2025-02-17 16:05:25,"""Uptown""","""Latte""","""Soy""",20,true,"""Jill""",2.59
2025-02-25 08:08:15,"""SoHo""","""Flat White""","""Soy""",20,true,"""Danielle""",6.31
…,…,…,…,…,…,…,…
2025-05-20 16:47:12,"""Downtown""","""Macchiato""","""Soy""",20,true,"""Jeffrey""",5.93
2025-05-25 20:18:39,"""Midtown West""","""Latte""","""Soy""",20,true,"""Jeffrey""",6.3
2025-05-26 12:49:07,"""Downtown""","""Espresso""","""Soy""",20,true,"""Christopher""",6.49
2025-05-29 17:02:49,"""Downtown""","""Americano""","""Soy""",20,true,"""Christopher""",4.16


- These methods on expressions produce new expressions.

In [27]:
pl.col("milk_type").eq("Soy").and_(pl.col("size").gt(18)).and_(pl.col("iced"))

In [28]:
type(pl.col("milk_type").eq("Soy").and_(pl.col("size").gt(18)).and_(pl.col("iced")))

polars.expr.expr.Expr

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.and_.html

## Keyword Argument Filtering
- The `filter` method supports keyword arguments.
- Pass the column name as the parameter and the comparison value as the argument.
- Keyword argument filtering only works for equality comparisons.
- The Polars team generally advises against this filtering approach. Use expressions.

In [29]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


In [30]:
coffees.filter(location="Uptown")
coffees.filter(pl.col("location") == "Uptown")

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-02 23:18:27,"""Uptown""","""Americano""","""Coconut""",20,false,"""Anthony""",4.08
2025-01-04 08:21:33,"""Uptown""","""Americano""","""Oat""",16,true,"""Joshua""",4.52
…,…,…,…,…,…,…,…
2025-06-02 08:35:29,"""Uptown""","""Espresso""","""Oat""",16,false,"""Angel""",3.81
2025-06-04 10:34:43,"""Uptown""","""Espresso""","""Soy""",20,true,"""Jeffrey""",5.73
2025-06-06 10:01:56,"""Uptown""","""Americano""","""2%""",16,true,"""Jill""",5.59
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33


- The following example filters row that have a `location` of "Uptown" and a `barista` of "Danielle".

In [31]:
coffees.filter(location="Uptown", barista="Danielle")

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-08 19:36:59,"""Uptown""","""Espresso""","""Soy""",12,false,"""Danielle""",4.57
2025-01-23 05:45:42,"""Uptown""","""Americano""","""Coconut""",20,false,"""Danielle""",2.77
2025-01-23 09:31:53,"""Uptown""","""Mocha""","""Oat""",12,false,"""Danielle""",3.33
2025-01-26 17:17:32,"""Uptown""","""Mocha""","""Whole""",20,false,"""Danielle""",6.3
2025-02-07 06:02:22,"""Uptown""","""Mocha""","""Oat""",16,false,"""Danielle""",3.69
…,…,…,…,…,…,…,…
2025-05-10 21:39:44,"""Uptown""","""Americano""","""2%""",16,true,"""Danielle""",5.04
2025-05-20 16:03:34,"""Uptown""","""Americano""","""2%""",12,false,"""Danielle""",2.66
2025-05-22 20:24:05,"""Uptown""","""Flat White""","""Almond""",20,true,"""Danielle""",5.45
2025-05-28 19:35:15,"""Uptown""","""Flat White""","""Almond""",20,true,"""Danielle""",4.98


- Boolean columns require an explicit `True` argument.
- The following example adds an additional criteria of a `True` in the `iced` column.

In [32]:
coffees.filter(location="Uptown", barista="Danielle", iced=True)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-04-16 06:48:46,"""Uptown""","""Americano""","""Soy""",20,True,"""Danielle""",4.74
2025-05-07 04:40:51,"""Uptown""","""Flat White""","""Oat""",12,True,"""Danielle""",5.15
2025-05-10 21:39:44,"""Uptown""","""Americano""","""2%""",16,True,"""Danielle""",5.04
2025-05-22 20:24:05,"""Uptown""","""Flat White""","""Almond""",20,True,"""Danielle""",5.45
2025-05-28 19:35:15,"""Uptown""","""Flat White""","""Almond""",20,True,"""Danielle""",4.98
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,True,"""Danielle""",3.33


## Applying or Logic
- The `|` operator applies OR logic between two Boolean expressions.
- At least one condition must be true for the final evaluation to be true.
- The OR (`|`) logic is called "disjunction" in mathematics.

In [33]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- The next example filters row where `price` is less than or equal to 3 or `location` is equal to "Downtown".

In [34]:
coffees.filter((pl.col("price") <= 3) | (pl.col("location") == "Downtown"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 21:03:50,"""Downtown""","""Latte""","""Soy""",16,false,"""Anthony""",2.53
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
2025-01-04 18:46:05,"""Uptown""","""Espresso""","""Soy""",20,false,"""Joshua""",2.76
…,…,…,…,…,…,…,…
2025-06-05 18:41:12,"""Downtown""","""Macchiato""","""2%""",12,false,"""Robert""",5.49
2025-06-06 07:47:07,"""Downtown""","""Americano""","""Coconut""",12,false,"""Jeffrey""",2.8
2025-06-06 21:39:23,"""Downtown""","""Macchiato""","""Whole""",12,false,"""Robert""",6.26
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49


- The `le` method is equivalent to the `<=` symbol.
- Methods allow for the omission of parentheses.

In [35]:
coffees.filter(pl.col("price").le(3) | pl.col("location").eq("Downtown"))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 21:03:50,"""Downtown""","""Latte""","""Soy""",16,false,"""Anthony""",2.53
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
2025-01-04 18:46:05,"""Uptown""","""Espresso""","""Soy""",20,false,"""Joshua""",2.76
…,…,…,…,…,…,…,…
2025-06-05 18:41:12,"""Downtown""","""Macchiato""","""2%""",12,false,"""Robert""",5.49
2025-06-06 07:47:07,"""Downtown""","""Americano""","""Coconut""",12,false,"""Jeffrey""",2.8
2025-06-06 21:39:23,"""Downtown""","""Macchiato""","""Whole""",12,false,"""Robert""",6.26
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49


- The `or_` method is the equivalent of the `|` operator.
- We build up a complex expression that captures the same either/or logic.
- The Polars team chose this awkard name intentionally to avoid conflict/confusion with Python's `or` keyword.

In [36]:
coffees.filter(pl.col("price").le(3).or_(pl.col("location").eq("Downtown")))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 21:03:50,"""Downtown""","""Latte""","""Soy""",16,false,"""Anthony""",2.53
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
2025-01-04 18:46:05,"""Uptown""","""Espresso""","""Soy""",20,false,"""Joshua""",2.76
…,…,…,…,…,…,…,…
2025-06-05 18:41:12,"""Downtown""","""Macchiato""","""2%""",12,false,"""Robert""",5.49
2025-06-06 07:47:07,"""Downtown""","""Americano""","""Coconut""",12,false,"""Jeffrey""",2.8
2025-06-06 21:39:23,"""Downtown""","""Macchiato""","""Whole""",12,false,"""Robert""",6.26
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49


- To simplify code, assign expressions to descriptive variables.

In [37]:
cheap_coffee = pl.col("price") <= 3
in_downtown = pl.col("location") == "Downtown"
coffees.filter(cheap_coffee | in_downtown)
coffees.filter(cheap_coffee.or_(in_downtown))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 21:03:50,"""Downtown""","""Latte""","""Soy""",16,false,"""Anthony""",2.53
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
2025-01-04 18:46:05,"""Uptown""","""Espresso""","""Soy""",20,false,"""Joshua""",2.76
…,…,…,…,…,…,…,…
2025-06-05 18:41:12,"""Downtown""","""Macchiato""","""2%""",12,false,"""Robert""",5.49
2025-06-06 07:47:07,"""Downtown""","""Americano""","""Coconut""",12,false,"""Jeffrey""",2.8
2025-06-06 21:39:23,"""Downtown""","""Macchiato""","""Whole""",12,false,"""Robert""",6.26
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.or_.html

## Operator Precedence
- Polars prioritizes the AND (`&`) operator over the OR (`|`) operator.

In [38]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Let's start by defining some Boolean expressions.

In [39]:
is_almond_milk = pl.col("milk_type").eq("Almond")
is_iced = pl.col("iced")
is_cheap = pl.col("price").le(4)

- How does Polars interpret the filter below?
- Option A ❌: Extract rows that are (almond milk OR iced) AND cheap
- Option B ✅: Extract rows that are almond milk OR (iced AND cheap)
- The `&` wins out so Polars group `is_iced & is_cheap` together.

In [40]:
coffees.filter(is_almond_milk | is_iced & is_cheap)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
…,…,…,…,…,…,…,…
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


- Use parentheses to group together the related expressions.
- Use parentheses even if Polars' default inference is correct.

In [41]:
coffees.filter(is_almond_milk | (is_iced & is_cheap))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
…,…,…,…,…,…,…,…
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


- The placement of parentheses switches up the result.
- The next example extracts rows that are (Almond milk OR Iced) AND cheap

In [42]:
coffees.filter((is_almond_milk | is_iced) & is_cheap)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
2025-01-03 10:57:47,"""Midtown West""","""Flat White""","""Soy""",12,true,"""Anthony""",3.69
…,…,…,…,…,…,…,…
2025-06-04 14:11:55,"""Downtown""","""Espresso""","""Coconut""",20,true,"""Joshua""",2.98
2025-06-04 14:41:15,"""East Village""","""Macchiato""","""Coconut""",12,true,"""Christopher""",2.86
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


## Exclusive OR
- The exclusive OR operator (`^`) requires that only one condition be met.
- If one condition is true, the other condition must be false.
- if both conditions are true, the `^` expression evaluates to false.
- The method equivalent is `xor`.

In [43]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Let's filter for the rows with either a location of "Uptown" or a milk of "Oat".
- Polars will _exclude_ a row if it has both a location of "Uptown" and a milk of "Oat".

In [44]:
uptown = pl.col("location") == "Uptown"
oat_milk = pl.col("milk_type") == "Oat"

coffees.filter(uptown | oat_milk)
coffees.filter(uptown ^ oat_milk)
coffees.filter(uptown.xor(oat_milk))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-02 23:18:27,"""Uptown""","""Americano""","""Coconut""",20,false,"""Anthony""",4.08
2025-01-03 05:30:19,"""SoHo""","""Macchiato""","""Oat""",12,true,"""Christopher""",2.9
2025-01-04 18:46:05,"""Uptown""","""Espresso""","""Soy""",20,false,"""Joshua""",2.76
…,…,…,…,…,…,…,…
2025-06-06 14:36:47,"""SoHo""","""Flat White""","""Oat""",20,false,"""Danielle""",3.47
2025-06-06 14:51:19,"""East Village""","""Flat White""","""Oat""",20,false,"""Erica""",4.75
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.xor.html

## Filtering for Unique and Duplicate Values

In [45]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- The `is_unique` method identifies if a row value is unique (only occurs once in the column).
- The `is_duplicated` method identifies if a row value is a duplicate. First occurrences count as duplicates.

In [46]:
coffees.filter(pl.col("coffee_name").is_unique())

coffees.select(pl.col("coffee_name").value_counts(sort=True))

coffee_name
struct[2]
"{""Flat White"",155}"
"{""Macchiato"",152}"
"{""Espresso"",151}"
"{""Latte"",139}"
"{""Mocha"",134}"
…
"{null,6}"
"{""Frappe"",1}"
"{""Affogato"",1}"
"{""Red eye"",1}"


In [47]:
coffees.filter(pl.col("coffee_name").is_duplicated())

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


- The `is_first_distinct` method identifies if a row holds the first occurrence of a value, regardless of whether it is a duplicate or not.
- The `is_last_distinct` method identifies if a row holds the last occurrence of a value, regardless of whether it is a duplicate or not.

In [48]:
coffees.filter(pl.col("coffee_name").is_first_distinct())

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
…,…,…,…,…,…,…,…
2025-01-05 14:17:03,"""Downtown""","""Cappuccino""","""Oat""",12,false,"""Erica""",3.71
2025-01-27 01:50:31,"""SoHo""","""Frappe""","""Soy""",16,true,"""Jeffrey""",4.3
2025-02-07 09:30:28,"""East Village""","""Affogato""","""2%""",16,false,"""Christopher""",2.98
2025-04-15 13:22:05,"""Uptown""","""Red eye""","""2%""",12,false,"""Anthony""",4.71


In [49]:
coffees.filter(pl.col("coffee_name").is_last_distinct())

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-27 01:50:31,"""SoHo""","""Frappe""","""Soy""",16,true,"""Jeffrey""",4.3
2025-02-07 09:30:28,"""East Village""","""Affogato""","""2%""",16,false,"""Christopher""",2.98
2025-04-15 13:22:05,"""Uptown""","""Red eye""","""2%""",12,false,"""Anthony""",4.71
2025-04-26 01:07:54,"""Midtown West""","""Cortado""","""Whole""",20,true,"""Joshua""",5.58
2025-05-10 16:28:05,"""SoHo""",,"""2%""",20,false,"""Erica""",4.4
…,…,…,…,…,…,…,…
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_duplicated.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_first_distinct.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_last_distinct.html

## Filtering with Datetimes
- The `filter` method supports filtering date, time, and datetime columns.

In [50]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Let's find all the coffees purchased on or after June 1st, 2025.
- Polars is less flexible than Pandas. It will not allow comparison of dates with time strings.

In [51]:
# coffees.filter(pl.col("time_of_purchase") >= "2025-06-01")

- To filter datetimes, we need to provide a valid date/datetime object from Python or Polars.
- Polars has a `date` function which has the same API as `dt.date` (year, month, and day).
- The `date` function returns an expression representing a date in time.
- `pl.date` is different from `pl.Date` which is the `Date` type.
- When using a date on a datetime column, Polars assumes a time of midnight (00:00:00).

In [52]:
import datetime as dt

In [53]:
coffees.filter(pl.col("time_of_purchase") >= dt.date(2025, 6, 1))
coffees.filter(pl.col("time_of_purchase").ge(dt.date(2025, 6, 1)))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-06-01 00:13:38,"""Midtown West""","""Mocha""","""2%""",12,false,"""Patricia""",3.03
2025-06-01 00:46:22,"""SoHo""","""Macchiato""","""Oat""",12,false,"""Danielle""",3.67
2025-06-01 03:42:07,"""Midtown West""","""Flat White""","""Oat""",16,false,"""Erica""",6.21
2025-06-01 04:39:44,"""Downtown""","""Mocha""","""Oat""",20,false,"""Erica""",5.66
2025-06-01 11:36:18,"""Midtown West""","""Cappuccino""","""Whole""",12,true,"""Angel""",4.51
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


In [54]:
type(pl.date(2025, 6, 1))

polars.expr.expr.Expr

In [55]:
coffees.filter(pl.col("time_of_purchase") >= pl.date(2025, 6, 1))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-06-01 00:13:38,"""Midtown West""","""Mocha""","""2%""",12,false,"""Patricia""",3.03
2025-06-01 00:46:22,"""SoHo""","""Macchiato""","""Oat""",12,false,"""Danielle""",3.67
2025-06-01 03:42:07,"""Midtown West""","""Flat White""","""Oat""",16,false,"""Erica""",6.21
2025-06-01 04:39:44,"""Downtown""","""Mocha""","""Oat""",20,false,"""Erica""",5.66
2025-06-01 11:36:18,"""Midtown West""","""Cappuccino""","""Whole""",12,true,"""Angel""",4.51
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


- The `dt.datetime` constructor accepts the hours, minutes, and seconds after the year, month, and day.
- Here, we filter for the coffees purchased after June 1st, 2025 12:30:45pm.
- Polars has a complementary `pl.datetime` constructor.

In [56]:
coffees.filter(pl.col("time_of_purchase") >= dt.datetime(2025, 6, 1, 12, 30, 45))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-06-01 16:48:35,"""SoHo""","""Cappuccino""","""Almond""",20,false,"""Anthony""",4.68
2025-06-01 22:11:07,"""Downtown""","""Macchiato""","""Whole""",16,false,"""Joshua""",2.86
2025-06-01 23:20:42,"""Downtown""","""Latte""","""Almond""",16,true,"""Christopher""",4.86
2025-06-02 02:43:45,"""East Village""","""Mocha""","""Soy""",16,true,"""Patricia""",5.81
2025-06-02 06:02:01,"""Midtown West""","""Espresso""","""Coconut""",20,true,"""Danielle""",6.13
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


In [57]:
coffees.filter(pl.col("time_of_purchase") >= pl.datetime(2025, 6, 1, 12, 30, 45))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-06-01 16:48:35,"""SoHo""","""Cappuccino""","""Almond""",20,false,"""Anthony""",4.68
2025-06-01 22:11:07,"""Downtown""","""Macchiato""","""Whole""",16,false,"""Joshua""",2.86
2025-06-01 23:20:42,"""Downtown""","""Latte""","""Almond""",16,true,"""Christopher""",4.86
2025-06-02 02:43:45,"""East Village""","""Mocha""","""Soy""",16,true,"""Patricia""",5.81
2025-06-02 06:02:01,"""Midtown West""","""Espresso""","""Coconut""",20,true,"""Danielle""",6.13
…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.date.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.datetime.html

## The is_between Method

In [58]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Let's say that we're targeting all transactions with a price between 3 and 4 dollars.
- When checking for inclusion in a range, one option is to declare each bound as a separate condition.

In [59]:
coffees.filter(pl.col("price").ge(3) & pl.col("price").le(4))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-03 10:57:47,"""Midtown West""","""Flat White""","""Soy""",12,true,"""Anthony""",3.69
2025-01-05 14:17:03,"""Downtown""","""Cappuccino""","""Oat""",12,false,"""Erica""",3.71
…,…,…,…,…,…,…,…
2025-06-05 11:01:53,"""Downtown""","""Latte""","""Oat""",20,false,"""Anthony""",3.56
2025-06-06 03:25:24,"""Midtown West""","""Latte""","""Soy""",12,false,"""Robert""",3.07
2025-06-06 14:36:47,"""SoHo""","""Flat White""","""Oat""",20,false,"""Danielle""",3.47
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33


- The `is_between` method returns a Boolean expression that indicates whether a row value falls within a range.
- Both the lower bound and upper bound are inclusive.

In [60]:
coffees.filter(pl.col("price").is_between(3, 4))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-03 10:57:47,"""Midtown West""","""Flat White""","""Soy""",12,true,"""Anthony""",3.69
2025-01-05 14:17:03,"""Downtown""","""Cappuccino""","""Oat""",12,false,"""Erica""",3.71
…,…,…,…,…,…,…,…
2025-06-05 11:01:53,"""Downtown""","""Latte""","""Oat""",20,false,"""Anthony""",3.56
2025-06-06 03:25:24,"""Midtown West""","""Latte""","""Soy""",12,false,"""Robert""",3.07
2025-06-06 14:36:47,"""SoHo""","""Flat White""","""Oat""",20,false,"""Danielle""",3.47
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33


In [61]:
coffees.filter(pl.col("size").is_between(12, 16))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
2025-01-02 09:30:42,"""East Village""","""Macchiato""","""Whole""",12,false,"""Robert""",6.24
…,…,…,…,…,…,…,…
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


In [62]:
coffees.filter(
    pl.col("time_of_purchase").is_between(pl.date(2025, 2, 1), pl.date(2025, 3, 1))
)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-02-01 00:03:58,"""Midtown West""","""Latte""","""Whole""",20,false,"""Patricia""",3.13
2025-02-01 01:34:12,"""Downtown""","""Flat White""","""Soy""",20,false,"""Christopher""",4.15
2025-02-01 08:52:23,"""East Village""","""Macchiato""","""Almond""",12,false,"""Erica""",5.6
2025-02-01 09:01:35,"""Uptown""","""Mocha""","""Whole""",12,false,"""Angel""",5.12
2025-02-01 13:18:47,"""Midtown West""","""Cappuccino""","""Almond""",12,false,"""Patricia""",4.47
…,…,…,…,…,…,…,…
2025-02-28 06:05:15,"""East Village""","""Latte""","""Whole""",16,false,"""Joshua""",4.76
2025-02-28 07:36:11,"""Midtown West""","""Espresso""","""Whole""",16,true,"""Anthony""",3.68
2025-02-28 10:31:41,"""Downtown""","""Macchiato""","""Oat""",16,true,"""Christopher""",3.29
2025-02-28 13:28:37,"""SoHo""","""Cappuccino""","""Coconut""",20,true,"""Patricia""",5.38


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_between.html

## The is_in Method
- The `is_in` method validates the inclusion of a row value within a collection of values.

In [63]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Say we want to extract rows where the `coffee_name` is either "Americano" or "Espresso".
- We can apply the `|` operator or the `_or` method to check for multiple independent expressions.

In [64]:
coffees.filter(
    pl.col("coffee_name").eq("Americano") | pl.col("coffee_name").eq("Espresso")
)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 03:28:39,"""East Village""","""Espresso""","""2%""",20,true,"""Joshua""",4.79
2025-01-02 23:18:27,"""Uptown""","""Americano""","""Coconut""",20,false,"""Anthony""",4.08
2025-01-03 20:46:35,"""Midtown West""","""Espresso""","""Coconut""",20,false,"""Anthony""",4.81
2025-01-04 08:21:33,"""Uptown""","""Americano""","""Oat""",16,true,"""Joshua""",4.52
…,…,…,…,…,…,…,…
2025-06-06 17:29:25,"""Midtown West""","""Americano""","""2%""",12,false,"""Jeffrey""",5.96
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


- The `is_in` method accepts a collection (list, tuple, etc) or another expression.

In [65]:
coffees.filter(pl.col("coffee_name").is_in(["Americano", "Espresso", "Latte"]))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 03:28:39,"""East Village""","""Espresso""","""2%""",20,true,"""Joshua""",4.79
2025-01-02 19:24:47,"""Midtown West""","""Latte""","""Soy""",12,false,"""Jill""",5.69
2025-01-02 21:03:50,"""Downtown""","""Latte""","""Soy""",16,false,"""Anthony""",2.53
…,…,…,…,…,…,…,…
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.is_in.html

## The remove Method
- The `filter` method includes rows that satisfy an expression.
- The `remove` method _excludes_ rows that satisfy an expression.
- The `remove` method performs the inverse logic of the `filter` method.

In [66]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Say we want to extract any rows where the `size` value is not equal to 12.

In [67]:
coffees.filter(pl.col("size") != 12)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 03:28:39,"""East Village""","""Espresso""","""2%""",20,true,"""Joshua""",4.79
2025-01-02 06:17:11,"""SoHo""","""Mocha""","""Coconut""",20,true,"""Joshua""",5.67
…,…,…,…,…,…,…,…
2025-06-06 19:12:31,"""SoHo""","""Latte""","""Coconut""",16,true,"""Erica""",4.96
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


- Alternatively, we could `remove` any rows where the `size` value is equal to 12.

In [68]:
coffees.remove(pl.col("size") == 12)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 03:28:39,"""East Village""","""Espresso""","""2%""",20,true,"""Joshua""",4.79
2025-01-02 06:17:11,"""SoHo""","""Mocha""","""Coconut""",20,true,"""Joshua""",5.67
…,…,…,…,…,…,…,…
2025-06-06 19:12:31,"""SoHo""","""Latte""","""Coconut""",16,true,"""Erica""",4.96
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


- We can validate the equality of `DataFrames` with the `equals` method.
- The `equals` method returns `True` if two `DataFrames` are equal (same columns, rows, values, etc).

In [69]:
coffees.filter(pl.col("size") != 12).equals(coffees.remove(pl.col("size") == 12))

True

- If you use `&` or provide multiple arguments, Polars will confirm both conditions are met to exclude a row.
- Polars will remove rows where the `size` is 12 and the `barista` is either "Christopher" or "Erica".
- Polars will keep rows with a size of 12 and a barista who is not Christopher or Erica.
- Polars will keep rows with a barista of Christopher or Erica and a size not equal to 12.

In [70]:
coffees.remove(
    (pl.col("size") == 12) & pl.col("barista").is_in(["Christopher", "Erica"])
)

coffees.remove(pl.col("size") == 12, pl.col("barista").is_in(["Christopher", "Erica"]))

coffees.remove(
    (pl.col("size") == 12) | pl.col("barista").is_in(["Christopher", "Erica"])
)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-02 03:28:39,"""East Village""","""Espresso""","""2%""",20,true,"""Joshua""",4.79
2025-01-02 06:17:11,"""SoHo""","""Mocha""","""Coconut""",20,true,"""Joshua""",5.67
2025-01-02 12:27:29,"""Uptown""","""Macchiato""","""Coconut""",16,true,"""Joshua""",3.89
2025-01-02 21:03:50,"""Downtown""","""Latte""","""Soy""",16,false,"""Anthony""",2.53
2025-01-02 23:18:27,"""Uptown""","""Americano""","""Coconut""",20,false,"""Anthony""",4.08
…,…,…,…,…,…,…,…
2025-06-06 15:50:35,"""SoHo""","""Americano""","""2%""",16,false,"""Jeffrey""",4.16
2025-06-06 22:28:52,"""Uptown""","""Cappuccino""","""Soy""",16,true,"""Danielle""",3.33
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.remove.html

## Negation with the Tilde Symbol
- The tilde symbol (`~`) inverts a Boolean expression's values.
- A true becomes a false, and a false becomes a true.
- The method equivalent is `not_`. Python has a reserved `not` keyword.

In [71]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


In [72]:
coffees.select(
    pl.col("barista"),
    pl.col("barista").eq("Christopher").alias("is_chris"),
    ~pl.col("barista").eq("Christopher").alias("is_not_chris"),
)

barista,is_chris,is_not_chris
str,bool,bool
"""Christopher""",true,false
"""Christopher""",true,false
"""Jeffrey""",false,true
"""Jeffrey""",false,true
"""Danielle""",false,true
…,…,…
"""Danielle""",false,true
"""Angel""",false,true
"""Danielle""",false,true
"""Danielle""",false,true


In [73]:
coffees.select(
    pl.col("barista"),
    pl.col("barista").eq("Christopher").alias("is_chris"),
    pl.col("barista").eq("Christopher").not_().alias("is_not_chris"),
)

barista,is_chris,is_not_chris
str,bool,bool
"""Christopher""",true,false
"""Christopher""",true,false
"""Jeffrey""",false,true
"""Jeffrey""",false,true
"""Danielle""",false,true
…,…,…
"""Danielle""",false,true
"""Angel""",false,true
"""Danielle""",false,true
"""Danielle""",false,true


- Let's extract the rows with coffee prices less than `$3` and greater than `$4`.

In [74]:
coffees.filter(pl.col("price").lt(3) | pl.col("price").gt(4))

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
…,…,…,…,…,…,…,…
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


- Alternatively, we can target all rows with prices that don't fall between `$3` and `$4`.

In [75]:
coffees.filter(~pl.col("price").is_between(3, 4))
coffees.filter(pl.col("price").is_between(3, 4).not_())

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
…,…,…,…,…,…,…,…
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


- Use variables to provide context for what an expression represents.

In [76]:
three_dollar_coffees = pl.col("price").is_between(3, 4)
other_priced_coffees = ~three_dollar_coffees
other_priced_coffees = three_dollar_coffees.not_()

coffees.filter(three_dollar_coffees)
coffees.filter(other_priced_coffees)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05
2025-01-01 16:31:26,"""Midtown West""","""Americano""","""2%""",20,false,"""Erica""",4.83
2025-01-02 01:50:26,"""Uptown""","""Flat White""","""Oat""",12,true,"""Joshua""",2.6
…,…,…,…,…,…,…,…
2025-06-07 04:13:09,"""Midtown West""","""Espresso""","""Almond""",12,true,"""Joshua""",4.54
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.not_.html

## When, Then, Otherwise
- The `when` method creates a conditional expression.
- The `then` method assigns a value based on the condition being met.
- The `otherwise` method provides an optional fallback value if the condition is not met.

In [77]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True).sort("time_of_purchase")
coffees.head(2)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,True,"""Christopher""",3.1
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,True,"""Christopher""",4.29


- Say we wanted to provide a description of every coffee based on the milk type.
- If the milk is "Oat", the description should be `"Creamy"`. Otherwise, fallback to `"Unknown"`.
- The `pl.lit` function returns an expression that represents a literal (i.e., fixed, constant) value.
- Without a fallback, Polars falls back to a missing value (`null`) for any non-matching value.

In [78]:
coffees.with_columns(
    pl.when(pl.col("milk_type") == "Oat").then(pl.lit("Creamy")).alias("description")
)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price,description
datetime[μs],str,str,str,i64,bool,str,f64,str
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1,
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29,"""Creamy"""
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85,
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05,
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74,"""Creamy"""
…,…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49,"""Creamy"""
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82,
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17,
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87,


- Use the `otherwise` method to provide a fallback to use instead of `null`.

In [79]:
coffees.with_columns(
    pl.when(pl.col("milk_type") == "Oat")
    .then(pl.lit("Creamy"))
    .otherwise(pl.lit("Delicious"))
    .alias("description")
)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price,description
datetime[μs],str,str,str,i64,bool,str,f64,str
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1,"""Delicious"""
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29,"""Creamy"""
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85,"""Delicious"""
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05,"""Delicious"""
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74,"""Creamy"""
…,…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49,"""Creamy"""
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82,"""Delicious"""
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17,"""Delicious"""
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87,"""Delicious"""


- We can chain multiple invocations of the `when` method in sequence.
- Each `when` method call creates a separate condition to match.
- The design is similar to a `switch case` statement in various programming languages.

In [80]:
coffees.with_columns(
    pl.when(pl.col("milk_type") == "Oat")
    .then(pl.lit("Creamy"))
    .when(pl.col("milk_type") == "Almond")
    .then(pl.lit("Watery"))
    .otherwise(pl.lit("Delicious"))
    .alias("description")
)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price,description
datetime[μs],str,str,str,i64,bool,str,f64,str
2025-01-01 01:53:30,"""Midtown West""","""Latte""","""Soy""",20,true,"""Christopher""",3.1,"""Delicious"""
2025-01-01 02:39:54,"""Uptown""","""Flat White""","""Oat""",20,true,"""Christopher""",4.29,"""Creamy"""
2025-01-01 07:52:54,"""Downtown""","""Mocha""","""Almond""",12,false,"""Jeffrey""",5.85,"""Watery"""
2025-01-01 10:02:41,"""Midtown West""",,"""Soy""",12,false,"""Jeffrey""",5.05,"""Delicious"""
2025-01-01 15:07:27,"""SoHo""","""Flat White""","""Oat""",12,false,"""Danielle""",3.74,"""Creamy"""
…,…,…,…,…,…,…,…,…
2025-06-07 10:21:56,"""Downtown""","""Macchiato""","""Oat""",20,false,"""Danielle""",6.49,"""Creamy"""
2025-06-07 10:28:12,"""East Village""","""Americano""","""Coconut""",12,false,"""Angel""",4.82,"""Delicious"""
2025-06-08 06:12:20,"""East Village""","""Espresso""","""Almond""",16,false,"""Danielle""",5.17,"""Watery"""
2025-06-08 10:56:38,"""East Village""","""Latte""","""Almond""",12,false,"""Danielle""",2.87,"""Watery"""


### Further Reading
- https://docs.pola.rs/user-guide/expressions/basic-operations/#conditionals
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.when.html

## Partitioning DataFrames
- Say we wanted to extract a `DataFrame` for every unique coffee (`"Flat White"`, `"Latte"`).

In [81]:
coffees = pl.read_csv("coffee_sales.csv", try_parse_dates=True)
coffees.head(3)

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-02-12 10:55:36,"""Downtown""","""Flat White""","""Oat""",12,True,"""Danielle""",5.47
2025-02-15 04:16:42,"""SoHo""","""Flat White""","""Oat""",12,True,"""Robert""",2.85
2025-04-09 03:07:34,"""Midtown West""","""Latte""","""Almond""",20,True,"""Jeffrey""",4.52


- Use the `unique` method to return a column of unique values from a target column.
- Writing 12 `filter` statements is verbose and not future-proof.

In [82]:
coffees.select(pl.col("coffee_name").unique())

coffee_name
str
"""Americano"""
"""Frappe"""
"""Cortado"""
"""Affogato"""
"""Flat White"""
…
""
"""Mocha"""
"""Espresso"""
"""Red eye"""


- The `partition_by` method uses a column's unique values to return a list of `DataFrames`.
- The `partition_by` method accepts a string with the column's name (an expression is invalid).
- The word __partition__ means to "divide into parts".
- If we have 12 different coffees, then `partition_by` will return a list of 12 `DataFrames`.

In [83]:
coffees.partition_by("coffee_name")

[shape: (155, 8)
 ┌─────────────────────┬──────────────┬─────────────┬───────────┬──────┬───────┬──────────┬───────┐
 │ time_of_purchase    ┆ location     ┆ coffee_name ┆ milk_type ┆ size ┆ iced  ┆ barista  ┆ price │
 │ ---                 ┆ ---          ┆ ---         ┆ ---       ┆ ---  ┆ ---   ┆ ---      ┆ ---   │
 │ datetime[μs]        ┆ str          ┆ str         ┆ str       ┆ i64  ┆ bool  ┆ str      ┆ f64   │
 ╞═════════════════════╪══════════════╪═════════════╪═══════════╪══════╪═══════╪══════════╪═══════╡
 │ 2025-02-12 10:55:36 ┆ Downtown     ┆ Flat White  ┆ Oat       ┆ 12   ┆ true  ┆ Danielle ┆ 5.47  │
 │ 2025-02-15 04:16:42 ┆ SoHo         ┆ Flat White  ┆ Oat       ┆ 12   ┆ true  ┆ Robert   ┆ 2.85  │
 │ 2025-04-27 21:36:38 ┆ SoHo         ┆ Flat White  ┆ Coconut   ┆ 20   ┆ false ┆ Robert   ┆ 4.18  │
 │ 2025-05-07 04:40:51 ┆ Uptown       ┆ Flat White  ┆ Oat       ┆ 12   ┆ true  ┆ Danielle ┆ 5.15  │
 │ 2025-02-12 19:19:39 ┆ Midtown West ┆ Flat White  ┆ Whole     ┆ 16   ┆ false ┆ Da

- The list isn't particularly helpful because the `DataFrames` are accessible by index position.
- Polars will store the `DataFrames` in the order that it encounters the unique values in the original `DataFrame`.
- In this example, "Flat White" comes first, then "Latte", then "Macchiato", etc.
- The `as_dict` parameter returns a dictionary where the keys are tuples with the unique column values and the values are the `DataFrames`.

In [84]:
coffees_dict = coffees.partition_by("coffee_name", as_dict=True)

- Use square brackets to index into the dictionary and provide a tuple key to access the corresponding `DataFrame`.
- A tuple only requires a comma but is usually written in parentheses.

In [85]:
coffees_dict[("Mocha",)]

coffees_dict[("Affogato",)]

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-02-07 09:30:28,"""East Village""","""Affogato""","""2%""",16,False,"""Christopher""",2.98


- The reason that the dictionary stores tuple keys is to support multi-key partitioning.
- Let's instead the split the `DataFrame` by every combination of coffee and milk.

In [86]:
coffee_and_milk = coffees.partition_by(["coffee_name", "milk_type"], as_dict=True)

- Pass a tuple with both the coffee and milk entry to extract the target `DataFrame`.
- A single value (i.e., "Americano") will trigger an error.

In [87]:
coffee_and_milk[("Americano", "Soy")]

time_of_purchase,location,coffee_name,milk_type,size,iced,barista,price
datetime[μs],str,str,str,i64,bool,str,f64
2025-05-29 06:08:25,"""Downtown""","""Americano""","""Soy""",12,false,"""Christopher""",2.98
2025-05-04 10:30:15,"""SoHo""","""Americano""","""Soy""",12,true,"""Jeffrey""",3.52
2025-02-12 15:28:51,"""Midtown West""","""Americano""","""Soy""",20,false,"""Anthony""",5.15
2025-05-08 06:33:56,"""SoHo""","""Americano""","""Soy""",16,true,"""Jeffrey""",5.26
2025-03-18 19:46:41,"""East Village""","""Americano""","""Soy""",16,false,"""Robert""",3.88
…,…,…,…,…,…,…,…
2025-05-08 15:34:51,"""SoHo""","""Americano""","""Soy""",20,false,"""Erica""",3.53
2025-03-01 20:35:16,"""SoHo""","""Americano""","""Soy""",20,true,"""Angel""",3.74
2025-06-04 03:04:30,"""SoHo""","""Americano""","""Soy""",16,false,"""Christopher""",2.85
2025-04-14 10:46:22,"""Downtown""","""Americano""","""Soy""",16,true,"""Patricia""",2.68


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.partition_by.html