In this video, we’ll explore how Polars lets you write SQL-style analytics directly in Python — with expressions like 
pl.col(), pl.when(), and groupby().agg() that look clean, scale to millions of rows, and run faster than pandas or Spark in many cases!


Goals

Introduce Polars Expressions
How to write SQL like Queries
Demonstrate lazyframe worksflows
Advanced topics - window functions, join, pivots, exploding lists, conditional logic and custom UDFs
We will focus on practical examples and execises that can be adapted to real datasets


In [0]:
# %% [markdown]
# Polars Expressions for Analytics (SQL-like style in Python)

# Notebook overview
# This notebook is an extensive hands-on guide to using **Polars expressions** — the powerful, lazy, SQL-like expression system in Polars — for analytics in Python.

# Goals:
# - Introduce Polars expressions and the core building blocks (`pl.col`, `pl.lit`, `pl.when`, arithmetic, functions)
# - Show how to write SQL-like queries using expressions (`select`, `with_columns`, `filter`, `groupby`, `agg`)
# - Demonstrate LazyFrame workflows for performant pipelines
# - Cover advanced topics: window functions, joins, pivots, exploding lists, conditional logic, and custom UDFs
# - Provide practical examples and exercises you can adapt to real datasets

# Requirements:
# - polars >= 0.18 (API is stable, but if you have an older version some names may differ)
# - pandas (for comparison examples, optional)

In [0]:
pip install polars

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
import polars as pl
import numpy as np
import datetime as dt

In [0]:
# %% [markdown]
# 1) Quick primer: DataFrame vs LazyFrame
#
# `pl.DataFrame` evaluates eagerly (similar to pandas). `pl.LazyFrame` builds a query plan and evaluates on `collect()`.
# Expressions are the building blocks used by both, but they shine with LazyFrame because of query optimization.

# Create a small sample dataset to use throughout the notebook

In [0]:
np.random.seed(42)

n = 20
cats = ["A", "B", "C"]

df = pl.DataFrame({
    "id": np.arange(1, n + 1),
    "category": np.random.choice(cats, size=n),
    "value": np.random.normal(loc=50, scale=15, size=n).round(2),
    "timestamp": [dt.datetime(2025, 1, 1) + dt.timedelta(days=int(x)) for x in np.random.randint(0, 30, size=n)]
})

# add a list column to demo explode

df = df.with_columns([
    pl.int_ranges(1, 1 + (pl.col("id") % 4)).alias("small_list")
])


2) Expression basics
- `pl.col("colname")` — references a column
- `pl.lit(value)` — literal value
- `expr.alias("new_name")` — rename expression
- `expr1 + expr2`, `expr * 2`, etc. — arithmetic supported
- Vectorized functions exist in `pl.col(...).sum()`, `pl.col(...).mean()`, string functions with `.str.*`, datetime with `.dt.*`, etc.

In [0]:
(df
 .select([
     pl.col("id"),
     pl.col("value"),
     (pl.col("value") * 1.1).alias("value_plus_10pct"),
     pl.col("category").alias("cat")
 ])
)

id,value,value_plus_10pct,cat
i64,f64,f64,str
1,43.05,47.355,"""C"""
2,43.01,47.311,"""A"""
3,53.63,58.993,"""C"""
4,21.3,23.43,"""C"""
5,24.13,26.543,"""A"""
…,…,…,…
16,51.66,56.826,"""A"""
17,32.74,36.014,"""B"""
18,55.64,61.204,"""B"""
19,40.99,45.089,"""B"""


Expressions can be composed. Using `.round(0)` or `.floor()` etc. Many methods return expressions when called on `pl.col()`.

In [0]:
# Chain expression methods
(df
 .select([
     pl.col("value").mean().alias("mean_value"),
     pl.col("value").median().alias("median_value"),
     pl.col("value").std().alias("std_value")
 ])
)

mean_value,median_value,std_value
f64,f64,f64
42.406,42.42,12.268675


3) `select`, `with_columns`, `with_column`
 - `select` produces a new DataFrame with only the provided expressions (can be used with both eager & lazy)
 - `with_columns` adds or replaces multiple columns (vectorized)
 - `with_column` adds a single column

In [0]:
# Add computed columns using expressions
(df
 .with_columns([
     (pl.col("value") / pl.col("value").sum()).alias("value_share"),
     (pl.col("value").rank("dense")).alias("dense_rank")
 ])
 .select(["id", "category", "value", "value_share", "dense_rank"])
)

id,category,value,value_share,dense_rank
i64,str,f64,f64,u32
1,"""C""",43.05,0.050759,12
2,"""A""",43.01,0.050712,11
3,"""C""",53.63,0.063234,17
4,"""C""",21.3,0.025114,1
5,"""A""",24.13,0.028451,2
…,…,…,…,…
16,"""A""",51.66,0.060911,16
17,"""B""",32.74,0.038603,5
18,"""B""",55.64,0.065604,19
19,"""B""",40.99,0.04833,8


4) Filtering (`filter`) and boolean logic
Expressions can be used inside `filter`.

In [0]:
# Filter rows where value > mean
mean_val = df.select(pl.col("value").mean()).item()

high = df.filter(pl.col("value") > mean_val)
print("Mean value:", mean_val)
print(high)

Mean value: 42.406
shape: (10, 5)
┌─────┬──────────┬───────┬─────────────────────┬────────────┐
│ id  ┆ category ┆ value ┆ timestamp           ┆ small_list │
│ --- ┆ ---      ┆ ---   ┆ ---                 ┆ ---        │
│ i64 ┆ str      ┆ f64   ┆ datetime[μs]        ┆ list[i64]  │
╞═════╪══════════╪═══════╪═════════════════════╪════════════╡
│ 1   ┆ C        ┆ 43.05 ┆ 2025-01-20 00:00:00 ┆ [1]        │
│ 2   ┆ A        ┆ 43.01 ┆ 2025-01-28 00:00:00 ┆ [1, 2]     │
│ 3   ┆ C        ┆ 53.63 ┆ 2025-01-15 00:00:00 ┆ [1, 2, 3]  │
│ 8   ┆ B        ┆ 54.71 ┆ 2025-01-08 00:00:00 ┆ []         │
│ 11  ┆ C        ┆ 71.98 ┆ 2025-01-14 00:00:00 ┆ [1, 2, 3]  │
│ 12  ┆ C        ┆ 46.61 ┆ 2025-01-17 00:00:00 ┆ []         │
│ 13  ┆ A        ┆ 51.01 ┆ 2025-01-04 00:00:00 ┆ [1]        │
│ 16  ┆ A        ┆ 51.66 ┆ 2025-01-04 00:00:00 ┆ []         │
│ 18  ┆ B        ┆ 55.64 ┆ 2025-01-30 00:00:00 ┆ [1, 2]     │
│ 20  ┆ B        ┆ 45.62 ┆ 2025-01-22 00:00:00 ┆ []         │
└─────┴──────────┴───────┴──────────

In [0]:
df.filter((pl.col("category") == "A") & (pl.col("value") > 60))

id,category,value,timestamp,small_list
i64,str,f64,datetime[μs],list[i64]


5) GroupBy & Aggregations (SQL `GROUP BY`)
 - `groupby` accepts column(s) and then `.agg()` with expressions
 - Use `pl.col().agg_list()` or `.mean()` etc for convenience

In [0]:
(df
 .group_by("category")
 .agg([
     pl.len().alias("n"),
     pl.col("value").mean().alias("avg_value"),
     pl.col("value").median().alias("median_value"),
     pl.col("value").std().alias("std_value"),
     pl.col("id").max().alias("max_id")
 ])
)

category,n,avg_value,median_value,std_value,max_id
str,u32,f64,f64,f64,i64
"""A""",5,42.276,43.01,11.12044,16
"""C""",9,40.578889,36.38,15.421072,14
"""B""",6,45.255,43.725,8.7627,20


Multiple groupby columns, and using named expressions is straightforward:

In [0]:
# add another grouping column for demonstration

(df
 .with_columns((pl.col("timestamp").dt.weekday().alias("weekday")))
 .group_by(["category", "weekday"])
 .agg([
     pl.len().alias("n"),
     pl.col("value").mean().alias("avg_value")
 ])
 .sort(["category", "weekday"])
)

category,weekday,n,avg_value
str,i8,u32,f64
"""A""",2,2,33.57
"""A""",6,2,51.335
"""A""",7,1,41.57
"""B""",1,1,40.99
"""B""",3,3,47.386667
…,…,…,…
"""C""",1,1,43.05
"""C""",2,2,46.64
"""C""",3,3,41.606667
"""C""",5,2,37.715


6) Window functions and rolling aggregations
 - Use `.over()` to apply an expression as a window function
 - Use `groupby_rolling` or `groupby_dynamic` for time-based rolling

In [0]:
# Create a simple window: partition by category and compute running mean of 'value'

df_with_rank = (
    df
    .with_columns([
        pl.col("value").rank("dense").over("category").alias("rank_within_cat"),
        pl.col("value").mean().over("category").alias("mean_within_cat")
    ])
)

df_with_rank

id,category,value,timestamp,small_list,rank_within_cat,mean_within_cat
i64,str,f64,datetime[μs],list[i64],u32,f64
1,"""C""",43.05,2025-01-20 00:00:00,[1],6,40.578889
2,"""A""",43.01,2025-01-28 00:00:00,"[1, 2]",3,42.276
3,"""C""",53.63,2025-01-15 00:00:00,"[1, 2, 3]",8,40.578889
4,"""C""",21.3,2025-01-28 00:00:00,[],1,40.578889
5,"""A""",24.13,2025-01-07 00:00:00,[1],1,42.276
…,…,…,…,…,…,…
16,"""A""",51.66,2025-01-04 00:00:00,[],5,42.276
17,"""B""",32.74,2025-01-02 00:00:00,[1],1,45.255
18,"""B""",55.64,2025-01-30 00:00:00,"[1, 2]",6,45.255
19,"""B""",40.99,2025-01-06 00:00:00,"[1, 2, 3]",2,45.255


Rolling / time-based: convert to LazyFrame and use `groupby_dynamic` for time windows.

In [0]:
lf = (
    df.lazy()
    .with_columns(pl.col("timestamp").cast(pl.Datetime))
    .sort("timestamp")
)

result = (
    lf
    .group_by_dynamic(
        index_column="timestamp",
        every="7d",
        period="7d",
        offset="0d",
    )
    .agg([
        pl.len().alias("n"),
        pl.col("value").mean().alias("avg_value")
    ])
    .collect()
)

print(result)

shape: (5, 3)
┌─────────────────────┬─────┬───────────┐
│ timestamp           ┆ n   ┆ avg_value │
│ ---                 ┆ --- ┆ ---       │
│ datetime[μs]        ┆ u32 ┆ f64       │
╞═════════════════════╪═════╪═══════════╡
│ 2025-01-02 00:00:00 ┆ 8   ┆ 40.73625  │
│ 2025-01-09 00:00:00 ┆ 4   ┆ 50.89     │
│ 2025-01-16 00:00:00 ┆ 4   ┆ 40.9775   │
│ 2025-01-23 00:00:00 ┆ 3   ┆ 33.04     │
│ 2025-01-30 00:00:00 ┆ 1   ┆ 55.64     │
└─────────────────────┴─────┴───────────┘


7) Joins and concatenation
 - `join` behaves SQL-like; specify `how` and `on`/`left_on`/`right_on`
 - Use `hstack` / `vstack` for concatenation

In [0]:
# Create a lookup table for category metadata

cat_meta = pl.DataFrame({
    "category": ["A", "B", "C"],
    "description": ["Group A", "Group B", "Group C"],
    "weight": [1.0, 1.2, 0.9]
})

# join
(df.join(cat_meta, on="category", how="left")).select(["id", "category", "description", "value"])


id,category,description,value
i64,str,str,f64
1,"""C""","""Group C""",43.05
2,"""A""","""Group A""",43.01
3,"""C""","""Group C""",53.63
4,"""C""","""Group C""",21.3
5,"""A""","""Group A""",24.13
…,…,…,…
16,"""A""","""Group A""",51.66
17,"""B""","""Group B""",32.74
18,"""B""","""Group B""",55.64
19,"""B""","""Group B""",40.99


8) Pivot & unpivot (melt)
 - `pivot` is useful for turning groupby-aggregations into wide form
 - `melt` (unpivot) goes from wide to long


In [0]:
# pivot: average value per category for each weekday (wide table)
(
    df
    .with_columns(pl.col("timestamp").dt.weekday().alias("weekday"))
    .pivot(
        values="value",
        index="weekday",
        on="category",
        aggregate_function="mean"   # or "sum", "count", etc.
    )
)

weekday,C,A,B
i8,f64,f64,f64
1,43.05,,40.99
2,46.64,33.57,
3,41.606667,,47.386667
7,,41.57,
5,37.715,,
6,28.63,51.335,
4,,,44.19


In [0]:
# Unpivot (melt) example: make the wide table long

wide = pl.DataFrame({"id": [1,2,3], "A": [10,20,30], "B": [5,6,7]})
wide.unpivot(index=["id"], on=["A","B"], variable_name="category", value_name="value")


id,category,value
i64,str,i64
1,"""A""",10
2,"""A""",20
3,"""A""",30
1,"""B""",5
2,"""B""",6
3,"""B""",7


9) Working with lists and `explode`
- Polars has native list-columns and many list functions. Avoid Python loops; use expressions.

In [0]:
# Explode the list into separate rows
exploded = df.explode("small_list")
print(exploded)

shape: (35, 5)
┌─────┬──────────┬───────┬─────────────────────┬────────────┐
│ id  ┆ category ┆ value ┆ timestamp           ┆ small_list │
│ --- ┆ ---      ┆ ---   ┆ ---                 ┆ ---        │
│ i64 ┆ str      ┆ f64   ┆ datetime[μs]        ┆ i64        │
╞═════╪══════════╪═══════╪═════════════════════╪════════════╡
│ 1   ┆ C        ┆ 43.05 ┆ 2025-01-20 00:00:00 ┆ 1          │
│ 2   ┆ A        ┆ 43.01 ┆ 2025-01-28 00:00:00 ┆ 1          │
│ 2   ┆ A        ┆ 43.01 ┆ 2025-01-28 00:00:00 ┆ 2          │
│ 3   ┆ C        ┆ 53.63 ┆ 2025-01-15 00:00:00 ┆ 1          │
│ 3   ┆ C        ┆ 53.63 ┆ 2025-01-15 00:00:00 ┆ 2          │
│ …   ┆ …        ┆ …     ┆ …                   ┆ …          │
│ 18  ┆ B        ┆ 55.64 ┆ 2025-01-30 00:00:00 ┆ 2          │
│ 19  ┆ B        ┆ 40.99 ┆ 2025-01-06 00:00:00 ┆ 1          │
│ 19  ┆ B        ┆ 40.99 ┆ 2025-01-06 00:00:00 ┆ 2          │
│ 19  ┆ B        ┆ 40.99 ┆ 2025-01-06 00:00:00 ┆ 3          │
│ 20  ┆ B        ┆ 45.62 ┆ 2025-01-22 00:00:00 ┆ null  

10) Conditional logic: `when` / `otherwise`
- Use `pl.when(condition).then(expr).otherwise(expr)` to vectorize `CASE WHEN` logic.

In [0]:
(df
 .with_columns([
     pl.when(pl.col("value") > 60).then(pl.lit("high"))
       .when(pl.col("value") > 40).then(pl.lit("medium"))
       .otherwise(pl.lit("low")).alias("value_bucket")
 ])
)

id,category,value,timestamp,small_list,value_bucket
i64,str,f64,datetime[μs],list[i64],str
1,"""C""",43.05,2025-01-20 00:00:00,[1],"""medium"""
2,"""A""",43.01,2025-01-28 00:00:00,"[1, 2]","""medium"""
3,"""C""",53.63,2025-01-15 00:00:00,"[1, 2, 3]","""medium"""
4,"""C""",21.3,2025-01-28 00:00:00,[],"""low"""
5,"""A""",24.13,2025-01-07 00:00:00,[1],"""low"""
…,…,…,…,…,…
16,"""A""",51.66,2025-01-04 00:00:00,[],"""medium"""
17,"""B""",32.74,2025-01-02 00:00:00,[1],"""low"""
18,"""B""",55.64,2025-01-30 00:00:00,"[1, 2]","""medium"""
19,"""B""",40.99,2025-01-06 00:00:00,"[1, 2, 3]","""medium"""


11) LazyFrame best practices and query optimization
 - Prefer LazyFrame for larger pipelines: `df.lazy()` -> expressions -> `collect()`
 - Use `with_columns` instead of repeated `.with_column` calls (fewer passes)
 - Avoid `.apply` and Python-level UDFs when possible — use vectorized expressions
 - Use `select` early to prune unneeded columns
 - `explain()` on a LazyFrame shows the query plan

In [0]:
total_value = df["value"].sum()
mean_value = df["value"].mean()

lazy = (
    df.lazy()
    .filter(pl.col("value") > 20)
    .with_columns([
        (pl.col("value") - mean_value).alias("diff_from_mean"),
        (pl.col("value") / total_value).alias("share"),
    ])
    .group_by("category")
    .agg([
        pl.len().alias("n"),
        pl.col("share").sum().alias("sum_share"),
    ])
)

print(lazy.collect())


shape: (3, 3)
┌──────────┬─────┬───────────┐
│ category ┆ n   ┆ sum_share │
│ ---      ┆ --- ┆ ---       │
│ str      ┆ u32 ┆ f64       │
╞══════════╪═════╪═══════════╡
│ C        ┆ 9   ┆ 0.430611  │
│ B        ┆ 6   ┆ 0.320155  │
│ A        ┆ 5   ┆ 0.249234  │
└──────────┴─────┴───────────┘


 12) Tips: avoiding `apply`, vectorized alternatives, performance tricks
 - Use built-in expressions: `.str.*` for strings, `.dt.*` for datetimes, `.arr.*` for lists
 - Use `.map_dict` for mapping/lookup replacement, or `.replace` where supported
 - Convert expensive Python-level loops to expressions or use `.map` with `literal` where necessary

In [0]:
# Example showing that apply is allowed but discouraged

def python_transform(x):
    return x * 2

try:
    df.with_columns([pl.col("value").apply(python_transform).alias("bad_double")])
except Exception as e:
    print("apply example (okay but may disable optimizations):", e)

# Better: use expression
(df.with_columns([(pl.col("value") * 2).alias("better_double")]))

apply example (okay but may disable optimizations): 'Expr' object has no attribute 'apply'


id,category,value,timestamp,small_list,better_double
i64,str,f64,datetime[μs],list[i64],f64
1,"""C""",43.05,2025-01-20 00:00:00,[1],86.1
2,"""A""",43.01,2025-01-28 00:00:00,"[1, 2]",86.02
3,"""C""",53.63,2025-01-15 00:00:00,"[1, 2, 3]",107.26
4,"""C""",21.3,2025-01-28 00:00:00,[],42.6
5,"""A""",24.13,2025-01-07 00:00:00,[1],48.26
…,…,…,…,…,…
16,"""A""",51.66,2025-01-04 00:00:00,[],103.32
17,"""B""",32.74,2025-01-02 00:00:00,[1],65.48
18,"""B""",55.64,2025-01-30 00:00:00,"[1, 2]",111.28
19,"""B""",40.99,2025-01-06 00:00:00,"[1, 2, 3]",81.98


13) More advanced examples
 - compute top-N per group
 - compute percentiles using expressions
 - using `.join` with expression-built keys

 Top-N per group: 'top 2 values per category'
 Approach: compute row_number over partition and then filter

In [0]:
ranked = (
    df
    .with_columns(
        pl.col("value").rank("dense").over("category").alias("rank_in_cat")
    )
    .filter(pl.col("rank_in_cat") <= 2)
    .sort(["category", "rank_in_cat"], descending=[False, False])
)

print(ranked)


shape: (6, 6)
┌─────┬──────────┬───────┬─────────────────────┬────────────┬─────────────┐
│ id  ┆ category ┆ value ┆ timestamp           ┆ small_list ┆ rank_in_cat │
│ --- ┆ ---      ┆ ---   ┆ ---                 ┆ ---        ┆ ---         │
│ i64 ┆ str      ┆ f64   ┆ datetime[μs]        ┆ list[i64]  ┆ u32         │
╞═════╪══════════╪═══════╪═════════════════════╪════════════╪═════════════╡
│ 5   ┆ A        ┆ 24.13 ┆ 2025-01-07 00:00:00 ┆ [1]        ┆ 1           │
│ 6   ┆ A        ┆ 41.57 ┆ 2025-01-12 00:00:00 ┆ [1, 2]     ┆ 2           │
│ 17  ┆ B        ┆ 32.74 ┆ 2025-01-02 00:00:00 ┆ [1]        ┆ 1           │
│ 19  ┆ B        ┆ 40.99 ┆ 2025-01-06 00:00:00 ┆ [1, 2, 3]  ┆ 2           │
│ 4   ┆ C        ┆ 21.3  ┆ 2025-01-28 00:00:00 ┆ []         ┆ 1           │
│ 14  ┆ C        ┆ 28.63 ┆ 2025-01-18 00:00:00 ┆ [1, 2]     ┆ 2           │
└─────┴──────────┴───────┴─────────────────────┴────────────┴─────────────┘


 Compute percentiles per group (e.g. 75th percentile) using `quantile`

In [0]:
(df
 .group_by("category")
 .agg([
     pl.col("value").quantile(0.75).alias("p75"),
     pl.col("value").quantile(0.5).alias("median"),
 ])
)


category,p75,median
str,f64,f64
"""A""",51.01,43.01
"""B""",54.71,45.62
"""C""",46.61,36.38


Writing SQL-like code with expressions
 - Many users think in SQL. Polars expressions let you write the same ideas but in Python.
 - Examples: SQL `SELECT category, AVG(value) as avg_value FROM df WHERE value>20 GROUP BY category HAVING AVG(value) > 45 ORDER BY avg_value DESC LIMIT 10`


In [0]:
(
    df.lazy()
    .filter(pl.col("value") > 20)
    .group_by("category")
    .agg(pl.col("value").mean().alias("avg_value"))
    .filter(pl.col("avg_value") > 45)
    .sort("avg_value", descending=True)
    .limit(10)
    .collect()
)


category,avg_value
str,f64
"""B""",45.255
