# Selectors
- The `polars` library includes a `selectors` submodule.
- `selectors` helps create more complex expressions targeting specific columns.
- The common alias for `selectors` is `cs`.

In [1]:
import polars as pl
import polars.selectors as cs

## Introducing the Dataset
- The `spotify` dataset is a collection of popular tracks on the music streaming service Spotify.

In [2]:
music = pl.read_csv("spotify_top_1000_tracks.csv", try_parse_dates=True)
music.head(3)

track_name,artist,album_name,release_date,popularity,duration_min
str,str,str,date,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",1981-01-01,67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",1992-01-18,66,4.506217
"""Wannabe""","""Spice Girls""","""Spice""",1996-01-01,82,2.883767


- The `pl.col` function is pretty flexible by itself.
- It can target one column or multiple columns. It accepts a variety of inputs.

In [3]:
music.select(pl.col("track_name"), pl.col("artist"))
music.select(pl.col("track_name", "artist"))
music.select(pl.col(["track_name", "artist"]))
music.select(pl.all())

track_name,artist,album_name,release_date,popularity,duration_min
str,str,str,date,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",1981-01-01,67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",1992-01-18,66,4.506217
"""Wannabe""","""Spice Girls""","""Spice""",1996-01-01,82,2.883767
"""Roses Are Red - Original Versi…","""Aqua""","""Aquarium (Special Edition)""",1997-01-01,60,3.76555
"""Lollipop (Candyman)""","""Aqua""","""Aquarium (Special Edition)""",1997-01-01,61,3.610433
…,…,…,…,…,…
"""Boots On The Ground""","""Dillin Hoox""","""Boots On The Ground""",2025-03-28,44,2.569517
"""The Monster""","""Gaullin""","""The Monster""",2025-03-28,56,2.290317
"""ALONE""","""NEWER""","""ALONE""",2025-04-04,32,3.09375
"""Where Are You Now""","""Nick Giardino""","""Where Are You Now""",2025-04-04,37,2.354833


- We can pass a data type to `col` to target columns by data type.

In [4]:
music.select(pl.col(pl.String))

track_name,artist,album_name
str,str,str
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle"""
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love"""
"""Wannabe""","""Spice Girls""","""Spice"""
"""Roses Are Red - Original Versi…","""Aqua""","""Aquarium (Special Edition)"""
"""Lollipop (Candyman)""","""Aqua""","""Aquarium (Special Edition)"""
…,…,…
"""Boots On The Ground""","""Dillin Hoox""","""Boots On The Ground"""
"""The Monster""","""Gaullin""","""The Monster"""
"""ALONE""","""NEWER""","""ALONE"""
"""Where Are You Now""","""Nick Giardino""","""Where Are You Now"""


- We can pass a regular expression to `col` to target columns by pattern match.
- Regular expressions must start with `^` and end with `$`.
- The `^` and `$` anchors mark the beginning and end of the target string.
- The `.` regex symbol matches any character.
- The `+` symbol means 1 or more of any character.
- The regex below evaluates to "look for 1 or more of any character, then the text `name` before the end of the string".
- The regex matches the two columns that end with `name`.

In [5]:
music.select(pl.col(r"^.+name$"))

track_name,album_name
str,str
"""Goo Goo Muck""","""Psychedelic Jungle"""
"""What Is Love - 7"" Mix""","""What Is Love"""
"""Wannabe""","""Spice"""
"""Roses Are Red - Original Versi…","""Aquarium (Special Edition)"""
"""Lollipop (Candyman)""","""Aquarium (Special Edition)"""
…,…
"""Boots On The Ground""","""Boots On The Ground"""
"""The Monster""","""The Monster"""
"""ALONE""","""ALONE"""
"""Where Are You Now""","""Where Are You Now"""


- The `pl.exclude` function creates an expression that rejects columns.
- We can target a column by name, by type, by regular expression, and more.

In [6]:
music.select(pl.exclude("release_date"))
music.select(pl.exclude(pl.Int64))
music.select(pl.exclude(r"^.+name$"))

artist,release_date,popularity,duration_min
str,date,i64,f64
"""The Cramps""",1981-01-01,67,3.100633
"""Haddaway""",1992-01-18,66,4.506217
"""Spice Girls""",1996-01-01,82,2.883767
"""Aqua""",1997-01-01,60,3.76555
"""Aqua""",1997-01-01,61,3.610433
…,…,…,…
"""Dillin Hoox""",2025-03-28,44,2.569517
"""Gaullin""",2025-03-28,56,2.290317
"""NEWER""",2025-04-04,32,3.09375
"""Nick Giardino""",2025-04-04,37,2.354833


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/col.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.exclude.html

## Introducing Selectors
- The `cs` (column selectors) module has 30+ functions for specifying columns.
- The selectors can be applied anywhere expressions are expected (`select`, `filter`, `group_by`, etc)

In [7]:
music = pl.read_csv("spotify_top_1000_tracks.csv", try_parse_dates=True)
music.head(2)

track_name,artist,album_name,release_date,popularity,duration_min
str,str,str,date,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",1981-01-01,67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",1992-01-18,66,4.506217


- The `cs.by_name` selector targets a column by name.

In [8]:
type(cs.by_name("track_name"))

polars.selectors.Selector

In [9]:
music.select(cs.by_name("track_name"))

track_name
str
"""Goo Goo Muck"""
"""What Is Love - 7"" Mix"""
"""Wannabe"""
"""Roses Are Red - Original Versi…"
"""Lollipop (Candyman)"""
…
"""Boots On The Ground"""
"""The Monster"""
"""ALONE"""
"""Where Are You Now"""


- Selectors like `by_name` are not particularly helpful. Their syntax is more verbose compared to `pl.col`.
- Other selectors can simplify targeting columns. For example, the `ends_with` method is easier than passing a regex to `pl.col`.
- The next example uses `cs.ends_with` to target all columns that end with `"name"`.

In [10]:
music.select(cs.ends_with("name"))

track_name,album_name
str,str
"""Goo Goo Muck""","""Psychedelic Jungle"""
"""What Is Love - 7"" Mix""","""What Is Love"""
"""Wannabe""","""Spice"""
"""Roses Are Red - Original Versi…","""Aquarium (Special Edition)"""
"""Lollipop (Candyman)""","""Aquarium (Special Edition)"""
…,…
"""Boots On The Ground""","""Boots On The Ground"""
"""The Monster""","""The Monster"""
"""ALONE""","""ALONE"""
"""Where Are You Now""","""Where Are You Now"""


- The complementary `cs.starts_with` selector targets columns that start with a prefix.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#more-flexible-column-selections
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.by_name
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.starts_with
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.ends_with

## Selecting by Data Type
- There are special selectors for targeting columns by data types.
- For example, `cs.numeric` will target all numeric columns (integers and floating-points).

In [11]:
music = pl.read_csv("spotify_top_1000_tracks.csv", try_parse_dates=True)
music.head(2)

track_name,artist,album_name,release_date,popularity,duration_min
str,str,str,date,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",1981-01-01,67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",1992-01-18,66,4.506217


- `pl.col` can target columns by one or more data types but we have to explicitly write out each type.
- Polars has 5 types of integers! (`Int8`, `Int16`, `Int32`, `Int64`, `Int128`).
- The `cs.integer` selector targets all integer columns irrespective of their exact integer type.
- The `cs.float` selector targets all floating-points columns.

In [12]:
music.select(cs.integer())
music.select(cs.float())

duration_min
f64
3.100633
4.506217
2.883767
3.76555
3.610433
…
2.569517
2.290317
3.09375
2.354833


- Or perhaps we just want to target all numeric columns irrespective of their data type.

In [13]:
music.select(cs.numeric())

popularity,duration_min
i64,f64
67,3.100633
66,4.506217
82,2.883767
60,3.76555
61,3.610433
…,…
44,2.569517
56,2.290317
32,3.09375
37,2.354833


- The `cs.date`, `cs.time`, and `cs.datetime` selectors target date, time, and datetime columns.

In [14]:
music.select(cs.date())
music.select(cs.time())
music.select(cs.datetime())

- The `cs.temporal` selector targets any date/time/datetime columns.

In [15]:
music.select(cs.temporal())

release_date
date
1981-01-01
1992-01-18
1996-01-01
1997-01-01
1997-01-01
…
2025-03-28
2025-03-28
2025-04-04
2025-04-04


- The `cs.alpha` selector targets columns whose names contain only alphabetic characters (a-z).
- The selector eliminates columns with underscores.
- The `cs.alphanumeric` selector targets columns whose names contain only alphabetic characters or numbers/digits.

In [16]:
music.select(cs.alpha())
music.select(cs.alphanumeric())

artist,popularity
str,i64
"""The Cramps""",67
"""Haddaway""",66
"""Spice Girls""",82
"""Aqua""",60
"""Aqua""",61
…,…
"""Dillin Hoox""",44
"""Gaullin""",56
"""NEWER""",32
"""Nick Giardino""",37


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.integer
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.float
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.numeric
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.date
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.time
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.temporal
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.alpha
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.alphanumeric

## Selecting by Column Position

In [17]:
music = pl.read_csv("spotify_top_1000_tracks.csv", try_parse_dates=True)
music.head()

track_name,artist,album_name,release_date,popularity,duration_min
str,str,str,date,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",1981-01-01,67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",1992-01-18,66,4.506217
"""Wannabe""","""Spice Girls""","""Spice""",1996-01-01,82,2.883767
"""Roses Are Red - Original Versi…","""Aqua""","""Aquarium (Special Edition)""",1997-01-01,60,3.76555
"""Lollipop (Candyman)""","""Aqua""","""Aquarium (Special Edition)""",1997-01-01,61,3.610433


- The `cs.by_index` selector targets columns by index position.
- Index positions start counting at 0. The first column is index 0, the second column is index 1, and so on.

In [18]:
music.select(cs.by_index(0))

track_name
str
"""Goo Goo Muck"""
"""What Is Love - 7"" Mix"""
"""Wannabe"""
"""Roses Are Red - Original Versi…"
"""Lollipop (Candyman)"""
…
"""Boots On The Ground"""
"""The Monster"""
"""ALONE"""
"""Where Are You Now"""


- Passing an invalid index will lead to a `ColumnNotFound` error.

In [19]:
# music.select(cs.by_index(100))

- Pass multiple values to target multiple columns by index.

In [20]:
music.select(cs.by_index(0, 2, 4))
music.select(cs.by_index([0, 2, 4]))

track_name,album_name,popularity
str,str,i64
"""Goo Goo Muck""","""Psychedelic Jungle""",67
"""What Is Love - 7"" Mix""","""What Is Love""",66
"""Wannabe""","""Spice""",82
"""Roses Are Red - Original Versi…","""Aquarium (Special Edition)""",60
"""Lollipop (Candyman)""","""Aquarium (Special Edition)""",61
…,…,…
"""Boots On The Ground""","""Boots On The Ground""",44
"""The Monster""","""The Monster""",56
"""ALONE""","""ALONE""",32
"""Where Are You Now""","""Where Are You Now""",37


- Negative values will extract from the end of the `DataFrame`.
- -1 pulls the last column, -3 pulls the third-to-last column and so on.

In [21]:
music.select(cs.by_index(-1, -3))

duration_min,release_date
f64,date
3.100633,1981-01-01
4.506217,1992-01-18
2.883767,1996-01-01
3.76555,1997-01-01
3.610433,1997-01-01
…,…
2.569517,2025-03-28
2.290317,2025-03-28
3.09375,2025-04-04
2.354833,2025-04-04


- We can mix and match positive and negative values.
- The following targets the last column (-1), third-to-last column(-3), and the third column (2).
- Every column name must be unique so Polars will raise an exception if we target the same column twice.

In [22]:
music.select(cs.by_index(-1, -3, 2))

duration_min,release_date,album_name
f64,date,str
3.100633,1981-01-01,"""Psychedelic Jungle"""
4.506217,1992-01-18,"""What Is Love"""
2.883767,1996-01-01,"""Spice"""
3.76555,1997-01-01,"""Aquarium (Special Edition)"""
3.610433,1997-01-01,"""Aquarium (Special Edition)"""
…,…,…
2.569517,2025-03-28,"""Boots On The Ground"""
2.290317,2025-03-28,"""The Monster"""
3.09375,2025-04-04,"""ALONE"""
2.354833,2025-04-04,"""Where Are You Now"""


- The `first` and `last` selectors target the first and last columns.
- The `first` method is equivalent to `cs.by_index(0)`.
- The `last` method is equivalent to `cs.by_index(-1)`.

In [23]:
music.select(cs.first())
music.select(cs.last())

duration_min
f64
3.100633
4.506217
2.883767
3.76555
3.610433
…
2.569517
2.290317
3.09375
2.354833


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.by_index
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.first
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.last

## Set Operations with Selectors
- Selectors from `cs` support common set operations (union, intersection, difference, symmetric difference, and complement).
- A set is an unordered collection of unique values. Python has a `set` type.
- Set operations refer to various comparison operations between two sets.

In [24]:
music = pl.read_csv("spotify_top_1000_tracks.csv", try_parse_dates=True)
music.head(2)

track_name,artist,album_name,release_date,popularity,duration_min
str,str,str,date,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",1981-01-01,67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",1992-01-18,66,4.506217


### Union
- We can use symbols to combine selectors. Different symbols apply different set operations.
- The `|` symbol creates a union (either/or) operation between the selectors.
- The union is the combination of two sets' values regardless of whether the value exists in one set or both.
- The next example targets columns that end with `"name"` or store temporal/datetime data (or both).

In [25]:
music.select(cs.ends_with("name") | cs.temporal())

track_name,album_name,release_date
str,str,date
"""Goo Goo Muck""","""Psychedelic Jungle""",1981-01-01
"""What Is Love - 7"" Mix""","""What Is Love""",1992-01-18
"""Wannabe""","""Spice""",1996-01-01
"""Roses Are Red - Original Versi…","""Aquarium (Special Edition)""",1997-01-01
"""Lollipop (Candyman)""","""Aquarium (Special Edition)""",1997-01-01
…,…,…
"""Boots On The Ground""","""Boots On The Ground""",2025-03-28
"""The Monster""","""The Monster""",2025-03-28
"""ALONE""","""ALONE""",2025-04-04
"""Where Are You Now""","""Where Are You Now""",2025-04-04


### Intersection
- The `&` symbol performs an intersection (AND). The value must exist in both sets.
- The `cs.contains` selector checks if the column name contains a substring.
- The code below targets columns that hold string values AND whose names contain `"ar"`.
- The selector excludes the `popularity` column. It holds `"ar"` but it is not a string column.

In [26]:
music.select(cs.string() & cs.contains("ar"))

artist
str
"""The Cramps"""
"""Haddaway"""
"""Spice Girls"""
"""Aqua"""
"""Aqua"""
…
"""Dillin Hoox"""
"""Gaullin"""
"""NEWER"""
"""Nick Giardino"""


### Difference
- The `-` symbol calculates the difference between two sets.
- A difference operation removes entries from one set if they are found in the other set.
- Think of it as "subtracting" values from the set.
- The following example selects string columns _except for_ those that contain `name`.

In [27]:
music.select(cs.string() - cs.contains("art"))

track_name,album_name
str,str
"""Goo Goo Muck""","""Psychedelic Jungle"""
"""What Is Love - 7"" Mix""","""What Is Love"""
"""Wannabe""","""Spice"""
"""Roses Are Red - Original Versi…","""Aquarium (Special Edition)"""
"""Lollipop (Candyman)""","""Aquarium (Special Edition)"""
…,…
"""Boots On The Ground""","""Boots On The Ground"""
"""The Monster""","""The Monster"""
"""ALONE""","""ALONE"""
"""Where Are You Now""","""Where Are You Now"""


### Symmetric Difference/Exclusive Or
- The `^` (exclusive or symbol) selects columns that satisfy either one condition or the other but _not_ both.
- Equivalently, the `^` targets values that exist in the first set or the second set but not both sets.
- The next example selects columns that are either strings or contain the text `name` but not both.
- Polars excludes `track_name` and `album_name` because they are string columns that contain `"name"`.

In [28]:
music.select(cs.string() ^ cs.contains("name"))

artist
str
"""The Cramps"""
"""Haddaway"""
"""Spice Girls"""
"""Aqua"""
"""Aqua"""
…
"""Dillin Hoox"""
"""Gaullin"""
"""NEWER"""
"""Nick Giardino"""


### Exclusion
- The `~` (tilde) symbol performs exclusion/negation.
- The selector targets the inverse of the results set.
- For example, `~cs.temporal` will select all non-temporal columns.

In [29]:
music.select(~cs.temporal())

track_name,artist,album_name,popularity,duration_min
str,str,str,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",66,4.506217
"""Wannabe""","""Spice Girls""","""Spice""",82,2.883767
"""Roses Are Red - Original Versi…","""Aqua""","""Aquarium (Special Edition)""",60,3.76555
"""Lollipop (Candyman)""","""Aqua""","""Aquarium (Special Edition)""",61,3.610433
…,…,…,…,…
"""Boots On The Ground""","""Dillin Hoox""","""Boots On The Ground""",44,2.569517
"""The Monster""","""Gaullin""","""The Monster""",56,2.290317
"""ALONE""","""NEWER""","""ALONE""",32,3.09375
"""Where Are You Now""","""Nick Giardino""","""Where Are You Now""",37,2.354833


- We can accomplish the same result with the `cs.exclude` function.
- The `cs.exclude` function accepts other selectors. The `pl.exclude` function will not work here.
- The next example excludes all temporal columns (date/time/datetime).

In [30]:
music.select(cs.exclude(cs.temporal()))

track_name,artist,album_name,popularity,duration_min
str,str,str,i64,f64
"""Goo Goo Muck""","""The Cramps""","""Psychedelic Jungle""",67,3.100633
"""What Is Love - 7"" Mix""","""Haddaway""","""What Is Love""",66,4.506217
"""Wannabe""","""Spice Girls""","""Spice""",82,2.883767
"""Roses Are Red - Original Versi…","""Aqua""","""Aquarium (Special Edition)""",60,3.76555
"""Lollipop (Candyman)""","""Aqua""","""Aquarium (Special Edition)""",61,3.610433
…,…,…,…,…
"""Boots On The Ground""","""Dillin Hoox""","""Boots On The Ground""",44,2.569517
"""The Monster""","""Gaullin""","""The Monster""",56,2.290317
"""ALONE""","""NEWER""","""ALONE""",32,3.09375
"""Where Are You Now""","""Nick Giardino""","""Where Are You Now""",37,2.354833


- Selectors can utilize other selectors.
- The next example excludes columns that fall in the set of string columns that do not contain `"art"`.
- `track_name` and `album_name` are string columns that do not contain `"art"`.
- Polars thus excludes `track_name` and `album_name` from the `DataFrame`.

In [31]:
music.select(cs.exclude(cs.string() - cs.contains("art")))

artist,release_date,popularity,duration_min
str,date,i64,f64
"""The Cramps""",1981-01-01,67,3.100633
"""Haddaway""",1992-01-18,66,4.506217
"""Spice Girls""",1996-01-01,82,2.883767
"""Aqua""",1997-01-01,60,3.76555
"""Aqua""",1997-01-01,61,3.610433
…,…,…,…
"""Dillin Hoox""",2025-03-28,44,2.569517
"""Gaullin""",2025-03-28,56,2.290317
"""NEWER""",2025-04-04,32,3.09375
"""Nick Giardino""",2025-04-04,37,2.354833


### Further Reading
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#combining-selectors-with-set-operations
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.string
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.contains
- https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.temporal

## Complete List of Selectors
- Docs: https://docs.pola.rs/user-guide/expressions/expression-expansion/#complete-reference