# DataFrames II

In [1]:
import polars as pl

## The fill_null Method
- Polars uses `null` to represent a missing value.

In [2]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


- The `fill_null` method replaces missing values with a specified algorithm.
- We can provide a constant value, a fill strategy, or a Polars expression.

In [3]:
employees.select(
    pl.col("department"),
    pl.col("department").fill_null("Intern").alias("department2"),
)

department,department2
str,str
"""CEO""","""CEO"""
"""Operations""","""Operations"""
,"""Intern"""
"""HR""","""HR"""
,"""Intern"""
…,…
,"""Intern"""
"""Operations""","""Operations"""
,"""Intern"""
"""Finance""","""Finance"""


- The `strategy` parameter specifies whether to use the previous or next present value to replace a missing value.

In [4]:
employees.head(6).select(
    pl.col("department"),
    pl.col("department").fill_null(strategy="forward").alias("department_forward"),
    pl.col("department").fill_null(strategy="backward").alias("department_backward"),
)

department,department_forward,department_backward
str,str,str
"""CEO""","""CEO""","""CEO"""
"""Operations""","""Operations""","""Operations"""
,"""Operations""","""HR"""
"""HR""","""HR""","""HR"""
,"""HR""","""Marketing"""
"""Marketing""","""Marketing""","""Marketing"""


- Let's select the first 5 rows, then add a `title` column to the `DataFrame`.
- We can use the title as the best match for a missing `department` column value.

In [5]:
employees.head().with_columns(
    pl.Series(
        "title",
        [
            "CEO",
            "Warehouse Manager",
            "Frontend Developer",
            "Acquisition Lead",
            "Backend Developer",
        ],
    )
).with_columns(
    pl.col("department").fill_null(pl.col("title")).alias("department_updated")
)

name,department,email,salary,years_at_company,start_date,title,department_updated
str,str,str,i64,i64,date,str,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14,"""CEO""","""CEO"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13,"""Warehouse Manager""","""Operations"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01,"""Frontend Developer""","""Frontend Developer"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25,"""Acquisition Lead""","""HR"""
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14,"""Backend Developer""","""Backend Developer"""


### Further Reading
- https://docs.pola.rs/user-guide/expressions/missing-data/#filling-missing-data
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.fill_null.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.fill_null.html

## Interpolation
- Interpolation replaces missing values using linear interpolation.
- Interpolation draws a straight line between two values and fills in the gaps along that line.

In [6]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [7]:
employees.head(6).select(
    pl.col("name"), pl.Series("bonus", [10000, None, 20000, None, None, 50000])
).with_columns(pl.col("bonus").interpolate().alias("bonus_updated"))

name,bonus,bonus_updated
str,i64,f64
"""Nicholas Maldonado""",10000.0,10000.0
"""Michael Fletcher""",,15000.0
"""Jeffrey Tanner""",20000.0,20000.0
"""Diana Weaver""",,30000.0
"""Sierra Ross""",,40000.0
"""Melissa Page""",50000.0,50000.0


### Further Reading
- https://docs.pola.rs/user-guide/expressions/missing-data/#fill-with-interpolation
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.interpolate.html

## Dropping Missing Data
- The `drop_nulls` method on an expression removes `null` values from the target column.
- If there are missing values, the new column will be shorter than the original one.
- The `with_columns` method attaches new columns to the end of the `DataFrame` and expects columns of equal length.

In [8]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [9]:
employees.select(pl.col("department").drop_nulls())

# employees.with_columns(pl.col("department").drop_nulls())

department
str
"""CEO"""
"""Operations"""
"""HR"""
"""Marketing"""
"""Marketing"""
…
"""Operations"""
"""Finance"""
"""HR"""
"""Operations"""


- The `drop_nulls` method on a `DataFrame` removes rows that have one or more `null` values.
- The `subset` parameter limits the columns that Polars uses to identify `null` values.

In [10]:
employees = employees.head().with_columns(pl.lit(None).alias("email"))
employees

name,department,email,salary,years_at_company,start_date
str,str,null,i64,i64,date
"""Nicholas Maldonado""","""CEO""",,250000,9,2016-07-14
"""Michael Fletcher""","""Operations""",,96540,9,2016-02-13
"""Jeffrey Tanner""",,,126489,10,2015-03-01
"""Diana Weaver""","""HR""",,84672,5,2019-11-25
"""Sierra Ross""",,,148601,7,2018-02-14


In [11]:
employees.drop_nulls()

employees.drop_nulls(subset=["department"])

employees.drop_nulls(subset=["department", "email"])

name,department,email,salary,years_at_company,start_date
str,str,null,i64,i64,date


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.drop_nulls.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop_nulls.html

## Sorting by Single Column
- Sorting changes the order of rows based on one or more columns' values.
- The number of rows in the `DataFrame` remains the same in the sorted result.

In [12]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


- The default sort order is ascending (smallest to largest, alphabetical, earliest to latest)
- Pandas has an `ascending` parameter. Polars has a `descending` parameter which defaults to `False`.

- An ascending sort orders dates from earliest to latest.

In [13]:
employees.sort("start_date")

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Terry Walls""",,"""terry.walls@polars.io""",89421,10,2014-07-24
"""Jeremy Harris""","""HR""","""jeremy.harris@polars.io""",74442,10,2014-07-25
"""Kylie Clarke""","""HR""","""kylie.clarke@polars.io""",66997,10,2014-08-01
"""Keith Gross""","""Marketing""","""keith.gross@polars.io""",128607,10,2014-08-11
"""Paul Guerrero""","""Engineering""","""paul.guerrero@polars.io""",184788,10,2014-08-15
…,…,…,…,…,…
"""Reginald Wallace""","""Finance""","""reginald.wallace@polars.io""",142361,0,2025-06-29
"""Isaiah Smith""","""HR""","""isaiah.smith@polars.io""",86023,0,2025-06-29
"""Carrie Montoya""","""Engineering""","""carrie.montoya@polars.io""",145307,0,2025-07-04
"""Andrew Lowery""","""Operations""","""andrew.lowery@polars.io""",88076,0,2025-07-14


- Pass `True` to the `descending` parameter to sort in descending order.

In [14]:
employees.sort(pl.col("salary"), descending=True)
employees.sort(pl.col("name"), descending=True)
employees.sort(pl.col("start_date"), descending=True)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Sarah Haney""","""Engineering""","""sarah.haney@polars.io""",191943,0,2025-07-16
"""Andrew Lowery""","""Operations""","""andrew.lowery@polars.io""",88076,0,2025-07-14
"""Carrie Montoya""","""Engineering""","""carrie.montoya@polars.io""",145307,0,2025-07-04
"""Reginald Wallace""","""Finance""","""reginald.wallace@polars.io""",142361,0,2025-06-29
"""Isaiah Smith""","""HR""","""isaiah.smith@polars.io""",86023,0,2025-06-29
…,…,…,…,…,…
"""Paul Guerrero""","""Engineering""","""paul.guerrero@polars.io""",184788,10,2014-08-15
"""Keith Gross""","""Marketing""","""keith.gross@polars.io""",128607,10,2014-08-11
"""Kylie Clarke""","""HR""","""kylie.clarke@polars.io""",66997,10,2014-08-01
"""Jeremy Harris""","""HR""","""jeremy.harris@polars.io""",74442,10,2014-07-25


- The `nulls_last` parameter can force `null` (missing) values to the end of the sort.

In [15]:
employees.sort("department")

employees.sort("department", nulls_last=True)

employees.sort("department", nulls_last=True, descending=True)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Brandi Medina""","""Sales""","""brandi.medina@polars.io""",81482,10,2015-02-06
"""Ashley Black""","""Sales""","""ashley.black@polars.io""",58143,10,2015-05-25
"""Ashley Parsons""","""Sales""","""ashley.parsons@polars.io""",73527,1,2024-05-23
"""Rachel Hogan""","""Sales""","""rachel.hogan@polars.io""",87529,3,2021-07-30
"""Alisha Lewis""","""Sales""","""alisha.lewis@polars.io""",60832,1,2023-09-27
…,…,…,…,…,…
"""Amber Smith""",,"""amber.smith@polars.io""",88525,0,2024-08-09
"""Jennifer Murphy""",,"""jennifer.murphy@polars.io""",79626,1,2024-07-13
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12


- Polars sorts uppercase letters before lowercase ones.

In [16]:
pl.Series("fruits", ["Apple", "bananas", "Pear"]).sort()

fruits
str
"""Apple"""
"""Pear"""
"""bananas"""


- We can also sort an individual column in an expression.
- Say we are assigning every employee a different existing salary.
- The original connection between a row and its salary will be lost.

In [17]:
employees.with_columns(pl.col("salary").sort())

employees.with_columns(pl.col("salary").sort(descending=True))

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",199503,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",199381,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",199260,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",199257,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",55304,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",55242,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",55078,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",55012,4,2020-11-07


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sort.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sort.html

## Sorting by Multiple Columns I
- Pass multiple strings to the `sort` method to sort by multiple columns.
- Polars will apply a uniform ascending sort to each column by default.

In [18]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


- Polars will place `null` values first, then sort by department, then sort by each name within each department.

In [19]:
employees.sort("department", "name", nulls_last=True)

employees.sort("department", "salary", nulls_last=True)

employees.sort(["department", "name"], nulls_last=True)

employees.sort(pl.col("department"), pl.col("name"), nulls_last=True)

employees.sort(pl.col("department", "name"), nulls_last=True)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Amanda Carter""","""Engineering""","""amanda.carter@polars.io""",175965,10,2015-02-04
"""Amanda Meyer""","""Engineering""","""amanda.meyer@polars.io""",128286,1,2024-02-27
"""Amanda Rodriguez""","""Engineering""","""amanda.rodriguez@polars.io""",127603,0,2024-11-27
"""Amy Pham""","""Engineering""","""amy.pham@polars.io""",175067,7,2017-08-22
…,…,…,…,…,…
"""Troy Allen""",,"""troy.allen@polars.io""",55432,8,2017-05-08
"""Valerie Rivera""",,"""valerie.rivera@polars.io""",73413,4,2021-03-08
"""Veronica Gutierrez""",,"""veronica.gutierrez@polars.io""",65010,6,2019-02-09
"""Wendy Thomas""",,"""wendy.thomas@polars.io""",66870,4,2021-05-29


## Sorting by Multiple Columns II
- Polars sorts each column in ascending order by default.
- Pass the `descending` parameter a list of Booleans to customize sort order by column.
- The number of entries in the list must match the number of columns.

In [20]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [21]:
employees.sort("department", "name")

employees.sort("department", "name", descending=[False, False])

employees.sort("department", "name", descending=[True, True])

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Yolanda Chen""",,"""yolanda.chen@polars.io""",68038,9,2015-09-28
"""Wendy Thomas""",,"""wendy.thomas@polars.io""",66870,4,2021-05-29
"""Veronica Gutierrez""",,"""veronica.gutierrez@polars.io""",65010,6,2019-02-09
"""Valerie Rivera""",,"""valerie.rivera@polars.io""",73413,4,2021-03-08
"""Troy Allen""",,"""troy.allen@polars.io""",55432,8,2017-05-08
…,…,…,…,…,…
"""Amy Pham""","""Engineering""","""amy.pham@polars.io""",175067,7,2017-08-22
"""Amanda Rodriguez""","""Engineering""","""amanda.rodriguez@polars.io""",127603,0,2024-11-27
"""Amanda Meyer""","""Engineering""","""amanda.meyer@polars.io""",128286,1,2024-02-27
"""Amanda Carter""","""Engineering""","""amanda.carter@polars.io""",175965,10,2015-02-04


- Provide both `True` and `False` to sort in different order per column.
- `[True, False]` sorts `department` in descending order and `name` in ascending order within each department.
- `[False, True]` sorts `department` in ascending order and `name` in descending order within each department.

In [22]:
employees.sort("department", "name", descending=[True, False])

employees.sort("department", "name", descending=[False, True])

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Yolanda Chen""",,"""yolanda.chen@polars.io""",68038,9,2015-09-28
"""Wendy Thomas""",,"""wendy.thomas@polars.io""",66870,4,2021-05-29
"""Veronica Gutierrez""",,"""veronica.gutierrez@polars.io""",65010,6,2019-02-09
"""Valerie Rivera""",,"""valerie.rivera@polars.io""",73413,4,2021-03-08
"""Troy Allen""",,"""troy.allen@polars.io""",55432,8,2017-05-08
…,…,…,…,…,…
"""Alisha Lewis""","""Sales""","""alisha.lewis@polars.io""",60832,1,2023-09-27
"""Alicia Jones""","""Sales""","""alicia.jones@polars.io""",80066,0,2025-02-10
"""Aimee Reeves""","""Sales""","""aimee.reeves@polars.io""",64971,0,2024-10-16
"""Adam Harris""","""Sales""","""adam.harris@polars.io""",87220,5,2020-03-16


## Characters vs Bytes

In [23]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


- Earlier, we introduced the `name` attribute/namespace for methods that deal with column names.
- As always, the methods return a new expression that we can apply in a specific Polars context.

In [24]:
employees.select(pl.all().name.to_uppercase()).head(1)

NAME,DEPARTMENT,EMAIL,SALARY,YEARS_AT_COMPANY,START_DATE
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14


- The `str` namespace contains methods for string manipulations.
- The `to_lowercase` and `to_uppercase` methods lowercase or capitalize the values in a column.

In [25]:
employees.select(
    pl.col("name").str.to_lowercase().alias("lower"),
    pl.col("name").str.to_uppercase().alias("upper"),
)

lower,upper
str,str
"""nicholas maldonado""","""NICHOLAS MALDONADO"""
"""michael fletcher""","""MICHAEL FLETCHER"""
"""jeffrey tanner""","""JEFFREY TANNER"""
"""diana weaver""","""DIANA WEAVER"""
"""sierra ross""","""SIERRA ROSS"""
…,…
"""james bryant""","""JAMES BRYANT"""
"""patricia vazquez""","""PATRICIA VAZQUEZ"""
"""katie clay""","""KATIE CLAY"""
"""monique swanson""","""MONIQUE SWANSON"""


- The `str.len_chars` method returns a count of characters in a string.
- The complementary `str.len_bytes` method counts the bytes in a string.
- One English alphabetic character occupies one byte in memory.
- The 1-to-1 relationship is not always true. An emoji like 🍕 has 1 character but occupies 4 bytes.

In [26]:
pl.DataFrame({"foods": ["pizza", "🍕"]}).with_columns(
    pl.col("foods").str.len_chars().alias("length_in_chars"),
    pl.col("foods").str.len_bytes().alias("length_in_bytes"),
)

foods,length_in_chars,length_in_bytes
str,u32,u32
"""pizza""",5,5
"""🍕""",1,4


### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#the-string-namespace
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.to_lowercase.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.to_uppercase.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.len_bytes.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.len_chars.html

## Sorting with Expressions
- We can pass an expression to the `sort` method to customize the sort.

In [27]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [28]:
employees.select(pl.col("name").str.len_chars())

name_lengths = pl.col("name").str.len_chars()
name_lengths

- We can use the expression as the basis of a custom sort.
- Let's sort the rows based on the lengths of the employee's names.
- Custom expressions can be combined with plain column expressions.

In [29]:
employees.sort(name_lengths)
employees.sort(name_lengths, descending=True)

employees.sort("department", name_lengths, nulls_last=True)

employees.sort("department", name_lengths, descending=[False, True])

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Mrs. Hannah Copeland""",,"""mrs..copeland@polars.io""",65097,3,2022-01-24
"""Stephanie Contreras""",,"""stephanie.contreras@polars.io""",78467,1,2023-07-20
"""Christopher Blevins""",,"""christopher.blevins@polars.io""",73644,8,2016-09-23
"""Mr. Robert Castillo""",,"""mr..castillo@polars.io""",74553,6,2019-06-21
"""Veronica Gutierrez""",,"""veronica.gutierrez@polars.io""",65010,6,2019-02-09
…,…,…,…,…,…
"""Lisa Cline""","""Sales""","""lisa.cline@polars.io""",85057,7,2017-11-05
"""John Lucas""","""Sales""","""john.lucas@polars.io""",69149,8,2017-05-30
"""Joy Baker""","""Sales""","""joy.baker@polars.io""",72449,6,2019-02-03
"""Kari Diaz""","""Sales""","""kari.diaz@polars.io""",86009,9,2016-04-23


## The top_k and bottom_k Methods
- The `top_k` method extracts a specified number of the greatest/maximum values.
- The `bottom_k` method extracts a specified number of the smallest/minimum values.

In [30]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [31]:
employees.sort("salary", descending=True).head(3)

employees.select(pl.col("salary").top_k(3))

salary
i64
250000
199503
199381


In [32]:
employees.select(
    pl.col("salary").top_k(3).alias("top_3_salaries"),
    pl.col("salary").bottom_k(3).alias("bottom_3_salaries"),
)

top_3_salaries,bottom_3_salaries
i64,i64
250000,55011
199503,55012
199381,55078


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.top_k.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.bottom_k.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.top_k.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.bottom_k.html

## The rank Method
- The `rank` method assigns each row value a position in line based on its order.
- By default, the smallest value has a ranking of #1.

In [33]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


- Let's rank the employees by salary. The employee with the greatest salary should be 1.
- Pass the `descending` parameter a value of `True` to rank from largest to smallest.
- Multiple occurrences of the same value will share a rank.
- The ranking will then pick up from the next logical value.

In [34]:
employees.with_columns(
    pl.col("salary").rank(descending=True).cast(pl.UInt16).alias("salary_rank")
).sort("salary_rank")

name,department,email,salary,years_at_company,start_date,salary_rank
str,str,str,i64,i64,date,u16
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14,1
"""Jack Tanner""","""Engineering""","""jack.tanner@polars.io""",199503,1,2024-02-23,2
"""Shawn Gray""","""Engineering""","""shawn.gray@polars.io""",199381,2,2022-12-15,3
"""Shannon Klein""","""Finance""","""shannon.klein@polars.io""",199260,7,2017-07-19,4
"""James Wilson""","""Finance""","""james.wilson@polars.io""",199257,2,2023-01-06,5
…,…,…,…,…,…,…
"""Brenda Lopez""",,"""brenda.lopez@polars.io""",55304,4,2021-04-19,996
"""Cristina Williams""",,"""cristina.williams@polars.io""",55242,8,2016-07-21,997
"""Aaron Morgan""",,"""aaron.morgan@polars.io""",55078,10,2014-08-21,998
"""Joseph Lopez""",,"""joseph.lopez@polars.io""",55012,2,2022-08-13,999


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.rank.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cast.html

## The shuffle Method
- The `shuffle` method randomizes the order of elements in a column.

In [35]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [36]:
one_thousand_bools = [True] * 250 + [False] * 750
one_thousand_bools

Output = None

In [37]:
employees.with_columns(pl.Series("getting_promotion", one_thousand_bools)).with_columns(
    pl.col("getting_promotion").shuffle()
)

name,department,email,salary,years_at_company,start_date,getting_promotion
str,str,str,i64,i64,date,bool
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14,false
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13,false
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01,false
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25,false
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14,true
…,…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09,false
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20,false
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12,false
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07,false


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.shuffle.html

## Counting and Extracting Unique Values
- The `n_unique` method counts the exact number of unique values in a column.
- Polars include a `null`/missing value in the count.
- The `approx_n_unique` method returns an approximate count of unique values.

In [38]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [39]:
employees.select(pl.col("department").n_unique())

employees.select(pl.col("department", "email").n_unique())

employees.select(pl.all().n_unique())

employees.select(pl.n_unique("department", "email"))

department,email
u32,u32
8,989


- For small datasets, the `approx_n_unique` method will likely return a perfect value.
- For larger datasets, the `approx_n_unique` may be off slightly but will perform faster.
- The more unique values that a column holds, the greater chance that `approx_n_unique` will be off.
- The `approx_n_unique` method does not support datetime columns.

In [40]:
employees.select(pl.col("department", "email").approx_n_unique())

employees.select(pl.approx_n_unique("department", "email"))

department,email
u32,u32
8,993


- The `unique` method returns the unique values from the specified column.
- Each unique value is listed once.

In [41]:
employees.select(pl.col("department").unique())

# employees.select(pl.col("department", "email").unique())

department
str
"""HR"""
""
"""CEO"""
"""Engineering"""
"""Marketing"""
"""Sales"""
"""Operations"""
"""Finance"""


### Further Reading
- https://docs.pola.rs/user-guide/expressions/basic-operations/#counting-unique-values
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.n_unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.approx_n_unique.html

## The value_counts Method
- The `value_counts` method counts the number of occurrences of each unique value.
- The `value_counts` method returns a column of structs.
- A struct is a data structure that comparable to a Python dictionary. It consists of key-value pairs.
- Each row's struct holds two pieces of data, the `department` name and its count.

In [42]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


In [43]:
employees.select(pl.col("department").value_counts())

department
struct[2]
"{""CEO"",1}"
"{""HR"",150}"
"{null,155}"
"{""Operations"",136}"
"{""Marketing"",146}"
"{""Engineering"",139}"
"{""Finance"",144}"
"{""Sales"",129}"


- The `unnest` method on a `DataFrame` extracts each row's struct's values into separate columns.
- The struct's key-value pairs are "nested" within the struct -- this "unnests" them.

In [44]:
employees.select(pl.col("department").value_counts()).unnest(pl.col("department"))

department,count
str,u32
"""Engineering""",139
"""Operations""",136
"""Finance""",144
"""Sales""",129
"""Marketing""",146
,155
"""HR""",150
"""CEO""",1


- Pass `True` to the `sort` parameter to sort by the highest occuring value first.

In [45]:
employees.select(pl.col("department").value_counts(sort=True))

department
struct[2]
"{null,155}"
"{""HR"",150}"
"{""Marketing"",146}"
"{""Finance"",144}"
"{""Engineering"",139}"
"{""Operations"",136}"
"{""Sales"",129}"
"{""CEO"",1}"


- Set `normalize` to `True` to see the relative percentages of each unique value..

In [46]:
employees.select(pl.col("department").value_counts(sort=True, normalize=True))

department
struct[2]
"{null,0.155}"
"{""HR"",0.15}"
"{""Marketing"",0.146}"
"{""Finance"",0.144}"
"{""Engineering"",0.139}"
"{""Operations"",0.136}"
"{""Sales"",0.129}"
"{""CEO"",0.001}"


- Notice the column name is `proportion` instead of `count`.
- The column name comes directly from the key inside the struct.

In [47]:
employees.select(pl.col("department").value_counts(sort=True, normalize=True)).unnest(
    pl.col("department")
)

department,proportion
str,f64
,0.155
"""HR""",0.15
"""Marketing""",0.146
"""Finance""",0.144
"""Engineering""",0.139
"""Operations""",0.136
"""Sales""",0.129
"""CEO""",0.001


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.value_counts.html