# DataFrames I

In [1]:
import polars as pl

## Intro to DataFrames
- A `DataFrame` is a 2-dimensional table consisting of rows and columns.
- A `DataFrame` is a collection of `Series` glued together.
- The columns in a `DataFrame` can be of different data types but must be of the same length.
- The data within any single `Series` must be homogenous.

![DataFrame data structure](images/DataFrame.png)

## Create a DataFrame from Scratch
- The `pl.DataFrame` class constructor accepts a variety of inputs.
- One option is an dictionary that maps column names to column values.
- Pass a list for each column's values. The lengths of the lists must be equal.

In [2]:
pl.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "credit_score": [3.4, 6.8, 9.2, 13.3, 18.9],
    }
)

id,credit_score
i64,f64
1,3.4
2,6.8
3,9.2
4,13.3
5,18.9


- The output includes the shape/dimensions of the `DataFrame` (height x width).
- Polars also prints the data type of each column.

### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#dataframe
- https://docs.pola.rs/api/python/stable/reference/dataframe/index.html

## Read a DataFrame from CSV
- The `pl.read_csv` function imports a CSV file as a `DataFrame`.
- The function's parameters can customize details like which columns to target, how many rows to include, what values constitute a missing value, and more.
- Polars will output 10 `DataFrame` rows by default: the first 5 and the last 5 separated with a gap in between.
- Polars uses ellipses (`...`) to mark a gap in the data. The data is still present; it's just not printed.

In [3]:
employees = pl.read_csv("employees.csv")
employees

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""


- A **method** is a command, while an attribute is a piece of information.
- The `head` method returns rows from the start of the `DataFrame`.
- The `limit` method is an alias for `head`.
- The `tail` method returns rows from the end of the `DataFrame`.

In [4]:
employees.head(7)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""
"""Melissa Page""","""Marketing""","""melissa.page@polars.io""",114120,9,"""2015-12-28"""
"""Martin Adams""",,"""martin.adams@polars.io""",61705,0,"""2024-10-28"""


In [5]:
employees.limit(4)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""


In [6]:
employees.tail(9)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Peter Green""","""Operations""","""peter.green@polars.io""",93617,4,"""2020-11-16"""
"""Patricia Duncan""","""Finance""","""patricia.duncan@polars.io""",188852,5,"""2020-06-14"""
"""Jennifer Murphy""",,"""jennifer.murphy@polars.io""",79626,1,"""2024-07-13"""
"""Rachel Walker""","""HR""","""rachel.walker@polars.io""",77912,6,"""2019-04-20"""
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""
"""Christopher Lynch""",,"""christopher.lynch@polars.io""",85372,3,"""2022-07-04"""


- An **attribute** is a piece of data that lives on the object.
- Methods require parentheses. Attributes need a dot and the attribute name.
- The `columns` attribute provides a list of the column names.
- The `dtypes` attribute provides a list of the columns' data types.

In [7]:
employees.columns

['name', 'department', 'email', 'salary', 'years_at_company', 'start_date']

In [8]:
employees.dtypes

[String, String, String, Int64, Int64, String]

- A **schema** is a "representation of a plan in an outline or model".
- The `schema` attribute returns a `Schema` object that connects each column to its type.
- Printing the `Schema` instance displays the relationships between columns and data types.
- The display format of data types may differ between the schema and the printed `DataFrame`.

In [9]:
employees.schema

Schema([('name', String),
        ('department', String),
        ('email', String),
        ('salary', Int64),
        ('years_at_company', Int64),
        ('start_date', String)])

- The `DataFrame` will outputs its dimensions (height x width) when printed.
- The height is the number of rows, and the width is the number of columns.
- The `height` and `width` attributes provide the same shape information.
- The `shape` attribute is a tuple with the dimensions.

In [10]:
employees.height

1000

In [11]:
employees.width

6

In [12]:
employees.shape

(1000, 6)

In [13]:
employees

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""


### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#head
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#tail
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#schema
- https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.columns.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.dtypes.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.schema.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.height.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.width.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.shape.html

## No Index, No Problem
- Unlike Pandas, Polars DataFrames do not create an ascending numeric index starting from 0.
- The `with_row_index` method will add an `index` column at the start of the `DataFrame`.

In [14]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


In [15]:
employees.with_row_index()

index,name,department,email,salary,years_at_company,start_date
u32,str,str,str,i64,i64,str
0,"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
1,"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
2,"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
3,"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
4,"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""
…,…,…,…,…,…,…
995,"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
996,"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""
997,"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""
998,"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""


- The `offset` parameter sets the starting number.
- Use 1 for an index that starts from 1.

In [16]:
employees.with_row_index(offset=1).head()

employees.with_row_index(offset=58).head()

index,name,department,email,salary,years_at_company,start_date
u32,str,str,str,i64,i64,str
58,"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
59,"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
60,"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
61,"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
62,"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""


- The `name` parameter sets the index column name.

In [17]:
employees.with_row_index(offset=58, name="id").head()

id,name,department,email,salary,years_at_company,start_date
u32,str,str,str,i64,i64,str
58,"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
59,"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
60,"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
61,"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
62,"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""


- The `read_csv` function also supports `row_index_name` and `row_index_offset` parameters for the same result.

In [18]:
pl.read_csv("employees.csv", row_index_name="id", row_index_offset=1)

id,name,department,email,salary,years_at_company,start_date
u32,str,str,str,i64,i64,str
1,"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
2,"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
3,"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
4,"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
5,"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""
…,…,…,…,…,…,…
996,"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
997,"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""
998,"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""
999,"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.with_row_index.html

## Intro to Expressions
- An **expression** is an instruction for how to target or transform data.
- An **expression** is a future computation. It's an evaluation that only has significance when combined with a function/method.

In [19]:
employees = pl.read_csv("employees.csv")
employees

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""


- The `pl.col` function creates an expression that targets a column.
- The expression has no association with a specific `DataFrame`.
- We use the terminology "lazy" because the expression is a building block for a future operation.
- The currently expression's logic is: Locate a `years_at_company` column.

In [20]:
pl.col("years_at_company")

In [21]:
type(pl.col("years_at_company"))

polars.expr.expr.Expr

- Expressions have methods. Each method returns a new expression with an expanded set of instructions,.
- Let's create an expression that targets a column named `years_at_company` and calculates the average of its values.
- The expression is lazy. It has no knowledge of a specific `DataFrame`. It also doesn't know the data type of its column.
- The currently expression's logic is: Locate the `years_at_company` column, take the average of its values.

In [22]:
pl.col("years_at_company").mean()

In [23]:
type(pl.col("years_at_company").mean())

polars.expr.expr.Expr

### Further Reading
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions
- https://docs.pola.rs/api/python/stable/reference/expressions/col.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html

## The select Method I
- An expression is a lazy transformation that does not execute until it is used in a specific _context_.
- A **context** is a situation that forces the computation.
- Most Polars work consists of combining expressions with methods.

In [24]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- The `select` method executes one or more expressions.
- Polars gathers the results of the expression executions in a new `DataFrame`.
- Each `select` method expression creates a new `DataFrame` column.
- The simplest example of an expression is selecting a column. There are no additional transformations but the _selection_ of a column constitutes a step.
- If the expression targets an existing column, Polars will keep the column's name in the new `DataFrame`.

In [25]:
employees.select(pl.col("years_at_company"))

years_at_company
i64
9
9
10
5
7
…
9
6
0
4


- The `mean` method transforms an expression to calculate the average of a column.
- In this scenario, the `select` method returns a `DataFrame` with a single row holding the average.
- The `years_of_company` column stores integers but the average calculation forces a float.

In [26]:
employees.select(pl.col("years_at_company").mean())

years_at_company
f64
5.141


In [27]:
employees.select(pl.col("years_at_company"), pl.col("name"))

employees.select(pl.col("name"), pl.col("years_at_company"))

name,years_at_company
str,i64
"""Nicholas Maldonado""",9
"""Michael Fletcher""",9
"""Jeffrey Tanner""",10
"""Diana Weaver""",5
"""Sierra Ross""",7
…,…
"""James Bryant""",9
"""Patricia Vazquez""",6
"""Katie Clay""",0
"""Monique Swanson""",4


### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#select
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#contexts
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.select.html

## Renaming Columns
- Invoke the `alias` method on an expression to rename the targeted column.
- The `alias` method is another example of a method that returns a new expression from an existing expression.
- The currently expression's logic becomes: Locate a `years_at_company` column, calculate the average of its values, rename the column to `years`.

In [28]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


In [29]:
employees.select(pl.col("years_at_company").mean().alias("average"))

average
f64
5.141


- Alternatively, pass the `select` method a keyword argument to set the name of the column.
- With either syntax option, Polars returns a new `DataFrame`.

In [30]:
employees.select(average=pl.col("years_at_company").mean())

average
f64
5.141


- Polars will raise an error if multiple columns have the same name.

In [31]:
employees.select(
    pl.col("years_at_company").mean().alias("average"),
    pl.col("years_at_company").sum().alias("sum"),
)

average,sum
f64,i64
5.141,5141


In [32]:
employees.select(
    pl.col("years_at_company").mean().alias("year_average"),
    pl.col("salary").mean().alias("salary_average"),
)

year_average,salary_average
f64,f64
5.141,109790.73


In [33]:
employees.select(
    salary_average=pl.col("salary").mean(), salary_sum=pl.col("salary").sum()
)

salary_average,salary_sum
f64,i64
109790.73,109790730


- The `select` methods is non-mutative.
- The method returns a new `DataFrame`; it does _not_ modify the existing `DataFrame`.
- The inconsistency between copies versus in-place mutations was a pain point in Pandas.
- To keep the changes, assign the resulting `DataFrame` to a variable.

In [34]:
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sum.html

## The select Method II
- The `select` method supports a variety of syntax options.
- Polars recommends targeting every column with an individual `pl.col` expression.
- The separation of columns makes it easier to build different expressions for each column.

In [35]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- The `select` method sequential string arguments or a list of the columns' names.
- Expression expansion is a shorthand notation that applies the same transformation to multiple columns.
- Polars will utilize whatever optimization enables calculations to execute in parallel.
- For example, Polars will convert sequential string arguments to `select` into expressions.

In [36]:
employees.select(["salary", "name"])

employees.select(
    pl.col("salary"),
    pl.col("name"),
)

salary,name
i64,str
250000,"""Nicholas Maldonado"""
96540,"""Michael Fletcher"""
126489,"""Jeffrey Tanner"""
84672,"""Diana Weaver"""
148601,"""Sierra Ross"""
…,…
85285,"""James Bryant"""
92190,"""Patricia Vazquez"""
87151,"""Katie Clay"""
196704,"""Monique Swanson"""


- The `pl.col` function creates an expression, a building block for future transformations.
- The `pl.col` function also supports sequential arguments as well as a list input.
- The `pl.col([])` syntax works well when applying consistent operations to the list's columns.

In [37]:
employees.select(pl.col(["name", "salary"]))

name,salary
str,i64
"""Nicholas Maldonado""",250000
"""Michael Fletcher""",96540
"""Jeffrey Tanner""",126489
"""Diana Weaver""",84672
"""Sierra Ross""",148601
…,…
"""James Bryant""",85285
"""Patricia Vazquez""",92190
"""Katie Clay""",87151
"""Monique Swanson""",196704


- Polars can perform aggregate operations on every targeted column in an expression.
- The following example calculates the largest value within both the `name` and `salary` columns.
- To be clear, this is _not_ Zachary Woods's salary. This is the highest salary in the dataset.

In [38]:
employees.select(pl.col(["years_at_company", "salary"]).max())

years_at_company,salary
i64,i64
10,250000


- Polars supports multiple expression arguments to `select`.
- This syntax is the most flexible because code can apply different logic to different columns.

In [39]:
employees.select(pl.col("name"), pl.col("salary")).head()

name,salary
str,i64
"""Nicholas Maldonado""",250000
"""Michael Fletcher""",96540
"""Jeffrey Tanner""",126489
"""Diana Weaver""",84672
"""Sierra Ross""",148601


- The next example extracts the greatest name (last alphabetical value) and the smallest salary.
- Polars applies different operations to different columns.

In [40]:
employees.select(pl.col("name").max(), pl.col("salary").min()).head()

name,salary
str,i64
"""Zachary Woods""",55011


- Polars also supports a list of expressions as an argument to `select`.
- This syntax can be helpful when constructing dynamic lists.

In [41]:
employees.select([pl.col("name"), pl.col("salary")])

name,salary
str,i64
"""Nicholas Maldonado""",250000
"""Michael Fletcher""",96540
"""Jeffrey Tanner""",126489
"""Diana Weaver""",84672
"""Sierra Ross""",148601
…,…
"""James Bryant""",85285
"""Patricia Vazquez""",92190
"""Katie Clay""",87151
"""Monique Swanson""",196704


In [42]:
columns = ["salary", "years_at_company", "name"]
employees.select([pl.col(column).max() for column in columns])

salary,years_at_company,name
i64,i64,str
250000,10,"""Zachary Woods"""


### Further Reading
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expression-expansion
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#function-col
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.select.html

## The select Method III: Targeting by Data Type
- The `pl.col` function also accepts Polars data types.
- All Polars data types are avalable at the top-level `pl` namespace.
- The `select` method will extract all columns of that data type.

In [43]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


In [44]:
employees.select(pl.col(pl.String))

employees.select(pl.col(pl.Int64))

salary,years_at_company
i64,i64
250000,9
96540,9
126489,10
84672,5
148601,7
…,…
85285,9
92190,6
87151,0
196704,4


- The `col` function supports multiple types.
- Data types must be exact. `pl.Int32` will not match a `pl.Int64` column.

In [45]:
employees.select(pl.col(pl.String, pl.Int64))

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""


## Expressions as Building Blocks
- Expressions are lazy steps to apply to a future computation.
- Expressions are reusable building blocks that exist independently of a specific `DataFrame`.
- Let's assign an expression to a variable and then reuse it across different `DataFrames.`
- The `todos.csv` dataset has a `start_date` column

In [46]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


In [47]:
todos = pl.read_csv("todos.csv")
todos

todos,start_date
str,str
"""Pay mortgage""","""2026-12-31"""
"""Exercise""","""2026-10-21"""
"""Study Polars""","""2026-01-01"""


In [48]:
start_date = pl.col("start_date").alias("date")
start_date

In [49]:
employees.select(start_date).head()

date
str
"""2016-07-14"""
"""2016-02-13"""
"""2015-03-01"""
"""2019-11-25"""
"""2018-02-14"""


In [50]:
todos.select(start_date)

date
str
"""2026-12-31"""
"""2026-10-21"""
"""2026-01-01"""


- An expression does not care about the column's data type.
- The `read_csv` function supports a `try_parse_dates` parameter to convert columns to datetimes.
- Polars will read the `start_date` column as a `date` column rather than a string column.
- The `start_date` expression continues to work. 

In [51]:
todos = pl.read_csv("todos.csv", try_parse_dates=True)
todos

todos,start_date
str,date
"""Pay mortgage""",2026-12-31
"""Exercise""",2026-10-21
"""Study Polars""",2026-01-01


In [52]:
todos.select(start_date)

date
date
2026-12-31
2026-10-21
2026-01-01


## Expressions that Count Values
- Expressions support methods that create new expressions with expanded instructions.
- The `len` method counts all values in the column.
- The `count` method counts only present values.
- The `null_count` method counts null values.
- The methods return a `u32` (unsigned integer) because the count must be 0 or positive.

In [53]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


In [54]:
employees.select(
    pl.col("department").len().alias("dep_length"),
    pl.col("department").count().alias("dep_count"),
    pl.col("department").null_count().alias("dep_null_count"),
)

dep_length,dep_count,dep_null_count
u32,u32,u32
1000,845,155


In [55]:
department = pl.col("department")

employees.select(
    department.len().alias("dep_length"),
    department.count().alias("dep_count"),
    department.null_count().alias("dep_null_count"),
)

dep_length,dep_count,dep_null_count
u32,u32,u32
1000,845,155


- The `describe` method returns a `DataFrame` of statistics for columns.
- We can invoke it on `employees` but it's not particularly helpful for non-numeric columns.
- Let's use `select` to target the two numeric columns, then invoke `describe` on the new `DataFrame`.

In [56]:
employees.describe()

employees.select(pl.col("salary"), pl.col("years_at_company")).describe()

employees.select(pl.col("salary", "years_at_company")).describe()

statistic,salary,years_at_company
str,f64,f64
"""count""",1000.0,1000.0
"""null_count""",0.0,0.0
"""mean""",109790.73,5.141
"""std""",40031.948121,3.241238
"""min""",55011.0,0.0
"""25%""",77675.0,2.0
"""50%""",96767.0,5.0
"""75%""",135842.0,8.0
"""max""",250000.0,10.0


- Another way to solve this problem is targeting the columns by their shared `pl.Int64` data type.

In [57]:
employees.select(pl.col(pl.Int64)).describe()

statistic,salary,years_at_company
str,f64,f64
"""count""",1000.0,1000.0
"""null_count""",0.0,0.0
"""mean""",109790.73,5.141
"""std""",40031.948121,3.241238
"""min""",55011.0,0.0
"""25%""",77675.0,2.0
"""50%""",96767.0,5.0
"""75%""",135842.0,8.0
"""max""",250000.0,10.0


### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#describe
- https://docs.pola.rs/user-guide/transformations/time-series/parsing/#parsing-dates-from-a-file
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.len.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.count.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.null_count.html

## Extracting One or More Rows
- Unlike Pandas, Polars does not maintain an index (a unique identifier for each row).
- Polars stores its data in columnar format. It is optimized for column operations.

In [58]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- We can still extract a row's values by its index position.
- Polars will aggregate a row's values across all of its columns.
- The `row` method return a tuple with a single row's values.
- A tuple is an immutable sequenced collection of values.

In [59]:
employees.row(0)
employees.row(1)
employees.row(100)

('Nathan Blanchard',
 'Marketing',
 'nathan.blanchard@polars.io',
 96290,
 7,
 '2018-01-22')

- The `slice` method extracts multiple rows by their index positions.
- The first argument is the starting row index.
- The second argument is the number of rows to extract.
- The next example starts at index 1 (row 2) and extracts 4 rows.
- The rows have index positions 1, 2, 3, and 4 in `employees`.

In [60]:
employees.slice(1, 4)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,"""2018-02-14"""


- The first argument supports negative values.
- A negative value starts relative to the end of the `DataFrame`.
- Example: `(-8, 3)` starts 8 rows from the end of the `DataFrame`  and extracts 3 rows.

In [61]:
employees.slice(-8, 3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Patricia Duncan""","""Finance""","""patricia.duncan@polars.io""",188852,5,"""2020-06-14"""
"""Jennifer Murphy""",,"""jennifer.murphy@polars.io""",79626,1,"""2024-07-13"""
"""Rachel Walker""","""HR""","""rachel.walker@polars.io""",77912,6,"""2019-04-20"""


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.row.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.slice.html

## List Slicing Syntax
- Polars supports the list slicing syntax from Python.
- This approach is generally discouraged by the Polars team. Prefer `slice` and method-based approaches.

In [62]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- A single value targets a row by index position.

In [63]:
employees[5]

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Melissa Page""","""Marketing""","""melissa.page@polars.io""",114120,9,"""2015-12-28"""


- The list slicing syntax targets multiple rows by index position.
- The ending position is exclusive; the row at that index will be excluded.
- Subtract the first index from the second index to calculate the number of rows.

In [64]:
employees[5:8]

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Melissa Page""","""Marketing""","""melissa.page@polars.io""",114120,9,"""2015-12-28"""
"""Martin Adams""",,"""martin.adams@polars.io""",61705,0,"""2024-10-28"""
"""Jessica Hill""","""Marketing""","""jessica.hill@polars.io""",135265,5,"""2019-10-04"""


- Omit the value before the colon to extract from the start of the `DataFrame`.

In [65]:
employees[0:4]

employees[:4]

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""


- Polars supports negative values in either or both positions.

In [66]:
employees[-10:-7]

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Tina Miller""","""Finance""","""tina.miller@polars.io""",196962,9,"""2016-06-04"""
"""Peter Green""","""Operations""","""peter.green@polars.io""",93617,4,"""2020-11-16"""
"""Patricia Duncan""","""Finance""","""patricia.duncan@polars.io""",188852,5,"""2020-06-14"""


- Omit the value after the colon to extract to the end of the `DataFrame`.

In [67]:
employees[-2:]

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,"""2020-11-07"""
"""Christopher Lynch""",,"""christopher.lynch@polars.io""",85372,3,"""2022-07-04"""


## Expressions that Target Row Values

In [68]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- Invoke the `get` method on an expression to target a value in a specific column.

In [69]:
employees.select(pl.col("name").get(499))

name
str
"""Rachel Morris"""


In [70]:
employees.select(pl.col("name").get(499), pl.col("email").get(499))

name,email
str,str
"""Rachel Morris""","""rachel.morris@polars.io"""


- This is one example where multiple expressions create more verbose code.
- We can pass multiple values to `pl.col` to have the expression target multiple columns.

In [71]:
employees.select(pl.col("name", "email").get(499))

name,email
str,str
"""Rachel Morris""","""rachel.morris@polars.io"""


- Passing `pl.col` a complete list of columns will quickly become verbose.
- The `pl.all` helper function returns an expression that targets all columns.

In [72]:
employees.select(pl.all().get(499))

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Rachel Morris""","""Finance""","""rachel.morris@polars.io""",168722,6,"""2018-08-28"""


- Passing an asterisk (`*`) to `pl.col` targets all columns as well.

In [73]:
employees.select(pl.col("*").get(499))

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Rachel Morris""","""Finance""","""rachel.morris@polars.io""",168722,6,"""2018-08-28"""


### Further Reading
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#selecting-all-columns
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.get.html

## Extracting a Single Value from DataFrame

In [74]:
employees = pl.read_csv("employees.csv")
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,"""2015-03-01"""


- The `item` method extracts a single cell value by its row and column index position.
- The first argument is the row index, the second argument is the column index.
- Both the row and column index start counting from 0.

In [75]:
employees.item(0, 0)
employees.item(row=0, column=0)

'Nicholas Maldonado'

In [76]:
employees.item(0, 1)

'CEO'

In [77]:
employees.item(343, 3)

148647

- Be careful. Null/missing values will not have a printed representation.
- Wrap the value in Python's `print` function to see the visual `None`.
- Python's `None` data type represents an absent/missing value.

In [78]:
employees.item(2, 1)

In [79]:
print(employees.item(2, 1))

None


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.item.html

## The gather and gather_every Methods
- The `gather` method extracts multiple rows by index position.
- Pass the method a list with the index positions.

In [80]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- The next example uses `pl.all` to target all columns, then extracts the rows at index positions 0, 100, and 200.

In [81]:
employees.select(pl.all().gather([0, 100, 200]))

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Nathan Blanchard""","""Marketing""","""nathan.blanchard@polars.io""",96290,7,"""2018-01-22"""
"""George Thomas""",,"""george.thomas@polars.io""",55386,10,"""2014-09-06"""


- The `gather_every` method extracts rows using a gap/interval.
- The next example targets every second row (index positions 0, 2, 4, etc).

In [82]:
employees.select(pl.all().gather_every(2))

employees.select(pl.all().gather_every(3))

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Martin Adams""",,"""martin.adams@polars.io""",61705,0,"""2024-10-28"""
"""Gregory Ward""",,"""gregory.ward@polars.io""",134169,7,"""2018-06-16"""
"""Gregory Flores""","""Marketing""","""gregory.flores@polars.io""",138482,10,"""2015-02-17"""
…,…,…,…,…,…
"""Deborah Hancock""","""Marketing""","""deborah.hancock@polars.io""",138183,7,"""2017-12-18"""
"""Tina Miller""","""Finance""","""tina.miller@polars.io""",196962,9,"""2016-06-04"""
"""Jennifer Murphy""",,"""jennifer.murphy@polars.io""",79626,1,"""2024-07-13"""
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,"""2019-02-20"""


- The `gather_every` method is also available on the `DataFrame`.

In [83]:
employees.gather_every(n=2, offset=1)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,"""2019-11-25"""
"""Melissa Page""","""Marketing""","""melissa.page@polars.io""",114120,9,"""2015-12-28"""
"""Jessica Hill""","""Marketing""","""jessica.hill@polars.io""",135265,5,"""2019-10-04"""
"""Gregory Ward""",,"""gregory.ward@polars.io""",134169,7,"""2018-06-16"""
…,…,…,…,…,…
"""Peter Green""","""Operations""","""peter.green@polars.io""",93617,4,"""2020-11-16"""
"""Jennifer Murphy""",,"""jennifer.murphy@polars.io""",79626,1,"""2024-07-13"""
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,"""2016-05-09"""
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,"""2025-02-12"""


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.gather.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.gather_every.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.gather_every.html

## Extracting a Random Set of Values

In [84]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- The `sample` method extracts a number of random rows from the `DataFrame`.
- The new `DataFrame` may contain the rows in a different order than they appear in the `DataFrame`.

In [85]:
employees.sample(n=3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Brian Brady""","""Sales""","""brian.brady@polars.io""",85471,10,"""2015-07-15"""
"""Terry Bonilla""","""Marketing""","""terry.bonilla@polars.io""",83437,8,"""2017-04-19"""
"""Jeffrey Houston""","""HR""","""jeffrey.houston@polars.io""",60299,8,"""2017-03-20"""


- The `fraction` parameter extracts a portion/percentage of the original rows.
- The next example extracts 3.5% of the original rows.

In [86]:
employees.sample(fraction=0.035)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Anthony Grant""","""Marketing""","""anthony.grant@polars.io""",138355,6,"""2019-01-11"""
"""Christine Snyder""","""Finance""","""christine.snyder@polars.io""",158005,10,"""2015-04-07"""
"""Kyle Mccall MD""",,"""kyle.md@polars.io""",64581,7,"""2018-01-16"""
"""Christopher Lopez""","""Finance""","""christopher.lopez@polars.io""",187070,1,"""2024-05-26"""
"""Cristina Williams""",,"""cristina.williams@polars.io""",55242,8,"""2016-07-21"""
…,…,…,…,…,…
"""Martha Chase""","""Sales""","""martha.chase@polars.io""",81018,8,"""2017-06-06"""
"""Elizabeth Rivera""","""Marketing""","""elizabeth.rivera@polars.io""",127606,0,"""2025-04-26"""
"""Chelsea Williamson""","""Engineering""","""chelsea.williamson@polars.io""",189501,10,"""2015-06-28"""
"""Amber Harris""","""HR""","""amber.harris@polars.io""",78162,0,"""2025-06-14"""


### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#sample
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sample.html

## Casting Columns to Different Types
- The `cast` method converts a column's values into another data type.
- Pass the `cast` method a built-in Polars data type. For example, `pl.Float64` will attempt to convert to 64-bit floats.
- Polars is strict by default; it will raise an error if a single value cannot be converted.

In [87]:
employees = pl.read_csv("employees.csv")
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,str
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,"""2016-07-14"""
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,"""2016-02-13"""


- When applying an expression, Polars will keep the original column name by default.
- Polars will raise an error if multiple columns in a new `DataFrame` have the same name.
- The `select` method thus requires each expression to produce a column with a unique name.

In [88]:
employees.select(
    pl.col("salary"), pl.col("salary").cast(pl.Float64).alias("salary_as_float")
)

salary,salary_as_float
i64,f64
250000,250000.0
96540,96540.0
126489,126489.0
84672,84672.0
148601,148601.0
…,…
85285,85285.0
92190,92190.0
87151,87151.0
196704,196704.0


- The `Int8` type supports the range of values from -128 to 127. Nobody will work 127 years for the company!
- The `UInt8` type supports the range of values from 0 to 255.


In [89]:
employees.select(pl.col("years_at_company").cast(pl.Int8))
employees.select(pl.col("years_at_company").cast(pl.UInt8))

years_at_company
u8
9
9
10
5
7
…
9
6
0
4


- We can cast multiple columns to new types in a single `select` call.

In [90]:
employees.select(
    pl.col("years_at_company").cast(pl.UInt8), pl.col("start_date").cast(pl.Date)
)

years_at_company,start_date
u8,date
9,2016-07-14
9,2016-02-13
10,2015-03-01
5,2019-11-25
7,2018-02-14
…,…
9,2016-05-09
6,2019-02-20
0,2025-02-12
4,2020-11-07


- Recall that `pl.col` can create an expresson targeting multiple columns.
- Pass the `pl.col` function the data type to target.

In [91]:
employees.select(pl.col(pl.Int64).cast(pl.Float64))

salary,years_at_company
f64,f64
250000.0,9.0
96540.0,9.0
126489.0,10.0
84672.0,5.0
148601.0,7.0
…,…
85285.0,9.0
92190.0,6.0
87151.0,0.0
196704.0,4.0


### Further Reading
- https://docs.pola.rs/user-guide/expressions/casting/#basic-example
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cast.html

## Customizing the DataFrame Schema
- The `schema_overrides` parameter overwrites Polars' inferred data type for specified columns.
- The parameter accepts a dictionary that maps column names to desired data types.

In [92]:
employees = pl.read_csv(
    "employees.csv",
    schema_overrides={"years_at_company": pl.Int8, "start_date": pl.Date},
)
employees.head(1)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i8,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14


- The `schema` parameter is also available but it requires the explicit provision of every column.
- Polars will raise a `ComputeError` violation if a column in missing.

In [93]:
employees = pl.read_csv(
    "employees.csv",
    schema={
        "name": pl.String,
        "department": pl.String,
        "email": pl.String,
        "salary": pl.Int32,
        "years_at_company": pl.Int8,
        "start_date": pl.Date,
    },
)
employees.head(1)

name,department,email,salary,years_at_company,start_date
str,str,str,i32,i8,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14


- The `schema` attribute prints out the current schema of the `DataFrame`.

In [94]:
employees.schema

Schema([('name', String),
        ('department', String),
        ('email', String),
        ('salary', Int32),
        ('years_at_company', Int8),
        ('start_date', Date)])

- An alternative to specifying `pl.Date` or `pl.DateTime` types is using the `try_parse_dates` parameter.
- The `try_parse_dates` parameter attempts to parse strings as dates/datetimes (if Polars can figure out the format).
- The parameter accepts a Boolean.

In [95]:
pl.read_csv("employees.csv", try_parse_dates=True)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/datatypes.html

## Renaming Columns
- The `select` method can rename columns in the new `DataFrame` but we have to specify all columns we want to include.

In [96]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


In [97]:
employees.select(pl.col("name").alias("first_name"))

first_name
str
"""Nicholas Maldonado"""
"""Michael Fletcher"""
"""Jeffrey Tanner"""
"""Diana Weaver"""
"""Sierra Ross"""
…
"""James Bryant"""
"""Patricia Vazquez"""
"""Katie Clay"""
"""Monique Swanson"""


- The `rename` accepts a dictionary where each key is an existing column name and each value is its new name.
- It renames the specified columns while keeping existing ones.

In [98]:
employees.rename({"years_at_company": "years", "salary": "pay"})

name,department,email,pay,years,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07


- The `read_csv` function includes a `new_columns` parameter that accepts a partial or complete list of column names in order.
- The next example overrides the first two column names.

In [99]:
employees = pl.read_csv(
    "employees.csv",
    try_parse_dates=True,
    new_columns=[
        "employee_name",
        "dept",
        "email",
        "salary",
        "years_at_company",
        "start_day",
    ],
)
employees.head(2)

employee_name,dept,email,salary,years_at_company,start_day
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rename.html

## The name Attribute
- Polars nests additional expression methods under attributes/namespaces.
- The `name` attribute/namespace holds methods for adjusting column names.
- The `pl.all` function creates an expression that targets all columns.
- The `name.to_uppercase` method converts all column names to uppercase.
- The `name.to_lowercase` method converts all column names to lowercase.

In [100]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


In [101]:
employees.select(pl.all().name.to_uppercase())

NAME,DEPARTMENT,EMAIL,SALARY,YEARS_AT_COMPANY,START_DATE
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07


In [102]:
employees.select(pl.all().name.to_lowercase())

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07


- The `name.prefix` method concatenates a consistent piece of text before each column name.
- The `name.suffix` method concatenates a consistent piece of text after each column name.

In [103]:
employees.select(pl.all().name.prefix("emp_"))

emp_name,emp_department,emp_email,emp_salary,emp_years_at_company,emp_start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07


In [104]:
employees.select(pl.all().name.suffix("_value"))

name_value,department_value,email_value,salary_value,years_at_company_value,start_date_value
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.to_uppercase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.to_lowercase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.prefix.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.suffix.html

## Dropping Columns
- The `drop` method removes one or more columns from a `DataFrame`.
- Using `select` to target only the columns you'd like to keep is another valid strategy.

In [105]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


In [106]:
employees.drop(pl.col("start_date"))
employees.drop("start_date")

employees.drop(pl.col("start_date"), pl.col("department"))
employees.drop(pl.col("start_date", "department"))

employees.drop("start_date", "department")
employees.drop(pl.col(["start_date", "department"]))

name,email,salary,years_at_company
str,str,i64,i64
"""Nicholas Maldonado""","""nicholas.maldonado@polars.io""",250000,9
"""Michael Fletcher""","""michael.fletcher@polars.io""",96540,9
"""Jeffrey Tanner""","""jeffrey.tanner@polars.io""",126489,10
"""Diana Weaver""","""diana.weaver@polars.io""",84672,5
"""Sierra Ross""","""sierra.ross@polars.io""",148601,7
…,…,…,…
"""James Bryant""","""james.bryant@polars.io""",85285,9
"""Patricia Vazquez""","""patricia.vazquez@polars.io""",92190,6
"""Katie Clay""","""katie.clay@polars.io""",87151,0
"""Monique Swanson""","""monique.swanson@polars.io""",196704,4


- A nonexistent column will trigger an error.

In [107]:
employees.drop("start_date", "department")

employees.drop(pl.col("start_date"), pl.col("department"))

name,email,salary,years_at_company
str,str,i64,i64
"""Nicholas Maldonado""","""nicholas.maldonado@polars.io""",250000,9
"""Michael Fletcher""","""michael.fletcher@polars.io""",96540,9
"""Jeffrey Tanner""","""jeffrey.tanner@polars.io""",126489,10
"""Diana Weaver""","""diana.weaver@polars.io""",84672,5
"""Sierra Ross""","""sierra.ross@polars.io""",148601,7
…,…,…,…
"""James Bryant""","""james.bryant@polars.io""",85285,9
"""Patricia Vazquez""","""patricia.vazquez@polars.io""",92190,6
"""Katie Clay""","""katie.clay@polars.io""",87151,0
"""Monique Swanson""","""monique.swanson@polars.io""",196704,4


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop.html

## Replacing Values
- The `replace` method swaps one value with another.
- The method accepts a list of values to replace with a list of another values.
- Polars matches the values based on shared index positions.
- The method also accepts a dictionary of mappings.

In [108]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


- The code below replaces `HR` with `Human Resources` and `Engineering` with `Tech`.

In [109]:
employees.select(
    pl.col("department").replace(
        old=["HR", "Engineering"], new=["Human Resources", "Tech"]
    )
)

department
str
"""CEO"""
"""Operations"""
""
"""Human Resources"""
""
…
""
"""Operations"""
""
"""Finance"""


In [110]:
employees.select(
    pl.col("department").replace({"HR": "Human Resources", "Engineering": "Nerds"})
)

department
str
"""CEO"""
"""Operations"""
""
"""Human Resources"""
""
…
""
"""Operations"""
""
"""Finance"""


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.replace.html

## Mathematical Operations I
- Polars supports common mathematical operations on `DataFrames` and `Series`.
- Most mathematical symbols have complementary method equivalents.

In [111]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


- Operations on `null` (missing) values will produce `null` values.
- If using the `alias` method, wrap the full expression in parentheses.

In [112]:
df1 = employees.select(
    pl.col("salary"),
    addition=pl.col("salary") + 10000,
    subtraction=pl.col("salary") - 12345,
    multiplication=pl.col("salary") * 2,
    division=pl.col("salary") / 10,
    floor_division=pl.col("salary") // 10,
    years=pl.col("years_at_company"),
    exponentiation=pl.col("years_at_company") ** 3,
    remainder=pl.col("years_at_company") % 3,
)
df1.head(3)

salary,addition,subtraction,multiplication,division,floor_division,years,exponentiation,remainder
i64,i64,i64,i64,f64,i64,i64,i64,i64
250000,260000,237655,500000,25000.0,25000,9,729,0
96540,106540,84195,193080,9654.0,9654,9,729,0
126489,136489,114144,252978,12648.9,12648,10,1000,1


- Polars provides method equivalents for the mathematical operators.

In [113]:
df2 = employees.select(
    pl.col("salary"),
    addition=pl.col("salary").add(10000),
    subtraction=pl.col("salary").sub(12345),
    multiplication=pl.col("salary").mul(2),
    division=pl.col("salary").truediv(10),
    floor_division=pl.col("salary").floordiv(10),
    years=pl.col("years_at_company"),
    exponentiation=pl.col("years_at_company").pow(3),
    remainder=pl.col("years_at_company").mod(3),
)
df2.head(3)

salary,addition,subtraction,multiplication,division,floor_division,years,exponentiation,remainder
i64,i64,i64,i64,f64,i64,i64,i64,i64
250000,260000,237655,500000,25000.0,25000,9,729,0
96540,106540,84195,193080,9654.0,9654,9,729,0
126489,136489,114144,252978,12648.9,12648,10,1000,1


- The `equals` method compares the equality of two `DataFrames`.
- All row values and column names must be equal.

In [114]:
df1.equals(df2)

True

### Further Reading
- https://docs.pola.rs/user-guide/expressions/basic-operations/#basic-arithmetic
- https://docs.pola.rs/user-guide/expressions/basic-operations/#comparisons
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.add.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sub.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mul.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.truediv.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.floordiv.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.pow.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mod.html

## Mathematical Operations II
- An expression can reference multiple columns.
- The expression below multiplies the `years_at_company` column values by the `salary` column values.

In [115]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


In [116]:
employees.select(
    pl.col("salary"),
    pl.col("years_at_company"),
    (pl.col("salary") * pl.col("years_at_company")).alias("total_spend"),
)

salary,years_at_company,total_spend
i64,i64,i64
250000,9,2250000
96540,9,868860
126489,10,1264890
84672,5,423360
148601,7,1040207
…,…,…
85285,9,767565
92190,6,553140
87151,0,0
196704,4,786816


- Polars will gracefully handle type conversion in arithmetic operations if necessary.
- For example, multiplying an integer by a float will produce a float.

In [117]:
employees.select(pl.col("salary"), (pl.col("salary") * 1.05).alias("salary next year"))

salary,salary next year
i64,f64
250000,262500.0
96540,101367.0
126489,132813.45
84672,88905.6
148601,156031.05
…,…
85285,89549.25
92190,96799.5
87151,91508.55
196704,206539.2


## Cumulative Mathematical Operations
- Polars includes cumulative methods that calculate change from one row to the next.

In [118]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(4)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25


- The `cum_sum` method returns the cumulative sum up to and including that row.
- The `cum_count` method counts the number of present values (exclude `null`) by each row.
- The `cum_max` method returns the largest value encountered so far.
- The `cum_min` method returns the smallest value encountered so far.
- The `cum_prod` method returns the cumulative product up to and including that row.
- The `pct_change` method returns the percent change between the current row and the previous one.

In [119]:
employees.select(
    pl.col("years_at_company"),
    cum_sum=pl.col("years_at_company").cum_sum(),
    cum_count=pl.col("years_at_company").cum_count(),
    cum_max=pl.col("years_at_company").cum_max(),
    cum_min=pl.col("years_at_company").cum_min(),
    cum_prod=pl.col("years_at_company").cum_prod(),
    pct_change=pl.col("years_at_company").pct_change() * 100,
)

years_at_company,cum_sum,cum_count,cum_max,cum_min,cum_prod,pct_change
i64,i64,u32,i64,i64,i64,f64
9,9,1,9,9,9,
9,18,2,9,9,81,0.0
10,28,3,10,9,810,11.111111
5,33,4,10,5,4050,-50.0
7,40,5,10,5,28350,40.0
…,…,…,…,…,…,…
9,5128,996,10,0,0,50.0
6,5134,997,10,0,0,-33.333333
0,5134,998,10,0,0,-100.0
4,5138,999,10,0,0,inf


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_sum.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_count.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_max.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_min.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_prod.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.pct_change.html

## The with_columns Method
- The `with_columns` method creates a new `DataFrame` with all existing columns and the new columns from the expressions.
- The `with_columns` method adds the columns to the right end of the existing `DataFrame`.

In [120]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


In [121]:
employees.select(next_year_salary=pl.col("salary") * 1.05)

employees.with_columns(next_year_salary=pl.col("salary") * 1.05)

employees.with_columns((pl.col("salary") * 1.05).alias("next_year_salary"))

name,department,email,salary,years_at_company,start_date,next_year_salary
str,str,str,i64,i64,date,f64
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14,262500.0
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13,101367.0
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01,132813.45
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25,88905.6
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14,156031.05
…,…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09,89549.25
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20,96799.5
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12,91508.55
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07,206539.2


- The `with_columns` method overwrites the original column if a new column has the same name.

In [122]:
employees.with_columns(pl.col("salary") * 1.05)

# employees.select(pl.col("salary"), pl.col("salary") * 1.05)

name,department,email,salary,years_at_company,start_date
str,str,str,f64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",262500.0,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",101367.0,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",132813.45,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",88905.6,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",156031.05,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",89549.25,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",96799.5,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",91508.55,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",206539.2,4,2020-11-07


- Polars prefers to execute expressions independently in parallel.
- An expression thus cannot depend on a column from another expression in the same call.
- Chain multiple `with_columns` methods to keep all existing columns, then create new columns based on generated ones.
- Technically, Python's walrus operator (`:=`) enables this behavior but `with_columns` is easier to reason about.

In [123]:
employees.with_columns(new_salary=pl.col("salary") * 1.05).with_columns(
    new_salary_per_week=pl.col("new_salary") / 52
)

name,department,email,salary,years_at_company,start_date,new_salary,new_salary_per_week
str,str,str,i64,i64,date,f64,f64
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14,262500.0,5048.076923
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13,101367.0,1949.365385
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01,132813.45,2554.104808
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25,88905.6,1709.723077
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14,156031.05,3000.597115
…,…,…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",85285,9,2016-05-09,89549.25,1722.100962
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",92190,6,2019-02-20,96799.5,1861.528846
"""Katie Clay""",,"""katie.clay@polars.io""",87151,0,2025-02-12,91508.55,1759.779808
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",196704,4,2020-11-07,206539.2,3971.907692


### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#with_columns
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#with_columns
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.with_columns.html

## The all and exclude Functions
- The `pl.all` function returns an expression that targets all columns.
- The `with_columns` method is equivalent to `df.select(pl.all(), new_column_expressions)` behind the scenes.

In [124]:
employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(2)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13


In [125]:
employees.select(pl.all(), next_year_salary=pl.col("salary") * 1.05).head()

name,department,email,salary,years_at_company,start_date,next_year_salary
str,str,str,i64,i64,date,f64
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14,262500.0
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13,101367.0
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01,132813.45
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",84672,5,2019-11-25,88905.6
"""Sierra Ross""",,"""sierra.ross@polars.io""",148601,7,2018-02-14,156031.05


- The `pl.exclude` function returns an expression that targets all columns _except_ for the specified ones.
- The next example targets all columns except for `start_date`.

In [126]:
employees.select(pl.exclude("start_date")).head()

employees.select(pl.exclude("start_date", "email")).head()

name,department,salary,years_at_company
str,str,i64,i64
"""Nicholas Maldonado""","""CEO""",250000,9
"""Michael Fletcher""","""Operations""",96540,9
"""Jeffrey Tanner""",,126489,10
"""Diana Weaver""","""HR""",84672,5
"""Sierra Ross""",,148601,7


- We can target columns of a specific data type type in a `select` call.
- We can also exclude columns of a data type with the `pl.exclude` function.

In [127]:
employees.select(pl.exclude(pl.String)).head()

salary,years_at_company,start_date
i64,i64,date
250000,9,2016-07-14
96540,9,2016-02-13
126489,10,2015-03-01
84672,5,2019-11-25
148601,7,2018-02-14


### Further Reading
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#selecting-all-columns
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#excluding-columns
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.exclude.html