## Explore [_'Polars'_](https://pola.rs)

### Install
```bash
uv add polars
```

___N.B.:___ 
- _Data can be represented in either of the following format -_
    - _[`DataFrame`](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html) - Two-dimensional data structure representing data as a table with rows and columns._
    - _[`LazyFrame`](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html) - Representation of a Lazy computation graph/query against a DataFrame. This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars._
- _`LazyFrame` methods have `scan` and `sink` keywords in the respective modules/functions, for I/O operations. e.g. `scan_csv`, `scan_delta`, `scan_parquet`, `sink_csv`, `sink_parquet`, etc._
- _`collect()` method must be used when an operation/query is performed on `LazyFrame`._
- _More information on `Lazy API` can be found [here](https://docs.pola.rs/user-guide/concepts/lazy-api/) and [here](https://docs.pola.rs/user-guide/lazy/)._


### Load libraries

In [1]:
import os
import polars as pl
import kagglehub

### Load/Download (large) the dataset

[Kaggle - Latest Data Science Job Salaries 2020 - 2025](https://www.kaggle.com/datasets/saurabhbadole/latest-data-science-job-salaries-2024/data)

In [53]:
# Download latest version of the dataset.
path_data = kagglehub.dataset_download("saurabhbadole/latest-data-science-job-salaries-2024")
print(f"Path to dataset files: {path_data}")

Downloading from https://www.kaggle.com/api/v1/datasets/download/saurabhbadole/latest-data-science-job-salaries-2024?dataset_version_number=3...


100%|██████████| 1.48M/1.48M [00:00<00:00, 5.53MB/s]

Extracting files...
Path to dataset files:  /Users/shaz/.cache/kagglehub/datasets/saurabhbadole/latest-data-science-job-salaries-2024/versions/3





In [None]:
# Get the 'csv' filename of the dataset.
# path_data = '/Users/shaz/.cache/kagglehub/datasets/saurabhbadole/latest-data-science-job-salaries-2024/versions/3'
filenames = os.listdir(path_data)
filename_csv = [file for file in filenames if file.endswith('.csv')][0]

# Load the dataset into a Polars DataFrame.
df_salary = pl.read_csv(source=os.path.join(path_data, filename_csv))
print(f"Shape of the DataFrame: {df_salary.shape}\n")

# Display the first few rows of the DataFrame.
# display(df_ev.head())
print(df_salary.head())

Shape of the DataFrame: (93597, 11)

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ S

### (OPTIONAL) Save and load the data in _'parquet'_ format locally

- normal/regular method

In [3]:
# Define filename for 'parquet' file.
filename_parquet = f"df_{filename_csv.split('.')[0].lower()}.parquet"

# Save the data.
df_salary.write_parquet(file=filename_parquet)

# Read the data back from the 'parquet' file.
df_salary = pl.read_parquet(source=filename_parquet)
print(f"Shape of the `DataFrame`: {df_salary.shape}\n")

Shape of the `DataFrame`: (93597, 11)



- `lazy` method

In [4]:
# Define filename for `lazy` method.
filename_lazy_parquet = f"lf_{filename_csv.split('.')[0].lower()}.parquet"

# Convert the DataFrame to a lazy DataFrame and save it as a Parquet file.
df_salary_lazy = df_salary.lazy()
df_salary_lazy.sink_parquet(path=filename_lazy_parquet)

# Read the lazy DataFrame from the Parquet file.
df_salary_lazy = pl.scan_parquet(source=filename_lazy_parquet)

# Convert the lazy DataFrame to a regular DataFrame.
df_salary = df_salary_lazy.collect()
print(f"`type` of 'df_ev': {type(df_salary).__module__}.{type(df_salary).__name__}")
print(f"`type` of 'df_ev_lazy': {type(df_salary_lazy).__module__}.{type(df_salary_lazy).__name__}\n")

# Display the first few rows of the DataFrame.
print(df_salary.head())

`type` of 'df_ev': polars.dataframe.frame.DataFrame
`type` of 'df_ev_lazy': polars.lazyframe.frame.LazyFrame

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0

### [Inspecting the `DataFrame`](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#inspecting-a-dataframe)

- #### `head`

    The function `head` shows the first 5 rows of a `DataFrame`, by default. This can be over-ridden by specifying the desired number of rows.

In [6]:
print(f"'default' usage of `head` method:\n{df_salary.head()}\n")

print(f"'custom' usage of `head` method:\n{df_salary.head(10)}\n")

'default' usage of `head` method:
shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scie

- #### `tail`

    The function `tail` shows the last 5 rows of a `DataFrame`, by default and is similar to `head`. The default value can be over-ridden by specifying the desired number of rows.

In [7]:
print(f"'default' usage of `tail` method:\n{df_salary.tail()}\n")

print(f"'custom' usage of `tail` method:\n{df_salary.tail(10)}\n")

'default' usage of `tail` method:
shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2020      ┆ SE        ┆ FT        ┆ Data      ┆ … ┆ US        ┆ 100       ┆ US        ┆ L        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2021      ┆ MI        ┆ FT        ┆ Principal ┆ … ┆ US        ┆ 100       ┆ US        ┆ L        │
│           ┆           ┆           ┆ Data

- #### `describe`

    `describe` computes and displays the summary statistics for all columns of the `DataFrame`.

In [8]:
print(df_salary.describe())

print(df_salary.describe().to_dict())

shape: (9, 12)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ work_year ┆ experienc ┆ employmen ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ ---       ┆ e_level   ┆ t_type    ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ str       ┆ f64       ┆ ---       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆           ┆ str       ┆ str       ┆   ┆ str       ┆ f64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count     ┆ 93597.0   ┆ 93597     ┆ 93597     ┆ … ┆ 93597     ┆ 93597.0   ┆ 93597     ┆ 93597    │
│ null_coun ┆ 0.0       ┆ 0         ┆ 0         ┆ … ┆ 0         ┆ 0.0       ┆ 0         ┆ 0        │
│ t         ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ mean      ┆ 2024.0864 ┆ null      ┆ null      ┆ … ┆ null      ┆ 21.455816 

- #### `glimpse`

    The function `glimpse` is another function that shows the values of the first few rows of a `DataFrame`, but formats the output differently from `head`. Each line of the output corresponds to a single column, making it easier to take inspect wider dataframes.

In [9]:
print(df_salary.glimpse(max_items_per_column=5, return_as_string=True))

Rows: 93597
Columns: 11
$ work_year          <i64> 2025, 2025, 2025, 2025, 2025
$ experience_level   <str> 'MI', 'MI', 'SE', 'SE', 'MI'
$ employment_type    <str> 'FT', 'FT', 'FT', 'FT', 'FT'
$ job_title          <str> 'Research Scientist', 'Research Scientist', 'Research Scientist', 'Research Scientist', 'AI Engineer'
$ salary             <i64> 208000, 147000, 173000, 117000, 100000
$ salary_currency    <str> 'USD', 'USD', 'USD', 'USD', 'USD'
$ salary_in_usd      <i64> 208000, 147000, 173000, 117000, 100000
$ employee_residence <str> 'US', 'US', 'US', 'US', 'US'
$ remote_ratio       <i64> 0, 0, 0, 0, 100
$ company_location   <str> 'US', 'US', 'US', 'US', 'US'
$ company_size       <str> 'M', 'M', 'M', 'M', 'M'



- #### `sample`

    The `sample` function retrieves an arbitrary number of randomly selected rows from the `DataFrame`. 
    
    _N.B.: The rows are not necessarily returned in the same order as they appear in the `DataFrame`._


In [10]:
print(df_salary.sample(n=5, with_replacement=False, seed=14))

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Power BI  ┆ … ┆ US        ┆ 100       ┆ US        ┆ M        │
│           ┆           ┆           ┆ Developer ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Associate ┆ … ┆ US        ┆ 100       ┆ US        ┆ M        │
│ 2025      ┆ SE        ┆ FT        ┆ Machine   ┆ … ┆ US        ┆ 0         

- #### `schema`

    For `DataFrame`, the `schema` is a mapping of column or series names to the data types of those same columns or series.

    Much like with series, Polars will infer the schema of a `DataFrame` when it is created. But, it can be used to override the inference system if needed.
    
    In Python, an explicit schema can be specified by using a dictionary to map column names to data types. The value `None` is used, if you do not wish to override inference for a given column.

    ```python
    df = pl.DataFrame(
        data={
            "name": ["Alice", "Ben", "Chloe", "Daniel"],
            "age": [27, 39, 41, 43],
            },
        schema={"name": None, "age": pl.Int8},
    )
    print(df)
    ```

    The parameter `schema_overrides` tends to be more convenient, when only some/certain columns need to override the inference. It doesn't override the inference for the columns that have been excluded.

    ```python
    df = pl.DataFrame(
        data={
            "name": ["Alice", "Ben", "Chloe", "Daniel"],
            "age": [27, 39, 41, 43],
            },
        schema_overrides={"age": pl.UInt8},
    )
    print(df)
    ```

In [11]:
print(df_salary.schema)

Schema([('work_year', Int64), ('experience_level', String), ('employment_type', String), ('job_title', String), ('salary', Int64), ('salary_currency', String), ('salary_in_usd', Int64), ('employee_residence', String), ('remote_ratio', Int64), ('company_location', String), ('company_size', String)])


In [22]:
# Define the numeirc and categorical columns.
cols_numeric = ['work_year', 'salary', 'salary_in_usd', 'remote_ratio',]
cols_categorical = ['experience_level', 'employment_type', 'job_title', 'salary_currency', 'employee_residence', 'company_location', 'company_size']

# Define the new schema for the `DataFrame`.
schema_new = {col: pl.Categorical for col in cols_categorical}
schema_new.update({
    'work_year': pl.Int16,
    'salary_in_usd': pl.Float32,
    'remote_ratio': pl.UInt8,
    }
)
print(f"New schema defination: \n{schema_new}\n")

# Apply the new schema to the DataFrame.
df_salary = df_salary.with_columns(
    [pl.col(col_name).cast(data_type) for col_name, data_type in schema_new.items()]
)
print(f"`schema` of the `DataFrame` after applying the new schema defination: \n{df_salary.schema}\n")

New schema defination: 
{'experience_level': Categorical, 'employment_type': Categorical, 'job_title': Categorical, 'salary_currency': Categorical, 'employee_residence': Categorical, 'company_location': Categorical, 'company_size': Categorical, 'work_year': Int16, 'salary_in_usd': Float32, 'remote_ratio': UInt8}

`schema` of the `DataFrame` after applying the new schema defination: 
Schema([('work_year', Int16), ('experience_level', Categorical(ordering='physical')), ('employment_type', Categorical(ordering='physical')), ('job_title', Categorical(ordering='physical')), ('salary', Int64), ('salary_currency', Categorical(ordering='physical')), ('salary_in_usd', Float32), ('employee_residence', Categorical(ordering='physical')), ('remote_ratio', UInt8), ('company_location', Categorical(ordering='physical')), ('company_size', Categorical(ordering='physical'))])



### [Expressions and contexts](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/)

Polars has developed its own `Domain Specific Language` (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. Expressions and contexts are very important in achieving this readability while also allowing the Polars query engine to optimize your queries to make them run as fast as possible.

- #### Expressions

    In Polars, an `expression` is a lazy representation of a data transformation. Expressions are modular and flexible. Hence, they can be used as building blocks to build more complex expressions.

    ```python
    pl.col("weight") / (pl.col("height") ** 2)
    ```

    The code above expresses an abstract computation. This can be saved in a variable, to manipulate further or just print.

    ```python
    bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
    print(bmi_expr)

    # [(col("weight")) / (col("height").pow([dyn int: 2]))]
    ```

    Since expressions are lazy, no computations have taken place yet. That's why `contexts` are needed.

In [26]:
expr_salary_in_euro = pl.col('salary_in_usd') * 0.91

print(f"Expression for `salary_in_euro`: \n{expr_salary_in_euro}\n")
print(f"Expression for `salary_in_euro` with `alias`: \n{expr_salary_in_euro.alias('salary_in_euro')}\n")

Expression for `salary_in_euro`: 
[(col("salary_in_usd")) * (dyn float: 0.91)]

Expression for `salary_in_euro` with `alias`: 
[(col("salary_in_usd")) * (dyn float: 0.91)].alias("salary_in_euro")



- #### Contexts

    Polars expressions need a context in which they are executed to produce a result. Depending on the context it is used in, the same Polars expression can produce different results. The four most common contexts that Polars provides are as follows:

    - `select`

        The selection context `select` applies expressions over columns. The context `select` may produce new columns that are aggregations, combinations of other columns, or literals.

        The expressions in a context `select` must produce series that are all the same length or they must produce a scalar. Scalars will be broadcast to match the length of the remaining series.

        Note that broadcasting can also occur within expressions.

        The context `select` is very flexible and powerful and allows you to evaluate arbitrary expressions independent of, and in parallel to, each other.

In [46]:
df_select = df_salary.select(
    pl.col('work_year'),
    pl.col('salary_in_usd'),
    pl.col('remote_ratio'),
    expr_salary_in_euro.alias('salary_in_euro'),
    mean_salary_in_euro = expr_salary_in_euro.mean(),
    stddev_salary_in_euro = expr_salary_in_euro.std(),
    cv_salary_in_euro = expr_salary_in_euro.std() / expr_salary_in_euro.mean(),
)
print(f"Shape of `df_salary`: {df_salary.shape}\n")
print(f"Shape of `df_select`: {df_select.shape}\n")
print(df_select.head())

# df_select = df_salary.select(
#     [
#         pl.col('work_year'),
#         pl.col('salary_in_usd'),
#     ]
# )

Shape of `df_salary`: (93597, 11)

Shape of `df_select`: (93597, 7)

shape: (5, 7)
┌───────────┬──────────────┬──────────────┬──────────────┬─────────────┬─────────────┬─────────────┐
│ work_year ┆ salary_in_us ┆ remote_ratio ┆ salary_in_eu ┆ mean_salary ┆ stddev_sala ┆ cv_salary_i │
│ ---       ┆ d            ┆ ---          ┆ ro           ┆ _in_euro    ┆ ry_in_euro  ┆ n_euro      │
│ i16       ┆ ---          ┆ u8           ┆ ---          ┆ ---         ┆ ---         ┆ ---         │
│           ┆ f32          ┆              ┆ f32          ┆ f32         ┆ f32         ┆ f32         │
╞═══════════╪══════════════╪══════════════╪══════════════╪═════════════╪═════════════╪═════════════╡
│ 2025      ┆ 208000.0     ┆ 0            ┆ 189280.0     ┆ 143368.4062 ┆ 67020.69531 ┆ 0.467472    │
│           ┆              ┆              ┆              ┆ 5           ┆ 2           ┆             │
│ 2025      ┆ 147000.0     ┆ 0            ┆ 133770.0     ┆ 143368.4062 ┆ 67020.69531 ┆ 0.467472    │
│       

In [45]:
df_select = df_salary.select(
    expr_salary_in_euro.alias('salary_in_euro'),
)
print(f"Shape of `df_salary`: {df_salary.shape}\n")
print(f"Shape of `df_select`: {df_select.shape}\n")
print(df_select.head())

Shape of `df_salary`: (93597, 11)

Shape of `df_select`: (93597, 1)

shape: (5, 1)
┌────────────────┐
│ salary_in_euro │
│ ---            │
│ f32            │
╞════════════════╡
│ 189280.0       │
│ 133770.0       │
│ 157430.0       │
│ 106470.0       │
│ 91000.0        │
└────────────────┘


- #### Contexts (cont.)

    - `with_columns`

        The context `with_columns` is very similar to the context `select`. The main difference between the two is that the context `with_columns` creates a new dataframe that contains the columns from the original dataframe and the new columns according to its input expressions. Whereas the context `select` only includes the columns selected by its input expressions.

        Because of this difference between `select` and `with_columns`, the expressions used in a context `with_columns` must produce series that have the same length as the original columns in the dataframe. Whereas it is enough for the expressions in the context `select` to produce series that have the same length among them.

In [44]:
df_withcolumns = df_salary.with_columns(
    expr_salary_in_euro.alias('salary_in_euro'),
)
print(f"Shape of `df_salary`: {df_salary.shape}\n")
print(f"Shape of `df_withcolumns`: {df_withcolumns.shape}\n")
print(df_withcolumns.head())

Shape of `df_salary`: (93597, 11)

Shape of `df_withcolumns`: (93597, 12)

shape: (5, 12)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ remote_ra ┆ company_l ┆ company_s ┆ salary_i │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ tio       ┆ ocation   ┆ ize       ┆ n_euro   │
│ i16       ┆ ---       ┆ ---       ┆ cat       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ cat       ┆ cat       ┆           ┆   ┆ u8        ┆ cat       ┆ cat       ┆ f32      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ 0         ┆ US        ┆ M         ┆ 189280.0 │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ 0         ┆ US        ┆ M         ┆ 133770.0 │
│

- #### Contexts (cont.)

    - `filter`

        The context `filter` filters the rows of a dataframe based on one or more expressions that evaluate to the Boolean data type.


In [None]:
df_filter = df_salary.filter(
    (pl.col('work_year') > 2020) & (pl.col('salary_in_usd') > 100000)
)
# df_filter = df_salary.filter(
#     (pl.col('work_year') > 2020),
#     (pl.col('salary_in_usd') > 100000)
# )
print(f"Shape of `df_salary` (before filtering): {df_salary.shape}\n")
print(f"Shape of `df_filter` (after filtering): {df_filter.shape}\n")
print(df_filter.head())

Shape of `df_salary` (before filtering): (93597, 11)

Shape of `df_filter` (after filtering): (72337, 11)

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i16       ┆ ---       ┆ ---       ┆ cat       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ cat       ┆ cat       ┆           ┆   ┆ cat       ┆ u8        ┆ cat       ┆ cat      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0   

In [47]:
df_filter = df_salary.filter(
    (pl.col('work_year') > 2020),
    (pl.col('salary_in_usd') > 100000)
)
print(f"Shape of `df_salary` (before filtering): {df_salary.shape}\n")
print(f"Shape of `df_filter` (after filtering): {df_filter.shape}\n")
print(df_filter.head())

Shape of `df_salary` (before filtering): (93597, 11)

Shape of `df_filter` (after filtering): (72337, 11)

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i16       ┆ ---       ┆ ---       ┆ cat       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ cat       ┆ cat       ┆           ┆   ┆ cat       ┆ u8        ┆ cat       ┆ cat      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0   

- #### Contexts (cont.)

    - `group_by` and `agg` (aggregations)

        In the context `group_by`, rows are grouped according to the unique values of the grouping expressions. You can then apply expressions to the resulting groups, which may be of variable lengths. An expression can be used to compute the groupings dynamically.

        After using `group_by` we use `agg` to apply aggregating expressions to the groups. We can specify as many grouping expressions as we'd like and the context `group_by` will group the rows according to the distinct values across the expressions specified. 

        The resulting dataframe, after applying aggregating expressions, contains one column per each grouping expression on the left and then as many columns as needed to represent the results of the aggregating expressions. In turn, we can specify as many aggregating expressions as we want.

        See also [`group_by_dynamic`](https://docs.pola.rs/user-guide/transformations/time-series/rolling/#using-expressions-in-group_by_dynamic) and [`rolling`](https://docs.pola.rs/user-guide/transformations/time-series/rolling/#grouping-by-rolling-windows) for other grouping contexts.

In [60]:
df_groupby = df_salary.group_by(
    [
        'work_year', # pl.col('work_year'),
        'remote_ratio', # pl.col('remote_ratio'),
        (pl.col('salary_in_usd')<100000).alias('is_salary_less_than_100k'),
    ],
    maintain_order=True,
).agg(
    [
        pl.col('salary_in_usd').mean().alias('mean_salary_in_usd'),
        pl.col('salary_in_usd').std().alias('stddev_salary_in_usd'),
        pl.col('salary_in_usd').median().alias('median_salary_in_usd'),
        pl.col('salary_in_usd').min().alias('min_salary_in_usd'),
        pl.col('salary_in_usd').max().alias('max_salary_in_usd'),
    ]
)
print(f"Shape of `df_salary`: {df_salary.shape}\n")
print(f"Shape of `df_groupby`: {df_groupby.shape}\n")
print(df_groupby.head())

Shape of `df_salary`: (93597, 11)

Shape of `df_groupby`: (36, 8)

shape: (5, 8)
┌───────────┬────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┐
│ work_year ┆ remote_rat ┆ is_salary_ ┆ mean_salar ┆ stddev_sa ┆ median_sa ┆ min_salar ┆ max_salar │
│ ---       ┆ io         ┆ less_than_ ┆ y_in_usd   ┆ lary_in_u ┆ lary_in_u ┆ y_in_usd  ┆ y_in_usd  │
│ i16       ┆ ---        ┆ 100k       ┆ ---        ┆ sd        ┆ sd        ┆ ---       ┆ ---       │
│           ┆ u8         ┆ ---        ┆ f32        ┆ ---       ┆ ---       ┆ f32       ┆ f32       │
│           ┆            ┆ bool       ┆            ┆ f32       ┆ f32       ┆           ┆           │
╞═══════════╪════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 2025      ┆ 0          ┆ false      ┆ 183402.937 ┆ 70081.929 ┆ 168900.0  ┆ 100000.0  ┆ 793136.0  │
│           ┆            ┆            ┆ 5          ┆ 688       ┆           ┆           ┆           │
│ 2025    

In [54]:
df_salary.head()

work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
i16,cat,cat,cat,i64,cat,f32,cat,u8,cat,cat
2025,"""MI""","""FT""","""Research Scientist""",208000,"""USD""",208000.0,"""US""",0,"""US""","""M"""
2025,"""MI""","""FT""","""Research Scientist""",147000,"""USD""",147000.0,"""US""",0,"""US""","""M"""
2025,"""SE""","""FT""","""Research Scientist""",173000,"""USD""",173000.0,"""US""",0,"""US""","""M"""
2025,"""SE""","""FT""","""Research Scientist""",117000,"""USD""",117000.0,"""US""",0,"""US""","""M"""
2025,"""MI""","""FT""","""AI Engineer""",100000,"""USD""",100000.0,"""US""",100,"""US""","""M"""


### [Advanced Expressions](https://docs.pola.rs/user-guide/expressions/)

In [None]:
# TODO: Yet to be documented.

### [Transformation](https://docs.pola.rs/user-guide/transformations/)

- `Joins`


### Transformation (cont.)

- `Concatenation`


### Transformation (cont.)

- `Pivot`


### Transformation (cont.)

- `Unpivot`