## Explore [_'Polars'_](https://pola.rs)

### Install.
```bash
uv add polars
```

___N.B.:___ 
- _Data can be represented in either of the following format -_
    - _[`DataFrame`](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html) - Two-dimensional data structure representing data as a table with rows and columns._
    - _[`LazyFrame`](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html) - Representation of a Lazy computation graph/query against a DataFrame. This allows for whole-query optimisation in addition to parallelism, and is the preferred (and highest-performance) mode of operation for polars._
- _`LazyFrame` methods have `scan` and `sink` keywords in the respective modules/functions, for I/O operations. e.g. `scan_csv`, `scan_delta`, `scan_parquet`, `sink_csv`, `sink_parquet`, etc._
- _`collect()` method must be used when an operation/query is performed on `LazyFrame`._
- _More information on `Lazy API` can be found [here](https://docs.pola.rs/user-guide/concepts/lazy-api/) and [here](https://docs.pola.rs/user-guide/lazy/)._


### Load libraries.

In [54]:
import os
import polars as pl
import kagglehub

### Load/Download (large) the dataset.

[Kaggle - Latest Data Science Job Salaries 2020 - 2025](https://www.kaggle.com/datasets/saurabhbadole/latest-data-science-job-salaries-2024/data)

In [None]:
# Download latest version of the dataset.
path_data = kagglehub.dataset_download("saurabhbadole/latest-data-science-job-salaries-2024")
print(f"Path to dataset files: {path_data}")

Downloading from https://www.kaggle.com/api/v1/datasets/download/saurabhbadole/latest-data-science-job-salaries-2024?dataset_version_number=3...


100%|██████████| 1.48M/1.48M [00:00<00:00, 5.53MB/s]

Extracting files...
Path to dataset files:  /Users/shaz/.cache/kagglehub/datasets/saurabhbadole/latest-data-science-job-salaries-2024/versions/3





In [65]:
# Get the 'csv' filename of the dataset.
# path_data = '/Users/shaz/.cache/kagglehub/datasets/saurabhbadole/latest-data-science-job-salaries-2024/versions/3'
filenames = os.listdir(path_data)
filename_csv = [file for file in filenames if file.endswith('.csv')][0]

# Load the dataset into a Polars DataFrame.
df_salary = pl.read_csv(source=os.path.join(path_data, filename_csv))
print(f"Shape of the DataFrame: {df_salary.shape}\n")

# Display the first few rows of the DataFrame.
# display(df_ev.head())
print(df_salary.head())

Shape of the DataFrame: (93597, 11)

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ S

### (OPTIONAL) Save and load the data in _'parquet'_ format locally.

- normal/regular method

In [69]:
filename_csv
filename_csv.split('.')[0].lower()

f"df_{filename_csv.split('.')[0].lower()}.parquet"

'df_datascience_salaries_2025.parquet'

In [71]:
# Define filename for 'parquet' file.
# filename_parquet = 'df_electic_vehicle.parquet'
filename_parquet = f"df_{filename_csv.split('.')[0].lower()}.parquet"

# Save the data.
df_salary.write_parquet(file=filename_parquet)

# Read the data back from the 'parquet' file.
df_salary = pl.read_parquet(source=filename_parquet)
print(f"Shape of the `DataFrame`: {df_salary.shape}\n")

Shape of the `DataFrame`: (93597, 11)



- `lazy` method

In [72]:
# Define filename for `lazy` method.
filename_lazy_parquet = f"lf_{filename_csv.split('.')[0].lower()}.parquet"

# Convert the DataFrame to a lazy DataFrame and save it as a Parquet file.
df_salary_lazy = df_salary.lazy()
df_salary_lazy.sink_parquet(path=filename_lazy_parquet)

# Read the lazy DataFrame from the Parquet file.
df_salary_lazy = pl.scan_parquet(source=filename_lazy_parquet)

# Convert the lazy DataFrame to a regular DataFrame.
df_salary = df_salary_lazy.collect()
print(f"`type` of 'df_ev': {type(df_salary).__module__}.{type(df_salary).__name__}")
print(f"`type` of 'df_ev_lazy': {type(df_salary_lazy).__module__}.{type(df_salary_lazy).__name__}\n")

# Display the first few rows of the DataFrame.
print(df_salary.head())

`type` of 'df_ev': polars.dataframe.frame.DataFrame
`type` of 'df_ev_lazy': polars.lazyframe.frame.LazyFrame

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0

### Inspecting the `DataFrame`.

- #### `head`

    The function `head` shows the first 5 rows of a `DataFrame`, by default. This can be over-ridden by specifying the desired number of rows.

In [73]:
print(f"'default' usage of `head` method:\n{df_salary.head()}\n")

print(f"'custom' usage of `head` method:\n{df_salary.head(10)}\n")

'default' usage of `head` method:
shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Research  ┆ … ┆ US        ┆ 0         ┆ US        ┆ M        │
│           ┆           ┆           ┆ Scie

- #### `tail`

    The function `tail` shows the last 5 rows of a `DataFrame`, by default and is similar to `head`. The default value can be over-ridden by specifying the desired number of rows.

In [74]:
print(f"'default' usage of `tail` method:\n{df_salary.tail()}\n")

print(f"'custom' usage of `tail` method:\n{df_salary.tail(10)}\n")

'default' usage of `tail` method:
shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2020      ┆ SE        ┆ FT        ┆ Data      ┆ … ┆ US        ┆ 100       ┆ US        ┆ L        │
│           ┆           ┆           ┆ Scientist ┆   ┆           ┆           ┆           ┆          │
│ 2021      ┆ MI        ┆ FT        ┆ Principal ┆ … ┆ US        ┆ 100       ┆ US        ┆ L        │
│           ┆           ┆           ┆ Data

- #### `describe`

    `describe` computes and displays the summary statistics for all columns of the `DataFrame`.

In [75]:
print(df_salary.describe())

print(df_salary.describe().to_dict())

shape: (9, 12)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ work_year ┆ experienc ┆ employmen ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ ---       ┆ e_level   ┆ t_type    ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ str       ┆ f64       ┆ ---       ┆ ---       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆           ┆ str       ┆ str       ┆   ┆ str       ┆ f64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count     ┆ 93597.0   ┆ 93597     ┆ 93597     ┆ … ┆ 93597     ┆ 93597.0   ┆ 93597     ┆ 93597    │
│ null_coun ┆ 0.0       ┆ 0         ┆ 0         ┆ … ┆ 0         ┆ 0.0       ┆ 0         ┆ 0        │
│ t         ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ mean      ┆ 2024.0864 ┆ null      ┆ null      ┆ … ┆ null      ┆ 21.455816 

- #### `glimpse`

    The function `glimpse` is another function that shows the values of the first few rows of a `DataFrame`, but formats the output differently from `head`. Each line of the output corresponds to a single column, making it easier to take inspect wider dataframes.

In [76]:
print(df_salary.glimpse(max_items_per_column=5, return_as_string=True))

Rows: 93597
Columns: 11
$ work_year          <i64> 2025, 2025, 2025, 2025, 2025
$ experience_level   <str> 'MI', 'MI', 'SE', 'SE', 'MI'
$ employment_type    <str> 'FT', 'FT', 'FT', 'FT', 'FT'
$ job_title          <str> 'Research Scientist', 'Research Scientist', 'Research Scientist', 'Research Scientist', 'AI Engineer'
$ salary             <i64> 208000, 147000, 173000, 117000, 100000
$ salary_currency    <str> 'USD', 'USD', 'USD', 'USD', 'USD'
$ salary_in_usd      <i64> 208000, 147000, 173000, 117000, 100000
$ employee_residence <str> 'US', 'US', 'US', 'US', 'US'
$ remote_ratio       <i64> 0, 0, 0, 0, 100
$ company_location   <str> 'US', 'US', 'US', 'US', 'US'
$ company_size       <str> 'M', 'M', 'M', 'M', 'M'



- #### `sample`

    The `sample` function retrieves an arbitrary number of randomly selected rows from the `DataFrame`. 
    
    _N.B.: The rows are not necessarily returned in the same order as they appear in the `DataFrame`._


In [77]:
print(df_salary.sample(n=5, with_replacement=False, seed=14))

shape: (5, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ work_year ┆ experienc ┆ employmen ┆ job_title ┆ … ┆ employee_ ┆ remote_ra ┆ company_l ┆ company_ │
│ ---       ┆ e_level   ┆ t_type    ┆ ---       ┆   ┆ residence ┆ tio       ┆ ocation   ┆ size     │
│ i64       ┆ ---       ┆ ---       ┆ str       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆ str       ┆ i64       ┆ str       ┆ str      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ 2025      ┆ MI        ┆ FT        ┆ Power BI  ┆ … ┆ US        ┆ 100       ┆ US        ┆ M        │
│           ┆           ┆           ┆ Developer ┆   ┆           ┆           ┆           ┆          │
│ 2025      ┆ MI        ┆ FT        ┆ Associate ┆ … ┆ US        ┆ 100       ┆ US        ┆ M        │
│ 2025      ┆ SE        ┆ FT        ┆ Machine   ┆ … ┆ US        ┆ 0         

- #### `schema`

    For `DataFrame`, the `schema` is a mapping of column or series names to the data types of those same columns or series.

In [78]:
print(df_salary.schema)

Schema([('work_year', Int64), ('experience_level', String), ('employment_type', String), ('job_title', String), ('salary', Int64), ('salary_currency', String), ('salary_in_usd', Int64), ('employee_residence', String), ('remote_ratio', Int64), ('company_location', String), ('company_size', String)])


- #### `schema` (cont.)

    Much like with series, Polars will infer the schema of a `DataFrame` when it is created. But, it can be used to override the inference system if needed.
    
    In Python, an explicit schema can be specified by using a dictionary to map column names to data types. The value `None` is used, if you do not wish to override inference for a given column.

In [None]:
df = pl.DataFrame(
    {
        "name": ["Alice", "Ben", "Chloe", "Daniel"],
        "age": [27, 39, 41, 43],
    },
    schema={"name": None, "age": pl.Int8},
)

print(df)

- #### `schema` (cont.)

    The parameter `schema_overrides` tends to be more convenient, when only some/certain columns need to override the inference. It doesn't override the inference for the columns that have been excluded.

In [None]:
df = pl.DataFrame(
    {
        "name": ["Alice", "Ben", "Chloe", "Daniel"],
        "age": [27, 39, 41, 43],
    },
    schema_overrides={"age": pl.UInt8},
)

print(df)

### Expressions and contexts.