# Getting Started with Polars: A Hands-On Introduction

Author: Zach Wong

## Purpose 

This notebook is designed to give a high-level, practical introduction to using Polars - a **fast, efficient DataFrame library** built for modern data workflows. 

This can be helpful if you're coming from tools like Stata or Pandas and need to work with **high-volume datasets** where **performance, memory efficiency, and speed matter**.

Feel free to make a copy, modify, and experiment with the code as you go.

*This notebook was last updated 4/30/2025 using Polars v1.28.1. Behavior may vary slightly with future versions.*

## ⚠️ **Note**

This notebook is not intended to replace the official Polars documentation. Rather, consider it a quick-start guide designed to get you up and running quickly so you can experiment with Polars in real time as you dive deeper into the official documentation.

While examples of Polars syntax are included, **Polars is always subject to change**. The updated Polars user guide and syntax guide are to be found on Polars’ website and links to specific concepts are referenced throughout this notebook:

**User guide**: https://docs.pola.rs/

**(Python) API reference**: https://docs.pola.rs/api/python/stable/reference/index.html

To the extent that there are differences in syntax between what is written in this notebook and what is written in the official documentation, always defer to the official documentation.


## 0. Setup

In [6]:
# Install Polars and PyArrow — pip is Python’s default package manager.
!pip install polars pyarrow

# Set environment variable to limit Polars to 16 threads
import os
os.environ['POLARS_MAX_THREADS'] = '16'  # Prevent CPU overload, especially in shared environments

# Import Polars and confirm setup
import polars as pl

# Check Polars version and thread setting
print(f"Polars version: {pl.__version__}")
print(f"Using {os.environ.get('POLARS_MAX_THREADS')} threads")

Polars version: 1.28.1
Using 16 threads


## 1. Polars vs STATA Data Processing Benefits (High-Level)

| Feature                    | STATA                                                    | Polars                                                              |
|----------------------------|----------------------------------------------------------|---------------------------------------------------------------------|
| Memory Use                 | Higher: entire dataset loaded into RAM                  | Lower: columnar format with lazy execution and streaming                                                |
| Performance                | Slower: Single-threaded                                  | Higher: Multithreaded, built in Rust                                     |
| Big Data Handling              | Limited: can crash or slow down on large datasets   | Robust: supports streaming for out-of-core processing                         |
| Extensibility Ecosystem    | Limited: scripting in Stata language only	             | 	Flexible: full Python + access to the broader Python ecosystem

Read more about Polars Features and Benefits here: https://docs.pola.rs/

## 2. Polars Eager DataFrames vs LazyFrames

Polars operates with two primary DataFrame types: DataFrame for eager execution and LazyFrame for lazy execution.

* **DataFrame (Eager)**: Executes operations immediately, similar to pandas or Stata.

* **LazyFrame (Lazy)**: Builds a query plan and defers execution until explicitly triggered, allowing for more efficient, optimized data processing—especially useful with large datasets.

**As a general rule**: use LazyFrame when working with large datasets or in memory-constrained environments (such as shared VMs). Use DataFrame for smaller datasets or when working on a local machine with dedicated resources.

This approach ensures you get the best performance and resource efficiency based on your computing environment.

| Aspect                     | DataFrame (Eager)                                         | Lazyframe (Lazy)                                                                     |
|----------------------------|-----------------------------------------------------------|--------------------------------------------------------------------------------------|
| Ideal Use Case             | Quick exploration, small datasets, ad hoc workflows           | Pipelines, large-scale data, low-memory/shared environments                                 |
| Execution                  | Immediate: each operation runs line by line               | Deferred: builds a plan, runs only on .collect()                                         |
| Performance                | Faster for small data                                     | More efficient for large data and multi-step pipelines                                  |
| Memory Usage               | Higher: all data + intermediate steps held in memory | Lower: optimized plan loads only needed rows/columns at execution |
| Parallelism                | Basic or single-threaded                                  | Multithreaded by default, with optional streaming                                    |
| Debuggability              | Easier: inspect output step-by-step                    | Use explain() to visualize and debug the lazy execution plan        
| Streaming Mode Compatible | No                                                        | Yes: enables out-of-core, batch processing                                                                                  ||
| Read Operations            | read_csv(), read_parquet(): reads full file into memory  | scan_csv(), scan_parquet(): builds a lazy scan plan (no immediate read)                 |
| Write Operations           | write_csv(), write_parquet(): executes immediately           | sink_csv(), sink_parquet(): deffered, executes with .collect()                |


### 2a. Basic Example using Eager Dataframes

Like Stata, operations on a Polars DataFrame (eager mode) execute immediately, line by line.
This means that:

* Each transformation is applied **as soon as it's called**.

* The full DataFrame and any intermediate results are **held in memory**.

* This approach is **intuitive but can be memory-intensive** for large datasets.

In [7]:
# Load the CSV into a Polars DataFrame (eager execution)
df = pl.read_csv("Electric_Vehicle_Population_Data.csv")

# Filter rows where Electric Range < 100 and select relevant columns
df = df.filter(pl.col("Electric Range") < 100).select([
    pl.col("VIN (1-10)"),
    pl.col("Electric Range")
])

# Add a new column: natural log of Electric Range
df = df.with_columns(
    (pl.col("Electric Range").log()).alias("log_electric_range")
)

# Print the first 5 rows of the result
print(df.head())

# Export the filtered DataFrame to a CSV file
export_fp = "filtered_ev_data_from_eager.csv"
df.write_csv(export_fp)
print(f"Exported DataFrame to: {export_fp}")

shape: (5, 3)
┌────────────┬────────────────┬────────────────────┐
│ VIN (1-10) ┆ Electric Range ┆ log_electric_range │
│ ---        ┆ ---            ┆ ---                │
│ str        ┆ i64            ┆ f64                │
╞════════════╪════════════════╪════════════════════╡
│ 1C4JJXP68P ┆ 21             ┆ 3.044522           │
│ 1N4AZ0CP8E ┆ 84             ┆ 4.430817           │
│ 7SAYGDEE4S ┆ 0              ┆ -inf               │
│ KNDJX3AE9G ┆ 93             ┆ 4.532599           │
│ JTDKARFP9H ┆ 25             ┆ 3.218876           │
└────────────┴────────────────┴────────────────────┘
Exported DataFrame to: filtered_ev_data_from_eager.csv


Read more about Polars' DataFrames here: https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#dataframe

### 2b. Basic Example using LazyFrames

Unlike Stata, operations on a Polars LazyFrame do not execute immediately.
Instead:

* Each line adds a step to a **deferred execution plan**.

* The actual computation only runs when **.collect() is called**.

* This allows Polars to optimize the entire query and **significantly reduce memory usage**, since data isn't fully loaded or processed until necessary.

In [8]:
# Create a LazyFrame by scanning the CSV (no data is loaded yet)
lf = pl.scan_csv("Electric_Vehicle_Population_Data.csv")  # Note: "scan_csv", not "read_csv"

# Build the query: filter rows where Electric Range < 100 and select relevant columns
lf = lf.filter(pl.col("Electric Range") < 100).select([
    pl.col("VIN (1-10)"),
    pl.col("Electric Range")
])

# Add a new column: natural log of Electric Range
lf = lf.with_columns(
    (pl.col("Electric Range").log()).alias("log_electric_range")
)

#If you call lf.head() on a LazyFrame, it doesn't immediately return the data; instead, it prints the query plan as seen here. 
print(lf.head())

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

SLICE[offset: 0, len: 5]
   WITH_COLUMNS:
   [col("Electric Range").log().alias("log_electric_range")] 
    SELECT [col("VIN (1-10)"), col("Electric Range")]
    FROM
      FILTER [(col("Electric Range")) < (100)]
      FROM
        Csv SCAN [Electric_Vehicle_Population_Data.csv]
        PROJECT */17 COLUMNS


Until this point, "lf" is just a deferred execution plan - no data has been read or processed yet.
When .collect() is called, Polars analyzes, optimizes, and then executes the entire plan, returning a materialized DataFrame held in memory.

In [9]:
# Materialize the LazyFrame into an eager DataFrame
# This triggers execution of the full query plan and loads the result into memory
df = lf.collect()

# Display the first 5 rows
print(df.head())

shape: (5, 3)
┌────────────┬────────────────┬────────────────────┐
│ VIN (1-10) ┆ Electric Range ┆ log_electric_range │
│ ---        ┆ ---            ┆ ---                │
│ str        ┆ i64            ┆ f64                │
╞════════════╪════════════════╪════════════════════╡
│ 1C4JJXP68P ┆ 21             ┆ 3.044522           │
│ 1N4AZ0CP8E ┆ 84             ┆ 4.430817           │
│ 7SAYGDEE4S ┆ 0              ┆ -inf               │
│ KNDJX3AE9G ┆ 93             ┆ 4.532599           │
│ JTDKARFP9H ┆ 25             ┆ 3.218876           │
└────────────┴────────────────┴────────────────────┘


Read more about Polars' Lazyframes here: https://docs.pola.rs/user-guide/concepts/lazy-api/

## 3. Intro to Streaming Mode

One of Polars’ key advantages over Stata is its **LazyFrame streaming capability**, which is especially useful when working with datasets too large to fit entirely in RAM - common in shared VM environments.

Unlike Stata, which requires full materialization of the dataset in memory, Polars can **stream data in batches**, processing it in **parallel across threads**. This means:

* Only a **portion of the data** is loaded into memory at any given time.

* Memory usage stays **low and predictable**.

* **Resources are preserved** for other users and processes on shared infrastructure.

This makes Polars highly suitable for large-scale data exports, transformations, or aggregations without overwhelming system memory.

In [10]:
# Create a LazyFrame (no data is read yet)
lf = pl.scan_csv("Electric_Vehicle_Population_Data.csv")

# Data operations here...
# (You can add filters, selects, transformations, etc. before collecting)

# Execute the query plan using the streaming engine (for large datasets)
# This processes data in batches instead of loading everything into memory at once
df = lf.collect(engine="streaming")

# Export the LazyFrame directly to a CSV using streaming (batch-by-batch write)
# The batch_size parameter controls how many rows each thread processes at a time
lf.sink_csv("filtered_ev_data_from_lazy_streaming.csv", batch_size=10)

**Note**: Streaming mode is best used with operations that support it, like filters and column selection. More complex operations (e.g. joins or sorts) may fall back to non-streaming execution.

Read more about streaming capabilities in the Polars User Guide here: https://docs.pola.rs/user-guide/concepts/_streaming/

## 4. Why Parquets > CSVs?

While the examples in this notebook use CSV files for simplicity, **Parquet is the preferred file** format when working with Polars due to its performance and efficiency advantages.

**In summary, Parquet files are well-optimized for Polars’ lazy engine, offering advantages in speed, memory efficiency, and scalability. For large data pipelines, Polars strongly recommends using Parquet over formats like CSV or Excel.**

| Feature               | CSV                                | Parquet                                      |
|-----------------------|------------------------------------|----------------------------------------------|
| File size             | Large: text-based with no built-in compression  | Small: compressed binary format |
| Read speed            | Slower: parsed line-by-line          | Faster: columnar format allows efficient parallel reads |
| Write speed           | Moderate: Text formatting adds overhead | Faster: Writes binary blocks  with less overhead |
| Columnar Access       | Inefficient - must read full rows | Efficient - reads only necessary columns        |
| Polars-native support | Slower: line-based parsing         | Optimized: works seamlessly with Polars' lazy and streaming engine |

Read more about Polars' compatability with Parquet files (and other formats) here: https://docs.pola.rs/user-guide/io/parquet/