# Intro to [polars](https://pola.rs)

A brief introduction to the incredible `polars` dataframe library.

![polars logo](https://raw.githubusercontent.com/pola-rs/polars-static/master/banner/polars_github_banner.svg)

Created by: [Ryan Parker](https://github.com/rparkr), on `2024-08-15`.

# Data analysis in Python
As an interpreted language with an easy-to-read syntax, Python is fantastic for data analysis, where rapid iteration enables exploration and accelerates development.

Since its first release in 2008, [pandas](https://pandas.pydata.org/docs/) has been the de-facto standard for data analysis in Python, but in recent years other libraries have been created which offer distinct advantages. Some of those include:
- [cuDF](https://docs.rapids.ai/api/cudf/stable/): GPU-accelerated dataframe operations with pandas API support
- [modin](https://modin.readthedocs.io/en/stable/): pandas API running on distributed compute using [Ray](https://www.ray.io/) or [Dask](https://www.dask.org/) as a backend
- [ibis](https://ibis-project.org/): dataframe library supporting dozens of backends (including pandas, polars, DuckDB, and many SQL databases)
- [DuckDB](https://duckdb.org/): in-process database engine for running SQL queries on local or remote data
- [temporian](https://temporian.readthedocs.io/en/stable/): efficient data processing for timeseries data
- [polars](https://pola.rs/): ultra-fast dataframe library written in Rust
- and others...

# Polars advantages
- Easy to use
- Parallelized across all CPU cores
- Zero dependencies
- Built on the Apache Arrow in-memory data format: enables zero-copy interoperability with other libraries (e.g., DuckDB, Snowflake)
- Handles datasets larger than RAM
- Powerful query optimizer
- Fully compatible with scikit-learn, thanks to the [Dataframe Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/)
- <img src="https://www.rust-lang.org/static/images/rust-logo-blk.svg" width="20"> written in [Rust](https://rust-lang.org), a compiled language that has experienced rapid adoption since its first stable release in 2015 thanks to its C/C++ performance, concurrency, and memory safety

# Key concepts
> Polars uses the Apache Arrow data format, which is column-oriented. The primary data structures for polars are Series and DataFrames, similar to pandas.

- Apache Arrow supports many useful data types (many more than those which are supported by NumPy), so you can perform fast, vectorized operations on all kinds of data (nested JSON `structs`, strings, datetimes, etc.)

## Contexts
In Polars, a _context_ refers to the data available to operate on.

The primary contexts are:

**Selection**:
- `.select()`: choose a subset of columns and perform operations on them
- `.with_columns()`: add to the columns already available

**Filtering**:
- `.filter()`: filter the data using boolean conditions on row values

**Aggregation**:
- `.group_by()`: perform aggregations on groups of values

## Expressions
_Expressions_ are the operations performed in Polars, things like:
- `.sum()`
- `.len()`
- `.mean().over()...`
- `when().then().otherwise()`
- `.str.replace()`

## Lazy vs. Eager mode
- `scan_csv()` vs. `read_csv()`

### Recommendation: use Lazy mode
- In Lazy mode, Polars will optimize the query plan

# Plugin ecosystem
You can create custom expressions to use in Polars, which will also be vectorized and run in parallel like standard Polars expressions. If there's an operation you'd like to run on your data, chances are someone has already implemented it and it's just a `pip install` away. Here are [some examples](https://docs.pola.rs/user-guide/expressions/plugins/#community-plugins)...

### [`polars_ds`](https://github.com/abstractqqq/polars_ds_extension): Polars extension for data science tasks
> - A combination of functions and operations from scikit-learn, SciPy, and edit distance
> - Polars is the only dependency (unless you want to create plots; that adds Plotly as a dependency)
> - Can create bar plots within dataframe outputs (HTML `__repr__` in a notebook) -- like sparklines, and similar to what is available in pandas advanced dataframe styling options


### [`polars_distance`](https://github.com/ion-elgreco/polars-distance): distance calculations (e.g., word similarity) in polars
> also includes haversine distance (lat/lon), cosine similarity, etc.

### [`polars_reverse_geocode`](https://github.com/MarcoGorelli/polars-reverse-geocode): offline reverse geocoding
> find a city based on provided lat/lon; using an offline lookup table

### Tutorial: [how to create a polars plugin](https://marcogorelli.github.io/polars-plugins-tutorial/)

# Final thoughts

## Upgrade weekly
⭐ Polars development [advances rapidly](https://github.com/pola-rs/polars/releases), so I recommend upgrading often (weekly) to get the latest features

## Try it out
The best way to learn is by doing. Try using Polars any time you create a new notebook or start a new project.

# Resources
- [Polars user guide](https://docs.pola.rs/user-guide/migration/pandas/): fantastic guide to learning Polars alongside helpful explanations
- [Coming from `pandas`](https://docs.pola.rs/user-guide/migration/pandas/): are you familiar with `pandas` and want to learn the differences you'll notice when switching to polars? This guide translates common concepts to help you.
  - [This series of articles from 2022](https://kevinheavey.github.io/modern-polars/) demonstrates some operations in pandas and polars, side-by-side. _Polars development advances rapidly, so many of the concepts covered in that series are already different. Still it will help you get a general feel for the flow of using Polars compared to pandas._
- [Polars Python API](https://docs.pola.rs/api/python/stable/reference/index.html): detailed info on every expression, method, and function in Polars. I recommend browsing this list to get a feel for what Polars can do.