A simple benchmark for Python DataFrame frameworks

The benchmark uses the data from the NYC Taxi and Limousine Commission (TLC) dataset with billions of rows available.

Although all frameworks can read a CSV file, a more optimized approach is to use a binary version, namely Parquet. Therefore, in the current version of the benchmark all implementations are using Parquet to load data except (Datatable, Sdc and Cylon) for each the method didn't exist/work.

Current frameworks:

Pandas The python data analysis library.
Intel® Scalable Dataframe Compiler Extension of Numba that enables compilation of Pandas* operations. It automatically vectorizes and parallelizes the code.
Bodo New approach to HPC-style parallel computing. Able to run distributed using MPI. Needs license.
Vaex High performance Python library for lazy Out-of-Core DataFrames.
Intel® Distribution of Modin A performant, parallel, and distributed dataframe system.
Dask A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster.
Koalas Implements the pandas DataFrame API on top of Apache Spark and thus able to run distributed.
Rapids Open GPU Data Science. Able to run distributed with multiple GPUs when combined with Dask.
Polars Polars is a fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model.
Cylon Is a data engineering toolkit designed to work with AI/ML systems and integrate with data processing systems. Able to run distributed using MPI.

Requirements

The benchmark assumes the following packages are install:

Python
PowerJoular. PowerJoular requires root/sudo access on the latest Linux kernels (5.10 and newer): sudo powerjoular. So if needed add sudo in run.sh and setup passwordless sudo.
JDK for Koalas

Operations

The benchmark currently support the following common operations of dataframes:

join: join with a very small dataframe with description of payment type.
mean: mean of one column.
sum: sum of several columns.
groupby: mean of a given column grouped by other column.
multiple_groupby: mean of a given column grouped by other two columns.
unique_rows: Counts of unique values of a given column.
filter: The filter operation filter the records to the ones that received a tip between $1 – 5 dollars, filtering down to 36% of the original data.

Instructions for setup

All frameworks are loaded in a independent Conda environment apart from X that does not support Conda environment and thus we use regular Python virtual environment.

Thus the setup script creates an environment for each framework and also downloads the NYC TLC data to local parquet folders, for different dataset sizes.

setup.sh

Instructions for running

Thus the run script uses the environments created for each framework by the setup script and the local data files download by the setup script, either to CSV or parquet files.

For each dataset size and framework, the script monitors the energy consumption using PowerJoular and store all results in a CSV file, one per dataset size.

run.sh

Plotting results

The plot script generates the graphics in PNG and PDF format, taking as input the CSV files with results, one per dataset size.

plot.sh

Feedback

Updated source and an issue tracker are available at GitHub.

Your feedback is welcome.

Contact

Ricardo Vilaca (rmvilaca@di.uminho.pt)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
envs		envs
src		src
README.md		README.md
plot.sh		plot.sh
run.sh		run.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

envs

envs

src

src

README.md

README.md

plot.sh

plot.sh

run.sh

run.sh

setup.sh

setup.sh

Repository files navigation

A simple benchmark for Python DataFrame frameworks

Requirements

Operations

Instructions for setup

Instructions for running

Plotting results

Feedback

Contact

About

Releases

Packages

Languages

rmpvilaca/PDFPEEB

Folders and files

Latest commit

History

Repository files navigation

A simple benchmark for Python DataFrame frameworks

Requirements

Operations

Instructions for setup

Instructions for running

Plotting results

Feedback

Contact

About

Resources

Stars

Watchers

Forks

Languages