Skip to content

CSV processing benchmarks for different open source technologies

Notifications You must be signed in to change notification settings

ritchie46/bench_csv

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python benchmarks to process a csv file

To run the benchmarks install Pixi, clone this repository and from inside the repository directory run:

gzip -d data.csv.gz
pixi install
pixi run bench

The results in my machine:

Description File / Function Time (seconds)
Pure Python looping with csv module using int types pure_python_int 3.4547557830810547
Pure Python looping with csv module using float types pure_python_float 3.8738009929656982
pandas with C engine pandas_c 1.50089430809021
pandas with Python engine pandas_python 8.328583478927612
pandas with PyArrow engine and NumPy dtypes pandas_pyarrow 0.31276631355285645
pandas with PyArrow engine and PyArrow dtypes pandas_pyarrow_arrow 0.29172492027282715
Polars in lazy mode polars_lazy 0.1676042079925537
Polars in streaming mode polars_streaming 0.11536002159118652
DuckDB with SQL API duckdb_sql 0.10763740539550781
DataFusion with SQL API datafusion_sql 0.0019359588623046875
NumPy with loadtxt function numpy_loadtxt 1.8354885578155518

The exact version of each library can be seen in the pixi.toml file. Note that DuckDB seems to package for conda-forge later, so the benchmarks use DuckDB 0.9 while 0.10 seems to be available in other package managers.

About

CSV processing benchmarks for different open source technologies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%