Python benchmarks to process a csv file

To run the benchmarks install Pixi, clone this repository and from inside the repository directory run:

gzip -d data.csv.gz
pixi install
pixi run bench

The results in my machine:

Description	File / Function	Time (seconds)
Pure Python looping with csv module using int types	pure_python_int	3.4547557830810547
Pure Python looping with csv module using float types	pure_python_float	3.8738009929656982
pandas with C engine	pandas_c	1.50089430809021
pandas with Python engine	pandas_python	8.328583478927612
pandas with PyArrow engine and NumPy dtypes	pandas_pyarrow	0.31276631355285645
pandas with PyArrow engine and PyArrow dtypes	pandas_pyarrow_arrow	0.29172492027282715
Polars in lazy mode	polars_lazy	0.1676042079925537
Polars in streaming mode	polars_streaming	0.11536002159118652
DuckDB with SQL API	duckdb_sql	0.10763740539550781
DataFusion with SQL API	datafusion_sql	0.0019359588623046875
NumPy with loadtxt function	numpy_loadtxt	1.8354885578155518

The exact version of each library can be seen in the pixi.toml file. Note that DuckDB seems to package for conda-forge later, so the benchmarks use DuckDB 0.9 while 0.10 seems to be available in other package managers.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
python_benchmarks		python_benchmarks
.gitignore		.gitignore
README.md		README.md
data.csv.gz		data.csv.gz
gen_data.py		gen_data.py
pixi.lock		pixi.lock
pixi.toml		pixi.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python benchmarks to process a csv file

About

Releases

Packages

Languages

ritchie46/bench_csv

Folders and files

Latest commit

History

Repository files navigation

Python benchmarks to process a csv file

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages