In [1]:
import io
import glob
import pandas as pd
import pyarrow.parquet as pq

Load the data into RAM so that we only profile the actual code and not the disk cache.

### 1. Read a single file that fits just in memory to do a Data Science task

In [2]:
with open('../data/yellow_tripdata_2016-01.csv', 'rb') as f:
    csv_bytes = f.read()

For all types of CSVs used in the benchmarks, we can already specify some default behaviour on `pandas.read_csv`. This is important for the performance.

In [3]:
def read_nyc_csv(filename_or_buf, **kwargs):
    df = pd.read_csv(
        filename_or_buf,
        parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
        index_col=False,
        infer_datetime_format=True,
        **kwargs,
    )

In the original CSV, the `'store_and_fwd_flag'` column is saved with Y and N as flags. We can load it instead as boolean values. As these are not the default pandas boolean values, we need to declare what we consider as true and false.

In [4]:
%%timeit
read_nyc_csv(
    io.BytesIO(csv_bytes),
    dtype={'store_and_fwd_flag': 'bool'},
    true_values=['Y'],
    false_values=['N']
)

41 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Python and the `pandas.read_csv` function cannot cope with more than 2.1 GiB sized binary input. Thus we measure here also the overhead of the OS filesystem cache in the benchmark.

In [5]:
%%timeit
with open('../data/str_yellow_tripdata_2016-01.csv', 'rb') as f:
    read_nyc_csv(f)

52.2 s ± 250 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [6]:
with open('../data/cat_yellow_tripdata_2016-01.csv', 'rb') as f:
    csv_bytes = f.read()

In [7]:
%%timeit
read_nyc_csv(io.BytesIO(csv_bytes), dtype={'str': 'category'})

44.1 s ± 184 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 2. Read multiple files (e.g. for online algorithms)

In [8]:
files = glob.glob('../data/yellow_tripdata_2016-*.csv')

In [9]:
%%timeit
for f in files:
    read_nyc_csv(
        f,
        dtype={'store_and_fwd_flag': 'bool'},
        true_values=['Y'],
        false_values=['N'],
    )
    # Here you would normally update e.g. your online algorithm.
    # We skip any work here as we only want to measure I/O time.

4min 15s ± 2.98 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 3. Store/Checkpoint your current state

We base the write benchmarks on the Parquet files to ensure that we have the same binary measurement basis for all.

In [10]:
df = pd.read_parquet('../data/yellow_tripdata_2016-01.parquet', engine='pyarrow')

In [11]:
%%timeit
df.to_csv();

3min 4s ± 262 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
df = pd.read_parquet('../data/str_yellow_tripdata_2016-01.parquet')

In [13]:
%%timeit
df.to_csv();

3min 19s ± 761 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
df = pq.read_table('../data/cat_yellow_tripdata_2016-01.parquet').to_pandas(categories=["str"])

In [20]:
%%timeit
df.to_csv();

3min 12s ± 3.12 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
