-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmarks: add dataset_serialize_benchmark.py #124
Changes from 19 commits
68e9d19
f15947b
6992f18
277807b
53968ad
14b7e80
8a1bed3
ff4f5e1
35f2227
a292a81
9cc975e
3b313a4
c6ae972
c484871
6aac940
c0b3eb6
15bbce8
fd3868e
1218548
cb7e90f
fcfe9c4
71781d8
f0806f0
4ce89d5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,272 @@ | ||||||
import itertools | ||||||
import logging | ||||||
import os | ||||||
import shutil | ||||||
import subprocess | ||||||
import sys | ||||||
import time | ||||||
import uuid | ||||||
|
||||||
import conbench.runner | ||||||
import pyarrow | ||||||
import pyarrow.dataset as ds | ||||||
|
||||||
from benchmarks import _benchmark | ||||||
|
||||||
log = logging.getLogger(__name__) | ||||||
|
||||||
|
||||||
# All benchmark scnearios will write below /dev/shm/<SHM_DIR_PREFIX>. That | ||||||
# directory tree is removed upon completion (not necessarily in case of error | ||||||
# though). | ||||||
OUTPUT_DIR_PREFIX = os.path.join("/dev/shm/", "bench-" + str(uuid.uuid4())[:8]) | ||||||
|
||||||
|
||||||
@conbench.runner.register_benchmark | ||||||
class DatasetSerializeBenchmark(_benchmark.Benchmark): | ||||||
""" | ||||||
This benchmark is supposed to measure the time it takes to write data from | ||||||
memory (from a pyarrow Table) to a tmpfs file system, given a specific | ||||||
serialization format (parquet, arrow, ...). | ||||||
|
||||||
To make this benchmark agnostic to disk read performance on the input side | ||||||
of things, the data is read fully into memory before starting (and timing) | ||||||
the benchmark function. That is (believed to be) achieved with: | ||||||
|
||||||
data = source_dataset.to_table(filter=somefilter, ...) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we are performing this outside of the measurements, then the filter is also not being measured, right? I'm all for keeping the code, though, because I think it would make sense to eventually expand this benchmark into measuring an end to end read - filter - write workload. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right! This following benchmark times read-from-disk-backed-fs-and-then-filter:
And this one does, too:
Although I would think that in both cases the filtering does not dominate the time, but instead it's the network-attached disk (see below). For focusing a benchmark on the filter performance, the data should be in memory already, using e.g. tmpfs (
Yeah, interesting thought. Maybe we want this kind of test/benchmark in the future. As described in #123 in fact here I started with that. And then I wondered what the use is. I called this "RDFSW":
and I found that the signal on the final two stages was weak / non-existant, because of the read-from-network-attached-disk on the left-hand side of the flow. This part is both slow and volatile compared to the other steps. If we do this, we should read from tmpfs (which limits dataset size, of course). That's a great discussion for subsequent work! |
||||||
|
||||||
After data of interest has been read into memory, the following call is | ||||||
used for both serialization and writing-to-filesystem in one go: | ||||||
|
||||||
pyarrow.dataset.write_dataset(format=someformat, ...) | ||||||
|
||||||
That operation is timed (and the duration is the major output of this | ||||||
benchmark). | ||||||
|
||||||
The data is written to `/dev/shm` (available on all Linux systems). This is | ||||||
a file system backed by RAM (tmpfs). The assumption is that writing to | ||||||
tmpfs is fast (so fast that benchmark duration is significantly affected by | ||||||
the serialization itself), and stable (so that its performance is ~constant | ||||||
across runs on the same machine). | ||||||
|
||||||
This benchmark does not resolve how much time goes into the CPU work for | ||||||
serialization vs. the system calls for writing to tmpfs (that would be a | ||||||
different question to answer, an interesting one, that is maybe more of a | ||||||
task for profiling). | ||||||
|
||||||
There are two dimensions that are varied: | ||||||
|
||||||
- serialization format | ||||||
- amount of the data being written, as set by a filter on the input | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah I see, the filter is the means by which we are varying the data size There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. It's a little indirect, and we don't get to see how big the data actually was. I think this can/should be improved in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. update: with the recent commits I changed the approach |
||||||
|
||||||
A note about /dev/shm: it's of great value because | ||||||
|
||||||
- unprivileges users can write to it | ||||||
- the `base_dir` arg to pyarrow.dataset.write_dataset() requires a path to | ||||||
a directory. That is, one cannot inject a memory-backed Python file | ||||||
object (a strategy that's elsewhere often used to simulate writing to an | ||||||
actual file) | ||||||
- it is not available on MacOS which is why we skip this test | ||||||
|
||||||
""" | ||||||
|
||||||
name = "dataset-serialize" | ||||||
|
||||||
arguments = ["source"] | ||||||
|
||||||
sources = [ | ||||||
"nyctaxi_multi_parquet_s3", | ||||||
"nyctaxi_multi_ipc_s3", | ||||||
# "chi_traffic_2020_Q1", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the
for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jonkeane should we submit this as an issue somewhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at that error, there's an issue converting a struct to a text representation for the csv representation. There might be an issue in apache/arrow for that already (though I suspect that faithful writing of structs to csvs is likely a feature that apache/arrow will punt on or possibly already did punt on) |
||||||
] | ||||||
|
||||||
sources_test = [ | ||||||
"nyctaxi_multi_parquet_s3_sample", | ||||||
"nyctaxi_multi_ipc_s3_sample", | ||||||
"chi_traffic_sample", | ||||||
] | ||||||
|
||||||
_params = { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a number of other parameters here that are likely worth exploring here.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great input! I think writing-with-compression should be its own rather laser-focused benchmark, very similar to what's in here so far (certainly with writing to RAM!). It's a great candidate for extending this very benchmark here later by one more dimension. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. About partitioning: maybe this makes most sense when writing many files of considerable size, I don't know yet how compatible these ideas are: i) write to potentially small tmpfs and ii) exercise quite a bit of partitioning stuff. They probably are compatible. But yeah, let's think about that in an independent effort. |
||||||
"selectivity": ("1pc", "10pc"), | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the end we'd also like to measure 100%, is that not included to keep the runtime in check? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am conservative here to keep /dev/shm usage in check. On my machine I have a 16 GB /dev/shm and during dev I managed to fill this and almost crashed my desktop environment. I will play a little more with that and maybe update the PR again to add a variant that writes more data, maybe not 100 %, but more than 10 % :). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added commits. For
I think we should not grow beyond that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update: I have adopted the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Percent (I'm assuming |
||||||
"format": ( | ||||||
"parquet", | ||||||
"arrow", | ||||||
"ipc", | ||||||
"feather", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. arrow == feather, at least for feather v2, and I don't think we care much about v1 anymore. There's been some effort now to just start calling everything "arrow files" so we can probably just drop feather here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait is ipc also the same thing? I don't know that much about it but it may be... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great input.
Interesting! Hmm, the timings that I saw suggest that this equality isn't quite right. It appears like
https://arrow.apache.org/docs/python/feather.html
So, I will remove either The docs for write_dataset() suggest that all three are the same/comparable:
However, there probably is a difference (as indicated by measurement results). I think it might have to do with compression. For
Maybe maybe I think in this case it makes sense to leave in both, kind of covering the 'default' settings. |
||||||
"csv", | ||||||
), | ||||||
} | ||||||
|
||||||
valid_cases = [tuple(_params.keys())] + list( | ||||||
itertools.product(*[v for v in _params.values()]) | ||||||
) | ||||||
|
||||||
filters = { | ||||||
"nyctaxi_multi_parquet_s3": { | ||||||
"1pc": ds.field("pickup_longitude") < -74.013451, # 561384 | ||||||
"10pc": ds.field("pickup_longitude") < -74.002055, # 5615432 | ||||||
"100pc": None, # 56154689 | ||||||
}, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we need to recalculate any of these for some reason (say we decide we're interested in 20% or whatever at some point), how are these values calculated? Can we store that somewhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I started building this module by copying everything from
I do not know, but I suppose the original author applied some manual approximation, tweaking the threshold until reaching roundabout the desired number of rows.
I thought about simply taking a number of rows either from head or tail or mid part, but then I realized that the filter gives some set of rows that might be a more interesting data selection, obtained deterministically (as you say, i.e. just doing a random subset of fixed size isn't going to be similar). I fully agree that the I will see if it's super easy to change this into using either head or tail with desired row counts, w/o affecting the timing of the benchmark much. |
||||||
"nyctaxi_multi_ipc_s3": { | ||||||
"1pc": ds.field("pickup_longitude") < -74.014053, # 596165 | ||||||
"10pc": ds.field("pickup_longitude") < -74.002708, # 5962204 | ||||||
"100pc": None, # 59616487 | ||||||
}, | ||||||
"chi_traffic_2020_Q1": { | ||||||
"1pc": ds.field("END_LONGITUDE") < -87.807262, # 124530 | ||||||
"10pc": ds.field("END_LONGITUDE") < -87.7624, # 1307565 | ||||||
"100pc": None, # 13038291 | ||||||
}, | ||||||
**dict.fromkeys( | ||||||
["nyctaxi_multi_parquet_s3_sample", "nyctaxi_multi_ipc_s3_sample"], | ||||||
{ | ||||||
"1pc": ds.field("pickup_longitude") < -74.0124, # 20 | ||||||
"10pc": ds.field("pickup_longitude") < -74.00172, # 200 | ||||||
"100pc": None, # 2000 | ||||||
}, | ||||||
), | ||||||
"chi_traffic_sample": { | ||||||
"1pc": ds.field("END_LONGITUDE") < -87.80726, # 10 | ||||||
"10pc": ds.field("END_LONGITUDE") < -87.76148, # 100 | ||||||
"100pc": None, # 1000 | ||||||
}, | ||||||
} | ||||||
|
||||||
_case_tmpdir_mapping = {} | ||||||
|
||||||
def _create_tmpdir_in_ramdisk(self, case: tuple): | ||||||
# Build simple prefix string for specific test case to facilitate | ||||||
# correlating directory names to test cases. | ||||||
pfx = "-".join(c.lower()[:9] for c in case) | ||||||
dirpath = os.path.join(OUTPUT_DIR_PREFIX, pfx + "-" + str(uuid.uuid4())) | ||||||
|
||||||
self._case_tmpdir_mapping[tuple(case)] = dirpath | ||||||
|
||||||
os.makedirs(dirpath, exist_ok=False) | ||||||
return dirpath | ||||||
|
||||||
def _get_dataset_for_source(self, source) -> ds.Dataset: | ||||||
"""Helper to construct a Dataset object.""" | ||||||
|
||||||
return pyarrow.dataset.dataset( | ||||||
source.source_paths, | ||||||
schema=pyarrow.dataset.dataset( | ||||||
source.source_paths[0], format=source.format_str | ||||||
).schema, | ||||||
format=source.format_str, | ||||||
) | ||||||
|
||||||
def _report_dirsize_and_wipe(self, dirpath: str): | ||||||
""" | ||||||
This module already has a dependency on Linux so we can just as well | ||||||
spawn `du` for correct recursive directory size reporting""" | ||||||
|
||||||
ducmd = ["du", "-sh", dirpath] | ||||||
p = subprocess.run(ducmd, capture_output=True) | ||||||
log.info("stdout of %s: %s", ducmd, p.stdout.decode("utf-8").split()[0]) | ||||||
if p.returncode != 0: | ||||||
log.info("stderr of %s: %s", ducmd, p.stderr) | ||||||
log.info("removing directory: %s", dirpath) | ||||||
shutil.rmtree(dirpath) | ||||||
|
||||||
def run(self, source, case=None, **kwargs): | ||||||
|
||||||
if not os.path.exists("/dev/shm"): | ||||||
sys.exit("/dev/shm not found but required (not available on Darwin). Exit.") | ||||||
|
||||||
cases = self.get_cases(case, kwargs) | ||||||
|
||||||
for source in self.get_sources(source): | ||||||
|
||||||
log.info("source %s: download, if required", source.name) | ||||||
source.download_source_if_not_exists() | ||||||
tags = self.get_tags(kwargs, source) | ||||||
|
||||||
t0 = time.monotonic() | ||||||
source_ds = self._get_dataset_for_source(source) | ||||||
log.info( | ||||||
"constructed Dataset object for source in %.4f s", time.monotonic() - t0 | ||||||
) | ||||||
|
||||||
for case in cases: | ||||||
|
||||||
log.info("case %s: create directory", case) | ||||||
dirpath = self._create_tmpdir_in_ramdisk(case) | ||||||
log.info("directory created, path: %s", dirpath) | ||||||
|
||||||
yield self.benchmark( | ||||||
f=self._get_benchmark_function( | ||||||
case, source.name, source_ds, dirpath | ||||||
), | ||||||
extra_tags=tags, | ||||||
options=kwargs, | ||||||
case=case, | ||||||
) | ||||||
|
||||||
# Free up memory in the RAM disk (tmpfs), assuming that we're | ||||||
# otherwise getting close to filling it (depending on the | ||||||
# machine this is executed on, a single test might easily | ||||||
# occupy 10 % or more of this tmpfs). Note that what | ||||||
# accumulated in `dirpath` is the result of potentially | ||||||
# multiple iterations. | ||||||
self._report_dirsize_and_wipe(dirpath) | ||||||
|
||||||
# Finally, remove outest directory. Should have no contents by now, but | ||||||
# if an individual benchmark iteration was Ctrl+C'd then this here | ||||||
# might still do useful cleanup. | ||||||
self._report_dirsize_and_wipe(OUTPUT_DIR_PREFIX) | ||||||
|
||||||
def _get_benchmark_function( | ||||||
self, case, source_name: str, source_ds: ds.Dataset, dirpath: str | ||||||
): | ||||||
|
||||||
(selectivity, serialization_format) = case | ||||||
|
||||||
# Option A: read-from-disk -> deserialize -> filter -> into memory | ||||||
# before timing serialize -> write-to-tmpfs | ||||||
t0 = time.monotonic() | ||||||
data = source_ds.to_table(filter=self.filters[source_name][selectivity]) | ||||||
log.info("read source dataset into memory in %.4f s", time.monotonic() - t0) | ||||||
|
||||||
# Option B (thrown away, but kept for posterity): use a Scanner() | ||||||
# object to transparently filter the source dataset upon consumption, | ||||||
# in which case what's timed is read-from-disk -> deserialize -> filter | ||||||
# -> serialize -> write-to-tmpfs | ||||||
# | ||||||
# Note(JP): I have confirmed that for the data used in this benchmark | ||||||
# this option is dominated by read-from-disk to the extent that no | ||||||
# useful signal is generated for the write phase, at least for some | ||||||
# serialization formats. | ||||||
# | ||||||
# data = pyarrow.dataset.Scanner.from_dataset(source_ds, | ||||||
# filter=self.filters[source_name][selectivity]) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I kept this because it's somewhat non-obvious from pyarrow docs that this I think I'd like to add a benchmark that does the complete round trip in a child process and then the metric we would care about is the maximum memory usage of that child process, confirming that this really is operating in a stream-like fashion: that at any given time the process only holds a fraction of the complete data set in memory. |
||||||
|
||||||
def benchfunc(): | ||||||
# This is a hack to make each iteration work in a separate | ||||||
# directory (otherwise some write operations would error out saying | ||||||
# that the target directory is not empty). With `benchrun` it will | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you don't like this, there is an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, cool. Thanks for pointing that out! Documented here: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html
The current approach creates a fresh directory for each iteration and should therefore be 'fastest'. Deletion might take valuable time (which would be timed by the benchmark wrapper), and is therefore not a good option. It might be that |
||||||
# be easier to cleanly hook into doing resource management before | ||||||
# and after an iteration w/o affecting the timing measurement. | ||||||
# Assume that creating a directory does not significantly add to | ||||||
# the duration of the actual payload function. | ||||||
dp = os.path.join(dirpath, str(uuid.uuid4())[:8]) | ||||||
os.makedirs(dp) | ||||||
|
||||||
# When dimensioning of benchmark parameters and execution | ||||||
# environment are not adjusted to each other, tmpfs quickly gets | ||||||
# full. In that case writing might fail with | ||||||
# | ||||||
# File "pyarrow/_dataset.pyx", line 2859, in | ||||||
# pyarrow._dataset._filesystemdataset_write File | ||||||
# "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status | ||||||
# OSError: [Errno 28] Error writing bytes to file. Detail: [errno | ||||||
# 28] No space left on device | ||||||
|
||||||
pyarrow.dataset.write_dataset( | ||||||
data=data, format=serialization_format, base_dir=dp | ||||||
) | ||||||
|
||||||
# The benchmark function returns `None` for now. If we need | ||||||
# deeper inspection into the result maybe iterate on that. | ||||||
|
||||||
return benchfunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I've had a good amount of issues with isort; I wonder if we should pin a version to hopefully get more consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the difference between environments was after all how I chose to integrate
conbench
(not from PyPI, but from my local checkout).