-
Notifications
You must be signed in to change notification settings - Fork 20
Make FileReader generic to Page and Row reader #136
Conversation
I've also parametrized FileReader with PagerReader and RowGroupReader, which avoids boxing and some dynamic dispatching. This is an intermediate step to make FileReader generic on any `Read`. More context on: sunchao#135
Thanks @gnieto . My concern is that this makes the API more verbose and more confusing. For instance, user of The concern about dynamic dispatching is valid though - is it possible that we can have some micro-benchmark to show the actual benefit before making the decision? |
Hi, sorry for the late reply. I've written a benchmark (which, to be honest, I do not know if it's representative) to check how the dynamic dispatch affects. The benchmark is the following one: #[bench]
fn bench_get_column_reader(bench: &mut Bencher) {
use ::parquet::file::reader::FileReader;
let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();
let metadata = fr.metadata();
let group_amount = metadata.num_row_groups();
bench.iter(|| {
fr.get_row_group(0).unwrap();
})
}
#[bench]
fn bench_get_column_reader_rand(bench: &mut Bencher) {
use ::parquet::file::reader::FileReader;
let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();
let metadata = fr.metadata();
let group_amount = metadata.num_row_groups();
bench.iter(|| {
let rg = ::rand::random::<usize>() % group_amount;
fr.get_row_group(rg).unwrap();
})
} I've ran the benchmarks 3 times. On master, the results are the following: test bench_get_column_reader ... bench: 267 ns/iter (+/- 1)
test bench_get_column_reader ... bench: 274 ns/iter (+/- 8)
test bench_get_column_reader ... bench: 267 ns/iter (+/- 2)
test bench_get_column_reader_rand ... bench: 293 ns/iter (+/- 5)
test bench_get_column_reader_rand ... bench: 297 ns/iter (+/- 7)
test bench_get_column_reader_rand ... bench: 292 ns/iter (+/- 3) On this branch: test bench_get_column_reader ... bench: 251 ns/iter (+/- 3)
test bench_get_column_reader ... bench: 252 ns/iter (+/- 4)
test bench_get_column_reader ... bench: 253 ns/iter (+/- 1)
test bench_get_column_reader_rand ... bench: 274 ns/iter (+/- 3)
test bench_get_column_reader_rand ... bench: 275 ns/iter (+/- 19)
test bench_get_column_reader_rand ... bench: 274 ns/iter (+/- 13) I got the feeling that this is bound by IO, but it seems a little bit better w/o dynamic dispatch. Even that, I fully agree that the API gets worse and I doubt that the change is worth. But, on top of https://github.com/sunchao/parquet-rs/pull/135/files, maybe But I'm happy to close this PRs if you think that the API gets too much complicated. |
@gnieto thanks for doing this! yeah I'm not fully convinced that the perf gain is worthing changing the API... I do think though that we should replace hardcoded |
I was thinking if it is worth making this PR a part of overall performance improvement work that we may need to do and discussing directions. |
Yes we definitely need to address the perf issue. From the benchmark I’m
not sure if this is the bottleneck though. Maybe we can keep this in mind
and come back later if necessary, after we have more findings?
…On Wed, Aug 8, 2018 at 1:09 AM Ivan ***@***.***> wrote:
I was thinking if it is worth making this PR a part of overall performance
improvement work that we may need to do and discussing directions.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#136 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAe7N2obVSJRYCCXkm29va6SHqVo_F5Bks5uOpyigaJpZM4VlCu9>
.
|
@sunchao Ok, let me close this and rework the previous PR. I will try to minimize the changes, but I think that I won't be able to avoid adding some generic parameters over |
I've also parametrized FileReader with PagerReader and
RowGroupReader, which avoids boxing and some dynamic dispatching.
This is an intermediate step to make FileReader generic on any
Read
.More context on: #135