Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Make FileReader generic to Page and Row reader #136

Closed
wants to merge 1 commit into from

Conversation

gnieto
Copy link
Contributor

@gnieto gnieto commented Jul 28, 2018

I've also parametrized FileReader with PagerReader and
RowGroupReader, which avoids boxing and some dynamic dispatching.

This is an intermediate step to make FileReader generic on any Read.
More context on: #135

I've also parametrized FileReader with PagerReader and
RowGroupReader, which avoids boxing and some dynamic dispatching.

This is an intermediate step to make FileReader generic on any `Read`.
More context on: sunchao#135
@coveralls
Copy link

Pull Request Test Coverage Report for Build 568

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 95.45%

Totals Coverage Status
Change from base Build 564: 0.0%
Covered Lines: 11349
Relevant Lines: 11890

💛 - Coveralls

@sunchao
Copy link
Owner

sunchao commented Jul 29, 2018

Thanks @gnieto . My concern is that this makes the API more verbose and more confusing. For instance, user of RowIter now needs to know what is the type for PageReader and RowGroupReader, which they shouldn't (not sure if associated types help here). Making RowGroupReader sized also bring some restrictions.

The concern about dynamic dispatching is valid though - is it possible that we can have some micro-benchmark to show the actual benefit before making the decision?

@gnieto
Copy link
Contributor Author

gnieto commented Aug 6, 2018

Hi, sorry for the late reply.

I've written a benchmark (which, to be honest, I do not know if it's representative) to check how the dynamic dispatch affects.

The benchmark is the following one:

#[bench]
fn bench_get_column_reader(bench: &mut Bencher) {
  use ::parquet::file::reader::FileReader;

  let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
  let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();

  let metadata = fr.metadata();
  let group_amount = metadata.num_row_groups();


  bench.iter(|| {
    fr.get_row_group(0).unwrap();
  })
}

#[bench]
fn bench_get_column_reader_rand(bench: &mut Bencher) {
  use ::parquet::file::reader::FileReader;

  let file = ::std::fs::File::open("data/10k-v2.parquet").unwrap();
  let fr = ::parquet::file::reader::SerializedFileReader::new(file).unwrap();

  let metadata = fr.metadata();
  let group_amount = metadata.num_row_groups();

  bench.iter(|| {
    let rg = ::rand::random::<usize>() % group_amount;
    fr.get_row_group(rg).unwrap();
  })
}

I've ran the benchmarks 3 times.

On master, the results are the following:

test bench_get_column_reader      ... bench:         267 ns/iter (+/- 1)
test bench_get_column_reader      ... bench:         274 ns/iter (+/- 8)
test bench_get_column_reader      ... bench:         267 ns/iter (+/- 2)


test bench_get_column_reader_rand ... bench:         293 ns/iter (+/- 5)
test bench_get_column_reader_rand ... bench:         297 ns/iter (+/- 7)
test bench_get_column_reader_rand ... bench:         292 ns/iter (+/- 3)

On this branch:

test bench_get_column_reader      ... bench:         251 ns/iter (+/- 3)
test bench_get_column_reader      ... bench:         252 ns/iter (+/- 4)
test bench_get_column_reader      ... bench:         253 ns/iter (+/- 1)

test bench_get_column_reader_rand ... bench:         274 ns/iter (+/- 3)
test bench_get_column_reader_rand ... bench:         275 ns/iter (+/- 19)
test bench_get_column_reader_rand ... bench:         274 ns/iter (+/- 13)

I got the feeling that this is bound by IO, but it seems a little bit better w/o dynamic dispatch.


Even that, I fully agree that the API gets worse and I doubt that the change is worth. But, on top of https://github.com/sunchao/parquet-rs/pull/135/files, maybe PageReader and/or RowGroupReader can be removed and make SerializedPageReader and SerializedRowGroupReader generic to Read.

But I'm happy to close this PRs if you think that the API gets too much complicated.

@sunchao
Copy link
Owner

sunchao commented Aug 8, 2018

@gnieto thanks for doing this! yeah I'm not fully convinced that the perf gain is worthing changing the API... I do think though that we should replace hardcoded File with something more generic such as ParquetReader so that in future we can support more types of data sources such as HDFS. Is it possible to do that without making the API change?

@sadikovi
Copy link
Collaborator

sadikovi commented Aug 8, 2018

I was thinking if it is worth making this PR a part of overall performance improvement work that we may need to do and discussing directions.

@sunchao
Copy link
Owner

sunchao commented Aug 8, 2018 via email

@gnieto
Copy link
Contributor Author

gnieto commented Aug 8, 2018

@sunchao Ok, let me close this and rework the previous PR. I will try to minimize the changes, but I think that I won't be able to avoid adding some generic parameters over Read.

@gnieto gnieto closed this Aug 8, 2018
@gnieto gnieto deleted the genricReaders branch August 8, 2018 20:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants