Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Improve performance of Row API by storing TypePtr instead of String #204

Closed
wants to merge 3 commits into from

Conversation

sadikovi
Copy link
Collaborator

@sadikovi sadikovi commented Dec 8, 2018

This PR updates Row struct to store Vec<(TypePtr, Field)> instead of Vec<(String, Field)>. This improves time when reading Parquet files.

For this I had to change record reader to pass type pointer instead of field name. This unfortunately makes it difficult to test, so I added a macro to generate dummy type with a field name.

Benchmark results:

Before

test record_reader_10k_collect             ... bench:  41,689,686 ns/iter (+/- 33,883,503) = 16 MB/s
test record_reader_stock_simulated_collect ... bench: 335,444,189 ns/iter (+/- 36,688,424) = 3 MB/s
test record_reader_stock_simulated_column  ... bench:  14,664,590 ns/iter (+/- 1,444,286) = 87 MB/s

After

test record_reader_10k_collect             ... bench:  26,658,590 ns/iter (+/- 1,773,580) = 25 MB/s
test record_reader_stock_simulated_collect ... bench: 140,238,171 ns/iter (+/- 6,796,855) = 9 MB/s
test record_reader_stock_simulated_column  ... bench:  15,275,906 ns/iter (+/- 349,117) = 84 MB/s

@sadikovi sadikovi requested a review from sunchao December 8, 2018 22:14
pub fn make_row(fields: Vec<(TypePtr, Field)>) -> Row { Row { fields } }

impl PartialEq for Row {
fn eq(&self, other: &Row) -> bool {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equality is only used in tests, normally user would not run that in the production code.

@@ -706,6 +724,20 @@ mod tests {
}};
}

// NOTE:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refactor our tests, there is a lot of code duplication here and there.

@sadikovi sadikovi changed the title Update Row API to store TypePtr instead of String Improve performance of Row API by storing TypePtr instead of String Dec 10, 2018
@sadikovi
Copy link
Collaborator Author

Fails to run $ rustup component add rustfmt-preview, I tried triggering build over the weekend and today - still the same error.

@sunchao
Copy link
Owner

sunchao commented Dec 11, 2018

Yes, there are some issue in the rustup toolchain right now. See the reddit thread. Let me see if we can use a nightly on a particular date to fix this issue.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 706

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 99 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.007%) to 95.776%

Files with Coverage Reduction New Missed Lines %
record/api.rs 27 96.4%
record/reader.rs 72 87.4%
Totals Coverage Status
Change from base Build 705: -0.007%
Covered Lines: 13175
Relevant Lines: 13756

💛 - Coveralls

@sunchao
Copy link
Owner

sunchao commented Dec 18, 2018

Sorry @sadikovi I forgot about this PR. Since parquet-rs has merged into Apache Arrow, could you file a JIRA and submit a PR for that? Thanks.

@sadikovi
Copy link
Collaborator Author

Yes, no problem. I think it would be good to mark this repo as moved or something similar.

@sadikovi sadikovi closed this Dec 18, 2018
@sadikovi sadikovi deleted the update-record-api branch December 18, 2018 09:23
@sunchao
Copy link
Owner

sunchao commented Dec 19, 2018

Yes. Will update the README for the merge.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants