Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Add page writer #133

Merged
merged 9 commits into from
Jul 23, 2018
Merged

Add page writer #133

merged 9 commits into from
Jul 23, 2018

Conversation

sadikovi
Copy link
Collaborator

This PR adds:

  • PageWriter interface and serialised implementation.
  • CompressedPage in addition to Page. Compressed page is a wrapper that allows us to store compressed buffer + uncompressed length. Page, when created, is always assumed to have uncompressed buffer. This is not the case for compressed page - internally we store compressed data.
  • Changed SerializedPageReader to take T: Read instead of file source. This allows me to test a page roundtrip.
  • Added tests.

@sadikovi
Copy link
Collaborator Author

@sunchao I implemented page writer. Let me know if I need to add more tests, I feel like the currently added tests may not cover all of the cases. Thanks!

@coveralls
Copy link

coveralls commented Jul 20, 2018

Pull Request Test Coverage Report for Build 560

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.05%) to 95.442%

Files with Coverage Reduction New Missed Lines %
encodings/encoding.rs 1 94.8%
column/page.rs 4 96.3%
Totals Coverage Status
Change from base Build 558: 0.05%
Covered Lines: 11348
Relevant Lines: 11890

💛 - Coveralls

}

/// Returns underlying page with potentially compressed buffer.
pub fn get_compressed_page(&self) -> &Page {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we change this to compressed_page? - just to conform with other method names.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks.

fn close(&mut self) -> Result<()>;

/// Returns dictionary page offset in bytes, if set.
#[inline]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to put #[inline] here? maybe just on the actual methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks.

fn write_metadata(&mut self, metadata: &ColumnChunkMetaData) -> Result<()>;

/// Closes resources and flushes underlying sink.
fn close(&mut self) -> Result<()>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PageWriter is not supposed to be used after close() is called, can we make close consume self?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I thought about it, but this method is part of trait PageWriter, so I cannot make it close(self).

}
}

/// Serializes page header into Thrift.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we specify what this method returns? same for below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks.

match page_type {
PageType::DATA_PAGE | PageType::DATA_PAGE_V2 => {
if self.data_page_offset.is_none() {
self.data_page_offset = Some(start_pos);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we record the data_page_offset (and also dictionary_page_offset) before writing the page header, is this correct?

Copy link
Collaborator Author

@sadikovi sadikovi Jul 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, every time we write a data page, we record the offset for the first data page we write. The same applies to dictionary page, but we will write only one of those. Normally, in column writer we would write either DICTIONARY_PAGE, DATA_PAGE, ..., DATA_PAGE or DATA_PAGE, ..., DATA_PAGE.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I misunderstood - I though the offset should be the start of the actual data, but it should be the start of the page header. Looks good now.


/// Serializes column chunk into Thrift.
#[inline]
fn serialize_column_chunk(&mut self, chunk: parquet::ColumnChunk) -> Result<()> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, this will only be called once per PageWriter instance, is that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will only be called when we finalise column writer. I know it is a bit confusing, it could be part of a column writer.

@sadikovi
Copy link
Collaborator Author

@sunchao can you have a look again? I addressed your comments and added more docs describing each method.

I was thinking if PageWriter trait needs an improvement; it might be a bit difficult to reason about the API without column writer. Let me know what you think.

Thanks!

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. My only concern is that PageWriter overlaps a little with the column chunk writer, but we can revisit this after the latter is implemented.

@sadikovi
Copy link
Collaborator Author

Yes, you are right, it does. We should be able to refactor both when working on column chunk writer.

@sunchao sunchao merged commit 3f70b0f into sunchao:master Jul 23, 2018
@sunchao
Copy link
Owner

sunchao commented Jul 23, 2018

Merged. Thanks @sadikovi !

@sadikovi
Copy link
Collaborator Author

Thanks @sunchao!

@sadikovi sadikovi deleted the add-page-writer branch July 23, 2018 20:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants