-
Notifications
You must be signed in to change notification settings - Fork 20
Conversation
@sunchao I implemented page writer. Let me know if I need to add more tests, I feel like the currently added tests may not cover all of the cases. Thanks! |
Pull Request Test Coverage Report for Build 560
💛 - Coveralls |
src/column/page.rs
Outdated
} | ||
|
||
/// Returns underlying page with potentially compressed buffer. | ||
pub fn get_compressed_page(&self) -> &Page { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we change this to compressed_page
? - just to conform with other method names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks.
src/column/page.rs
Outdated
fn close(&mut self) -> Result<()>; | ||
|
||
/// Returns dictionary page offset in bytes, if set. | ||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to put #[inline]
here? maybe just on the actual methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks.
src/column/page.rs
Outdated
fn write_metadata(&mut self, metadata: &ColumnChunkMetaData) -> Result<()>; | ||
|
||
/// Closes resources and flushes underlying sink. | ||
fn close(&mut self) -> Result<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the PageWriter
is not supposed to be used after close()
is called, can we make close
consume self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I thought about it, but this method is part of trait PageWriter
, so I cannot make it close(self)
.
} | ||
} | ||
|
||
/// Serializes page header into Thrift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we specify what this method returns? same for below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks.
match page_type { | ||
PageType::DATA_PAGE | PageType::DATA_PAGE_V2 => { | ||
if self.data_page_offset.is_none() { | ||
self.data_page_offset = Some(start_pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we record the data_page_offset
(and also dictionary_page_offset
) before writing the page header, is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, every time we write a data page, we record the offset for the first data page we write. The same applies to dictionary page, but we will write only one of those. Normally, in column writer we would write either DICTIONARY_PAGE, DATA_PAGE, ..., DATA_PAGE
or DATA_PAGE, ..., DATA_PAGE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I misunderstood - I though the offset should be the start of the actual data, but it should be the start of the page header. Looks good now.
src/file/writer.rs
Outdated
|
||
/// Serializes column chunk into Thrift. | ||
#[inline] | ||
fn serialize_column_chunk(&mut self, chunk: parquet::ColumnChunk) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding, this will only be called once per PageWriter instance, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will only be called when we finalise column writer. I know it is a bit confusing, it could be part of a column writer.
@sunchao can you have a look again? I addressed your comments and added more docs describing each method. I was thinking if Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. My only concern is that PageWriter
overlaps a little with the column chunk writer, but we can revisit this after the latter is implemented.
Yes, you are right, it does. We should be able to refactor both when working on column chunk writer. |
Merged. Thanks @sadikovi ! |
Thanks @sunchao! |
This PR adds:
PageWriter
interface and serialised implementation.CompressedPage
in addition toPage
. Compressed page is a wrapper that allows us to store compressed buffer + uncompressed length. Page, when created, is always assumed to have uncompressed buffer. This is not the case for compressed page - internally we store compressed data.SerializedPageReader
to takeT: Read
instead of file source. This allows me to test a page roundtrip.