Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Add type-safe accessors to Row #85

Closed
andygrove opened this issue Apr 13, 2018 · 9 comments
Closed

Add type-safe accessors to Row #85

andygrove opened this issue Apr 13, 2018 · 9 comments

Comments

@andygrove
Copy link
Contributor

When processing parquet files I would like the ability to get values like this:

let a = row.get_bool().unwrap();
let a = row.get_float().unwrap();
...

Currently I have to write pattern matching each time I want to extract a value.

@andygrove
Copy link
Contributor Author

PR: #86

@sunchao
Copy link
Owner

sunchao commented Apr 14, 2018

It seems a little bit strange that we need to cast an entire row to a bool or float, IMHO it's more common to treat row as a struct type and get the ith element from the struct as a particular type. If that's the case, maybe we just need to add getXXX(index: usize) -> T .

@andygrove
Copy link
Contributor Author

I agree. I can make this change but then each accessor will now need to perform two matches - one to check that the Row is a Group and then another match for the contained Row. This is in addition to creating a Vec for each row being read. This is quite expensive.

I think a better approach would be for the row reader to return a trait for reading each row that has accessor methods that can get the value from each column without having to build the intermediate vec. I don't know how big a change that would be though.

@andygrove
Copy link
Contributor Author

I pushed changes to the PR to add this:

trait RowAccessor {
  fn get_bool(&self, i: usize) -> Result<bool>;
  fn get_byte(&self, i: usize) -> Result<i8>;
  fn get_short(&self, i: usize) -> Result<i16>;
  fn get_int(&self, i: usize) -> Result<i32>;
  fn get_long(&self, i: usize) -> Result<i64>;
  fn get_float(&self, i: usize) -> Result<f32>;
  fn get_double(&self, i: usize) -> Result<f64>;
  fn get_string(&self, i: usize) -> Result<&String>;
  fn get_bytes(&self, i: usize) -> Result<&ByteArray>;
  fn get_timestamp(&self, i: usize) -> Result<u64>;
  fn get_group(&self, i: usize) -> Result<&Vec<(String,Row)>>;
  fn get_list(&self, i: usize) -> Result<&Vec<(Row)>>;
  fn get_map(&self, i: usize) -> Result<&Vec<(Row,Row)>>;
}

I then implemented this for Row. Let me know if you like this and I'll finish adding the tests.

@andygrove
Copy link
Contributor Author

I figure with this approach, it will be possible later to have get_row_reader() return an iterator over this perhaps.

@sunchao
Copy link
Owner

sunchao commented Apr 14, 2018

I agree. I can make this change but then each accessor will now need to perform two matches - one to check that the Row is a Group and then another match for the contained Row. This is in addition to creating a Vec for each row being read. This is quite expensive.

I wonder if it makes sense to break the current Row into two parts: a top level Row type that contains a list of (fieldName, fieldValue) pairs, and another RowField type which replaces the current Row. The top level API will still remain the same. We can then implement various getXXX methods on Row, and avoid exposing the internal RowField type.

@sadikovi : would like to know your thought on this.

@sadikovi
Copy link
Collaborator

It is a good idea, plus, I think it should have been named RowField in the first place:). I like the idea of accessing by field name and by position.

The implementation might take some time though, so I would suggest we go with what @andygrove has done, since it is already really good, and would unblock him.

@sunchao
Copy link
Owner

sunchao commented Apr 16, 2018

I've opened #89 with the refactoring. Can you take a look? After that, #86 can be resolved in a better way I think.

@sunchao
Copy link
Owner

sunchao commented Apr 22, 2018

Closing this since #86 is merged.

@sunchao sunchao closed this as completed Apr 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants