Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Add support for reading columns as Apache Arrow arrays #79

Open
andygrove opened this issue Apr 6, 2018 · 14 comments
Open

Add support for reading columns as Apache Arrow arrays #79

andygrove opened this issue Apr 6, 2018 · 14 comments

Comments

@andygrove
Copy link
Contributor

This is going to be easy. I have some ugly prototype code working already.

    let path = Path::new(&args[1]);
    let file = File::open(&path).unwrap();
    let parquet_reader = SerializedFileReader::new(file).unwrap();

    let row_group_reader = parquet_reader.get_row_group(0).unwrap();

    for i in 0..row_group_reader.num_columns() {
        match row_group_reader.get_column_reader(i) {
            Ok(ColumnReader::Int32ColumnReader(ref mut r)) => {

                let batch_size = 1024;
                let sz = mem::size_of::<i32>();
                let p = memory::allocate_aligned((batch_size * sz) as i64).unwrap();
                let ptr_i32 = unsafe { mem::transmute::<*const u8, *mut i32>(p) };
                let mut buf = unsafe {
                    slice::from_raw_parts_mut(ptr_i32, batch_size) };

//                let mut builder : Builder<i32> = Builder::with_capacity(1024);
//                let buffer = builder.finish();
                match r.read_batch(1024, None, None, &mut buf) {
                    Ok((count,_)) => {
                        let arrow_buffer = Buffer::from_raw_parts(ptr_i32, count as i32);
                        let arrow_array = Array::from(arrow_buffer);

                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }


                    },
                    _ => println!("error")
                }
            }
            _ => println!("column type not supported")
        }
    }

Now we need to come up with a real design and I probably need to add more helper methods to Arrow to make this easier.

@andygrove
Copy link
Contributor Author

Here is a cleaned up version:

                let mut builder : Builder<i32> = Builder::with_capacity(batch_size);
                match r.read_batch(1024, None, None, builder.slice_mut(0, batch_size)) {
                    Ok((count,_)) => {
                        builder.set_len(count);
                        let arrow_array = Array::from(builder.finish());
                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }

@andygrove
Copy link
Contributor Author

andygrove commented Apr 7, 2018

I've made good progress with integrating parquet-rs with datafusion .. I have examples like this working

    let mut ctx = ExecutionContext::local();

    let df = ctx.load_parquet("test/data/uk_cities.parquet/part-00000-bdf0c245-d300-4b28-bfdd-0f1f9cb898c4-c000.snappy.parquet").unwrap();

    ctx.register("uk_cities", df);

    // define the SQL statement
    let sql = "SELECT lat, lng FROM uk_cities";

    // create a data frame
    let df = ctx.sql(&sql).unwrap();

    df.show(10);

It only works for int32 and f32 columns though so far

@sunchao
Copy link
Owner

sunchao commented Apr 7, 2018

nice progress @andygrove ! really glad you can read parquet now using SQL.

@sadikovi
Copy link
Collaborator

sadikovi commented Apr 8, 2018

Looks great! I am curious how Arrow handles/maps optional or repeated fields (when definition and/or repetition levels exist).

@andygrove
Copy link
Contributor Author

I have merged the current Parquet support to master in DataFusion. I have added examples for both DataFrame and SQL API.

https://github.com/datafusion-rs/datafusion-rs/tree/master/examples

This is very rough code and I won't be promoting the fact that Parquet support is there until this is a little more complete and tested.

@liurenjie1024
Copy link
Contributor

@andygrove @sunchao Are you guys working on this? I'm working on an implementation which takes the cpp version as a reference.

@sunchao
Copy link
Owner

sunchao commented Oct 10, 2018

@liurenjie1024 are you working on the arrow part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part.

@liurenjie1024
Copy link
Contributor

liurenjie1024 commented Oct 10, 2018 via email

@nevi-me
Copy link

nevi-me commented Oct 31, 2018

This is unrelated, but I've seen that CSV readers are being implemented in Arrow (I think Python, C++, and there's a Go one that's an open PR).

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

@sunchao @andygrove , do you know if there's been discussion of something like this? Also, would some codegen be required? It's something I'd love to try contribute to in the coming weeks.

@sunchao
Copy link
Owner

sunchao commented Oct 31, 2018

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

Yes it is certainly do-able - it needs to be implemented in the arrow repo though. Some pieces may still be missing and you're welcome to work on the arrow repo :)

Also I'm not sure how this extend to parquet. Can you elaborate?

@nevi-me
Copy link

nevi-me commented Oct 31, 2018

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

Also, I haven't checked to see if datafusion-rs supports csv and parquet as sinks, so one'd be able to create table my_table<parquet> as select * from input<csv>.
A lot of my curiousity comes as a result of having to watch paint dry at work while I use Spark, so I've been thinking of what I'd need to be able to do to replace some of my workflow with Rust.

@sunchao
Copy link
Owner

sunchao commented Oct 31, 2018

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

Yes feel free to create a JIRA :)

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

I see. Yes I think we discussed (cc @sadikovi ) multiple times about creating a CSV -> parquet converter. This will be convenient. We should certainly do it.

@liurenjie1024
Copy link
Contributor

I'm working on an arrow reader implementation and have finished the first step, converting parquet schema to arrow schema in this PR, please help to review this.

@sunchao
Copy link
Owner

sunchao commented Nov 7, 2018

The umbrella task is #186. Please watch that issue for progress on this matter.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants