Skip to content
This repository has been archived by the owner on Dec 31, 2022. It is now read-only.

Cool! #4

Open
paleolimbot opened this issue Dec 8, 2022 · 18 comments
Open

Cool! #4

paleolimbot opened this issue Dec 8, 2022 · 18 comments

Comments

@paleolimbot
Copy link

Just a note that this is a really cool project!

I don't know how useful it is here, but I've been working on a 'nanoarrow' R package that provides conversion to and from Arrow C data interface structures from R data frames for the full spectrum of Arrow and R types: https://github.com/apache/arrow-nanoarrow/tree/main/r . I don't know if polars supports input/output of Arrow C data interface, but if it does, nanoarrow could potentially be useful!

@sorhawell
Copy link
Member

Hi Paleolimbot :)
I will take a look a nanoarrow next week.
Polars is using 'arrow2' which a rust implementation of the bindings. What would be a good use case, you think? To read arrow formats directly into data.frame without intermediary polars DataFrame maybe?

@paleolimbot
Copy link
Author

The use-case I imagine may help you (or may not!) would be that you'd get as.data.frame(a_polars_data_frame) and as_polars(some_data_frame) for free! It looks like arrow2 has the C Data interface implemented ( https://pola-rs.github.io/polars/arrow2/ffi/index.html ). My Rust coding is not good but I keep meaning to give it a try! I'll work up a PR to see what I can come up with.

@sorhawell
Copy link
Member

sorhawell commented Dec 19, 2022

I'm very open to try out and expose other conversions from polars/arrow to data.frame. rpolars including the nanoarrow dist seems workable. py-polars also here and there have two different "arrow-transfer"-implementations and it is possible to switch between them.

I just started learning Rust last March, feel free to write rust PR part in any style you like, the ideomatics can be sorted out later.

The current implementation(Robj->polars_series; polars_series->Robj ) is just what I could come up with (relying heavily on extendr and rust-polars), and this impl requires re-allocation.

extendr have enabled to use the altRep-vector, which allows zero-copy immutable views. Another extendr contributor made a proof of concept it. I have not gotten to try implementing any of it.

some zero-copy view might bring a smoother user experience such that rpolars-'DataFrame' could be become a subclass of R "data.frame", just like "data.table" is a subclass thereof.

@paleolimbot
Copy link
Author

Cool cool!

I suppose you could use 'nanoarrow' the C library, although 'nanoarrow' the R package is what I had in mind.

I think the path would be something like:

At this point, you'd need to figure out how to get the ArrowArray pointer address into R. I'd have to do some reading up on Rust's ffi interface to learn how to do that (or maybe you know already). Then you can use:

# remotes::install_github("apache/arrow-nanoarrow/r")
array <- nanoarrow::nanoarrow_allocate_array()
nanoarrow::nanoarrow_pointer_move("the address of the Polars exported arrow array", array)
as.vector(array)
# or
arrow::as_arrow_array(array) # (which would be zero copy!)

The as.vector() implementation for nanoarrow's array can do zero-copy ALTREP but it's not implemented everywhere. Currently it's only implemented for character vectors since those are expensive to convert to R land.

For data frames as a whole, it looks like those are represented by something like list(chunked_arrays)? If you can figure out how to convert them into an ArrowArrayStream (maybe using this? https://pola-rs.github.io/polars/arrow2/ffi/fn.export_iterator.html ), you can do:

stream <- nanoarrow::nanoarrow_allocate_array_stream()
nanoarrow::nanoarrow_pointer_move("the address of the exported arrow array stream", stream)
as.data.frame(stream)
# or
arrow::as_record_batch_reader(stream) # (again, zero copy!)

I'm not sure I have time to get into this over the holidays but I do promise to have a look at some point! Polars is awesome and getting the zero-copy thing to work for interchange with nanoarrow/the Arrow R package/DuckDB/other arrow-based things.

@sorhawell
Copy link
Member

sorhawell commented Dec 20, 2022

py-polars has this method to export arrow-arrays. I have adopted the method to just print the raw ptrs instead in this branch(link fixed). will provide a link to R installation binaries soon.

@sorhawell
Copy link
Member

link to release

@sorhawell
Copy link
Member

sorhawell commented Dec 21, 2022

something works, but something with the schemas...

library(nanoarrow)
library(rpolars)
library(xptr)


df = pl$DataFrame(iris)
arrays_copied_to_here_secure_lifetime = .pr$DataFrame$print_raw_arrow_pts(df)
arr_ptr = xptr::new_xptr("140699483402752") #write e.g. firsts column array ptr here

array <- nanoarrow::nanoarrow_allocate_array()
nanoarrow::nanoarrow_pointer_move(arr_ptr, array)
array #somethinf worked, array knows the vector is 150 

>
><nanoarrow_array <unknown schema>[150]>
 >$ length    : int 150
 >$ null_count: int 0
 >$ offset    : int 0
 >$ buffers   :List of 2
  >..$ :<nanoarrow_buffer_unknown[NA b] at 0x0>
  >..$ :<nanoarrow_buffer_unknown[NA b] at 0x7ff71cd62e00>
 >$ children  : NULL
 >$ dictionary: NULL
 
 
 

as.vector(array) #does not work yet, error in infer_nanoarrow_ptype(x) : 
# `schema` argument that does not inherit from 'nanoarrow_schema'


@paleolimbot
Copy link
Author

I forgot a step! You have to attach the data type or else there's no way to know how long each of the buffers is supposed to be: https://github.com/apache/arrow-nanoarrow/blob/main/r/R/pkg-arrow.R#L122

@paleolimbot
Copy link
Author

For lifecycle, the C data interface is supposed to be self contained (i.e., as soon as you call nanoarrow_pointer_move(), you shouldn't need to keep any other objects in scope to keep the array valid)

@sorhawell
Copy link
Member

It seems the py-polars internal functions, that I used, clones before export such that pyarrow is owned by it self.
If I didn't return the arrowArray in a thin wrapper, it would get dropped and the pointer would be invalid. See all this as a clumpsy "hello world" attempt. If the pointers are fully useable from nanoarrow, i will try to improve.
All the polars structs are behind something like a Arc+Cow (async-reference-counting + clone-on-write) which makes cloning close to free. I don't know if that is a arrow feature, in such case it might be very cheap to clone the arrow array. I guess the best way, is to link the pointers to the reference counting to gurrantee their life-times. When an R user drops the pointer object the "destructor"/drop-implementation can decrease the reference count again, and the polars array will it be dropped if no other R object has references to it.

I have made a new Nanoarrow Release 3 hours ago

I can move the ptr and it nanoarrow recognizes the length of the array. But I could not get the as.vector() to work due to some schema thing.

@sorhawell
Copy link
Member

Will try out if I can import the schema as of your link

@sorhawell
Copy link
Member

install.packages( #or similar for window or linux
  "https://github.com/rpolars/rpolars/releases/download/nanoarrow_fix_life/rpolars__x86_64-apple-darwin17.0.tgz",
 repos= NULL 
 )


library(arrow) #loading arrow first, otherwise some dependency loop-error with nanoarrow
arrow_datatype = arrow::infer_type(double())  #because first column of iris is double

library(nanoarrow)
nanoarrow_schema = nanoarrow::as_nanoarrow_schema(arrow_datatype)

library(rpolars)
library(xptr)
df = pl$DataFrame(iris)
ptrs_printed = capture.output({
  arrays_copied_to_here_secure_lifetime = .pr$DataFrame$print_raw_arrow_pts(df)
})

#auto debug print, dunno about these ptrs
arrays_copied_to_here_secure_lifetime$print()

#make pointer on first array of polars-dataframe containining iris
arr_ptr = xptr::new_xptr(strsplit(ptrs_printed,":")[[2]][3]) #write e.g. firsts column array ptr here

#make array from ptr
array <- nanoarrow::nanoarrow_allocate_array()
nanoarrow::nanoarrow_pointer_move(arr_ptr, array)
nanoarrow:::nanoarrow_array_set_schema(array, nanoarrow_schema)


print(as.vector(array)) #tadaa
.Internal(inspect(as.vector(array)))
> @7f9ede38e400 14 REALSXP g0c7 [] (len=150, tl=0) 5.1,4.9,4.7,4.6,5,...

@sorhawell
Copy link
Member

Can nanoarrow read schemas from a raw schema ptr also?

@sorhawell
Copy link
Member

What are the ownership rules for nanoarrow ptrs in conjunction with R?

For polars it is the combination of immutability and reference counting, that makes rust memory management work well together with the R or python garbage collector.

  • What if rpolars yields a ptr to an object that is then dropped? If nanoarrow read from dead address that would be a hard crash of the Rsession?

I guess it would be better to pass a smart pointer which nanoarrow can take ownership of. As long as smart pointer is not deleted the arrowarray is guranteed to persist. If nanoarrow drops the smart pointer and there are no other references to the array it will be dropped by polars as usual.

I'm quite ignorrant if memory safety is baked in arrowArrays themself or it is the responsibility of the implementation.

@paleolimbot
Copy link
Author

Cool! I will look into the dependency loop thing, that sounds like it's my fault.

The C Data interface has pretty strict rules for lifecycle management of its structs ( https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-producers )...I would be surprised if Rust deviated from that in any way. Basically, as soon as you get that pointer, it's up to nanoarrow to manage the lifecycle of the object. Once you do get a valid nanoarrow array, you can call the release callback manually using nanoarrow_pointer_release() or wait for the garbage collector to do it for you.

If for some reason you do need to keep some other object in scope for the lifecycle of the array pointer, there's a mechanism for that, too, I just didn't expose it at the R level. For now you can do something like attr(the_array, "keep me alive") <- the_object_you_need_to_be_alive.

@sorhawell
Copy link
Member

I tinkered a bit more with exporting the ArrowArray and pointers and monitoring memory usage. The exported array and ptrs are not a reallocation but pointing to the same arrays as polars is using. So it is just smart pointers to memory owned by rust. The ArrowArray-objects (arrays_copied_to_here_secure_lifetime) could as you suggest be an attribute. When the ArrowArray-objects is dropped rust-polars will update the reference counts, and that again can trigger R to garbage-collect the arrays if count is zero.

It seems nanoarrow can create an array pointing directly into the polars memory.
Would it be possible for nanoarrow to not take ownership but only allow immutable read only views?

A reallocation does ocour when calling as.vector(array) I guess it could be as.altrep.vector instead for the supported types.

install.packages( #or similar for window or linux
  "https://github.com/rpolars/rpolars/releases/download/nanoarrow_fix_life/rpolars__x86_64-apple-darwin17.0.tgz",
  repos= NULL 
)


library(arrow) #loading arrow first, otherwise some dependency loop-error with nanoarrow
arrow_datatype = arrow::infer_type(double())  #because first column of iris is double

library(nanoarrow)
nanoarrow_schema = nanoarrow::as_nanoarrow_schema(arrow_datatype)

library(rpolars)
library(xptr)
df = pl$DataFrame(rep(1,1E8)) #allocate a 800Mb Series

#exposing as Arrowarray and array ptrs did not increase memory usage, pointing 
# to no reallocation was made
ptrs_printed = capture.output({ 
  arrays_copied_to_here_secure_lifetime = .pr$DataFrame$print_raw_arrow_pts(df)
})

#auto debug print, dunno about these ptrs
arrays_copied_to_here_secure_lifetime$print()

#make pointer on first array of polars-dataframe containining iris
arr_ptr = xptr::new_xptr(strsplit(ptrs_printed,":")[[2]][3]) #write e.g. firsts column array ptr here

#make array from ptr
array <- nanoarrow::nanoarrow_allocate_array()
nanoarrow::nanoarrow_pointer_move(arr_ptr, array)
nanoarrow:::nanoarrow_array_set_schema(array, nanoarrow_schema)


vec = as.vector(array) #a new allocation happens here

@sorhawell
Copy link
Member

The interface between rpolars and nanoarrow could as you suggested also be on R level. rpolars can have functions to export to nanoarrow readable format and a user could use the nanoarrow package thereafter.

Inspired by data.table I avoid to let rpolars depend on any R packages at installation time and during run time. rpolars could suggest nanoarrow in order to test the bridge functionality works. It would be the responsibility of the user to also install a compatible version of nanoarrow alongside rpolars.

@paleolimbot
Copy link
Author

Again, this is so cool and I'm excited to play with this a little bit. I'm mostly away-from-keyboard during the holidays but should have time this week to properly read the code for this branch.

It seems nanoarrow can create an array pointing directly into the polars memory.
Would it be possible for nanoarrow to not take ownership but only allow immutable read only views?

I think this is functionally what is supposed to happen although the terminology might be slightly different because it's C. You won't be able to avoid pointers into Polars memory for the array data itself (or else you may as well copy the data, which is what we're trying to avoid here I think). R provides a number of mechanisms to make sure this data is not mutated (although because it's C, it's possible to ignore all of them for those who choose to ignore the safeguards).

For the ArrowArray wrapper around it, I believe the Rust-allocated structure is supposed to be temporary (and to no longer serve any purpose after nanoarrow_array_move() is called). Maybe the best way to put it is that nanoarrow doesn't own any Polars memory; it just owns one reference to it. This week I will get a dev version of rpolars set up and see if I can verify/properly explain what's going on here.

Inspired by data.table I avoid to let rpolars depend on any R packages at installation time and during run time. rpolars could suggest nanoarrow in order to test the bridge functionality works. It would be the responsibility of the user to also install a compatible version of nanoarrow alongside rpolars.

I wrote nanoarrow to make this easy to do (hopefully)...there are some S3 generics you can implement (as_nanoarrow_array(), as_nanoarrow_array_stream(), as_nanoarrow_schema(), and infer_nanoarrow_schema()) with a Suggests dependency. In that case you'll have to implement the R vector <-> Polars conversions yourself (maybe this isn't that hard). I'm sure you already know that nanoarrow is zero-dependency and is written entirely in C to minimize build time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants