Skip to content
This repository has been archived by the owner on May 29, 2024. It is now read-only.

arrow support #10

Closed
wants to merge 6 commits into from
Closed

arrow support #10

wants to merge 6 commits into from

Conversation

jfb-h
Copy link
Contributor

@jfb-h jfb-h commented Dec 1, 2023

Hi @tecosaur, I gave the arrow support a shot and here's what I got.

I tested the loader on a simple test dataset and it seems to work for read(dataset("test"), Any) and read(dataset("test"), DataFrame). I also implemented the writer but did not test it yet because I didn't know how, maybe you could take a look at that?

Closes tecosaur/DataToolkit.jl#30

@tecosaur
Copy link
Owner

tecosaur commented Dec 4, 2023

At a glance, this looks good @jfb-h! Thanks for giving this a shot.

Let me know how you found the process of figuring this out.

Before merging this, I do have a few comments:

  1. You have a bunch of lines like convert::Bool = @getparam loader."convert"::Bool true where there convert::Bool type assertion isn't doing anything. The assertion made by @getparam is known to the compiler, and that macro exists because when the type assertion doesn't hold it produces a more informative/helpful error.
  2. Arrow.Table should probably be listed in supported_types
  3. In the docstring you say "By default this driver supports parsing to three data types" but then only list two types
  4. The docstring doesn't mention the write arguments (if you don't want to mention them all, you can just say we "mirror the keyword arguments that Arrow.write accepts" or similar.

You can test writing to a dataset with write(dataset("name"), value).

Hope that helps 🙂

@jfb-h
Copy link
Contributor Author

jfb-h commented Dec 4, 2023

Thanks!

  1. I had just copied what it looked like for the delim loader but I have now removed the type assertions.
  2. done
  3. done
  4. I have added a comment to the docstring. However, in the specification I didn't add all arguments yet, because I wasn't sure if/how reading more complex arguments from the loader spec are supported (e.g. a dict with metadata for colmetadata).

I tried to test the writer with write(dataset("name"), value) but get the following error:

julia> write(dataset("test"), df)
ERROR: TransformerError: There are no writers for "test" that can work with DataFrame
Stacktrace:
 [1] write(dataset::DataSet, info::DataFrame)
   @ DataToolkitBase ~/.julia/packages/DataToolkitBase/JUjFu/src/interaction/externals.jl:387
 [2] top-level scope
   @ REPL[10]:1

I tried to add a writer spec by including

    [[test.writer]]
    driver = "arrow"

in the Data.toml, but that didn't do the trick. The docs/tutorials don't seem to contain any info regarding writer specification, so maybe that's another thing that could be useful to add to the docs.

I'd be happy to contribute a short writeup of my view of the involved steps once this is done, which maybe could inform a short tutorial for the docs.

@tecosaur
Copy link
Owner

tecosaur commented Dec 4, 2023

Thanks for the changes :)

  1. Ah, delim was written pre-@getparam and then lazily updated, so it could probably do with cleaning up
  2. Thanks!
  3. Thanks!
  4. Hmm, I'll take a second look at this, and maybe try testing on my machine.

I'd be happy to contribute a short writeup of my view of the involved steps once this is done, which maybe could inform a short tutorial for the docs.

❤️

@tecosaur
Copy link
Owner

Sorry for the delay, getting back to this now :)

I'm looking into the writing, thing I've found a small error in Base. I should have a fix in the next day or few.

tecosaur added a commit that referenced this pull request Jan 23, 2024
@tecosaur
Copy link
Owner

Well, turns out it was easier than I thought: the blocking bug is fixed in tecosaur/DataToolkitCore.jl@c05ed5c and I've merged this PR with some modifications in 5e544cf 🙂

@tecosaur
Copy link
Owner

Perhaps it's a bit late after the fact, but if you have any thoughts on what documentation I could add to make it easier for other people to contribute/write their own data loaders, that would be great 😃

@jfb-h
Copy link
Contributor Author

jfb-h commented Jan 25, 2024

I'd be happy to! I'll try to write something up this or next week 🙂

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Arrow loader
2 participants