-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Apache Arrow dataset support #36
Add Apache Arrow dataset support #36
Conversation
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
Still a bit of a WIP, need to do the following:
|
@yongtang , I have the tests passing locally but ran into a couple issues you might already be aware of. First, because of boost I had to use an updated version of bazel, 0.17.2, to build it. The version in the docker image is 0.15.0 and I'm not sure if there is a way to specify a different one. I ended up manually installing it to get things running. Second, right now my test has a dependency on pyarrow and one of the datasets defined here also uses it. If we don't want to make it a hard dependency here, it could optional to use the functionality that requires it and when running the tests. What do you think? Third, I upgraded the Arrow build to use version 0.10.0. I hope that doesn't interfere with the parquet reader. We might also want to think about using 0.11.1 which is the current latest, or even 0.12.0 which is due out in early Jan. Fourth, I had to bring over the Flatbuffer build files from tensorflow. I'm fairly new to Bazel, so I did what I could to get things working, but please let me know if anything can be improved :) |
Opened an issue in tensorflow/tensorflow#24523 to capture the bazel |
28ea481
to
36972c7
Compare
CLAs look good, thanks! |
WORKSPACE
Outdated
@@ -76,11 +79,11 @@ http_archive( | |||
http_archive( | |||
name = "arrow", | |||
urls = [ | |||
"https://mirror.bazel.build/github.com/apache/arrow/archive/apache-arrow-0.9.0.tar.gz", | |||
"https://github.com/apache/arrow/archive/apache-arrow-0.9.0.tar.gz", | |||
"https://mirror.bazel.build/github.com/apache/arrow/archive/apache-arrow-0.10.0.tar.gz", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yongtang I'm going to try and update this to use Arrow version 0.11.1. I think it shouldn't affect anything in the Parquet Dataset, but I'm not totally sure, so I will verify first. What are your thoughts on this?
Platform specific socket implementation has been done following the client in the Ignite Dataset, but currently only Unix is enabled to build until we can build/test for Windows. cc @dmitrievanthony |
@BryanCutler Actually flatbuffers already support Bazel build and the The I played with flatbuffers, and have the following two commits based on PR google/flatbuffers#5061: 5084922...yongtang:218644bd0fb1de06410932c1858b90a7ea5480bd If you pick up the above two commits and applies to your PR, I think the build will be successful. (You may also need to rebase with master, and strip the first two commits in your PR) I tried locally with the two commits + your PR, and the test works:
|
Actually, with some slight tweak around, it is much simpler to just include flatbuffers directly from bazel by using the most recent master branch. The following two commits will be much easier to include: I also created a PR in google/flatbuffers#5104 to fix some of the issues, but the above two commits should be all we need to build arrow support in tensorflow-io. |
Cool, thanks @yongtang , I'll give it a shot! |
1a0beb8
to
759f169
Compare
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the |
759f169
to
d09bba7
Compare
Ok, I believe I have the Arrow build worked out to use Flatbuffers directly from Bazel. I'll work on finishing up the remaining todos this week. |
4e60378
to
3839a7a
Compare
CLAs look good, thanks! |
@BryanCutler Some of the Travis CI failures are related to the verbose level of bazel. I think passing:
to bazel in |
Ok, thanks I'll try that out |
@yongtang , the last update was able to pass tests in Travis for Python 3.5, yay! But the other versions failed due to the log length exceeded. I removed the |
Removing WIP, I was hoping to get boolean type working but that will require a bit more work and I can address as a followup. I think this is ready for review now @yongtang if you could please take a look, thanks! |
Also cc @dmitrievanthony and @terrytangyuan if you are able to review, that would be great. Thanks! |
curr_array_values_ < 0 ? TensorShape({}) | ||
: TensorShape({curr_array_values_})); | ||
|
||
auto values = array.data()->buffers[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i'm not so familiar with arrow, but why is buffers[1]
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prob, it's not clear in this context. For primitive types, the first buffer is a validity bitmap to indicate NULL values, and the second buffer is the data values.
There is a check to make sure NULL count is zero here https://github.com/tensorflow/io/pull/36/files#diff-42f74bbc07801dbac60e26f2d9fd6f70R44, so we don't care about that first buffer (for now at least)
I will make this a static const int VALUE_BUFFER = 1
and add a note to make it more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! This is great. Added some comments. In addition, though I could look at how it works in the tests, it might be better to add a README with some small examples for each dataset.
arrow::Status ArrowStreamClient::Read(int64_t nbytes, | ||
int64_t* bytes_read, | ||
void* out) { | ||
// TODO: look into why 0 bytes are requested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 0 byte is requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is correct to use zero as a plural and say "0 bytes"
def is_float(dtype): | ||
return dtype == dtypes.float16 or \ | ||
dtype == dtypes.float32 or \ | ||
dtype == dtypes.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be simplified with dtype in [dtypes.float16, dtypes.float32, dtypes.float64]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's much better!
for i, col in enumerate(dataset.columns): | ||
if case_data.output_shapes[col].ndims == 0: | ||
if is_float(case_data.output_types[col]): | ||
self.assertAlmostEqual(value[i], case_data.data[col][row], 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the 4
here be inferred from the data itself instead of being hard-coded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put 4 because it's really comparing 1 == ~1
, 2 == ~2
etc. from the test data. So from the current test data, we don't require it to be that precise. I can try the default value of 7 digits, but I think it might be too much trouble to try to infer a value and make this function completely generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems 7 decimal places is too much and causes a failure. I think it would be possible to figure out how many places to check like you mentioned, but I don't know if we really need to be that clever here. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about this then. Thanks for the efforts!
elif case_data.output_shapes[col].ndims == 1: | ||
if is_float(case_data.output_types[col]): | ||
for j, v in enumerate(value[i]): | ||
self.assertAlmostEqual(v, case_data.data[col][row][j], 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Could move this into a separate variable inferred from data
|
||
f = tempfile.NamedTemporaryFile(delete=False) | ||
write_feather(df, f) | ||
f.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a context manager here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that should work
filenames, dtype=dtypes.string, name="filenames") | ||
|
||
def _as_variant_tensor(self): | ||
return arrow_ops.\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing \
here and add newline after the opening parenthesis instead. Similar in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
Thanks for the review @terrytangyuan ! I'll work on an update |
6cbe7b8
to
df105fd
Compare
I'll work on adding some examples to the README |
df105fd
to
d83619c
Compare
… batches. Define 3 ops to read record batches: 1) from memory, 2) from Feather files, 3) from an input stream/socket
d83619c
to
b5f9c4a
Compare
Added usage to README and squashed some commits |
b5f9c4a
to
5f531e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Though Would it be better to put the examples in a README under arrow's folder and link to it in the main README? This way the main README is more concise.
Thanks @terrytangyuan! Yeah, I agree about the README. I'll try adding a link under Arrow entry in the data source list. |
Thanks! |
Thanks for all the help with this @terrytangyuan @yongtang and @yupbank ! |
Thanks @BryanCutler for the PR 🎉 . There are some additional follow up work that might be needed. One is the upcoming 1.13 release of TensorFlow itself. The tf.data had some changes in API so the implementations based on 1.12 does not work anymore. A PR to make tensorflow-io work with 1.13 is in #56. I tried to fix arrow issues in tf 1.13 but was unsuccessful. Maybe you could take a look at some point. (Note this is not high priority as we will still build against tf 1.12 for now. TF 1.13 likely will only be released in a month or two.) Another is the R binding of ArrowDataset. It would be nice to have the R binding available like other ops. |
Sure, I'll be glad to help out with building against 1.13 and upcoming 2.0, and also the R bindings. |
hey... i just run into it there any reason we didn't consider about them ? |
@yupbank I have looked into Plasma is kind of a sub-project of Arrow and I haven't been involved much, but I think there is a good possibility these two efforts could work together in the future. |
Just curious... why not merge this excellent Bazel support in Arrow's codebase itself? |
@blais that's a good idea, I'll run it past the Arrow community and see if there is interest. |
BTW, I've updated it for 1.0.1 here: Also, there's a patch needed somehow #include <snappy.h> needs to become #include "snappy.h" in my build (not sure exactly if it's my setup or what-not). Using bazel 3.4.1 |
That's great @blais , would you be able to open a PR so we can upgrade to 1.0.1 here? |
This PR adds support for Apache Arrow datasets to interface with a variety of sources that produce in-memory data in Arrow format. Included in this change is a Dataset base layer that will create a TensorFlow Dataset with an iterator over Arrow record batches to produce Tensor values for each column. This base layer is then extended to implement three Dataset Ops that consume Arrow record batches:
The design of the Arrow dataset base layer was done to be flexible enough to allow for more Arrow Ops in the future from other sources or language bindings.
This fixes #13