Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Apache Arrow dataset support #36

Merged
merged 4 commits into from
Jan 18, 2019

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Dec 21, 2018

This PR adds support for Apache Arrow datasets to interface with a variety of sources that produce in-memory data in Arrow format. Included in this change is a Dataset base layer that will create a TensorFlow Dataset with an iterator over Arrow record batches to produce Tensor values for each column. This base layer is then extended to implement three Dataset Ops that consume Arrow record batches:

  1. from Python memory / Pandas DataFrames
  2. Reading Arrow Feather files
  3. Input stream with a socket client to connect to a server streaming Arrow record batches

The design of the Arrow dataset base layer was done to be flexible enough to allow for more Arrow Ops in the future from other sources or language bindings.

This fixes #13

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

@BryanCutler
Copy link
Member Author

BryanCutler commented Dec 21, 2018

Still a bit of a WIP, need to do the following:

  • Rebase to get build files used from Support Apache Parquet format #21
  • Discuss which version of Arrow to use, currently updated to use 0.10.0, will try 0.11.1
    Using version 0.11.1 now
  • What to do about pyarrow dependency, should it be optional?
    Now it has been made optional
  • Add socket support for Windows. Currently added, but need to enable Windows compilation to test
  • Fix boolean type
    Will address as a followup since it will require some additional work
  • Improve test coverage
  • Fixup docs and check style

@BryanCutler
Copy link
Member Author

@yongtang , I have the tests passing locally but ran into a couple issues you might already be aware of. First, because of boost I had to use an updated version of bazel, 0.17.2, to build it. The version in the docker image is 0.15.0 and I'm not sure if there is a way to specify a different one. I ended up manually installing it to get things running.

Second, right now my test has a dependency on pyarrow and one of the datasets defined here also uses it. If we don't want to make it a hard dependency here, it could optional to use the functionality that requires it and when running the tests. What do you think?

Third, I upgraded the Arrow build to use version 0.10.0. I hope that doesn't interfere with the parquet reader. We might also want to think about using 0.11.1 which is the current latest, or even 0.12.0 which is due out in early Jan.

Fourth, I had to bring over the Flatbuffer build files from tensorflow. I'm fairly new to Bazel, so I did what I could to get things working, but please let me know if anything can be improved :)

@yongtang
Copy link
Member

Opened an issue in tensorflow/tensorflow#24523 to capture the bazel 0.15.0->0.20.0 for tensorflow/tensorflow:custom-op

@googlebot
Copy link

CLAs look good, thanks!

WORKSPACE Outdated
@@ -76,11 +79,11 @@ http_archive(
http_archive(
name = "arrow",
urls = [
"https://mirror.bazel.build/github.com/apache/arrow/archive/apache-arrow-0.9.0.tar.gz",
"https://github.com/apache/arrow/archive/apache-arrow-0.9.0.tar.gz",
"https://mirror.bazel.build/github.com/apache/arrow/archive/apache-arrow-0.10.0.tar.gz",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongtang I'm going to try and update this to use Arrow version 0.11.1. I think it shouldn't affect anything in the Parquet Dataset, but I'm not totally sure, so I will verify first. What are your thoughts on this?

@BryanCutler
Copy link
Member Author

BryanCutler commented Jan 3, 2019

Platform specific socket implementation has been done following the client in the Ignite Dataset, but currently only Unix is enabled to build until we can build/test for Windows. cc @dmitrievanthony

@yongtang
Copy link
Member

yongtang commented Jan 5, 2019

@BryanCutler Actually flatbuffers already support Bazel build and the flatbuffer_cc_library has already been supported natively:
google/flatbuffers#5061

The flatbuffer_cc_library support is not in flatbuffers 1.10.0 release though.

I played with flatbuffers, and have the following two commits based on PR google/flatbuffers#5061:

5084922...yongtang:218644bd0fb1de06410932c1858b90a7ea5480bd

If you pick up the above two commits and applies to your PR, I think the build will be successful. (You may also need to rebase with master, and strip the first two commits in your PR)

I tried locally with the two commits + your PR, and the test works:

bazel test --cache_test_results=no -s --verbose_failures //tensorflow_io/arrow:all
...
INFO: Elapsed time: 8.417s, Critical Path: 5.42s
INFO: 3 processes: 3 local.
INFO: Build completed successfully, 4 total actions
//tensorflow_io/arrow:arrow_py_test                                      PASSED in 1.0s
  WARNING: //tensorflow_io/arrow:arrow_py_test: Test execution time (1.0s excluding execution overhead) outside of range for MODERATE tests. Consider setting timeout="short" or size="small".

@yongtang
Copy link
Member

yongtang commented Jan 5, 2019

Actually, with some slight tweak around, it is much simpler to just include flatbuffers directly from bazel by using the most recent master branch.

The following two commits will be much easier to include:
5084922...yongtang:d77bd65

I also created a PR in google/flatbuffers#5104 to fix some of the issues, but the above two commits should be all we need to build arrow support in tensorflow-io.

@BryanCutler
Copy link
Member Author

Actually, with some slight tweak around, it is much simpler to just include flatbuffers directly from bazel by using the most recent master branch.

Cool, thanks @yongtang , I'll give it a shot!

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

@BryanCutler
Copy link
Member Author

Ok, I believe I have the Arrow build worked out to use Flatbuffers directly from Bazel. I'll work on finishing up the remaining todos this week.

@googlebot
Copy link

CLAs look good, thanks!

@yongtang
Copy link
Member

yongtang commented Jan 9, 2019

@BryanCutler Some of the Travis CI failures are related to the verbose level of bazel. I think passing:

bazel test --noshow_progress --noshow_loading_progress 

to bazel in .travis.yml will address it.

@BryanCutler
Copy link
Member Author

@BryanCutler Some of the Travis CI failures are related to the verbose level of bazel

Ok, thanks I'll try that out

@BryanCutler
Copy link
Member Author

@yongtang , the last update was able to pass tests in Travis for Python 3.5, yay! But the other versions failed due to the log length exceeded. I removed the -s option to try and see if that will help. I think that may be causing a lot of output for each source file during compilation, which becomes the bulk of the log.

@BryanCutler
Copy link
Member Author

Seems like that did the trick, all passed! @yongtang are you ok with removing the -s option from the bazel test command done here 463bbb6?

@BryanCutler BryanCutler changed the title [WIP] Add Apache Arrow dataset support Add Apache Arrow dataset support Jan 10, 2019
@BryanCutler
Copy link
Member Author

Removing WIP, I was hoping to get boolean type working but that will require a bit more work and I can address as a followup. I think this is ready for review now @yongtang if you could please take a look, thanks!

@BryanCutler
Copy link
Member Author

Also cc @dmitrievanthony and @terrytangyuan if you are able to review, that would be great. Thanks!

curr_array_values_ < 0 ? TensorShape({})
: TensorShape({curr_array_values_}));

auto values = array.data()->buffers[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i'm not so familiar with arrow, but why is buffers[1] here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prob, it's not clear in this context. For primitive types, the first buffer is a validity bitmap to indicate NULL values, and the second buffer is the data values.

There is a check to make sure NULL count is zero here https://github.com/tensorflow/io/pull/36/files#diff-42f74bbc07801dbac60e26f2d9fd6f70R44, so we don't care about that first buffer (for now at least)

I will make this a static const int VALUE_BUFFER = 1 and add a note to make it more clear

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! This is great. Added some comments. In addition, though I could look at how it works in the tests, it might be better to add a README with some small examples for each dataset.

arrow::Status ArrowStreamClient::Read(int64_t nbytes,
int64_t* bytes_read,
void* out) {
// TODO: look into why 0 bytes are requested
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 0 byte is requested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is correct to use zero as a plural and say "0 bytes"

def is_float(dtype):
return dtype == dtypes.float16 or \
dtype == dtypes.float32 or \
dtype == dtypes.float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be simplified with dtype in [dtypes.float16, dtypes.float32, dtypes.float64]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's much better!

for i, col in enumerate(dataset.columns):
if case_data.output_shapes[col].ndims == 0:
if is_float(case_data.output_types[col]):
self.assertAlmostEqual(value[i], case_data.data[col][row], 4)
Copy link
Member

@terrytangyuan terrytangyuan Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the 4 here be inferred from the data itself instead of being hard-coded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put 4 because it's really comparing 1 == ~1, 2 == ~2 etc. from the test data. So from the current test data, we don't require it to be that precise. I can try the default value of 7 digits, but I think it might be too much trouble to try to infer a value and make this function completely generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems 7 decimal places is too much and causes a failure. I think it would be possible to figure out how many places to check like you mentioned, but I don't know if we really need to be that clever here. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about this then. Thanks for the efforts!

elif case_data.output_shapes[col].ndims == 1:
if is_float(case_data.output_types[col]):
for j, v in enumerate(value[i]):
self.assertAlmostEqual(v, case_data.data[col][row][j], 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Could move this into a separate variable inferred from data


f = tempfile.NamedTemporaryFile(delete=False)
write_feather(df, f)
f.close()
Copy link
Member

@terrytangyuan terrytangyuan Jan 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a context manager here instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that should work

filenames, dtype=dtypes.string, name="filenames")

def _as_variant_tensor(self):
return arrow_ops.\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing \ here and add newline after the opening parenthesis instead. Similar in other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@BryanCutler
Copy link
Member Author

Thanks for the review @terrytangyuan ! I'll work on an update

@BryanCutler
Copy link
Member Author

I'll work on adding some examples to the README

… batches. Define 3 ops to read record batches: 1) from memory, 2) from Feather files, 3) from an input stream/socket
@BryanCutler
Copy link
Member Author

Added usage to README and squashed some commits

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Though Would it be better to put the examples in a README under arrow's folder and link to it in the main README? This way the main README is more concise.

@BryanCutler
Copy link
Member Author

Looks great to me! Though Would it be better to put the examples in a README under arrow's folder and link to it in the main README? This way the main README is more concise.

Thanks @terrytangyuan! Yeah, I agree about the README. I'll try adding a link under Arrow entry in the data source list.

@terrytangyuan terrytangyuan merged commit e5dc0eb into tensorflow:master Jan 18, 2019
@terrytangyuan
Copy link
Member

Thanks!

@BryanCutler
Copy link
Member Author

Thanks for all the help with this @terrytangyuan @yongtang and @yupbank !

@yongtang
Copy link
Member

Thanks @BryanCutler for the PR 🎉 . There are some additional follow up work that might be needed.

One is the upcoming 1.13 release of TensorFlow itself. The tf.data had some changes in API so the implementations based on 1.12 does not work anymore. A PR to make tensorflow-io work with 1.13 is in #56. I tried to fix arrow issues in tf 1.13 but was unsuccessful. Maybe you could take a look at some point. (Note this is not high priority as we will still build against tf 1.12 for now. TF 1.13 likely will only be released in a month or two.)

Another is the R binding of ArrowDataset. It would be nice to have the R binding available like other ops.

@BryanCutler
Copy link
Member Author

Sure, I'll be glad to help out with building against 1.13 and upcoming 2.0, and also the R bindings.

@yupbank
Copy link
Member

yupbank commented Jan 21, 2019

hey... i just run into pyarrow.plasma and it seems they have a python api and tensorflow ops plasma.tf_plasma_op to convert tensors to and from arrow.RecordBatch

it there any reason we didn't consider about them ?

@BryanCutler
Copy link
Member Author

@yupbank I have looked into pyarrow.plasma, which is an object store, and there are some differences. The plasma-op is designed to transfer tensors to/from the object store, and it does not work with arrow.RecordBatches or columnar data, only individual tensors.

Plasma is kind of a sub-project of Arrow and I haven't been involved much, but I think there is a good possibility these two efforts could work together in the future.

@blais
Copy link

blais commented Sep 23, 2020

Just curious... why not merge this excellent Bazel support in Arrow's codebase itself?

@BryanCutler
Copy link
Member Author

@blais that's a good idea, I'll run it past the Arrow community and see if there is interest.

@BryanCutler BryanCutler deleted the arrow-dataset-13 branch September 24, 2020 23:53
@blais
Copy link

blais commented Sep 25, 2020

BTW, I've updated it for 1.0.1 here:
https://github.com/beancount/beancount/blob/master/third_party/proto/arrow.BUILD

Also, there's a patch needed somehow #include <snappy.h> needs to become #include "snappy.h" in my build (not sure exactly if it's my setup or what-not). Using bazel 3.4.1

@BryanCutler
Copy link
Member Author

BTW, I've updated it for 1.0.1

That's great @blais , would you be able to open a PR so we can upgrade to 1.0.1 here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Apache Arrow for tensorflow-io
6 participants