Add Apache Arrow dataset support #36

BryanCutler · 2018-12-21T05:59:38Z

This PR adds support for Apache Arrow datasets to interface with a variety of sources that produce in-memory data in Arrow format. Included in this change is a Dataset base layer that will create a TensorFlow Dataset with an iterator over Arrow record batches to produce Tensor values for each column. This base layer is then extended to implement three Dataset Ops that consume Arrow record batches:

from Python memory / Pandas DataFrames
Reading Arrow Feather files
Input stream with a socket client to connect to a server streaming Arrow record batches

The design of the Arrow dataset base layer was done to be flexible enough to allow for more Arrow Ops in the future from other sources or language bindings.

This fixes #13

googlebot · 2018-12-21T05:59:43Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

BryanCutler · 2018-12-21T06:03:42Z

Still a bit of a WIP, need to do the following:

Rebase to get build files used from Support Apache Parquet format #21
Discuss which version of Arrow to use, currently updated to use 0.10.0, will try 0.11.1
Using version 0.11.1 now
What to do about pyarrow dependency, should it be optional?
Now it has been made optional
Add socket support for Windows. Currently added, but need to enable Windows compilation to test
Fix boolean type
Will address as a followup since it will require some additional work
Improve test coverage
Fixup docs and check style

BryanCutler · 2018-12-21T06:14:24Z

@yongtang , I have the tests passing locally but ran into a couple issues you might already be aware of. First, because of boost I had to use an updated version of bazel, 0.17.2, to build it. The version in the docker image is 0.15.0 and I'm not sure if there is a way to specify a different one. I ended up manually installing it to get things running.

Second, right now my test has a dependency on pyarrow and one of the datasets defined here also uses it. If we don't want to make it a hard dependency here, it could optional to use the functionality that requires it and when running the tests. What do you think?

Third, I upgraded the Arrow build to use version 0.10.0. I hope that doesn't interfere with the parquet reader. We might also want to think about using 0.11.1 which is the current latest, or even 0.12.0 which is due out in early Jan.

Fourth, I had to bring over the Flatbuffer build files from tensorflow. I'm fairly new to Bazel, so I did what I could to get things working, but please let me know if anything can be improved :)

yongtang · 2018-12-22T19:36:51Z

Opened an issue in tensorflow/tensorflow#24523 to capture the bazel 0.15.0->0.20.0 for tensorflow/tensorflow:custom-op

googlebot · 2019-01-03T00:54:05Z

CLAs look good, thanks!

BryanCutler · 2019-01-03T01:00:39Z

WORKSPACE

@@ -76,11 +79,11 @@ http_archive(
 http_archive(
    name = "arrow",
    urls = [
-        "https://mirror.bazel.build/github.com/apache/arrow/archive/apache-arrow-0.9.0.tar.gz",
-        "https://github.com/apache/arrow/archive/apache-arrow-0.9.0.tar.gz",
+        "https://mirror.bazel.build/github.com/apache/arrow/archive/apache-arrow-0.10.0.tar.gz",


@yongtang I'm going to try and update this to use Arrow version 0.11.1. I think it shouldn't affect anything in the Parquet Dataset, but I'm not totally sure, so I will verify first. What are your thoughts on this?

BryanCutler · 2019-01-03T21:28:52Z

Platform specific socket implementation has been done following the client in the Ignite Dataset, but currently only Unix is enabled to build until we can build/test for Windows. cc @dmitrievanthony

yongtang · 2019-01-05T05:55:13Z

@BryanCutler Actually flatbuffers already support Bazel build and the flatbuffer_cc_library has already been supported natively:
google/flatbuffers#5061

The flatbuffer_cc_library support is not in flatbuffers 1.10.0 release though.

I played with flatbuffers, and have the following two commits based on PR google/flatbuffers#5061:

5084922...yongtang:218644bd0fb1de06410932c1858b90a7ea5480bd

If you pick up the above two commits and applies to your PR, I think the build will be successful. (You may also need to rebase with master, and strip the first two commits in your PR)

I tried locally with the two commits + your PR, and the test works:

bazel test --cache_test_results=no -s --verbose_failures //tensorflow_io/arrow:all
...
INFO: Elapsed time: 8.417s, Critical Path: 5.42s
INFO: 3 processes: 3 local.
INFO: Build completed successfully, 4 total actions
//tensorflow_io/arrow:arrow_py_test                                      PASSED in 1.0s
  WARNING: //tensorflow_io/arrow:arrow_py_test: Test execution time (1.0s excluding execution overhead) outside of range for MODERATE tests. Consider setting timeout="short" or size="small".

yongtang · 2019-01-05T06:39:18Z

Actually, with some slight tweak around, it is much simpler to just include flatbuffers directly from bazel by using the most recent master branch.

The following two commits will be much easier to include:
5084922...yongtang:d77bd65

I also created a PR in google/flatbuffers#5104 to fix some of the issues, but the above two commits should be all we need to build arrow support in tensorflow-io.

BryanCutler · 2019-01-06T23:41:14Z

Actually, with some slight tweak around, it is much simpler to just include flatbuffers directly from bazel by using the most recent master branch.

Cool, thanks @yongtang , I'll give it a shot!

googlebot · 2019-01-07T00:15:15Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

BryanCutler · 2019-01-07T00:22:13Z

Ok, I believe I have the Arrow build worked out to use Flatbuffers directly from Bazel. I'll work on finishing up the remaining todos this week.

googlebot · 2019-01-08T23:34:02Z

CLAs look good, thanks!

yongtang · 2019-01-09T19:11:41Z

@BryanCutler Some of the Travis CI failures are related to the verbose level of bazel. I think passing:

bazel test --noshow_progress --noshow_loading_progress

to bazel in .travis.yml will address it.

BryanCutler · 2019-01-10T18:00:44Z

@BryanCutler Some of the Travis CI failures are related to the verbose level of bazel

Ok, thanks I'll try that out

BryanCutler · 2019-01-10T20:03:51Z

@yongtang , the last update was able to pass tests in Travis for Python 3.5, yay! But the other versions failed due to the log length exceeded. I removed the -s option to try and see if that will help. I think that may be causing a lot of output for each source file during compilation, which becomes the bulk of the log.

BryanCutler · 2019-01-10T20:24:35Z

Seems like that did the trick, all passed! @yongtang are you ok with removing the -s option from the bazel test command done here 463bbb6?

BryanCutler · 2019-01-10T23:28:04Z

Removing WIP, I was hoping to get boolean type working but that will require a bit more work and I can address as a followup. I think this is ready for review now @yongtang if you could please take a look, thanks!

BryanCutler · 2019-01-10T23:29:23Z

Also cc @dmitrievanthony and @terrytangyuan if you are able to review, that would be great. Thanks!

yupbank · 2019-01-11T16:12:52Z

tensorflow_io/arrow/kernels/arrow_dataset_ops.cc

+                  curr_array_values_ < 0 ? TensorShape({})
+                                         : TensorShape({curr_array_values_}));
+
+    auto values = array.data()->buffers[1];


sorry i'm not so familiar with arrow, but why is buffers[1] here?

No prob, it's not clear in this context. For primitive types, the first buffer is a validity bitmap to indicate NULL values, and the second buffer is the data values.

There is a check to make sure NULL count is zero here https://github.com/tensorflow/io/pull/36/files#diff-42f74bbc07801dbac60e26f2d9fd6f70R44, so we don't care about that first buffer (for now at least)

I will make this a static const int VALUE_BUFFER = 1 and add a note to make it more clear

terrytangyuan

Good work! This is great. Added some comments. In addition, though I could look at how it works in the tests, it might be better to add a README with some small examples for each dataset.

terrytangyuan · 2019-01-12T21:09:40Z

tensorflow_io/arrow/kernels/arrow_stream_client_windows.cc

+arrow::Status ArrowStreamClient::Read(int64_t nbytes,
+                                      int64_t* bytes_read,
+                                      void* out) {
+  // TODO: look into why 0 bytes are requested


nit: 0 byte is requested

I think it is correct to use zero as a plural and say "0 bytes"

terrytangyuan · 2019-01-12T21:11:55Z

tensorflow_io/arrow/python/kernel_tests/arrow_test.py

+    def is_float(dtype):
+      return dtype == dtypes.float16 or \
+             dtype == dtypes.float32 or \
+             dtype == dtypes.float64


this can be simplified with dtype in [dtypes.float16, dtypes.float32, dtypes.float64]

yes, that's much better!

terrytangyuan · 2019-01-12T21:13:20Z

tensorflow_io/arrow/python/kernel_tests/arrow_test.py

+        for i, col in enumerate(dataset.columns):
+          if case_data.output_shapes[col].ndims == 0:
+            if is_float(case_data.output_types[col]):
+              self.assertAlmostEqual(value[i], case_data.data[col][row], 4)


Can the 4 here be inferred from the data itself instead of being hard-coded?

I put 4 because it's really comparing 1 == ~1, 2 == ~2 etc. from the test data. So from the current test data, we don't require it to be that precise. I can try the default value of 7 digits, but I think it might be too much trouble to try to infer a value and make this function completely generic.

It seems 7 decimal places is too much and causes a failure. I think it would be possible to figure out how many places to check like you mentioned, but I don't know if we really need to be that clever here. What do you think?

Don't worry about this then. Thanks for the efforts!

terrytangyuan · 2019-01-12T21:13:49Z

tensorflow_io/arrow/python/kernel_tests/arrow_test.py

+          elif case_data.output_shapes[col].ndims == 1:
+            if is_float(case_data.output_types[col]):
+              for j, v in enumerate(value[i]):
+                self.assertAlmostEqual(v, case_data.data[col][row][j], 4)


Same here. Could move this into a separate variable inferred from data

terrytangyuan · 2019-01-12T21:16:49Z

tensorflow_io/arrow/python/kernel_tests/arrow_test.py

+
+    f = tempfile.NamedTemporaryFile(delete=False)
+    write_feather(df, f)
+    f.close()


Use a context manager here instead?

Yup, that should work

terrytangyuan · 2019-01-12T21:19:03Z

tensorflow_io/arrow/python/ops/arrow_dataset_ops.py

+        filenames, dtype=dtypes.string, name="filenames")
+
+  def _as_variant_tensor(self):
+    return arrow_ops.\


Consider removing \ here and add newline after the opening parenthesis instead. Similar in other places.

BryanCutler · 2019-01-14T22:49:54Z

Thanks for the review @terrytangyuan ! I'll work on an update

BryanCutler · 2019-01-14T23:29:16Z

I'll work on adding some examples to the README

… batches. Define 3 ops to read record batches: 1) from memory, 2) from Feather files, 3) from an input stream/socket

BryanCutler · 2019-01-16T04:31:27Z

Added usage to README and squashed some commits

terrytangyuan

Looks great to me! Though Would it be better to put the examples in a README under arrow's folder and link to it in the main README? This way the main README is more concise.

BryanCutler · 2019-01-17T00:16:37Z

Looks great to me! Though Would it be better to put the examples in a README under arrow's folder and link to it in the main README? This way the main README is more concise.

Thanks @terrytangyuan! Yeah, I agree about the README. I'll try adding a link under Arrow entry in the data source list.

terrytangyuan · 2019-01-18T01:14:51Z

Thanks!

BryanCutler · 2019-01-18T22:50:19Z

Thanks for all the help with this @terrytangyuan @yongtang and @yupbank !

yongtang · 2019-01-19T20:30:48Z

Thanks @BryanCutler for the PR 🎉 . There are some additional follow up work that might be needed.

One is the upcoming 1.13 release of TensorFlow itself. The tf.data had some changes in API so the implementations based on 1.12 does not work anymore. A PR to make tensorflow-io work with 1.13 is in #56. I tried to fix arrow issues in tf 1.13 but was unsuccessful. Maybe you could take a look at some point. (Note this is not high priority as we will still build against tf 1.12 for now. TF 1.13 likely will only be released in a month or two.)

Another is the R binding of ArrowDataset. It would be nice to have the R binding available like other ops.

BryanCutler · 2019-01-21T22:14:02Z

Sure, I'll be glad to help out with building against 1.13 and upcoming 2.0, and also the R bindings.

yupbank · 2019-01-21T23:24:55Z

hey... i just run into pyarrow.plasma and it seems they have a python api and tensorflow ops plasma.tf_plasma_op to convert tensors to and from arrow.RecordBatch

it there any reason we didn't consider about them ?

BryanCutler · 2019-01-22T19:14:53Z

@yupbank I have looked into pyarrow.plasma, which is an object store, and there are some differences. The plasma-op is designed to transfer tensors to/from the object store, and it does not work with arrow.RecordBatches or columnar data, only individual tensors.

Plasma is kind of a sub-project of Arrow and I haven't been involved much, but I think there is a good possibility these two efforts could work together in the future.

blais · 2020-09-23T03:56:57Z

Just curious... why not merge this excellent Bazel support in Arrow's codebase itself?

BryanCutler · 2020-09-24T23:53:10Z

@blais that's a good idea, I'll run it past the Arrow community and see if there is interest.

blais · 2020-09-25T01:00:10Z

BTW, I've updated it for 1.0.1 here:
https://github.com/beancount/beancount/blob/master/third_party/proto/arrow.BUILD

Also, there's a patch needed somehow #include <snappy.h> needs to become #include "snappy.h" in my build (not sure exactly if it's my setup or what-not). Using bazel 3.4.1

BryanCutler · 2020-09-29T17:08:46Z

BTW, I've updated it for 1.0.1

That's great @blais , would you be able to open a PR so we can upgrade to 1.0.1 here?

BryanCutler mentioned this pull request Dec 21, 2018

Add Apache Arrow Support to TensorFlow Dataset tensorflow/tensorflow#23002

Closed

yongtang mentioned this pull request Jan 1, 2019

Support Apache Arrow for tensorflow-io #13

Closed

BryanCutler force-pushed the arrow-dataset-13 branch from 28ea481 to 36972c7 Compare January 3, 2019 00:54

BryanCutler commented Jan 3, 2019

View reviewed changes

BryanCutler mentioned this pull request Jan 4, 2019

Update Apache Arrow to 0.11.1 #44

Merged

BryanCutler force-pushed the arrow-dataset-13 branch from 1a0beb8 to 759f169 Compare January 7, 2019 00:15

BryanCutler force-pushed the arrow-dataset-13 branch from 759f169 to d09bba7 Compare January 7, 2019 00:20

yongtang mentioned this pull request Jan 8, 2019

Update flatbuffers to b99332e #47

Merged

BryanCutler force-pushed the arrow-dataset-13 branch from 4e60378 to 3839a7a Compare January 8, 2019 23:33

BryanCutler changed the title ~~[WIP] Add Apache Arrow dataset support~~ Add Apache Arrow dataset support Jan 10, 2019

yupbank reviewed Jan 11, 2019

View reviewed changes

terrytangyuan reviewed Jan 12, 2019

View reviewed changes

terrytangyuan mentioned this pull request Jan 14, 2019

R package CRAN submission preparations #52

Merged

BryanCutler force-pushed the arrow-dataset-13 branch from 6cbe7b8 to df105fd Compare January 14, 2019 23:28

Update Arrow build to include files needed for ipc messages

5a7bf25

BryanCutler force-pushed the arrow-dataset-13 branch from df105fd to d83619c Compare January 16, 2019 04:27

Create ArrowDataset with base classes for iterating over Arrow record…

2a2936b

… batches. Define 3 ops to read record batches: 1) from memory, 2) from Feather files, 3) from an input stream/socket

BryanCutler force-pushed the arrow-dataset-13 branch from d83619c to b5f9c4a Compare January 16, 2019 04:30

Added usage to README

5f531e4

BryanCutler force-pushed the arrow-dataset-13 branch from b5f9c4a to 5f531e4 Compare January 16, 2019 04:44

terrytangyuan approved these changes Jan 16, 2019

View reviewed changes

move Arrow usage guide to arrow/README.md

78e756b

terrytangyuan merged commit e5dc0eb into tensorflow:master Jan 18, 2019

yupbank mentioned this pull request Jan 21, 2019

ARROW-1744: [Plasma] Provide TensorFlow operator to transfer Tensors between Plasma and TensorFlow apache/arrow#2104

Closed

rsepassi mentioned this pull request Feb 25, 2019

Would the team accept a JS version, published in NPM (instead of PyPi)? tensorflow/datasets#60

Open

BryanCutler deleted the arrow-dataset-13 branch September 24, 2020 23:53

Add Apache Arrow dataset support #36

Add Apache Arrow dataset support #36

Conversation

BryanCutler commented Dec 21, 2018 • edited Loading

googlebot commented Dec 21, 2018

BryanCutler commented Dec 21, 2018 • edited Loading

BryanCutler commented Dec 21, 2018

yongtang commented Dec 22, 2018

googlebot commented Jan 3, 2019

Choose a reason for hiding this comment

BryanCutler commented Jan 3, 2019 • edited Loading

yongtang commented Jan 5, 2019

yongtang commented Jan 5, 2019

BryanCutler commented Jan 6, 2019

googlebot commented Jan 7, 2019

BryanCutler commented Jan 7, 2019

googlebot commented Jan 8, 2019

yongtang commented Jan 9, 2019

BryanCutler commented Jan 10, 2019

BryanCutler commented Jan 10, 2019

BryanCutler commented Jan 10, 2019

BryanCutler commented Jan 10, 2019

BryanCutler commented Jan 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan Jan 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan Jan 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Jan 14, 2019

BryanCutler commented Jan 14, 2019

BryanCutler commented Jan 16, 2019

terrytangyuan left a comment

Choose a reason for hiding this comment

BryanCutler commented Jan 17, 2019

terrytangyuan commented Jan 18, 2019

BryanCutler commented Jan 18, 2019

yongtang commented Jan 19, 2019

BryanCutler commented Jan 21, 2019

yupbank commented Jan 21, 2019

BryanCutler commented Jan 22, 2019

blais commented Sep 23, 2020 • edited Loading

BryanCutler commented Sep 24, 2020

blais commented Sep 25, 2020 • edited Loading

BryanCutler commented Sep 29, 2020

BryanCutler commented Dec 21, 2018 •

edited

Loading

BryanCutler commented Dec 21, 2018 •

edited

Loading

BryanCutler commented Jan 3, 2019 •

edited

Loading

terrytangyuan Jan 12, 2019 •

edited

Loading

terrytangyuan Jan 12, 2019 •

edited

Loading

blais commented Sep 23, 2020 •

edited

Loading

blais commented Sep 25, 2020 •

edited

Loading