feat: support Bigtable dataset #1578

dopiera · 2021-12-03T14:05:08Z

This PR adds the ability to read tensors from Bigtable.

The design is heavily inspired by Bigquery's implementation and shares some code with https://github.com/Unoperate/pytorch-cbt

It supports both sequential (sorted) reads and parallel reads (no particular row orders).

review-notebook-app · 2021-12-03T14:05:11Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

yongtang · 2021-12-06T21:39:39Z

@dopiera Thanks for the PR! I think tensorflow_io/bigtable.py may need to be moved to subdirectory as we don't want to expose top level module. Can you also take a look at the test failures?

kboroszko · 2021-12-07T14:25:21Z

@yongtang I can be the point of contact for the sake of this PR. Thank you for your comments. We have placed the tensorflow_io/bigtable.py in this top level folder following what was already done for bigquery. All it does is exposes classes from tensorflow_io/python/ops/bigtable Where would you like me to move it?
As for the tests, the failing step - "Setup 3.9 macOS" does not run any new code. It failed because apache pulsar didn't start, so it looks like a flaky test.

yongtang · 2021-12-09T07:18:30Z

Thanks @kboroszko, the bigquery was keep on top level mostly due to the backward-compatibility reason (not to break existing user). This acutally caused issues in docs generation and a special handling has to be used (see https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L28-L32)

We do want to eventually group bigquery with other google cloud related APIs, and have it placed in the same way as other standard APIs (through https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L22-L26).

For bigtable, since it is also part of the google cloud, I think tfio.gcloud.XXX might be better (or tfio.google.XXX if you prefer)?

(We can alias APIs of tfio.bigquery.XXX to tfio.gcloud.XXX (or tfio.google.XXX) so that bigquery and bigtable will shown up within the same group.)

In order to expose tfio.gcloud, you can:

Create a io/tensorflow_io/python/api/gcloud.py in simliar way as io/tensorflow_io/python/api/audio.py, and import symbols you want to expose.
Add a line from tensorflow_io.python.api import gcloud in https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L22

That will make sure the docs generation follows the same pattern and can shown up correctly in tensorflow.org.

kboroszko · 2021-12-09T11:40:05Z

@yongtang I would like to avoid putting everything in gcloud module in the same namespace as it can quickly become very confusing. For instance both bigtable and bigquery have tables, rows, etc. so we could end up having several "Table" classes in the gcloud module.

To avoid the confusion I could place everything in tensorflow_io.gcloud and then into separate submodules. However, then you would have to prepend everything with tfio.gcloud.bigtable.XXX which is pretty long.

Also, since gcloud module would itself be imported into the tensorflow_io by the tensorflow_io/__init__.py, you cannot alias imports because python doesn't see gcloud module as part of the tensorflow_io module.
You cannot import it in any sensible way:

import tensorflow_io.gcloud.bigtable as bt
or
from tensorflow_io.gcloud.bigtable import BigtableClient

It's pretty painful, to explicitly write the whole path every time.

Another argument for keeping them separate is that they don't share code and are independent technologies, so apart from being created by google I don't see a reason to keep them together.

How about just naming it bigtable?

yongtang · 2021-12-09T16:04:41Z

For instance both bigtable and bigquery have tables, rows, etc. so we could end up having several "Table" classes in the gcloud module.

Thanks for the explanation @kboroszko. This is a fair point.

The reason we are gradually phasing out top level namespaces, is that in the past the number of top level namespaces exploded and we had to scale back.

Assuming we are not going to see many more services like bigquery and bigtable in the future, it should be fine to place API as part of tfio.bigtable.

In that case, we still want tfio.bigtable to be exposed in the standard way:

Create a io/tensorflow_io/python/api/bigtable.py in simliar way as io/tensorflow_io/python/api/audio.py, and import symbols you want to expose.
Add a line from tensorflow_io.python.api import bigtable in https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L22

Can you take a look and see if the above will resolve the issues?

kboroszko · 2021-12-10T06:16:13Z

@yongtang I can see the build failed on the step - "Bazel Windows". I looks like it fails for many other builds from different PRs and even from master recently, when it tries to upload artifacts. Any idea what might be the cause?

yongtang · 2021-12-10T12:59:07Z

@kboroszko I think the issue is likely caused by Windows artifact too large to upload to GitHub Actions' storage. I have created a PR #1582 to remove unneeded files from artifact before upload.

kboroszko · 2021-12-10T13:02:11Z

@yongtang That's great, thank you for the heads up. As soon as it's merged I will update this PR so it involves the fix.

yongtang · 2021-12-13T05:30:10Z

@kboroszko As PR #1582 has been merged, can you update the PR? Believe build will pass after that.

Implements reading from bigtable in a synchronous manner.

In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.

This PR adds support for Bigtable version filters.

moved bigtable to tfensorflow_io.python.api

kboroszko · 2021-12-13T10:31:51Z

@yongtang It's done. Fingers crossed! 🤞

yongtang · 2021-12-13T16:05:26Z

It looks like the all tests hangs. Don't know if this is a GitHub CI issue or code issue. I will retrigger the tests to give it another run.

yongtang · 2021-12-14T02:23:54Z

@kboroszko Looks like the tests just hangs (and time out after 6hours). Think this might be related to the code change. Can you take a look?

kboroszko · 2021-12-20T14:11:44Z

Hi @yongtang ,

All current builds are failing due to the freetype repository being unavailable and mirrors not working.

Apart from that I have added pytest-timeout and it looks like it's the tests/test_io_layer.py::test_io_layer[text] that hangs.
I failed to replicate this error when I ran the tests locally.
I did manage to replicate it in Github CI in our forked repository, however I did not find a cause for that test to hang.

Do you have any idea what might be the cause?

yongtang · 2021-12-21T19:52:28Z

@dopiera If this is the only one we can disable the test for now. Can you add @pytest.mark.skip(reason="TODO") to the test so that it can be skipped?

changed path to bigtable emulator and cbt in tests moved arguments' initializations to the body of the function in bigtable_ops.py fixed interleaveFromRange of column filters when using only one column

kboroszko · 2022-01-18T16:33:46Z

@yongtang It turned out to be a problem with xdist using fork that was causing our tests to hang. It was quite hard to find, but it's fixed now.

Python initializes the default arguments at the start of the program and when xdist forkes the process in order to run the tests in parallel, the whole thing hangs. We fixed it by initializing the arguments to the default values in the body of the function, so they are created after the fork. It passes the tests in the CI now.

yongtang · 2022-01-23T00:01:10Z

@kboroszko Looks like linux tests pass though there might be some issues with macOS. Can you disable the macOS tests with @pytest.mark.skipif for now, so the CI build will pass?

* disable tests on macos

kboroszko · 2022-01-25T12:27:32Z

@yongtang I have disabled the tests on macOS.

yongtang

LGTM, thanks for the great effort to make it work!

kboroszko · 2022-01-26T11:37:59Z

@yongtang thank you too! I would merge it but I get a msg:

Only those with write access to this repository can merge pull requests.

So I guess you'll have to do it.

pierreoberholzer · 2022-03-17T13:48:46Z

@yongtang Is this feature making this one redundant #1284 ?

yongtang · 2022-03-17T18:28:10Z

@pierreoberholzer Thanks for the reminder! The issue has been closed now.

kboroszko force-pushed the master branch from 8942034 to 13e7b49 Compare December 9, 2021 19:14

kboroszko added 9 commits December 13, 2021 11:27

feat: reading from bigtable (#2)

28d6f4c

Implements reading from bigtable in a synchronous manner.

feat: RowRange and RowSet API.

3c533d5

feat: parallel read (#4)

fa2361c

In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.

feat: version filters (#6)

c126a44

This PR adds support for Bigtable version filters.

feat: support for other data types (#5)

e261104

fix: linter fixes (#8)

32a5103

feat docs (#9)

292a648

fix: building on windows (#12)

8ba023a

fix: refactor bigtable package to api folder (#14)

16db041

moved bigtable to tfensorflow_io.python.api

kboroszko force-pushed the master branch from 13e7b49 to 16db041 Compare December 13, 2021 10:30

yongtang mentioned this pull request Dec 13, 2021

Update to 0.23.0 with release note. #1586

Merged

fix: tests hanging (#30)

5bf12a7

changed path to bigtable emulator and cbt in tests moved arguments' initializations to the body of the function in bigtable_ops.py fixed interleaveFromRange of column filters when using only one column

fix: temporarily disable macos tests (#32)

bc994b7

* disable tests on macos

yongtang approved these changes Jan 25, 2022

View reviewed changes

yongtang merged commit bfa2e89 into tensorflow:master Jan 27, 2022

yongtang mentioned this pull request Mar 17, 2022

BigTable connector to TensorFlow 2.x #1284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support Bigtable dataset #1578

feat: support Bigtable dataset #1578

dopiera commented Dec 3, 2021

review-notebook-app bot commented Dec 3, 2021

yongtang commented Dec 6, 2021

kboroszko commented Dec 7, 2021

yongtang commented Dec 9, 2021

kboroszko commented Dec 9, 2021 •

edited

Loading

yongtang commented Dec 9, 2021

kboroszko commented Dec 10, 2021

yongtang commented Dec 10, 2021

kboroszko commented Dec 10, 2021

yongtang commented Dec 13, 2021

kboroszko commented Dec 13, 2021

yongtang commented Dec 13, 2021

yongtang commented Dec 14, 2021

kboroszko commented Dec 20, 2021

yongtang commented Dec 21, 2021

kboroszko commented Jan 18, 2022

yongtang commented Jan 23, 2022

kboroszko commented Jan 25, 2022

yongtang left a comment

kboroszko commented Jan 26, 2022

pierreoberholzer commented Mar 17, 2022

yongtang commented Mar 17, 2022

feat: support Bigtable dataset #1578

feat: support Bigtable dataset #1578

Conversation

dopiera commented Dec 3, 2021

review-notebook-app bot commented Dec 3, 2021

yongtang commented Dec 6, 2021

kboroszko commented Dec 7, 2021

yongtang commented Dec 9, 2021

kboroszko commented Dec 9, 2021 • edited Loading

yongtang commented Dec 9, 2021

kboroszko commented Dec 10, 2021

yongtang commented Dec 10, 2021

kboroszko commented Dec 10, 2021

yongtang commented Dec 13, 2021

kboroszko commented Dec 13, 2021

yongtang commented Dec 13, 2021

yongtang commented Dec 14, 2021

kboroszko commented Dec 20, 2021

yongtang commented Dec 21, 2021

kboroszko commented Jan 18, 2022

yongtang commented Jan 23, 2022

kboroszko commented Jan 25, 2022

yongtang left a comment

Choose a reason for hiding this comment

kboroszko commented Jan 26, 2022

pierreoberholzer commented Mar 17, 2022

yongtang commented Mar 17, 2022

kboroszko commented Dec 9, 2021 •

edited

Loading