Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support Bigtable dataset #1578

Merged
merged 11 commits into from
Jan 27, 2022
Merged

feat: support Bigtable dataset #1578

merged 11 commits into from
Jan 27, 2022

Conversation

dopiera
Copy link
Contributor

@dopiera dopiera commented Dec 3, 2021

This PR adds the ability to read tensors from Bigtable.

The design is heavily inspired by Bigquery's implementation and shares some code with https://github.com/Unoperate/pytorch-cbt

It supports both sequential (sorted) reads and parallel reads (no particular row orders).

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@yongtang
Copy link
Member

yongtang commented Dec 6, 2021

@dopiera Thanks for the PR! I think tensorflow_io/bigtable.py may need to be moved to subdirectory as we don't want to expose top level module. Can you also take a look at the test failures?

@kboroszko
Copy link
Contributor

@yongtang I can be the point of contact for the sake of this PR. Thank you for your comments. We have placed the tensorflow_io/bigtable.py in this top level folder following what was already done for bigquery. All it does is exposes classes from tensorflow_io/python/ops/bigtable Where would you like me to move it?
As for the tests, the failing step - "Setup 3.9 macOS" does not run any new code. It failed because apache pulsar didn't start, so it looks like a flaky test.

@yongtang
Copy link
Member

yongtang commented Dec 9, 2021

Thanks @kboroszko, the bigquery was keep on top level mostly due to the backward-compatibility reason (not to break existing user). This acutally caused issues in docs generation and a special handling has to be used (see https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L28-L32)

We do want to eventually group bigquery with other google cloud related APIs, and have it placed in the same way as other standard APIs (through https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L22-L26).

For bigtable, since it is also part of the google cloud, I think tfio.gcloud.XXX might be better (or tfio.google.XXX if you prefer)?

(We can alias APIs of tfio.bigquery.XXX to tfio.gcloud.XXX (or tfio.google.XXX) so that bigquery and bigtable will shown up within the same group.)

In order to expose tfio.gcloud, you can:

  1. Create a io/tensorflow_io/python/api/gcloud.py in simliar way as io/tensorflow_io/python/api/audio.py, and import symbols you want to expose.
  2. Add a line from tensorflow_io.python.api import gcloud in https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L22

That will make sure the docs generation follows the same pattern and can shown up correctly in tensorflow.org.

@kboroszko
Copy link
Contributor

kboroszko commented Dec 9, 2021

@yongtang I would like to avoid putting everything in gcloud module in the same namespace as it can quickly become very confusing. For instance both bigtable and bigquery have tables, rows, etc. so we could end up having several "Table" classes in the gcloud module.

To avoid the confusion I could place everything in tensorflow_io.gcloud and then into separate submodules. However, then you would have to prepend everything with tfio.gcloud.bigtable.XXX which is pretty long.

Also, since gcloud module would itself be imported into the tensorflow_io by the tensorflow_io/__init__.py, you cannot alias imports because python doesn't see gcloud module as part of the tensorflow_io module.
You cannot import it in any sensible way:

import tensorflow_io.gcloud.bigtable as bt
or
from tensorflow_io.gcloud.bigtable import BigtableClient

It's pretty painful, to explicitly write the whole path every time.

Another argument for keeping them separate is that they don't share code and are independent technologies, so apart from being created by google I don't see a reason to keep them together.

How about just naming it bigtable?

@yongtang
Copy link
Member

yongtang commented Dec 9, 2021

For instance both bigtable and bigquery have tables, rows, etc. so we could end up having several "Table" classes in the gcloud module.

Thanks for the explanation @kboroszko. This is a fair point.

The reason we are gradually phasing out top level namespaces, is that in the past the number of top level namespaces exploded and we had to scale back.

Assuming we are not going to see many more services like bigquery and bigtable in the future, it should be fine to place API as part of tfio.bigtable.

In that case, we still want tfio.bigtable to be exposed in the standard way:

  1. Create a io/tensorflow_io/python/api/bigtable.py in simliar way as io/tensorflow_io/python/api/audio.py, and import symbols you want to expose.
  2. Add a line from tensorflow_io.python.api import bigtable in https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L22

Can you take a look and see if the above will resolve the issues?

@kboroszko
Copy link
Contributor

@yongtang I can see the build failed on the step - "Bazel Windows". I looks like it fails for many other builds from different PRs and even from master recently, when it tries to upload artifacts. Any idea what might be the cause?

@yongtang
Copy link
Member

@kboroszko I think the issue is likely caused by Windows artifact too large to upload to GitHub Actions' storage. I have created a PR #1582 to remove unneeded files from artifact before upload.

@kboroszko
Copy link
Contributor

@yongtang That's great, thank you for the heads up. As soon as it's merged I will update this PR so it involves the fix.

@yongtang
Copy link
Member

@kboroszko As PR #1582 has been merged, can you update the PR? Believe build will pass after that.

Implements reading from bigtable in a synchronous manner.
In this pr we make the read methods accept a row_set reading only rows specified by the user.
We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
This PR adds support for Bigtable version filters.
moved bigtable to tfensorflow_io.python.api
@kboroszko
Copy link
Contributor

@yongtang It's done. Fingers crossed! 🤞

@yongtang
Copy link
Member

It looks like the all tests hangs. Don't know if this is a GitHub CI issue or code issue. I will retrigger the tests to give it another run.

@yongtang
Copy link
Member

@kboroszko Looks like the tests just hangs (and time out after 6hours). Think this might be related to the code change. Can you take a look?

@kboroszko
Copy link
Contributor

Hi @yongtang ,

All current builds are failing due to the freetype repository being unavailable and mirrors not working.

Apart from that I have added pytest-timeout and it looks like it's the tests/test_io_layer.py::test_io_layer[text] that hangs.
I failed to replicate this error when I ran the tests locally.
I did manage to replicate it in Github CI in our forked repository, however I did not find a cause for that test to hang.

Do you have any idea what might be the cause?

@yongtang
Copy link
Member

@dopiera If this is the only one we can disable the test for now. Can you add @pytest.mark.skip(reason="TODO") to the test so that it can be skipped?

changed path to bigtable emulator and cbt in tests

moved arguments' initializations to the body of the function in bigtable_ops.py

 fixed interleaveFromRange of column filters when using only one column
@kboroszko
Copy link
Contributor

@yongtang It turned out to be a problem with xdist using fork that was causing our tests to hang. It was quite hard to find, but it's fixed now.

Python initializes the default arguments at the start of the program and when xdist forkes the process in order to run the tests in parallel, the whole thing hangs. We fixed it by initializing the arguments to the default values in the body of the function, so they are created after the fork. It passes the tests in the CI now.

@yongtang
Copy link
Member

@kboroszko Looks like linux tests pass though there might be some issues with macOS. Can you disable the macOS tests with @pytest.mark.skipif for now, so the CI build will pass?

@kboroszko
Copy link
Contributor

@yongtang I have disabled the tests on macOS.

Copy link
Member

@yongtang yongtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the great effort to make it work!

@kboroszko
Copy link
Contributor

@yongtang thank you too! I would merge it but I get a msg:

Only those with write access to this repository can merge pull requests.

So I guess you'll have to do it.

@yongtang yongtang merged commit bfa2e89 into tensorflow:master Jan 27, 2022
@pierreoberholzer
Copy link

@yongtang Is this feature making this one redundant #1284 ?

@yongtang
Copy link
Member

@pierreoberholzer Thanks for the reminder! The issue has been closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants