Add NumpyDataset support #407

yongtang · 2019-08-03T07:37:53Z

This PR adds support for reading npy and npz files, as well as reading numpy array from python process, related to #68.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

yongtang · 2019-08-03T16:01:17Z

Actually reading numpy array from python process seems to be straightforward, as numpy's memory layout is consistent. Will update shortly.

yongtang · 2019-08-03T21:16:48Z

@terrytangyuan @BryanCutler I just realized that in case tensorflow 's kernel is local, then kernel and python will be in the same process. That means memory address space is accessible.

That will save a lot of effort for us I think. Maybe this could be applied to Arrow Batch/Stream as well?

In case tensorflow's session is not local, if they are in the same machine then we still could access the memory from another process.

terrytangyuan

This is great. Thanks!

BryanCutler

Very nice! LGTM

BryanCutler · 2019-08-05T22:05:20Z

I just realized that in case tensorflow 's kernel is local, then kernel and python will be in the same process. That means memory address space is accessible.

That will save a lot of effort for us I think. Maybe this could be applied to Arrow Batch/Stream as well?

I have actually been working on this, will post a WIP PR to discuss more

BryanCutler · 2019-08-05T22:45:26Z

tensorflow_io/numpy/python/ops/numpy_ops.py

+    else:
+      filename = ""
+      arrayname = ""
+      address, _ = array.__array_interface__['data']


One concern I had with sharing memory addresses between Python and C++ is that I think it's possible for the numpy array to go out of scope and get cleaned up, but the Dataset could still exists and hold an address that is no longer valid. If you store a reference to the data in the dataset, I think that should be enough. WDYT @yongtang ?

yongtang · 2019-08-06T00:04:44Z

@BryanCutler Thanks! Updated the PR with a reference to data added.

BryanCutler · 2019-08-06T16:38:15Z

tensorflow_io/numpy/python/ops/numpy_ops.py

+      start = 0
+      stop = array.shape[0]
+    else:
+      self._array_data = array.data


I believe array.data is a memoryview, does that increment the refcount of the object?

yongtang · 2019-08-06T17:59:05Z

Thanks @BryanCutler. I updated with np.array(array, copy=False). Think this should solve the issue.

BryanCutler · 2019-08-07T17:04:37Z

tensorflow_io/numpy/python/ops/numpy_ops.py

+      start = 0
+      stop = array.shape[0]
+    else:
+      self._array_holder = np.array(array, copy=False)


This looks fine, but could you just do self._array_holder = array?

yongtang · 2019-11-01T14:58:34Z

Change to Working-in-Progress, as I think there might be some further improvement if we could come up with a better alignment /memory pool that fits into TF.

terrytangyuan · 2019-12-19T05:11:19Z

@yongtang Any update on this? Perhaps we could get this in first and then improve later in separate PRs?

lxiongh · 2019-12-19T11:52:32Z

Is there any plan to support non-eager mode?

yongtang · 2019-12-19T16:30:53Z

@lxiongh Non-eager mode should not be very difficult to support, as the data type is visible from python side.

yongtang · 2019-12-20T14:30:42Z

@lxiongh @terrytangyuan It may take several days to update this PR, as quite a few things have been updated in tensorflow-io. I will see if I can find some time to get this one in this year. Otherwise it will likely be updated in January.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang · 2020-01-11T18:33:10Z

The PR has been updated and all should work. The graph mode for NumpyDataset is not supported yet. Once this PR is merged I will have a follow up PR to add graph mode support. (graph mode will require user to provide dtype before hand).

/cc @terrytangyuan

terrytangyuan · 2020-01-12T04:16:04Z

Great! Thanks for the efforts!

* Add NumpyDataset support Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add NumpyDataset for file support Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Pylint fix Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the numpy branch from 34dbe0d to 88e8a33 Compare August 3, 2019 21:11

yongtang changed the title ~~Add read_numpy, list_numpy_arrays, NumpyDataset support~~ Add NumpyDataset support Aug 3, 2019

yongtang requested review from BryanCutler and terrytangyuan August 3, 2019 21:13

yongtang mentioned this pull request Aug 3, 2019

Support NumPy for tensorflow-io #68

Closed

yongtang force-pushed the numpy branch 3 times, most recently from 661d28d to bdd83be Compare August 5, 2019 15:18

terrytangyuan approved these changes Aug 5, 2019

View reviewed changes

yongtang force-pushed the numpy branch from bdd83be to ae5e54f Compare August 5, 2019 21:13

BryanCutler approved these changes Aug 5, 2019

View reviewed changes

BryanCutler reviewed Aug 5, 2019

View reviewed changes

yongtang force-pushed the numpy branch from ae5e54f to dc522a7 Compare August 6, 2019 00:04

BryanCutler reviewed Aug 6, 2019

View reviewed changes

yongtang force-pushed the numpy branch from dc522a7 to e7b0208 Compare August 6, 2019 17:57

BryanCutler reviewed Aug 7, 2019

View reviewed changes

BryanCutler mentioned this pull request Aug 7, 2019

Enable zero-copy transfer to ArrowDataset #413

Merged

yongtang changed the title ~~Add NumpyDataset support~~ [WIP] Add NumpyDataset support Nov 1, 2019

yongtang added 2 commits January 10, 2020 13:54

Add NumpyDataset support

31d54d8

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Add NumpyDataset for file support

0e13e48

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Pylint fix

9cadfdf

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the numpy branch from e7b0208 to 9cadfdf Compare January 11, 2020 18:30

yongtang changed the title ~~[WIP] Add NumpyDataset support~~ Add NumpyDataset support Jan 11, 2020

terrytangyuan merged commit 942a209 into tensorflow:master Jan 12, 2020

yongtang deleted the numpy branch January 12, 2020 12:56

yongtang mentioned this pull request Jan 12, 2020

Add graph support for NumpyDataset #733

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NumpyDataset support #407

Add NumpyDataset support #407

yongtang commented Aug 3, 2019 •

edited

Loading

yongtang commented Aug 3, 2019

yongtang commented Aug 3, 2019

terrytangyuan left a comment

BryanCutler left a comment

BryanCutler commented Aug 5, 2019

BryanCutler Aug 5, 2019

yongtang commented Aug 6, 2019

BryanCutler Aug 6, 2019

yongtang commented Aug 6, 2019

BryanCutler Aug 7, 2019

yongtang commented Nov 1, 2019

terrytangyuan commented Dec 19, 2019

lxiongh commented Dec 19, 2019

yongtang commented Dec 19, 2019

yongtang commented Dec 20, 2019

yongtang commented Jan 11, 2020

terrytangyuan commented Jan 12, 2020

Add NumpyDataset support #407

Add NumpyDataset support #407

Conversation

yongtang commented Aug 3, 2019 • edited Loading

yongtang commented Aug 3, 2019

yongtang commented Aug 3, 2019

terrytangyuan left a comment

Choose a reason for hiding this comment

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler commented Aug 5, 2019

BryanCutler Aug 5, 2019

Choose a reason for hiding this comment

yongtang commented Aug 6, 2019

BryanCutler Aug 6, 2019

Choose a reason for hiding this comment

yongtang commented Aug 6, 2019

BryanCutler Aug 7, 2019

Choose a reason for hiding this comment

yongtang commented Nov 1, 2019

terrytangyuan commented Dec 19, 2019

lxiongh commented Dec 19, 2019

yongtang commented Dec 19, 2019

yongtang commented Dec 20, 2019

yongtang commented Jan 11, 2020

terrytangyuan commented Jan 12, 2020

yongtang commented Aug 3, 2019 •

edited

Loading