Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NumpyDataset support #407

Merged
merged 3 commits into from
Jan 12, 2020
Merged

Add NumpyDataset support #407

merged 3 commits into from
Jan 12, 2020

Conversation

yongtang
Copy link
Member

@yongtang yongtang commented Aug 3, 2019

This PR adds support for reading npy and npz files, as well as reading numpy array from python process, related to #68.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

@yongtang
Copy link
Member Author

yongtang commented Aug 3, 2019

Actually reading numpy array from python process seems to be straightforward, as numpy's memory layout is consistent. Will update shortly.

@yongtang yongtang changed the title Add read_numpy, list_numpy_arrays, NumpyDataset support Add NumpyDataset support Aug 3, 2019
@yongtang
Copy link
Member Author

yongtang commented Aug 3, 2019

@terrytangyuan @BryanCutler I just realized that in case tensorflow 's kernel is local, then kernel and python will be in the same process. That means memory address space is accessible.

That will save a lot of effort for us I think. Maybe this could be applied to Arrow Batch/Stream as well?

In case tensorflow's session is not local, if they are in the same machine then we still could access the memory from another process.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Thanks!

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! LGTM

@BryanCutler
Copy link
Member

I just realized that in case tensorflow 's kernel is local, then kernel and python will be in the same process. That means memory address space is accessible.

That will save a lot of effort for us I think. Maybe this could be applied to Arrow Batch/Stream as well?

I have actually been working on this, will post a WIP PR to discuss more

else:
filename = ""
arrayname = ""
address, _ = array.__array_interface__['data']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I had with sharing memory addresses between Python and C++ is that I think it's possible for the numpy array to go out of scope and get cleaned up, but the Dataset could still exists and hold an address that is no longer valid. If you store a reference to the data in the dataset, I think that should be enough. WDYT @yongtang ?

@yongtang
Copy link
Member Author

yongtang commented Aug 6, 2019

@BryanCutler Thanks! Updated the PR with a reference to data added.

start = 0
stop = array.shape[0]
else:
self._array_data = array.data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe array.data is a memoryview, does that increment the refcount of the object?

@yongtang
Copy link
Member Author

yongtang commented Aug 6, 2019

Thanks @BryanCutler. I updated with np.array(array, copy=False). Think this should solve the issue.

start = 0
stop = array.shape[0]
else:
self._array_holder = np.array(array, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, but could you just do self._array_holder = array?

@yongtang yongtang changed the title Add NumpyDataset support [WIP] Add NumpyDataset support Nov 1, 2019
@yongtang
Copy link
Member Author

yongtang commented Nov 1, 2019

Change to Working-in-Progress, as I think there might be some further improvement if we could come up with a better alignment /memory pool that fits into TF.

@terrytangyuan
Copy link
Member

@yongtang Any update on this? Perhaps we could get this in first and then improve later in separate PRs?

@lxiongh
Copy link

lxiongh commented Dec 19, 2019

Is there any plan to support non-eager mode?

@yongtang
Copy link
Member Author

@lxiongh Non-eager mode should not be very difficult to support, as the data type is visible from python side.

@yongtang
Copy link
Member Author

@lxiongh @terrytangyuan It may take several days to update this PR, as quite a few things have been updated in tensorflow-io. I will see if I can find some time to get this one in this year. Otherwise it will likely be updated in January.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
@yongtang yongtang changed the title [WIP] Add NumpyDataset support Add NumpyDataset support Jan 11, 2020
@yongtang
Copy link
Member Author

The PR has been updated and all should work. The graph mode for NumpyDataset is not supported yet. Once this PR is merged I will have a follow up PR to add graph mode support. (graph mode will require user to provide dtype before hand).

/cc @terrytangyuan

@terrytangyuan terrytangyuan merged commit 942a209 into tensorflow:master Jan 12, 2020
@terrytangyuan
Copy link
Member

Great! Thanks for the efforts!

@yongtang yongtang deleted the numpy branch January 12, 2020 12:56
i-ony pushed a commit to i-ony/io that referenced this pull request Feb 8, 2021
* Add NumpyDataset support

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Add NumpyDataset for file support

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

* Pylint fix

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants