-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NumpyDataset support #407
Conversation
Actually reading numpy array from python process seems to be straightforward, as numpy's memory layout is consistent. Will update shortly. |
@terrytangyuan @BryanCutler I just realized that in case tensorflow 's kernel is local, then kernel and python will be in the same process. That means memory address space is accessible. That will save a lot of effort for us I think. Maybe this could be applied to Arrow Batch/Stream as well? In case tensorflow's session is not local, if they are in the same machine then we still could access the memory from another process. |
661d28d
to
bdd83be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! LGTM
I have actually been working on this, will post a WIP PR to discuss more |
else: | ||
filename = "" | ||
arrayname = "" | ||
address, _ = array.__array_interface__['data'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern I had with sharing memory addresses between Python and C++ is that I think it's possible for the numpy array to go out of scope and get cleaned up, but the Dataset could still exists and hold an address that is no longer valid. If you store a reference to the data in the dataset, I think that should be enough. WDYT @yongtang ?
@BryanCutler Thanks! Updated the PR with a reference to data added. |
start = 0 | ||
stop = array.shape[0] | ||
else: | ||
self._array_data = array.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe array.data
is a memoryview, does that increment the refcount of the object?
Thanks @BryanCutler. I updated with |
start = 0 | ||
stop = array.shape[0] | ||
else: | ||
self._array_holder = np.array(array, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, but could you just do self._array_holder = array
?
Change to Working-in-Progress, as I think there might be some further improvement if we could come up with a better alignment /memory pool that fits into TF. |
@yongtang Any update on this? Perhaps we could get this in first and then improve later in separate PRs? |
Is there any plan to support non-eager mode? |
@lxiongh Non-eager mode should not be very difficult to support, as the data type is visible from python side. |
@lxiongh @terrytangyuan It may take several days to update this PR, as quite a few things have been updated in tensorflow-io. I will see if I can find some time to get this one in this year. Otherwise it will likely be updated in January. |
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
The PR has been updated and all should work. The graph mode for NumpyDataset is not supported yet. Once this PR is merged I will have a follow up PR to add graph mode support. (graph mode will require user to provide dtype before hand). /cc @terrytangyuan |
Great! Thanks for the efforts! |
* Add NumpyDataset support Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Add NumpyDataset for file support Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Pylint fix Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This PR adds support for reading npy and npz files, as well as reading numpy array from python process, related to #68.
Signed-off-by: Yong Tang yong.tang.github@outlook.com