-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apache Ignite Dataset #22210
Apache Ignite Dataset #22210
Conversation
…e variables to satisty code style, use pointers instead of references.
Hello @mrry, @martinwicke, @perfinion. Sorry for confusion, I've recreated #21853 here. The current state is:
Regarding Windows builds I see the same issues I had and asked here: https://groups.google.com/a/tensorflow.org/forum/#!topic/build/ePYss0Kxcu4. It looks like we need to add copt=-DWIN32_LEAN_AND_MEAN on CI server for Windows builds. Guys, let me ask you to continue review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that many of the style/performance/correctness comments apply generally across the code, but I've only commented at the first instance.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
"""Apache Ignite is a memory-centric distributed database, caching, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: all module, class, and function docstrings should begin with a one-line summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated docs and added a short description in the first line.
limitations under the License. | ||
==============================================================================*/ | ||
|
||
#include "ignite_binary_object_parser.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: use the full absolute path to included headers in the same project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
Status BinaryObjectParser::Parse(uint8_t** ptr, | ||
std::vector<Tensor>* out_tensors, | ||
std::vector<int32_t>* types) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types
argument seems to be unused, except in what seems to be a recursive call to this method. Delete it if it's unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, excellent point, I completely forgot about it. The reason of types
to be here is to build a schema
of every object that we receive and check it.
I added this logic today, so it affects this code and CheckTypes
method in IgniteDatasetIterator
. Please take a look.
==============================================================================*/ | ||
|
||
#include <vector> | ||
#include "tensorflow/core/framework/dataset.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This header seems to be unused. If you depend on other headers indirectly, include them directly here or (preferably) in the .cc
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, look like the following two headers are enough:
#include "tensorflow/core/framework/tensor.h"
#include "tensorflow/core/lib/core/status.h"
int32_t length = *((int32_t*)*ptr); | ||
*ptr += 4; | ||
Tensor tensor(cpu_allocator(), DT_STRING, {}); | ||
tensor.scalar<std::string>()() = std::string((char*)*ptr, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tensorflow::string
instead of std::string
everywhere. Some platforms use a different string implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
""" | ||
if self.fields is None: | ||
object_type = types[self.type_id] | ||
if object_type is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object_type
can never be None
, because None
is not a value in types
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I added check like:
if self.type_id in types:
Done.
if is_array: | ||
return tensor_shape.TensorShape([None]) | ||
return tensor_shape.TensorShape([]) | ||
raise Exception("Unsupported type [type_id=%d]" % self.type_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ValueError
instead of Exception
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced (where it's appropriate).
return self.to_flat_rec([]) | ||
|
||
def to_permutation(self): | ||
"""Returns a permutation that should be applied to order object leafs.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: s/leafs/leaves/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
name="page_size") | ||
self.username = ops.convert_to_tensor("" if username is None else username,\ | ||
dtype=dtypes.string, name="username") | ||
self.password = ops.convert_to_tensor("" if password is None else password,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and self.cert_password
) will encode potentially sensitive information in the GraphDef
and send it over insecure channels. Is it necessary to include this information here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary, but it's the simple way to start using.
We provide two ways to specify these parameters: via parameters of dataset or via environment variables on the nodes where dataset will be actually instantiated. As far as I understand, in the second case sensitive information won't be included into GraphDef
and passed via insecure channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should only support the environment variable, in that case? I'm concerned that novice users won't understand the distinction, take the simple route, and leak sensitive information.
However, security isn't my domain, so @martinwicke can you please make a determination here?
""" | ||
super(IgniteDataset, self).__init__() | ||
|
||
with IgniteClient(host, port, username, password, certfile, keyfile,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for a line continuation character here (or in the cases below) because it is implied by the open parentheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
Hi @mrry. Thank you for very detailed and deep review, I really appreciate it. I think I fixed all your comments today, so could you please have a look my changes? |
Meanwhile, @martinwicke, @perfinion, could you please rerun tests? I made a lot of changes, would be great to check CI statuses. |
: added |
@dmitrievanthony The test failure for However, different clang-format versions may generate different outputs. I think you could use clang-format bundled with Ubuntu 16.04 to get the outputs that matches the If you have Docker installed on your machine, I think you may use:
to format the ignite_byte_swapper.h that is consistent with |
Hi @mrry. It's been 6 days since I fixed all review comments. Could you please take a look? |
public: | ||
ByteSwapper(bool big_endian) { | ||
int x = 1; | ||
bool is_little_endian = (*(char *)&x == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the formulation here to work out the endianness statically:
constexpr bool kLittleEndian = __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it similar way.
name="page_size") | ||
self.username = ops.convert_to_tensor("" if username is None else username,\ | ||
dtype=dtypes.string, name="username") | ||
self.password = ops.convert_to_tensor("" if password is None else password,\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should only support the environment variable, in that case? I'm concerned that novice users won't understand the distinction, take the simple route, and leak sensitive information.
However, security isn't my domain, so @martinwicke can you please make a determination here?
Regarding security question, @mrry, @martinwicke, I think we can add warning in case user specifies sensitive data via parameters. What do you think? |
I think we should make it hard for credentials to end up in files or on the network in the clear. Therefore, I would prefer if we insist here that all credentials are present on the executing machine already, and we make it impossible to use credentials that are stored in the graph. I think environment variables would work fine in this case, can we restrict it to that for all sensitive information? |
I agree, it's reasonable, @mrry, @martinwicke. I updated code so that sensitive information is not encoded in graph. Be aware that it's still used in python for initial access (that is required to get data schema), but after that only environment variables are used. |
Do we have any open questions, @mrry? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go ahead and merge this.
Guys, looks like I introduced a bug on Windows when changed Also, as we found out previously, it was a bug in master that leads to Windows GPU build failure. See #22210 (comment):
Should I merge master into my branch? Or it's fine that Windows GPU build is broken by non-related to my code reason? |
@dmitrievanthony The windows GPU fix has been submitted internally so the build should pass. Let me help with rerun the test. |
Ok, @mrry, @martinwicke, @yongtang, all tests I expected to pass actually passed. XLA fails, but it fails all the time by reasons that are not related to my code. Derek, sorry for bothering you again, but we need you approval again. After that this PR can be merged, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approving. There are a few more internal steps required before the PR is merged, but you shouldn't need to do anything else from this point.
Hi, @mrry, @martinwicke. I'm not sure how long it takes to pass all internal steps, but it looks like it takes time. Meanwhile, as far as I see, there are several conflicts appeared. Shall I fix them? |
No need to update the branch. We're in the process of getting it to merge internally. FYI: You should check the diff between what we end up merging, and the original PR. Our internal checks seem to be more picky about style and other issues than the presubmit. It has also been necessary to disable the SSL tests and remove the checked-in private key file. |
PiperOrigin-RevId: 215258743
This is a proposal to add
IgniteDataset
that allows to work with Apache Ignite.Apache Ignite is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads, delivering in-memory speeds at petabyte scale. This proposal is a part of a more global initiative to implement so called "TensorFlow on Apache Ignite" (IGNITE-8335, Design Document).
The integration is based on Apache Ignite Binary Client Protocol and TensorFlow tf.data.Dataset. More information about supported features you can find in README.md of this module.
Tests have also been added. They use docker to hide configuration complexity, so that the implemented functionality can be tested quite simply by manual run.
It's a copy of #21853, because previous request has accidentally been closed because I cleaned up merge commits from the branch.