Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] fix bugs when loading KV files and more. Also make horovod saving function is callable independent bug not patched origin save function. #364

Merged
merged 8 commits into from
Oct 28, 2023

Conversation

MoFHeka
Copy link
Contributor

@MoFHeka MoFHeka commented Oct 18, 2023

Description

Functional Fix

Fix the default save function for tensorflow will not be patched at this time, as this can lead to unexpected errors.
Now using de.keras.models.de_hvd_save_model to replace tf.keras.models.save_model.

Bug Fix

  1. Fix load too many value from DE values files when using TF dataset op, which could cause insert error(shape inconformity).
  2. Fix fatal error: When restoring from kv files using insert_or_assign in CPU hash table, the recovered data will be the same first data in vector. Fixes TFRA load embedding weights with wrong value when use CPU table #363
  3. Fix TrainableWrapper and DEResourceVariable should not be save or restore parameter when using tf.train.Saver.
  4. Fix TypeVar was used incorrectly and Callable was used instead.

New Feature

  1. [feat] Add the DEHvdSaver class, which is similar to tf.train.Saver and is used to save DE KV files with different rank when using horovod all2all training.
  2. [feat] Add support to tf.train.Checkpoint and tf.train.CheckpointManager when using HvdAllToAllEmbedding by calling de.train.DEHvdCheckpoint.

New Document

[doc] Add documentation for HvdAllToAllEmbedding.

Type of change

  • Bug fix
  • New Tutorial
  • Updated or additional documentation
  • Additional Testing
  • New Feature

Checklist:

  • I've properly formatted my code according to the guidelines
    • By running yapf
    • By running clang-format
  • This PR addresses an already submitted issue for TensorFlow Recommenders-Addons
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

How Has This Been Tested?

Run the test.
But the fatal error in CPU table only happened in template <class K, class V, size_t DIM>, and I don't know how to trigger it in a small case.

rhdong
rhdong previously approved these changes Oct 19, 2023
Copy link
Member

@rhdong rhdong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MoFHeka MoFHeka changed the title [Fix] fix 2 bug when loading KV files. Also make horovod saving function is callable independent bug not patched origin save function. [Fix] fix bugs when loading KV files. Also make horovod saving function is callable independent bug not patched origin save function. Oct 19, 2023
Copy link
Contributor

@fuhailin fuhailin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of software engineering, sincerely and strongly recommend that one commit fix only one minor problem to avoid introducing unexpected bugs

@@ -233,8 +233,7 @@ def _insert_de_shard_from_file_system(
_values_tensor_dataset = readers.FixedLengthRecordDataset(
shard_values_file_list,
record_bytes=_value_dtype.size * value_dim,
buffer_size=buffer_size * value_dim).padded_batch(insert_num_once *
value_dim,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here need to delete value_dim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_values_tensor_dataset record_bytes is already _value_dtype.size * value_dim, if we padded_batch the bytes, they could be the size _value_dtype.size * value_dim * insert_num_once * value_dim. When they are reshaped, their shape would not compatible with _keys_tensor_dataset.

@MoFHeka MoFHeka changed the title [Fix] fix bugs when loading KV files. Also make horovod saving function is callable independent bug not patched origin save function. [Fix] fix bugs when loading KV files and more. Also make horovod saving function is callable independent bug not patched origin save function. Oct 26, 2023
@MoFHeka MoFHeka force-pushed the master-dev branch 3 times, most recently from 5a4dfb4 to f4404dc Compare October 26, 2023 18:05
… this time, as this can lead to unexpected errors.

Now using de.keras.models.de_hvd_save_model to replace tf.keras.models.save_model.
…n, the recovered data will be the same first data in vector.
…nd is used to save DE KV files with different rank when using horovod all2all training.
…ger when using HvdAllToAllEmbedding by calling de.train.DEHvdCheckpoint.
Copy link
Member

@rhdong rhdong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhdong rhdong merged commit 6f7bbb8 into tensorflow:master Oct 28, 2023
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TFRA load embedding weights with wrong value when use CPU table
3 participants