Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]Add the features for expanding and shrinking the number of tables in distributed training by independently saving files. #305

Merged
merged 9 commits into from
Mar 10, 2023

Conversation

MoFHeka
Copy link
Contributor

@MoFHeka MoFHeka commented Mar 9, 2023

Description

Add the features for expanding and shrinking the number of tables in distributed training by independently saving files.

Also improve the performance of CPU table by using std::copy_n.

Also make genarating _DEFAULT_CUDA_COMPUTE_CAPABILITIES more compatible and concise in build_deps/toolchains/gpu/cuda_configure.bzl.

Also compatible with TF 2.9, which would pass parameter validate_shape to _init_from_args.

Also fix RedisTableOfTensors Node missing user-defined name.

Also fix problem with the parameter 'checkpoint' passing not working when using DE BasicEmbedding.

Also compatible with 'find_namespace_packages' when using setuptools, because 'find_packages' has been deprecated.

Type of change

  • Bug fix
  • New Tutorial
  • Updated or additional documentation
  • Additional Testing
  • New Feature

Checklist:

  • I've properly formatted my code according to the guidelines
    • By running yapf
    • By running clang-format
  • This PR addresses an already submitted issue for TensorFlow Recommenders-Addons
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

How Has This Been Tested?

Read the doc(docs/api_docs/tfra/dynamic_embedding/FileSystemSaver.md) and run new tests.

@MoFHeka MoFHeka requested a review from rhdong as a code owner March 9, 2023 14:42
@MoFHeka MoFHeka requested a review from Lifann March 9, 2023 14:42
…les in distributed training by independently saving files.

User would be able to use DE filesystem KV files without any code changing. Just simply use savedmodel/checkpoint API to save and restore DE parameters.
A better implementation for TFRA training in Horovod.
…pe to _init_from_args

[fix] Compatible with TF 2.9 function read_value_no_copy.
…, because 'find_packages' has been deprecated.
Copy link
Member

@rhdong rhdong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhdong rhdong merged commit 373c729 into tensorflow:master Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants