Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParameterServerStrategy worker removing code change #49726

Open
wojciechp6 opened this issue May 26, 2021 · 2 comments
Open

ParameterServerStrategy worker removing code change #49726

wojciechp6 opened this issue May 26, 2021 · 2 comments
Assignees
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:feature Feature requests

Comments

@wojciechp6
Copy link

wojciechp6 commented May 26, 2021

Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template

System information

  • TensorFlow version (you are using): 2.4.1
  • Are you willing to contribute it (Yes/No): No

Hi, I'm working on dynamic worker adding and removing for ParameterServerStrategy. I noticed that one change in Worker class is necessary to remove worker while training.
The change is here:
cluster_coordinator.py

class Worker:
  [...]
  def _process_queue(self):
    """Function running in a worker thread to process closure queues."""
    self._maybe_delay()
    while self._should_worker_thread_run:
      closure = self._cluster._closure_queue.get()  # pylint: disable=protected-access
      if not self._should_worker_thread_run:
        closure.mark_cancelled()
        return
      if closure is None:
        return
      self._process_closure(closure)
      # To properly stop the worker and preemption threads, it is important that
      # `ClusterCoordinator` object is not held onto so its `__del__` can be
      # called. By removing the reference to the `closure` that has already been
      # processed, we ensure that the `closure` object is released, while
      # getting the next `closure` at above `self._cluster._closure_queue.get()`
      # call.
      del closure
  [...]

Could someone commit it? I'm not familiar with tensorflow contribute requirements, so I think somebody else could do this faster. Thanks in advance.

@wojciechp6 wojciechp6 added the type:feature Feature requests label May 26, 2021
@UsharaniPagadala UsharaniPagadala added comp:dist-strat Distribution Strategy related issues TF 2.4 for issues related to TF 2.4 labels May 26, 2021
@Saduf2019 Saduf2019 assigned ymodak and unassigned Saduf2019 May 30, 2021
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 3, 2021
@yuefengz
Copy link
Contributor

yuefengz commented Jun 7, 2021

Looks like you want to add the if block containing closure.mark_cancelled()? Is this the only change? Do you want to send a PR? I am happy to change it as well.

@wojciechp6
Copy link
Author

wojciechp6 commented Jun 8, 2021

Hi! Yes, the if block is the only change. I'll be happy if you could do it for me. The contribution polices are too complex to merge such small change. Greetings!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.4 for issues related to TF 2.4 type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

5 participants