[tune] fix Tensorboard file descriptor leak #12425

richardliaw · 2020-11-25T21:56:49Z

Why are these changes needed?

We have a tensorboard file-descriptor leak.

 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00005_5_a=-1_2020-11-25_12-55-11/events.out.tfevents.1606337713.Richards-MBP.attlocal.net', fd=92),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00007_7_a=-1_2020-11-25_12-55-11/events.out.tfevents.1606337713.Richards-MBP.attlocal.net', fd=94),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00003_3_a=-1_2020-11-25_12-55-10/events.out.tfevents.1606337714.Richards-MBP.attlocal.net', fd=96),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00000_0_a=1_2020-11-25_12-55-10/events.out.tfevents.1606337714.Richards-MBP.attlocal.net', fd=97),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00001_1_a=-1_2020-11-25_12-55-10/events.out.tfevents.1606337714.Richards-MBP.attlocal.net', fd=98),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00002_2_a=1_2020-11-25_12-55-10/events.out.tfevents.1606337715.Richards-MBP.attlocal.net', fd=99),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00003_3_a=-1_2020-11-25_12-55-10/events.out.tfevents.1606337715.Richards-MBP.attlocal.net', fd=100),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00004_4_a=1_2020-11-25_12-55-11/events.out.tfevents.1606337715.Richards-MBP.attlocal.net', fd=101),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00005_5_a=-1_2020-11-25_12-55-11/events.out.tfevents.1606337715.Richards-MBP.attlocal.net', fd=102),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00006_6_a=1_2020-11-25_12-55-11/events.out.tfevents.1606337715.Richards-MBP.attlocal.net', fd=103),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00007_7_a=-1_2020-11-25_12-55-11/events.out.tfevents.1606337715.Richards-MBP.attlocal.net', fd=104),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00000_0_a=1_2020-11-25_12-55-10/events.out.tfevents.1606337716.Richards-MBP.attlocal.net', fd=105),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00001_1_a=-1_2020-11-25_12-55-10/events.out.tfevents.1606337716.Richards-MBP.attlocal.net', fd=106),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00002_2_a=1_2020-11-25_12-55-10/events.out.tfevents.1606337716.Richards-MBP.attlocal.net', fd=107),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00003_3_a=-1_2020-11-25_12-55-10/events.out.tfevents.1606337716.Richards-MBP.attlocal.net', fd=108),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00004_4_a=1_2020-11-25_12-55-11/events.out.tfevents.1606337716.Richards-MBP.attlocal.net', fd=109),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00005_5_a=-1_2020-11-25_12-55-11/events.out.tfevents.1606337717.Richards-MBP.attlocal.net', fd=110),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00006_6_a=1_2020-11-25_12-55-11/events.out.tfevents.1606337717.Richards-MBP.attlocal.net', fd=111),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00007_7_a=-1_2020-11-25_12-55-11/events.out.tfevents.1606337717.Richards-MBP.attlocal.net', fd=112),
 popenfile(path='/Users/rliaw/ray_results/ray_demo/MyTrainable_81eaf_00000_0_a=1_2020-11-25_12-55-10/events.out.tfevents.1606337717.Richards-MBP.attlocal.net', fd=113)]

when running:


        pbt = PopulationBasedTraining(
            time_attr="training_iteration",
            metric="metric",
            mode="max",
            perturbation_interval=1,
            quantile_fraction=0.5,
            hyperparam_mutations={"b": [-1]},
        )

        tune.run(
            MyTrainable,
            name="ray_demo",
            scheduler=pbt,
            stop={"training_iteration": 20},
            num_samples=8,
            checkpoint_freq=1,
            keep_checkpoints_num=1,
            verbose=False,
            fail_fast=True,
            config={"a": tune.sample_from(lambda _: param_a())},
            trial_executor=CustomExecutor(
                queue_trials=False, reuse_actors=False),
        )

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

amogkam

LGTM. Theres just some extra comments to be removed.

amogkam · 2020-12-03T00:57:03Z

python/ray/tune/tests/test_trial_scheduler_pbt.py

+    def testFileFree(self):
+        class MyTrainable(Trainable):
+            def setup(self, config):
+                # Make sure this is large enough so ray uses object store


Remove this comment?

great catch

amogkam · 2020-12-03T00:57:23Z

python/ray/tune/tests/test_trial_scheduler_pbt.py

+                self.verbose = verbose
+
+            def on_trial_result(self, *args, **kwargs):
+                # assert len(ray.objects()) <= 10


great catch

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw · 2020-12-03T01:10:22Z

@amogkam thanks for the fast review!

richardliaw added 7 commits November 25, 2020 12:39

add-gc-fix

0389b39

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fd

f8a52c9

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix

34d8034

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

p

10088e7

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Merge branch 'master' into file-descriptor-tune

eb434f1

fix

2cef946

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

fix-restores

3e56988

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw marked this pull request as ready for review December 3, 2020 00:54

richardliaw requested review from krfricke and amogkam December 3, 2020 00:54

amogkam approved these changes Dec 3, 2020

View reviewed changes

richardliaw added 2 commits December 2, 2020 17:04

fix

fdd48ca

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

extra-psutil

2c4590c

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

richardliaw mentioned this pull request Dec 3, 2020

[tune] Fix file descriptor leak by syncer #12590

Merged

6 tasks

richardliaw added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Dec 3, 2020

richardliaw changed the title ~~[tune] file descriptor leak~~ [tune] fix Tensorboard file descriptor leak Dec 3, 2020

krfricke approved these changes Dec 3, 2020

View reviewed changes

richardliaw merged commit 7c58a85 into ray-project:master Dec 3, 2020

richardliaw deleted the file-descriptor-tune branch December 3, 2020 08:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] fix Tensorboard file descriptor leak #12425

[tune] fix Tensorboard file descriptor leak #12425

richardliaw commented Nov 25, 2020 •

edited

Loading

amogkam left a comment

amogkam Dec 3, 2020

richardliaw Dec 3, 2020

amogkam Dec 3, 2020

richardliaw Dec 3, 2020

richardliaw commented Dec 3, 2020

[tune] fix Tensorboard file descriptor leak #12425

[tune] fix Tensorboard file descriptor leak #12425

Conversation

richardliaw commented Nov 25, 2020 • edited Loading

Why are these changes needed?

Checks

amogkam left a comment

Choose a reason for hiding this comment

amogkam Dec 3, 2020

Choose a reason for hiding this comment

richardliaw Dec 3, 2020

Choose a reason for hiding this comment

amogkam Dec 3, 2020

Choose a reason for hiding this comment

richardliaw Dec 3, 2020

Choose a reason for hiding this comment

richardliaw commented Dec 3, 2020

richardliaw commented Nov 25, 2020 •

edited

Loading