-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
- Ray installed from (source or binary): source
- Ray version: 0.6.5
- Python version: Python 3.6.5
- Exact command to reproduce:
cd ray/python/ray/tune/examples
python mnist_pytorch.py
pytorch version:
1.0.0
or
cd ray/python/ray/tune/examples
python tune_mnist_keras.py
TF version:
1.12.0
keras version:
2.2.4
Describe the problem
Without any modfifications, build Ray from source, try to directly use tune provided examples, but seems most of the examples failed due to the
Destroying actor for trial xxxx. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
Btw, the machine has GPU and the version:
Cuda compilation tools, release 9.0, V9.0.176
However, after trying add reuse_actors=True , the same error msg appear.
Since the trials are suddenly stopped without any error or exception, could you please help to take a look? @richardliaw @robertnishihara Thanks!
Source code / logs
python mnist_pytorch.py
2019-03-23 23:54:34,913 WARNING worker.py:1406 -- WARNING: Not updating worker name since
setproctitleis not installed. Install this withpip install setproctitle(or ray[debug]) to enable monitoring of worker processes.
2019-03-23 23:54:34,914 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-03-23_23-54-34_52746/logs.
2019-03-23 23:54:35,021 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:24948 to respond...
2019-03-23 23:54:35,130 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:39939 to respond...
2019-03-23 23:54:35,132 INFO services.py:760 -- Starting Redis shard with 10.0 GB max memory.
2019-03-23 23:54:35,147 WARNING services.py:1236 -- Warning: Capping object memory store to 20.0GB. To increase this further, specifyobject_store_memorywhen calling ray.init() or ray start.
2019-03-23 23:54:35,148 INFO services.py:1384 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
2019-03-23 23:54:35,793 INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
2019-03-23 23:54:35,796 INFO tune.py:211 -- Starting a new experiment.
2019-03-23 23:54:37,283 WARNING util.py:62 -- Thestart_trialoperation took 1.3957560062408447 seconds to complete, which may be a performance bottleneck.
2019-03-23 23:54:58,442 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_0_lr=0.081371,momentum=0.40185. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:54:58,754 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_3_lr=0.010086,momentum=0.41713. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:54:59,133 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_1_lr=0.028139,momentum=0.40255. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:54:59,160 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_7_lr=0.030289,momentum=0.55615. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:54:59,299 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_5_lr=0.08914,momentum=0.18464. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:54:59,449 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_6_lr=0.066883,momentum=0.68077. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:55:00,221 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_4_lr=0.059111,momentum=0.82238. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:55:00,525 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_2_lr=0.063279,momentum=0.43368. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:55:21,020 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_9_lr=0.084676,momentum=0.45356. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-23 23:55:21,150 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_8_lr=0.051943,momentum=0.6297. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
python tune_mnist_keras.py
(pid=57890) 60000 train samples
(pid=57890) 10000 test samples
(pid=57881) x_train shape: (60000, 28, 28, 1)
(pid=57881) 60000 train samples
(pid=57881) 10000 test samples
(pid=57899) x_train shape: (60000, 28, 28, 1)
(pid=57899) 60000 train samples
(pid=57899) 10000 test samples
(pid=57916) x_train shape: (60000, 28, 28, 1)
(pid=57916) 60000 train samples
(pid=57916) 10000 test samples
(pid=57913) x_train shape: (60000, 28, 28, 1)
(pid=57913) 60000 train samples
(pid=57913) 10000 test samples
(pid=57910) x_train shape: (60000, 28, 28, 1)
(pid=57910) 60000 train samples
(pid=57910) 10000 test samples
2019-03-24 00:09:22,154 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_3_dropout1=0.41208,hidden=53,lr=0.0045996,momentum=0.29457. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:09:23,633 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_9_dropout1=0.78277,hidden=424,lr=0.085855,momentum=0.11821. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:09:28,650 WARNING util.py:62 -- Theexperiment_checkpointoperation took 0.14834022521972656 seconds to complete, which may be a performance bottleneck.
2019-03-24 00:09:36,315 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_1_dropout1=0.77148,hidden=307,lr=0.084435,momentum=0.87804. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:09:37,978 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_4_dropout1=0.71993,hidden=442,lr=0.014533,momentum=0.65771. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:10:18,199 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_6_dropout1=0.72255,hidden=446,lr=0.086364,momentum=0.86826. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:10:44,899 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_2_dropout1=0.73158,hidden=107,lr=0.087594,momentum=0.5979. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:10:48,515 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_0_dropout1=0.2571,hidden=236,lr=0.0083709,momentum=0.47214. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:10:51,434 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_7_dropout1=0.47593,hidden=218,lr=0.067242,momentum=0.85505. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:10:54,745 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_8_dropout1=0.47459,hidden=383,lr=0.094025,momentum=0.39063. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2019-03-24 00:10:56,552 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_5_dropout1=0.5431,hidden=429,lr=0.031262,momentum=0.61523. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.