[tune] EXAMPLE DOESN'T RUN only show failing information from two examples: mnist_pytorch.py and tune_mnist_keras.py

### System information
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: 

> NAME="Ubuntu"
> VERSION="16.04.5 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04.5 LTS"
> VERSION_ID="16.04"
> HOME_URL="http://www.ubuntu.com/"
> SUPPORT_URL="http://help.ubuntu.com/"
> BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
> VERSION_CODENAME=xenial
> UBUNTU_CODENAME=xenial

- **Ray installed from (source or binary)**: source
- **Ray version**: 0.6.5
- **Python version**: Python 3.6.5
- **Exact command to reproduce**:
```
cd ray/python/ray/tune/examples
python mnist_pytorch.py
```
pytorch version:

> 1.0.0

or 
```
cd ray/python/ray/tune/examples
python tune_mnist_keras.py
```
TF version:

> 1.12.0

keras version:

> 2.2.4



### Describe the problem
Without any modfifications, build Ray from source, try to directly use tune provided examples, but seems most of the examples failed due to the 

> Destroying actor for trial xxxx. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.

Btw, the machine has GPU and the version:

> Cuda compilation tools, release 9.0, V9.0.176

However, after trying add `reuse_actors=True` , the same error msg appear. 

Since the trials are suddenly stopped without any error or exception, could you please help to take a look? @richardliaw @robertnishihara Thanks!

### Source code / logs
`python mnist_pytorch.py`

> 2019-03-23 23:54:34,913 WARNING worker.py:1406 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
> 2019-03-23 23:54:34,914 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-03-23_23-54-34_52746/logs.
> 2019-03-23 23:54:35,021 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:24948 to respond...
> 2019-03-23 23:54:35,130 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:39939 to respond...
> 2019-03-23 23:54:35,132 INFO services.py:760 -- Starting Redis shard with 10.0 GB max memory.
> 2019-03-23 23:54:35,147 WARNING services.py:1236 -- Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
> 2019-03-23 23:54:35,148 INFO services.py:1384 -- Starting the Plasma object store with 20.0 GB memory using /dev/shm.
> 2019-03-23 23:54:35,793 INFO tune.py:60 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run()
> 2019-03-23 23:54:35,796 INFO tune.py:211 -- Starting a new experiment.
> 2019-03-23 23:54:37,283 WARNING util.py:62 -- The `start_trial` operation took 1.3957560062408447 seconds to complete, which may be a performance bottleneck.
> 2019-03-23 23:54:58,442 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_0_lr=0.081371,momentum=0.40185. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:54:58,754 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_3_lr=0.010086,momentum=0.41713. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:54:59,133 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_1_lr=0.028139,momentum=0.40255. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:54:59,160 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_7_lr=0.030289,momentum=0.55615. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:54:59,299 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_5_lr=0.08914,momentum=0.18464. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:54:59,449 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_6_lr=0.066883,momentum=0.68077. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:55:00,221 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_4_lr=0.059111,momentum=0.82238. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:55:00,525 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_2_lr=0.063279,momentum=0.43368. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:55:21,020 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_9_lr=0.084676,momentum=0.45356. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-23 23:55:21,150 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_8_lr=0.051943,momentum=0.6297. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


`python tune_mnist_keras.py`

> (pid=57890) 60000 train samples
> (pid=57890) 10000 test samples
> (pid=57881) x_train shape: (60000, 28, 28, 1)
> (pid=57881) 60000 train samples
> (pid=57881) 10000 test samples
> (pid=57899) x_train shape: (60000, 28, 28, 1)
> (pid=57899) 60000 train samples
> (pid=57899) 10000 test samples
> (pid=57916) x_train shape: (60000, 28, 28, 1)
> (pid=57916) 60000 train samples
> (pid=57916) 10000 test samples
> (pid=57913) x_train shape: (60000, 28, 28, 1)
> (pid=57913) 60000 train samples
> (pid=57913) 10000 test samples
> (pid=57910) x_train shape: (60000, 28, 28, 1)
> (pid=57910) 60000 train samples
> (pid=57910) 10000 test samples
> 2019-03-24 00:09:22,154 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_3_dropout1=0.41208,hidden=53,lr=0.0045996,momentum=0.29457. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:09:23,633 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_9_dropout1=0.78277,hidden=424,lr=0.085855,momentum=0.11821. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:09:28,650 WARNING util.py:62 -- The `experiment_checkpoint` operation took 0.14834022521972656 seconds to complete, which may be a performance bottleneck.
> 2019-03-24 00:09:36,315 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_1_dropout1=0.77148,hidden=307,lr=0.084435,momentum=0.87804. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:09:37,978 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_4_dropout1=0.71993,hidden=442,lr=0.014533,momentum=0.65771. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:10:18,199 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_6_dropout1=0.72255,hidden=446,lr=0.086364,momentum=0.86826. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:10:44,899 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_2_dropout1=0.73158,hidden=107,lr=0.087594,momentum=0.5979. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:10:48,515 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_0_dropout1=0.2571,hidden=236,lr=0.0083709,momentum=0.47214. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:10:51,434 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_7_dropout1=0.47593,hidden=218,lr=0.067242,momentum=0.85505. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:10:54,745 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_8_dropout1=0.47459,hidden=383,lr=0.094025,momentum=0.39063. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
> 2019-03-24 00:10:56,552 INFO ray_trial_executor.py:178 -- Destroying actor for trial TRAIN_FN_5_dropout1=0.5431,hidden=429,lr=0.031262,momentum=0.61523. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tune] EXAMPLE DOESN'T RUN only show failing information from two examples: mnist_pytorch.py and tune_mnist_keras.py #4467

System information

Describe the problem

Source code / logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[tune] EXAMPLE DOESN'T RUN only show failing information from two examples: mnist_pytorch.py and tune_mnist_keras.py #4467

Description

System information

Describe the problem

Source code / logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions