Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2+GPU nvidia-docker not found #69

Closed
leduckhc opened this issue Jan 17, 2017 · 4 comments
Closed

EC2+GPU nvidia-docker not found #69

leduckhc opened this issue Jan 17, 2017 · 4 comments

Comments

@leduckhc
Copy link

leduckhc commented Jan 17, 2017

Hi, I received error while running EC2 with GPU on us-west-1 ami-7ea27a1e
/var/lib/cloud/instance/scripts/part-001: line 47: nvidia-docker: command not found
Running in the stubbed mode with

run_experiment_lite(
    algo.train(),
    exp_prefix="space-invaders",
    exp_name="dqn",
    # Number of parallel workers for sampling
    n_parallel=8,
    # Only keep the snapshot parameters for the last iteration
    snapshot_mode="all",
    mode="ec2",
    use_gpu=True,
    # Specifies the seed for the experiment. If this is not provided, a random seed
    # will be used
    seed=1,
    # plot=True,
    sync_s3_pkl=True,
    periodic_sync=True,
    periodic_sync_interval=120,
)

After checking AWS EC2 console (web) i find terminated instance. Thus I go to S3 container and open stdout.log and find these lines

sync initiated
log sync initiated
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container did not start before the specified timeout".
upload: ../../../home/ubuntu/user_data.log to s3://rllab-rocky/experiments/first-exp/first_exp_2016_10_05_17_52_21_0001/stdout.log
{
    "return": "true"
}
{
    "return": "true"
}
start: Job is already running: docker
Using default tag: latest
latest: Pulling from dementrock/rllab3-shared
...
/var/lib/cloud/instance/scripts/part-001: line 47: nvidia-docker: command not found
@leduckhc
Copy link
Author

leduckhc commented Jan 17, 2017

Another issue is related to docker images dementrock/rllab3-shared and probably dementrock/rllab3-shared-gpu as well which don't have a latest version of gym==0.7.2. Would you please consider to update the images on docker hub and AWS AMIs as well?

Traceback (most recent call last):
  File "/root/code/rllab/scripts/run_experiment_lite.py", line 136, in <module>
    run_experiment(sys.argv)
  File "/root/code/rllab/scripts/run_experiment_lite.py", line 98, in run_experiment
    logger.log_parameters_lite(params_log_file, args)
  File "/root/code/rllab/rllab/misc/logger.py", line 310, in log_parameters_lite
    stub_method = pickle.loads(base64.b64decode(args.args_data))
  File "/root/code/rllab/experiments/dqn/dqn.py", line 8, in <module>
    from experiments.dqn.atari_env_wrapper import AtariEnvWrapper
  File "/root/code/rllab/experiments/dqn/atari_env_wrapper.py", line 7, in <module>
    from rllab.envs.gym_env import GymEnv
  File "/root/code/rllab/rllab/envs/gym_env.py", line 5, in <module>
    from gym.monitoring import monitor_manager
ImportError: cannot import name 'monitor_manager'

@dementrock
Copy link
Member

Yes, that AMI was supposed to work with CPU machines only. I will rollout a list of new AMIs / docker images soon.

@leduckhc
Copy link
Author

Thank you in advance.

@dementrock
Copy link
Member

I have updated all AMIs. Please find the new AMI ids here: https://github.com/openai/rllab/blob/master/scripts/setup_ec2_for_rllab.py#L50

jonashen pushed a commit to jonashen/rllab that referenced this issue May 29, 2018
Implement std_share_network architecture in the GaussianMLP*
classes of TensorFlow.

The std_share_network creates a single neural net with output
length of 2 * action_dimension. The first half output params
are the means params, and the second half params are the
log_std params.

As a result, the GaussianMLP* classes can support a new
architecture.

Refer to: rll#44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants