Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to update version of Ray on a cluster? #246

Closed
pcmoritz opened this issue Feb 4, 2017 · 4 comments
Closed

How to update version of Ray on a cluster? #246

pcmoritz opened this issue Feb 4, 2017 · 4 comments

Comments

@pcmoritz
Copy link
Contributor

pcmoritz commented Feb 4, 2017

Someone I chatted with wants to do the following: update an existing Ray cluster with a bunch of nodes to use a newer version of Ray. Right now if the cluster is large the best way to do it seems to be to create an AMI with the new version and restart all the instances. Is there a better way (one possibility: provide an update-ray.sh for pssh)?

@robertnishihara
Copy link
Collaborator

robertnishihara commented Feb 4, 2017

For now, I think using pssh is the way to go. Assuming you have a workers.txt with all of the node IP addresses (other than the head node), you can do the following. The instructions are similar to the ones in this file https://github.com/ray-project/ray/blob/master/doc/using-ray-on-a-large-cluster.md and maybe should be added to that file.

  1. Stop, update, and start Ray on the head node.
# Stop Ray
ray/scripts/stop_ray.sh

# Update Ray
cd ~/ray/python
git pull
python setup.py install

# Start Ray
cd ~
ray/scripts/start_ray.sh --head --num-workers=10 --redis-port=6379

Then make a script to run via parallel ssh on all the other nodes. E.g., script.sh

export PATH=/home/ubuntu/anaconda2/bin/:$PATH
ray/scripts/stop_ray.sh
cd ~/ray/python
git pull
python setup.py install
cd ~
ray/scripts/start_ray.sh --num-workers=10 --redis-address=<head-node-ip>:6379

Then run it via parallel-ssh

parallel-ssh -h workers.txt -P -I < script.sh

This assumes you have a workers.txt file containing the (private) IP addresses of all the nodes other than the head node and can ssh to the other nodes from the head node.

@robertnishihara robertnishihara changed the title update ray on a cluster How to update version of Ray on a cluster? Feb 4, 2017
@robertnishihara
Copy link
Collaborator

robertnishihara commented Feb 5, 2017

I've been trying to use a script for doing the initial installation of Ray so you don't need to create an AMI, but I haven't quite gotten it to work.

initial_setup.sh

sudo apt-get update
sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-al\
l-dev unzip emacs

wget https://repo.continuum.io/archive/Anaconda2-4.3.0-Linux-x86_64.sh -O ~/ana\
conda.sh
bash ~/anaconda.sh -b -p $HOME/anaconda
export PATH="$HOME/anaconda/bin:$PATH"
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc

git clone https://github.com/ray-project/ray.git
cd ray/python
python setup.py install

conda install -y libgcc

pip install numpy cloudpickle funcsigs colorama psutil redis
parallel-ssh -h workers.txt -P -I -t 0 < initial_setup.sh

The -t 0 is to prevent it from timing out and dying.

It currently dies with messages like

[FAILURE] 172.31.1.198 Exited with error code 127

FYI @jssmith.

My guess is that the command git clone https://github.com/ray-project/ray.git or something is failing when it is run the second time, e.g., with the error fatal: destination path 'ray' already exists and is not an empty directory. So we need to make that the failure of one command doesn't prevent the others from succeeding.

@jssmith
Copy link
Contributor

jssmith commented Feb 6, 2017

A few responses here:

On Updating

Instructions for updating are basically right. The steps should be 1/ shut down Ray on all nodes, 2/ write update script, 3/ run update script on head node, 4/ run update script in parallel on other nodes, 4/ start up Ray on all nodes. I agree that we should add these instructions to https://github.com/ray-project/ray/blob/master/doc/using-ray-on-a-large-cluster.md

On AMI

This should work. Some suggestions on bug fixes, then I'll comment on whether it is a good idea. You can use the -o and -e options on parallel-ssh to redirect the standard output and standard error from each host to a file. This should help in debugging. Also, as written I think the script will just keep running even if individual commands result in errors (newline is like ;, not like &&). It really would be best to have a script that won't have any errors along the way, though. That way we can just confirm that no error output was generated after running parallel-ssh and have confidence that the installation was successful.

I still tend to prefer steering users toward creating AMIs, but it is worth considering this. For one, the user doesn't have to get a smooth-running setup script. If there is a need for libraries that require license approval, large files, etc., it may be easier to just do it once, by hand, and then clone the result of this work. The larger worry that I have is that whenever there are external dependencies, e.g., downloading Anaconda or other dependencies, then speed and success become variable factors. Unless one has a good way to verify the success of the installation on each machine then this is a risky way to go. Note that these risks usually scale with the number of machines, so for small clusters the AMI may have less value, but as you get to larger installations it becomes increasingly useful to bring up all of the machines in a well-defined state.

@robertnishihara
Copy link
Collaborator

Instructions for updating the version of Ray using parallel-ssh have been added. #256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants