Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting processes by hand. #108

Closed
robertnishihara opened this issue Dec 9, 2016 · 11 comments
Closed

Starting processes by hand. #108

robertnishihara opened this issue Dec 9, 2016 · 11 comments

Comments

@robertnishihara
Copy link
Collaborator

robertnishihara commented Dec 9, 2016

To debug crashes, it is often useful to start each of the processes (the plasma store, the plasma manager, the local scheduler, the global scheduler, redis, etc) by hand. This way, some of the processes can be started in gdb, and logging messages from the different processes are easier to read.

Currently, this can be done as follows (run all of these commands in separate terminal windows).

  1. Start Redis
rm dump.rdb; ./src/common/thirdparty/redis/src/redis-server --loadmodule src/common/redis_module/ray_redis_module.so
  1. Start the global scheduler (and pass in the Redis address)
./src/global_scheduler/build/global_scheduler -r 127.0.0.1:6379
  1. Start the plasma store
src/plasma/build/plasma_store -s /tmp/s1 -m 1000000000
  1. Start the plasma manager
src/plasma/build/plasma_manager -s /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -h 127.0.0.1 -p 23894
  1. Start the local scheduler
src/photon/build/photon_scheduler -s /tmp/sched1 -p /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -a 127.0.0.1:23894 -h 127.0.0.1
  1. Modify start_ray_local in lib/python/ray/services.py to be something like this.
def start_ray_local(node_ip_address="127.0.0.1", num_workers=0, num_local_schedulers=1, worker_path=None):
  if worker_path is None:
    worker_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "workers/default_worker.py")
  redis_port = 6379
  redis_address = address(node_ip_address, redis_port)
  object_store_names = ["/tmp/s1"]
  object_store_manager_names = ["/tmp/m1"]
  local_scheduler_names = ["/tmp/sched1"]
  address_info = {"node_ip_address": node_ip_address,
                  "redis_port": redis_port,
                  "object_store_names": object_store_names,
                  "object_store_manager_names": object_store_manager_names,
                  "local_scheduler_names": local_scheduler_names}
  # Start the workers.
  for i in range(num_workers):
    start_worker(address_info["node_ip_address"],
                 address_info["object_store_names"][i % num_local_schedulers],
                 address_info["object_store_manager_names"][i % num_local_schedulers],
                 address_info["local_scheduler_names"][i % num_local_schedulers],
                 redis_port,
                 worker_path,
                 cleanup=True)
  return address_info
  1. Then start Python, and run
import ray
ray.init(start_ray_local=True, num_workers=2)

Then try running some commands and see which processes crash! Processes can be started in gdb as well (or lldb on Mac).

@robertnishihara
Copy link
Collaborator Author

Closing for now.

@robertnishihara
Copy link
Collaborator Author

Updated instructions

Start Redis

rm dump.rdb; ./src/common/thirdparty/redis/src/redis-server --loadmodule src/common/redis_module/ray_redis_module.so

Start the global scheduler (and pass in the Redis address)

./src/global_scheduler/build/global_scheduler -r 127.0.0.1:6379

Start the plasma store

src/plasma/build/plasma_store -s /tmp/s1 -m 1000000000

Start the plasma manager

src/plasma/build/plasma_manager -s /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -h 127.0.0.1 -p 23894

Start the local scheduler

src/photon/build/photon_scheduler -s /tmp/sched1 -p /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -a 127.0.0.1:23894 -h 127.0.0.1

Start a worker (or run this multiple times to start multiple workers).

python lib/python/ray/workers/default_worker.py --redis-address 127.0.0.1:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address 127.0.0.1

Connect a driver (run this in a Python interpreter).

import ray

address_info = {
  "node_ip_address": "127.0.0.1",
  "redis_address": "127.0.0.1:6379",
  "store_socket_name": "/tmp/s1",
  "manager_socket_name": "/tmp/m1",
  "local_scheduler_socket_name": "/tmp/sched1"}

ray.connect(address_info, mode=ray.SCRIPT_MODE)

Some other useful things:

  • Start a redis client with redis-cli and run monitor to monitor all of the commands going through the redis server.

@robertnishihara
Copy link
Collaborator Author

Updating this again, since the instructions have changed.

Start Redis

rm dump.rdb; ./python/core/src/common/thirdparty/redis/src/redis-server --loadmodule python/core/src/common/redis_module/libray_redis_module.so

Start the global scheduler (and pass in the Redis address)

./python/core/src/global_scheduler/global_scheduler -r 127.0.0.1:6379

Start the plasma store

./python/core/src/plasma/plasma_store -s /tmp/s1 -m 1000000000

Start the plasma manager

./python/core/src/plasma/plasma_manager -s /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -h 127.0.0.1 -p 23894

Start the local scheduler

./python/core/src/photon/photon_scheduler -s /tmp/sched1 -p /tmp/s1 -m /tmp/m1 -r 127.0.0.1:6379 -a 127.0.0.1:23894 -h 127.0.0.1

Start a worker (or run this multiple times to start multiple workers).

python python/ray/workers/default_worker.py --redis-address 127.0.0.1:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address 127.0.0.1

Connect a driver (run this in a Python interpreter).

import ray

address_info = {
  "node_ip_address": "127.0.0.1",
  "redis_address": "127.0.0.1:6379",
  "store_socket_name": "/tmp/s1",
  "manager_socket_name": "/tmp/m1",
  "local_scheduler_socket_name": "/tmp/sched1"}

ray.connect(address_info, mode=ray.SCRIPT_MODE)

Some other useful things:

  • Start a redis client with redis-cli and run monitor to monitor all of the commands going through the redis server.

@robertnishihara
Copy link
Collaborator Author

robertnishihara commented Apr 10, 2017

Updated instructions.

Start Redis

rm dump.rdb; ./python/ray/core/src/common/thirdparty/redis/src/redis-server \
    --loadmodule python/ray/core/src/common/redis_module/libray_redis_module.so

Start the global scheduler (and pass in the Redis address)

./python/ray/core/src/global_scheduler/global_scheduler \
    -r 127.0.0.1:6379 \
    -h 127.0.0.1

Start the plasma store

./python/ray/core/src/plasma/plasma_store \
    -s /tmp/s1 \
    -m 1000000000

Start the plasma manager

./python/ray/core/src/plasma/plasma_manager \
    -s /tmp/s1 \
    -m /tmp/m1 \
    -r 127.0.0.1:6379 \
    -h 127.0.0.1 \
    -p 23894

Start the local scheduler

# Without ability to start new workers.
./python/ray/core/src/local_scheduler/local_scheduler \
    -s /tmp/sched1 \
    -p /tmp/s1 \
    -m /tmp/m1 \
    -r 127.0.0.1:6379 \
    -a 127.0.0.1:23894 \
    -h 127.0.0.1

# With ability to start new workers.
./python/ray/core/src/local_scheduler/local_scheduler \
    -s /tmp/sched1 \
    -p /tmp/s1 \
    -m /tmp/m1 \
    -r 127.0.0.1:6379 \
    -a 127.0.0.1:23894 \
    -h 127.0.0.1 \
    -w "python python/ray/workers/default_worker.py --redis-address 127.0.0.1:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address 127.0.0.1"

Start a worker (or run this multiple times to start multiple workers).

python python/ray/workers/default_worker.py \
    --redis-address 127.0.0.1:6379 \
    --object-store-name /tmp/s1 \
    --object-store-manager-name /tmp/m1 \
    --local-scheduler-name /tmp/sched1 \
    --node-ip-address 127.0.0.1

Connect a driver (run this in a Python interpreter).

import ray

address_info = {
  "node_ip_address": "127.0.0.1",
  "redis_address": "127.0.0.1:6379",
  "store_socket_name": "/tmp/s1",
  "manager_socket_name": "/tmp/m1",
  "local_scheduler_socket_name": "/tmp/sched1"}

ray.connect(address_info, mode=ray.SCRIPT_MODE)

Some other useful things:

  • Start a redis client with redis-cli and run monitor to monitor all of the commands going through the redis server.

@robertnishihara
Copy link
Collaborator Author

robertnishihara commented Jun 14, 2017

More recent instructions.

Note: Throughout, you will need to replace <head-node-ip> with something more correct.

The processes on the head node can be started as follows.

  1. Run the following in Python to start the Redis shards.

    import ray
    
    ray.services.start_redis("<head-node-ip>",
                             port=6379,
                             num_redis_shards=1,
                             redirect_output=False,
                             cleanup=False)
  2. Start a global scheduler.

    ./python/ray/core/src/global_scheduler/global_scheduler \
        -r <head-node-ip>:6379 \
        -h <head-node-ip>
    
  3. Start a plasma store

    ./python/ray/core/src/plasma/plasma_store \
        -s /tmp/s1 \
        -m 1000000000
    
  4. Start a plasma manager

    ./python/ray/core/src/plasma/plasma_manager \
        -s /tmp/s1 \
        -m /tmp/m1 \
        -r <head-node-ip>:6379 \
        -h <head-node-ip> \
        -p 23894
    
  5. Start the local scheduler

    # Without ability to start new workers.
    ./python/ray/core/src/local_scheduler/local_scheduler \
        -s /tmp/sched1 \
        -p /tmp/s1 \
        -m /tmp/m1 \
        -r <head-node-ip>:6379 \
        -a <head-node-ip>:23894 \
        -h <head-node-ip> \
        -c 16,0
    
    # With ability to start new workers.
    ./python/ray/core/src/local_scheduler/local_scheduler \
        -s /tmp/sched1 \
        -p /tmp/s1 \
        -m /tmp/m1 \
        -r <head-node-ip>:6379 \
        -a <head-node-ip>:23894 \
        -h <head-node-ip> \
        -c 16,0 \
        -w "python python/ray/workers/default_worker.py --redis-address <head-node-ip>:6379 --object-store-name /tmp/s1 --object-store-manager-name /tmp/m1 --local-scheduler-name /tmp/sched1 --node-ip-address <head-node-ip>"
    
  6. Start a worker (or run this multiple times to start multiple workers).

     python python/ray/workers/default_worker.py \
         --redis-address <head-node-ip>:6379 \
         --object-store-name /tmp/s1 \
         --object-store-manager-name /tmp/m1 \
         --local-scheduler-name /tmp/sched1 \
         --node-ip-address <head-node-ip>
    
  7. Connect a driver (run this in a Python interpreter).

    import ray
    ray.init(redis_address="<head-node-ip>:6379")

After that, it should be possible to start up Ray on other nodes as follows.*

ray start --redis-address=<head-node-ip>:6379

@robertnishihara
Copy link
Collaborator Author

If you wish to start the Redis serves by hand instead of calling ray.services.start_redis as in the previous comment, you can do the following (this starts one primary Redis server and one other Redis shard). Note the value <head-node-ip> will need to be replaced.

./python/ray/core/src/common/thirdparty/redis/src/redis-server \
    --loglevel warning \
    --loadmodule ./python/ray/core/src/common/redis_module/libray_redis_module.so \
    --port 6379
./python/ray/core/src/common/thirdparty/redis/src/redis-server \
    --loglevel warning \
    --loadmodule ./python/ray/core/src/common/redis_module/libray_redis_module.so \
    --port 6380
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 set NumRedisShards 1
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 rpush RedisShards <head-node-ip>:6380

./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 config set notify-keyspace-events Kl
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6380 config set notify-keyspace-events Kl

./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 config set protected-mode no
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6380 config set protected-mode no

./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6379 config set client-output-buffer-limit "normal 0 0 0 slave 268435456 67108864 60 pubsub 134217728 134217728 60"
./python/ray/core/src/common/thirdparty/redis/src/redis-cli -p 6380 config set client-output-buffer-limit "normal 0 0 0 slave 268435456 67108864 60 pubsub 134217728 134217728 60"

@pcmoritz
Copy link
Contributor

A cool variant of this I've just successfully used:

Edit services.py to output the command that was supposed to be run instead of actually running it, and put a

import IPython
IPython.embed()

after it. Then you can start Ray normally and when the interpreter hits the IPython.embed, start the program by hand (possibly in gdb). It works beautifully!

Here is the diff:

--- a/python/ray/local_scheduler/local_scheduler_services.py
+++ b/python/ray/local_scheduler/local_scheduler_services.py
@@ -117,6 +117,10 @@ def start_local_scheduler(plasma_store_name,
                                stdout=stdout_file, stderr=stderr_file)
         time.sleep(1.0)
     else:
-        pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
+        print(" ".join(command))
+        import IPython
+        IPython.embed()
+        pid = None
+        # pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file)
         time.sleep(0.1)
     return local_scheduler_name, pid
diff --git a/python/ray/services.py b/python/ray/services.py
index a37c16a..e6f3ede 100644
--- a/python/ray/services.py
+++ b/python/ray/services.py
@@ -555,7 +555,7 @@ def start_local_scheduler(redis_address,
         stderr_file=stderr_file,
         static_resource_list=[num_cpus, num_gpus],
         num_workers=num_workers)
-    if cleanup:
+    if cleanup and p:
         all_processes[PROCESS_TYPE_LOCAL_SCHEDULER].append(p)
     record_log_files_in_redis(redis_address, node_ip_address,
                               [stdout_file, stderr_file])

-- Philipp.

@danielsuo
Copy link
Contributor

I've been using tmux panes to debug. Wrote a script here that may be helpful to others. Originally, I had this in tmuxinator, but wanted to reduce the number of dependencies.

It's a little finicky since this doesn't really respect dependencies, but a little less painful than starting all processes by hand.

@robertnishihara
Copy link
Collaborator Author

@danielsuo that looks like a step in the right direction! I think the most useful thing would be to be able to start any of the Ray processes within tmux (and also within gdb), e.g., integrating something like this within services.py.

@danielsuo
Copy link
Contributor

danielsuo commented Aug 11, 2018 via email

@javierabosch2
Copy link

A tutorial on this would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants