[Object Store] Push Manager: round for object manager client and FIFO for object #34269

Catch-Bull · 2023-04-11T15:03:57Z

Why are these changes needed?

This PR mainly achieves the rotation of PushManager among target nodes but implements FIFO for all objects within each target node. There are two objectives/purposes as follows:

After the arguments that the previous task depends on are pulled locally, the Raylet will allocate resources. In the process of spilling out subsequent tasks that are still pulling parameters, implementing FIFO can reduce unnecessary data transmission.
When there are many objects that need to be pushed, the function ScheduleRemainingPushes will try to push many objects that no longer have any chunks remaining, resulting in higher complexity of the internal loop. It can be illustrated by the following example.

test.py:

import ray
import numpy as np
import time
import sys
from tqdm import tqdm
from ray.cluster_utils import Cluster

SYSTEM_CONFIG = {
    "object_spilling_threshold": 1.0,
    # disable unlimited
    "oom_grace_period_s": 3600,
    # force argument to be put into object store
    "max_direct_call_object_size": 512,
}

cluster = Cluster()

cluster.add_node(
    object_store_memory=1 * 1024 ** 3,
    _system_config=SYSTEM_CONFIG,
)
ray.init(address="auto")

for _ in range(int(sys.argv[1])):
    cluster.add_node(
        object_store_memory=1 * 1024 ** 3,
        resources={"remote_node": 8},
        num_cpus=8,
    )

def get_small_object():
    # 1KB
    return np.random.rand(1, 1024 // 8)

TASK_NUMBER = 100
ARG_NUMBER = 1000

@ray.remote(resources={"remote_node": 1})
def get_sum(ans, *args):
    for i in args:
        ans -= i.sum()
    return abs(ans) < 1e-6

all_args = []
all_sums = []
for _ in tqdm(range(TASK_NUMBER * ARG_NUMBER)):
    data = get_small_object()
    all_args.append(ray.put(data))
    all_sums.append(data.sum())

st1 = time.time()
tasks = []
for index in tqdm(range(TASK_NUMBER)):
    ans = sum(all_sums[index * ARG_NUMBER:(index+1)*ARG_NUMBER])
    args = all_args[index * ARG_NUMBER:(index+1)*ARG_NUMBER]
    tasks.append(get_sum.remote(ans, *args))
st2 = time.time()
print(ray.get(tasks))
st3 = time.time()
print("submit cost time:", st2 - st1)
print("total time:", st3 - st1)
ray.shutdown()

test cmd: (The number "1" in the command means the number of remote nodes.)

ray stop -f && python test.py 1

result:
- ray 2.4.0:
  - 1 node : cost 31.97s
  - 2 nodes : cost 30.77s
  - 3 nodes : cost 36.80s
- round_node_and_FIFO_object:
  - 1 node : cost 25.21s
  - 2 nodes : cost 26.95s
  - 3 nodes : cost 26.62s
We can calculate the percentage of effective loop iterations using the following code, result:
- ray-2.4.0:
  - driver node loop summary:
  - all loop number: 3.88261e+07
  - send chunk number: 100000
  - send_chunk_num_ / all_loop_num_:0.00257559
- round_node_and_FIFO_object:
  - driver node loop summary:
  - all loop number: 101717
  - send chunk number: 100900
  - send_chunk_num_ / all_loop_num_:0.991968

Related issue number

Closes #34270

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Catch-Bull mentioned this pull request Apr 11, 2023

[Core][Object Store] Push Manager: round for object manager client and FIFO for object #34270

Open

Catch-Bull added 5 commits May 10, 2023 16:58

save

0b3c4e3

save

1e86200

save

b27fa26

save

af33618

save

87ef3dd

Catch-Bull force-pushed the round_node_and_FIFO_object branch from e1f1c5f to 87ef3dd Compare May 10, 2023 17:29

fix lint

bc4f6b3

Catch-Bull requested review from jjyao and scv119 May 10, 2023 17:39

Catch-Bull marked this pull request as ready for review May 10, 2023 17:39

Catch-Bull changed the title ~~[WIP]Push Manager: round for object manager client and FIFO for object~~ [Object Store] Push Manager: round for object manager client and FIFO for object May 10, 2023

save

322732d

Catch-Bull closed this Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Object Store] Push Manager: round for object manager client and FIFO for object #34269

[Object Store] Push Manager: round for object manager client and FIFO for object #34269

Catch-Bull commented Apr 11, 2023 •

edited

Loading

[Object Store] Push Manager: round for object manager client and FIFO for object #34269

[Object Store] Push Manager: round for object manager client and FIFO for object #34269

Conversation

Catch-Bull commented Apr 11, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Catch-Bull commented Apr 11, 2023 •

edited

Loading