Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object Store] Push Manager: round for object manager client and FIFO for object #34269

Closed

Conversation

Catch-Bull
Copy link
Contributor

@Catch-Bull Catch-Bull commented Apr 11, 2023

Why are these changes needed?

This PR mainly achieves the rotation of PushManager among target nodes but implements FIFO for all objects within each target node. There are two objectives/purposes as follows:

  1. After the arguments that the previous task depends on are pulled locally, the Raylet will allocate resources. In the process of spilling out subsequent tasks that are still pulling parameters, implementing FIFO can reduce unnecessary data transmission.

  2. When there are many objects that need to be pushed, the function ScheduleRemainingPushes will try to push many objects that no longer have any chunks remaining, resulting in higher complexity of the internal loop. It can be illustrated by the following example.

  • test.py:
import ray
import numpy as np
import time
import sys
from tqdm import tqdm
from ray.cluster_utils import Cluster

SYSTEM_CONFIG = {
    "object_spilling_threshold": 1.0,
    # disable unlimited
    "oom_grace_period_s": 3600,
    # force argument to be put into object store
    "max_direct_call_object_size": 512,
}

cluster = Cluster()

cluster.add_node(
    object_store_memory=1 * 1024 ** 3,
    _system_config=SYSTEM_CONFIG,
)
ray.init(address="auto")

for _ in range(int(sys.argv[1])):
    cluster.add_node(
        object_store_memory=1 * 1024 ** 3,
        resources={"remote_node": 8},
        num_cpus=8,
    )

def get_small_object():
    # 1KB
    return np.random.rand(1, 1024 // 8)

TASK_NUMBER = 100
ARG_NUMBER = 1000

@ray.remote(resources={"remote_node": 1})
def get_sum(ans, *args):
    for i in args:
        ans -= i.sum()
    return abs(ans) < 1e-6

all_args = []
all_sums = []
for _ in tqdm(range(TASK_NUMBER * ARG_NUMBER)):
    data = get_small_object()
    all_args.append(ray.put(data))
    all_sums.append(data.sum())

st1 = time.time()
tasks = []
for index in tqdm(range(TASK_NUMBER)):
    ans = sum(all_sums[index * ARG_NUMBER:(index+1)*ARG_NUMBER])
    args = all_args[index * ARG_NUMBER:(index+1)*ARG_NUMBER]
    tasks.append(get_sum.remote(ans, *args))
st2 = time.time()
print(ray.get(tasks))
st3 = time.time()
print("submit cost time:", st2 - st1)
print("total time:", st3 - st1)
ray.shutdown()
  • test cmd: (The number "1" in the command means the number of remote nodes.)
ray stop -f && python test.py 1
  • result:
    • ray 2.4.0:
      • 1 node : cost 31.97s
      • 2 nodes : cost 30.77s
      • 3 nodes : cost 36.80s
    • round_node_and_FIFO_object:
      • 1 node : cost 25.21s
      • 2 nodes : cost 26.95s
      • 3 nodes : cost 26.62s
  • We can calculate the percentage of effective loop iterations using the following code, result:
    • ray-2.4.0:
      • driver node loop summary:
      • all loop number: 3.88261e+07
      • send chunk number: 100000
      • send_chunk_num_ / all_loop_num_:0.00257559
    • round_node_and_FIFO_object:
      • driver node loop summary:
      • all loop number: 101717
      • send chunk number: 100900
      • send_chunk_num_ / all_loop_num_:0.991968

Related issue number

Closes #34270

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Catch-Bull Catch-Bull requested review from jjyao and scv119 May 10, 2023 17:39
@Catch-Bull Catch-Bull marked this pull request as ready for review May 10, 2023 17:39
@Catch-Bull Catch-Bull changed the title [WIP]Push Manager: round for object manager client and FIFO for object [Object Store] Push Manager: round for object manager client and FIFO for object May 10, 2023
@Catch-Bull Catch-Bull closed this Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][Object Store] Push Manager: round for object manager client and FIFO for object
1 participant