Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] microbenchmark regression #40606

Closed
rickyyx opened this issue Oct 23, 2023 · 3 comments
Closed

[core] microbenchmark regression #40606

rickyyx opened this issue Oct 23, 2023 · 3 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.8 release-blocker P0 Issue that blocks the release

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Oct 23, 2023

What happened + What you expected to happen

1_n_actor_calls_async is probably rpc related, but it never recovers. There seems to be ~10% regression
image

multi_client_tasks_async is similar to 1_n_actor_calls_async it never recovers from the dip.

image

1_n_async_actor_calls_async looks like regression.

image

single_client_tasks_sync seems real even though not significant.

image

Versions / Dependencies

NA

Reproduction script

NA

Issue Severity

None

@rickyyx rickyyx added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core ray 2.8 labels Oct 23, 2023
@rickyyx rickyyx changed the title [core] 1_n_actor_calls_async regression [core] microbenchmark regression Oct 23, 2023
@rickyyx rickyyx changed the title [core] microbenchmark regression [core] 1_n_actor_calls_async regression Oct 23, 2023
@rickyyx rickyyx changed the title [core] 1_n_actor_calls_async regression [core] microbenchmark regression Oct 23, 2023
@rickyyx rickyyx self-assigned this Oct 23, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Oct 24, 2023

The drop between 64c25cf and 2913e9b (from 9.7 -> 9.11)

2913e9b [train] Legacy interface cleanup (air.Checkpoint, LegacyExperimentAnalysis) (#39289)
fddde50 [train] remove _max_cpu_fraction_per_node (#39412)
7da798f [Doc] Add an ad for Ray Summit 2023 (#39404)
daf0dc1 polish observability (o11y) docs (#39069)
41cb273 [2.7] Cleanup all LightningTrainer Mentions in Ray Doc (#39406)
5867f32 [Data] Unpin pyarrow from test-requirements (#39290)
275fad8 jail //python/ray/data:test_streaming_executor (#39423)
fb4dd92 [train] update Train API references & annotations (#39294)
b6edccf [core] Fix performance regression in single_client_tasks_and_get_batch (#39362)
0f5b6f5 Update usage_pb2.py (#39425)
0e77916 [release byod] fix ml requirements file selection (#39353)
364df49 Update metrics.md (#38512)
c2d6f54 [Dashboard/Client] Update docs to Reflect Best Practices (#39403)
b1356d7 [Core] Merge Driver/Job's runtime environment when it conflicts (#39208)
07d6e67 [Core] Upgrade grpc from 1.46.6 to 1.57.0 (#39210)
449afc9 [Telemetry] Add Telemetry for Ray Train Utilities (#39363)
3e8a1dc [tune] Make Trainable.save/restore developer APIs (#39391)
3e7c8af [Serve][Doc] Add handle instruction to send multiplex request (#39274)
fecca87 [Doc] Add vSphere cluster configuration reference with examples (#39379)
1491937 [ci] marking serve:test_websockets as failing (#39375)
c324f38 [RLlib] Fix DDPG learning/release tests and make MARWIL CI test criterium more difficult. (#39386)
f33b8eb [train] Fix issues in migration of tune_cifar_torch_pbt_example (#39158)
8b7fcd7 [Core] Allow to rate limit the max # of workers concurrently started (#39253)
6523d94 skip gcs-ha-e2e-2 for now (#39359)
77412ab [civ2][serve/1] migrate other variances of serve tests to civ02 (#39045)

I guess grpc is the culprit here? ^

@rickyyx
Copy link
Contributor Author

rickyyx commented Oct 24, 2023

The increase bump e446573 to 035224b (9.13 -> 9.14)

`035224b525 [Serve] Make sure Ray installs uvloop to be used as event-loop implementation for asyncio (#39336)
4ed4b52 [docs][train]Make Train example titles, heading more consistent (#39606)
77b4cb9 [core] add grpc opencensus plugin (#39082)
1225d52 [docs][clusters] Change title of RayService doc to Deploy Ray Serve Apps (#39641)
b00d029 [Doc] Fix Title of the Transformers GLUE example (#39605)
5cd72b9 [tune/docs] typo in suggestion.rst (#39262)
7414a8d Downgrade grpc from 1.57.0 to 1.50.2 (#39575)
8094bda [release][core][autoscaler] Autoscaler e2e release tests [1/x] (#39046)
1da1834 [docs] Update KubeRay Ingress Docs (#39635)
3e49f5d [data][tests-only] Fix nightly microbenchmark for read_images #39609
ed32450 [data] store bytes spilled/restored after plan execution (#39361)
8d80377 [RLlib] Fixed 'rollout_fragment_length' in pong-example by setting it to 'auto'. (#39552)

So the downgrade doesn't fixes everything I guess?

@hora-anyscale hora-anyscale removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 24, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Oct 25, 2023

No longer regression after infra fixes: #40571 (comment)

@rickyyx rickyyx closed this as completed Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order ray 2.8 release-blocker P0 Issue that blocks the release
Projects
None yet
Development

No branches or pull requests

2 participants