Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Upgrade grpc from 1.46.6 to 1.57.0 #39210

Merged
merged 41 commits into from
Sep 7, 2023
Merged

Conversation

jjyao
Copy link
Contributor

@jjyao jjyao commented Sep 1, 2023

Why are these changes needed?

1.46.6 has a memory leak and 1.57.0 fixes it.

grpc: 1.46.6 -> 1.57.0
protobuf: v19.4 -> v23.4
absl: 2022062.1 -> 20230125.3
boringssl: b9232f9e27e5668bc0414879dcdedb2a59ea75f2 -> 342e805bc1f5dfdd650e3f031686d6c939b095d9

For a serve workload, the memory usage before:

Screenshot 2023-08-31 at 1 42 26 PM
Screenshot 2023-08-31 at 2 00 59 PM
Screenshot 2023-08-31 at 2 13 46 PM

after:
Screenshot 2023-09-01 at 7 40 03 AM
Screenshot 2023-09-01 at 7 40 26 AM
Screenshot 2023-09-01 at 7 41 09 AM

NOTE
For now, we check in the auto generated proto source files for python since we need to use an old version of protoc to support python 3.7. Once we deprecate 3.7, we can undo this part.

Related issue number

Closes #38591

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@@ -54,7 +54,8 @@ class LocalModeTaskSubmitter : public TaskSubmitter {

absl::Mutex actor_contexts_mutex_;

std::unordered_map<std::string, ActorID> named_actors_ GUARDED_BY(named_actors_mutex_);
std::unordered_map<std::string, ActorID> named_actors_
ABSL_GUARDED_BY(named_actors_mutex_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change from absl:

The legacy spellings of the thread annotation macros/functions (e.g. GUARDED_BY()) [have been removed by default](https://github.com/abseil/abseil-cpp/commit/6acb60c161f1203e6eca929b87f2041da7714bfe) in favor of the ABSL_ prefixed versions (e.g. ABSL_GUARDED_BY()) due to clashes with other libraries. The compatibility macro ABSL_LEGACY_THREAD_ANNOTATIONS can be defined on the compile command-line to temporarily restore these spellings, but this compatibility macro will be removed in the future.

@rkooo567 rkooo567 self-assigned this Sep 1, 2023
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested review from richardliaw, edoakes and a team as code owners September 1, 2023 16:53
python/setup.py Outdated
@@ -349,7 +349,7 @@ def get_packages(self):
"numpy >= 1.16; python_version < '3.9'",
"numpy >= 1.19.3; python_version >= '3.9'",
"packaging",
"protobuf >= 3.15.3, != 3.19.5",
"protobuf >= 3.20.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@edoakes
Copy link
Contributor

edoakes commented Sep 1, 2023

@jjyao can you post screenshots of the before and after memory usage for the serve workload?

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2023

Note: all the core tests are failing





ImportError: cannot import name 'builder' from 'google.protobuf.internal' (/opt/miniconda/lib/python3.8/site-packages/google/protobuf/internal/__init__.py)
 

<br class="Apple-interchange-newline">

@@ -8,6 +8,11 @@ build:windows --action_env=PATH
# For --compilation_mode=dbg, consider enabling checks in the standard library as well (below).
build --compilation_mode=opt
# Using C++ 17 on all platforms.
build:linux --host_cxxopt="-std=c++17"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to compile upgraded absil.

strip_prefix = "protobuf-2c5fa078d8e86e5f4bd34e6f4c9ea9e8d7d4d44a",
urls = [
# https://github.com/protocolbuffers/protobuf/commits/v23.4
"https://github.com/protocolbuffers/protobuf/archive/2c5fa078d8e86e5f4bd34e6f4c9ea9e8d7d4d44a.tar.gz",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we will use the lower version? (3.19 -> 2.3?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not 2.3 but 23 so 19 -> 23

@@ -279,11 +289,11 @@ def ray_deps_setup():
# https://github.com/grpc/grpc/blob/1ff1feaa83e071d87c07827b0a317ffac673794f/bazel/grpc_deps.bzl#L189
# Ensure this rule matches the rule used by grpc's bazel/grpc_deps.bzl
name = "boringssl",
sha256 = "534fa658bd845fd974b50b10f444d392dfd0d93768c4a51b61263fd37d851c40",
strip_prefix = "boringssl-b9232f9e27e5668bc0414879dcdedb2a59ea75f2",
sha256 = "0675a4f86ce5e959703425d6f9063eaadf6b61b7f3399e77a154c0e85bad46b1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did we choose this? is it for a specific version (add a comment for the version?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Ensure this rule matches the rule used by grpc's bazel/grpc_deps.bzl

patches = [
"@com_github_grpc_grpc//third_party:protobuf.patch",
],
patch_args = ["-p1"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment why this patch is needed from code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# This is copied from grpc's bazel/grpc_deps.bzlc

@@ -8,8 +8,8 @@ def gen_java_deps():
"com.github.java-json-tools:json-schema-validator:2.2.14",
"com.google.code.gson:gson:2.9.1",
"com.google.guava:guava:32.0.1-jre",
"com.google.protobuf:protobuf-java:3.19.6",
"com.google.protobuf:protobuf-java-util:3.19.6",
"com.google.protobuf:protobuf-java:3.23.4",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed since we upgraded protoc.

@@ -30,6 +30,8 @@ diff --git bazel/cython_library.bzl bazel/cython_library.bzl
+ srcs = [stem + ".cpp"] + cc_kwargs.pop("srcs", []),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you tell me why we need these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing patches, I don't know why we need it in the first place.

@@ -5,7 +5,7 @@ index 3bb5962..706b9b4 100644
@@ -25,7 +25,7 @@ namespace opencensus {
namespace exporters {
namespace stats {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there logic changes in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No logic change, only GUARDED_BY -> ABSL_GUARDED_BY

@@ -1,12 +0,0 @@
diff --git third_party/py/python_configure.bzl third_party/py/python_configure.bzl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we delete a file? Seems like it is not used?

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
BUILD.bazel Outdated
"@io_opencensus_proto//opencensus/proto/metrics/v1:metrics.proto",
"@io_opencensus_proto//opencensus/proto/resource/v1:resource.proto",
] + glob([
"src/ray/protobuf/**/*.proto",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel not all proto is needed actually. For example, not all proto is compiled with py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I just keep it simple to compile everything, it should be no harm.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codeowner stamp for lint change

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass from my end (please get @iycheng;s approval before merging it!). It seems like it has 13K lines of changes, is this expected?

@fishbone
Copy link
Contributor

fishbone commented Sep 6, 2023

@rkooo567 I think this is unfortunate :(

Let's clean this part with the highest priority after the summit @jjyao

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@aslonnie
Copy link
Collaborator

aslonnie commented Sep 6, 2023

It seems like it has 13K lines of changes, is this expected?

I think a lot of it is checking in generated pb files.

@jjyao
Copy link
Contributor Author

jjyao commented Sep 7, 2023

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 7, 2023

@jjyao I think the core nightly test failure is fixed by 313a923. You can double check it by downloading a log file from gcs_server.out and check if the check failure is from the autoscaler manager

Can you also post the performance difference from the master here just in case to see if there's a big change?

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao
Copy link
Contributor Author

jjyao commented Sep 7, 2023

Microbenchmark:



single_client_get_calls_Plasma_Store = [10093.560227255326, 79.94432751246542]
--
  | single_client_put_calls_Plasma_Store = [5471.864933158648, 77.70833287530638]
  | multi_client_put_calls_Plasma_Store = [12589.685634208958, 301.70029150589943]
  | single_client_put_gigabytes = [18.872557806854374, 4.919710267611302]
  | single_client_tasks_and_get_batch = [9.987917616953759, 0.22729131561338817]
  | multi_client_put_gigabytes = [37.13926701774409, 2.090084721396276]
  | single_client_get_object_containing_10k_refs = [14.861631575625921, 0.2784344123799417]
  | single_client_wait_1k_refs = [5.186563025540161, 0.17960575727533448]
  | single_client_tasks_sync = [1153.9544609729708, 12.884716556349781]
  | single_client_tasks_async = [10847.68019096285, 140.62998721211127]
  | multi_client_tasks_async = [33906.66399830253, 1032.6713613141028]
  | 1_1_actor_calls_sync = [2276.6743562161837, 46.55844259513228]
  | 1_1_actor_calls_async = [7999.982926145655, 177.79202152826016]
  | 1_1_actor_calls_concurrent = [4914.962538438439, 205.359840538267]
  | 1_n_actor_calls_async = [10837.931244604233, 181.07739735803023]
  | n_n_actor_calls_async = [31542.914183828845, 544.3859776512819]
  | n_n_actor_calls_with_arg_async = [2985.308087160443, 20.223620426350703]
  | 1_1_async_actor_calls_sync = [1369.8267490983353, 23.629841247319142]
  | 1_1_async_actor_calls_async = [2906.732546793207, 106.32202389863562]
  | 1_1_async_actor_calls_with_args_async = [2029.1677566872995, 93.01617892866881]
  | 1_n_async_actor_calls_async = [9490.779932946629, 70.19529690530324]
  | n_n_async_actor_calls_async = [25918.686289834965, 312.93067513177465]
  | placement_group_create/removal = [895.525259837609, 7.231674604368287]
  | client__get_calls = [1202.9198053395726, 38.9750139739174]
  | client__put_calls = [857.4076319486344, 15.278249054607969]
  | client__put_gigabytes = [0.07557427646923043, 0.0015435192016046153]
  | client__tasks_and_put_batch = [11657.759680450217, 166.3039789077301]
  | client__1_1_actor_calls_sync = [546.5581484960015, 19.95411267369889]
  | client__1_1_actor_calls_async = [1060.8005790473765, 47.92145784771617]
  | client__1_1_actor_calls_concurrent = [949.0861020977036, 19.170808568330436]
  | client__tasks_and_get_batch = [1.0392002851143174, 0.020889671274458985]

Seems ok.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping for test_state_api_log being flaky for known problems. #39410

@jjyao jjyao merged commit 07d6e67 into ray-project:master Sep 7, 2023
73 of 111 checks passed
@jjyao jjyao deleted the jjyao/upgrade branch September 7, 2023 23:03
jjyao added a commit that referenced this pull request Sep 7, 2023
1.46.6 has a memory leak and 1.57.0 fixes it.

grpc: 1.46.6 -> 1.57.0
protobuf: v19.4 -> v23.4
absl: 2022062.1 -> 20230125.3
boringssl: b9232f9e27e5668bc0414879dcdedb2a59ea75f2 -> 342e805bc1f5dfdd650e3f031686d6c939b095d9

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
harborn pushed a commit to harborn/ray that referenced this pull request Sep 8, 2023
1.46.6 has a memory leak and 1.57.0 fixes it.

grpc: 1.46.6 -> 1.57.0
protobuf: v19.4 -> v23.4
absl: 2022062.1 -> 20230125.3
boringssl: b9232f9e27e5668bc0414879dcdedb2a59ea75f2 -> 342e805bc1f5dfdd650e3f031686d6c939b095d9

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
GeneDer pushed a commit that referenced this pull request Sep 9, 2023
* [Core] Upgrade grpc from 1.46.6 to 1.57.0 (#39210)

1.46.6 has a memory leak and 1.57.0 fixes it.

grpc: 1.46.6 -> 1.57.0
protobuf: v19.4 -> v23.4
absl: 2022062.1 -> 20230125.3
boringssl: b9232f9e27e5668bc0414879dcdedb2a59ea75f2 -> 342e805bc1f5dfdd650e3f031686d6c939b095d9

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

* fix

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* up

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

* fix

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

---------

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
1.46.6 has a memory leak and 1.57.0 fixes it.

grpc: 1.46.6 -> 1.57.0
protobuf: v19.4 -> v23.4
absl: 2022062.1 -> 20230125.3
boringssl: b9232f9e27e5668bc0414879dcdedb2a59ea75f2 -> 342e805bc1f5dfdd650e3f031686d6c939b095d9

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
1.46.6 has a memory leak and 1.57.0 fixes it.

grpc: 1.46.6 -> 1.57.0
protobuf: v19.4 -> v23.4
absl: 2022062.1 -> 20230125.3
boringssl: b9232f9e27e5668bc0414879dcdedb2a59ea75f2 -> 342e805bc1f5dfdd650e3f031686d6c939b095d9

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Victor <vctr.y.m@example.com>
jjyao added a commit that referenced this pull request Oct 26, 2023
Revert back to the old way (before #39210) of auto-generating protobuf and grpc code so we don't need to manually generate and check them in. Also decouple the protobuf version we use in c++ and the protobuf version we use to auto generate python and java code (the protobuf version to auto generate python code affects the python protobuf library version that Ray can support) so that they can be upgraded independently.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Possible memory leak in async actor implementation
6 participants