Metrics export #2

rkooo567 · 2020-03-24T20:20:59Z

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

[Hosted Dashboard] End to end flow

[Hosted Dashboard] Running Grafana instances with pre-configured dashboards.

[Hosted Dashboard] Persist data to S3

Grafana Integration

Hosted Dashboard Initiation Flow

…ibly), minor fixes and preparations). (ray-project#13091)

…om concurrent chunk receive - #2 (ray-project#19216)

… and `MultiAgentEnvs` (ray-project#21063)

…ha-star style) #2. (ray-project#21649)

…roject#22317) Improve observability for general objects and lineage reconstruction by adding a "Status" field to `ray memory`. The value of the field can be: ``` // The task is waiting for its dependencies to be created. WAITING_FOR_DEPENDENCIES = 1; // All dependencies have been created and the task is scheduled to execute. SCHEDULED = 2; // The task finished successfully. FINISHED = 3; ``` In addition, tasks that failed or that needed to be re-executed due to lineage reconstruction will have a field listing the attempt number. Example output: ``` IP Address | PID | Type | Call Site | Status | Size | Reference Type | Object Ref 192.168.4.22 | 279475 | Driver | (task call) ... | Attempt #2: FINISHED | 10000254.0 B | LOCAL_REFERENCE | c2668a65bda616c1ffffffffffffffffffffffff0100000001000000 ```

…ray-project#23821) This PR refactors `LazyBlockList` in service of out-of-band serialization (see [mono-PR](ray-project#22616)) and is a precursor to an execution plan refactor (PR #2) and adding the actual out-of-band serialization APIs (PR #3). The following is included in this refactor: 1. `ReadTask`s are now a first-class concept, replacing calls; 2. read stage progress tracking is consolidated into `LazyBlockList._get_blocks_with_metadta()` and more of the read task complexity, e.g. the read remote function, was pushed into `LazyBlockList` to make `ray.data.read_datasource()` simpler; 3. we are a bit smarter with how we progressively launch tasks and fetch and cache metadata, including fetching the metadata for read tasks in `.iter_blocks_with_metadata()` instead of relying on the pre-read task metadata (which will be less accurate), and we also fix some small bugs in the lazy ramp-up around progressive metadata fetching. (1) is the most important item for supporting out-of-band serialization and fundamentally changes the `LazyBlockList` data model. This is required since we need to be able to reference the underlying read tasks when rewriting read stages during optimization and when serializing the lineage of the Dataset. See the [mono-PR](ray-project#22616) for more context. Other changes: 1. Changed stats actor to a global named actor singleton in order to obviate the need for serializing the actor handle with the Dataset stats; without this, we were encountering serialization failures.

We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 #1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 #7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 #8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 #11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 #12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 #14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 #15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.

ray-project#27560)

Why are these changes needed? Right now the theory is as follow. pubsub io service is created and run inside the GcsServer. That means if pubsub io service is accessed after GCSServer GC'ed, it will segfault. Right now, upon teardown, when we call rpc::DrainAndResetExecutor, this will recreate the Executor thread pool. Upon teardown, If DrainAndResetExecutor -> GcsServer's internal pubsub posts new SendReply to the newly created threadpool -> GcsServer.reset -> pubsub io service GC'ed -> SendReply invoked from the newly created thread pool, it will segfault. NOTE: the segfault is from pubsub service if you see the failure #2 0x7f92034d9129 in ray::rpc::ServerCallImpl<ray::rpc::InternalPubSubGcsServiceHandler, ray::rpc::GcsSubscriberPollRequest, ray::rpc::GcsSubscriberPollReply>::HandleRequestImpl()::'lambda'(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>)::operator()(ray::Status, std::__1::function<void ()>, std::__1::function<void ()>) const::'lambda'()::operator()() const /proc/self/cwd/bazel-out/k8-opt/bin/_virtual_includes/grpc_common_lib/ray/rpc/server_call.h:212:48 As a fix, I only drain the thread pool. And then reset it after all operations are fully cleaned up (only from tests). I think there's no need to reset for regular proc termination like raylet, gcs, core workers. Related issue number Closes ray-project#34344 Signed-off-by: SangBin Cho <rkooo567@gmail.com>

…-project#41075)

rkooo567 and others added 30 commits February 9, 2020 17:47

Ingest server can can data as is

8ddc342

PoC done.

458a8e3

Clean up

dcf020d

Final clean up before PR

6a4bc20

Fixed code based on code review

86bde5d

Fixed based on code review

aaf3cb1

Merge branch 'master' into hosted_dashboard

082da95

bug fix

7d25376

Prototype exporter completed

b31e886

Implemented end to end flow

bb2da83

Clean up

2bfc766

Fix bug

b480609

Formatting.

c99112e

Grafana integration done

5993327

Added a todo comment to implement frontend ui

9c48eff

Merge pull request #2 from anyscale/hd-e2e-impl

01d32c2

[Hosted Dashboard] End to end flow

Formatting + removed unnecessary files

c6ddc24

Merge pull request #3 from anyscale/grafana_integration

57c5e7f

[Hosted Dashboard] Running Grafana instances with pre-configured dashboards.

S3 writer

834a807

formatting

87319a1

Merge pull request #4 from anyscale/persist_data

a73b950

[Hosted Dashboard] Persist data to S3

Merge branch 'master' into hosted_dashboard

1e79e78

Formatting

0c25b4e

Hosted Dashboard Initiation Flow

ece32f3

Grafana integration

74db3b9

Add iFrame file

28bc3e1

Update grafana to {node,worker}x{cpu, memory}

08527d4

Merge pull request #6 from anyscale/hdash-grafana

2ec86df

Grafana Integration

Merge pull request #5 from anyscale/hdash-initiation-flow

3cf7e23

Hosted Dashboard Initiation Flow

Merge branch 'master' into hosted_dashboard

9734be0

rkooo567 added 17 commits March 19, 2020 10:02

Basic refactoring.

1bb527f

Delete undeleted frontend code.

17be980

Refactoring with api changes.

a715478

Use click for a main function.

d607144

Refactoring interface location.

df053a2

Fixed merge conflict.

648ae48

Last refactoring.

1021b94

Formatting.

ad959e4

Minor fix.

6a287ba

Addressed code review.

c1f3bf5

Add tests and restructure code to be unit testable.

680e0a0

Addressed code review.

c8a071a

Formatting.

91a80ee

Merge branch 'master' into refactoring

46bd23c

Formatting.

15acc00

Minor fix.

a51ef00

Formatting.

c1fdeeb

rkooo567 merged commit dd78c70 into master2 Mar 24, 2020

rkooo567 pushed a commit that referenced this pull request Dec 2, 2020

[RLlib] Attention Net prep PR #2: Smaller cleanups. (ray-project#12449)

3ad9365

rkooo567 pushed a commit that referenced this pull request Jan 4, 2021

[RLlib] JAXPolicy prep PR #2 (move get_activation_fn (backward-compat…

8726521

…ibly), minor fixes and preparations). (ray-project#13091)

rkooo567 pushed a commit that referenced this pull request Oct 8, 2021

[Object manager] don't abort entire pull request on race condition fr…

b066627

…om concurrent chunk receive - #2 (ray-project#19216)

ericl pushed a commit that referenced this pull request Jan 10, 2022

[RLlib] [MultiAgentEnv Refactor #2] Change space types for BaseEnvs…

39f8072

… and `MultiAgentEnvs` (ray-project#21063)

rkooo567 pushed a commit that referenced this pull request Jan 30, 2022

[RLlib] Preparatory PR for multi-agent, multi-GPU learning agent (alp…

ee41800

…ha-star style) #2. (ray-project#21649)

rkooo567 pushed a commit that referenced this pull request Aug 10, 2022

[serve] Integrate and Document Bring-Your-Own Gradio Applications (#2… (

419ba8e

ray-project#27560)

rkooo567 pushed a commit that referenced this pull request Feb 23, 2023

[data] Streaming executor fixes #2 (ray-project#32759)

4c6d75b

rkooo567 pushed a commit that referenced this pull request Dec 4, 2023

[RLlib] New ConnectorV2 API #2: SingleAgentEpisode enhancements. (ray…

d6d2dee

…-project#41075)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics export #2

Metrics export #2

rkooo567 commented Mar 24, 2020

Metrics export #2

Metrics export #2

Conversation

rkooo567 commented Mar 24, 2020

Why are these changes needed?

Related issue number

Checks