Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS]Report job error to gcs instead of direct publishing #14617

Merged
merged 2 commits into from
Mar 12, 2021

Conversation

WangTaoTheTonic
Copy link
Contributor

@WangTaoTheTonic WangTaoTheTonic commented Mar 11, 2021

Why are these changes needed?

In raylet, redis client's asynchronous context is just used for ReportJobError in gcs client. If we change the way of reporting job error to rpc based, one redis connection is short for each raylet. (with overhead brought by one extra hop)

Related issue number

part of #14463, first step of redis connection reducing in raylet side

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@WangTaoTheTonic WangTaoTheTonic changed the title [GCS]Report job error to gcs instead of direct publishing [WIP][GCS]Report job error to gcs instead of direct publishing Mar 11, 2021
@rkooo567 rkooo567 self-assigned this Mar 11, 2021
@WangTaoTheTonic WangTaoTheTonic changed the title [WIP][GCS]Report job error to gcs instead of direct publishing [GCS]Report job error to gcs instead of direct publishing Mar 12, 2021
@rkooo567 rkooo567 merged commit 3402b17 into ray-project:master Mar 12, 2021
ERROR_INFO_CHANNEL, job_id.Hex(), data_ptr->SerializeAsString(), callback);
RAY_LOG(DEBUG) << "Finished publishing job error, job id = " << job_id;
return status;
rpc::ReportJobErrorRequest request;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question. How does the job error look like? Who was the subscriber, and are subscribers still getting data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We barely publish job error data, and only does dashboard subscribe it for showing what happened to a failed job.
Before publishers publish job error msg to redis directly and subscribers receive that. After this change they send this message to gcs first and let gcs publish this message. It is transparent to subscribers' side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One usage is that if you start a Python driver in terminal, you can see the error messages in the driver output even though errors are generated in another process or even on another node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants