Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Observability] Extend sys_status and sys_invocation_state with endpoint_id #913

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Nov 28, 2023

[Observability] Extend sys_status and sys_invocation_state with endpoint_id

Summary:
This introduces two new columns in sys_invocation_state:

  • next_retry_at Points to the next retry timestamp if a retry is scheduled for an invocation.
  • last_attempt_endpoint_id The endpoint ID that was used for the last invocation attempt.

Also, a new column was added in sys_status:

  • pinned_endpoint_id Indicates the endpoint ID that this invocation is pinned to. If set, the invocation must continue to use this endpoint id even if other endpoints were installed that claim to host higher service revisions.

Stack created with Sapling. Best reviewed with ReviewStack.

Copy link

github-actions bot commented Nov 28, 2023

Test Results

  98 files  +  98    98 suites  +98   10m 34s ⏱️ + 10m 34s
  87 tests +  87    87 ✔️ +  87  0 💤 ±0  0 ±0 
220 runs  +220  220 ✔️ +220  0 💤 ±0  0 ±0 

Results for commit 41fefb4. ± Comparison against base commit 2519b5d.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@igalshilman igalshilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AhmedSoliman, I've looked at the datafusion side of things, looks 💯 !
If @slinkydeveloper can give a quick 👀 into the invoker and approves then lets merge.

@@ -25,6 +25,8 @@ pub struct InvocationStatusReportInner {
pub start_count: usize,
pub last_start_at: SystemTime,
pub last_retry_attempt_failure: Option<InvocationErrorReport>,
pub next_retry_at: Option<SystemTime>,
pub last_attempt_endpoint_id: Option<EndpointId>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you would add the last_attempt_endpoint_id both in the sys_invocation_state and in sys_status. For this field, the source of truth is sys_status anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok i see. You can track the case where you tried to use the endpooint_id but has not been committed yet. makes sense. Perhaps let's add a comment about that somewhere in the schema of the sys_invocation_state table, to avoid me making the same comment in 3 months :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I'll add a comment on the schema.

Comment on lines +510 to +519
self.status_store.on_endpoint_chosen(
&partition,
&full_invocation_id,
endpoint_id.clone(),
);
// If we think this selected endpoint has been freshly picked, otherwise
// we assume that we have stored it previously.
if has_changed {
ism.notify_chosen_endpoint(endpoint_id);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a unit test for this has_changed behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So next time i stumble on it i can figure it out from the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea but I'm not sure if this is going to be an easy to slot in existing tests. I'm a bit swamped and it'll be hard now to add a new test only for this case tbh.

…int_id

Summary:
This introduces two new columns in `sys_invocation_state`:
  - `next_retry_at` Points to the next retry timestamp if a retry is scheduled for an invocation.
  - `last_attempt_endpoint_id` The endpoint ID that was used for the last invocation attempt.

Also, a new column was added in `sys_status`:
  - `pinned_endpoint_id` Indicates the endpoint ID that this invocation is _pinned_ to. If set, the invocation must continue to use this endpoint id even if other endpoints were installed that claim to host higher service revisions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants