Skip to content

improvement(staging-mode): Clarify Squash/Rollback behavior for ContainerMessageType.GC #24774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 6, 2025

Conversation

markfields
Copy link
Member

@markfields markfields commented Jun 6, 2025

Description

GC ops can be made while in Staging Mode, and we need to be very clear with how they're handled.

  • Squashing: They don't need to be squashed, no personal data to protect
  • Rollback: GC ops submitted while in Staging Mode will be dropped if staged changes are discarded. (these are all "Tombstone Loaded" ops, which prevent data loss in case of invalid GC data)
    • We will log a new error event GC_OpDiscarded to highlight this case
    • We are also adding a prop inStagingMode to all ContainerRuntime logs, including GC_Tombstone_DataStore_Requested etc, to further aid investigations
    • Note: In our largest 1P customer depending on GC, there are no outstanding root causes leading to Tombstone Load events (where we depend on the GC op to avoid data loss). So we should not see this at all until something else regresses in GC, at which point the new info logged will come to bear in the investigation.

Also refactor the Resubmit code so it's clear that being able to support squash is an exceptional case - most op types don't need to deal with it.

Reviewer Guidance

This was pulled out of #24729 since it's really a separate change.

@Copilot Copilot AI review requested due to automatic review settings June 6, 2025 17:37
@markfields markfields requested review from Abe27342 and removed request for Copilot June 6, 2025 17:37
@github-actions github-actions bot added base: main PRs targeted against main branch area: runtime Runtime related issues labels Jun 6, 2025
@markfields markfields requested a review from agarwal-navin June 6, 2025 17:37
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR clarifies how GC operations behave in Staging Mode and refactors the resubmit logic to treat squash support as an exceptional case.

  • Enhance IPendingMessage with clearer metadata fields for op submission and sequencing.
  • Add explanatory comments around GC tombstone telemetry and staging behavior.
  • Refactor ContainerRuntime to introduce isSummarizerClient, wire it through lifecycle paths, and separate squash vs. non-squash resubmit flows.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
packages/runtime/container-runtime/src/pendingStateManager.ts Added localOpMetadata, opMetadata, and sequenceNumber fields with comments.
packages/runtime/container-runtime/src/gc/garbageCollection.ts Expanded comment to list the telemetry events preceding tombstone load errors.
packages/runtime/container-runtime/src/containerRuntime.ts Introduced isSummarizerClient; updated monitoring context; refactored reSubmitBatch, reSubmitWithSquashing, and reSubmit to distinguish squash paths and GC/DocumentSchemaChange handling.
Comments suppressed due to low confidence (4)

packages/runtime/container-runtime/src/pendingStateManager.ts:57

  • The localOpMetadata field is typed as unknown. Consider defining and using a more specific interface to improve type safety and documentation of expected metadata shape.
localOpMetadata: unknown;

packages/runtime/container-runtime/src/gc/garbageCollection.ts:1048

  • [nitpick] The comment starts with lowercase "i.e."; for consistency and readability, consider capitalizing the first letter or rephrasing to begin a new sentence.
// i.e. this will be preceded by one of these telemetry events;

packages/runtime/container-runtime/src/containerRuntime.ts:1189

  • Public or private properties injected in multiple lifecycle paths benefit from a doc comment explaining their purpose, especially isSummarizerClient since it alters teardown and resubmit behavior.
private readonly isSummarizerClient: boolean;

packages/runtime/container-runtime/src/containerRuntime.ts:4680

  • The new squash vs. non-squash code paths (reSubmitWithSquashing vs. reSubmit) introduce conditional behavior for GC and schema ops. Ensure unit tests cover each branch, including GC being forwarded or dropped under squash.
const resubmitFn = squash

@markfields markfields merged commit 4e504a1 into microsoft:main Jun 6, 2025
33 checks passed
@markfields markfields deleted the sm/squash-in-cr branch June 6, 2025 19:19
* for correlation to detect container forking.
*/
private reSubmitBatch(
batch: PendingMessageResubmitData[],
{ batchId, staged, squash }: PendingBatchResubmitMetadata,
): void {
assert(
this._summarizer === undefined,
0x8f2 /* Summarizer never reconnects so should never resubmit */,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come you added a hex code here? Is this copilot generated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a move from LHS line 4687

// Just drop it, but log an error, this is not expected and not ideal, but not critical failure either.
// Currently the only expected type here is TombstoneLoaded, which will have been preceded by one of these events as well:
// GC_Tombstone_DataStore_Requested, GC_Tombstone_SubDataStore_Requested, GC_Tombstone_Blob_Requested
this.mc.logger.sendErrorEvent({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to log more details here such as the id of the object, pkg, etc? That will help debugging if and when we see this log. As of now, debugging this will be very hard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be ok because there will be a corresponding GC_Tombstone_DataStore_Requested (or other object) event with more info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues base: main PRs targeted against main branch Feature_StagingMode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants