Skip to content

nvme: improve shadow doorbell support #1582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jstarks
Copy link
Member

@jstarks jstarks commented Jun 22, 2025

Currently, shadow doorbells are implemented in a way where there are essentially two doorbell values: the shadow one and the "real" one. Observed changes to the shadow value are reflected in the real value but not vice versa. This causes problems when guests do not use, or only inconsistently use, the shadow doorbells for a given queue.

The spec is somewhat ambiguous on this front, but it's clear from guest behavior that the expectation is that the shadow doorbells and the real doorbells remain in perfect sync, so that the actual doorbell value can be read from the shadow doorbell. To support this, use the shadow doorbell memory as the single source of truth for the doorbell register value.

In addition to these changes, simplify the EventIdx handling to match the strategy we use for virtio and vmbus devices--update EventIdx (thus letting the guest know it needs to write the real doorbell on the next request) only when polling for work and the queue is currently empty. This is easier to reason about than the existing heuristic approach.

This does eliminate an optimization where we currently avoid updating EventIdx when there are IOs in flight (thus increasing IO latency but reducing the intercept rate), but a future change can easily add this behavior back by avoiding polling (and instead just probing) a queue when more than the desired number of IOs are in flight.

These changes are an improvement on their own, but they are also prep work for supporting multiple submission queues per completion queue, which is required by the NVMe spec and leveraged by Windows guests.

Currently, shadow doorbells are implemented in a way where there are
essentially two doorbell values: the shadow one and the "real" one.
Observed changes to the shadow value are reflected in the real value
but not vice versa. This causes problems when guests do not use, or
only inconsistently use, the shadow doorbells for a given queue.

The spec is somewhat ambiguous on this front, but it's clear from
guest behavior that the expectation is that the shadow doorbells and
the real doorbells remain in perfect sync, so that the actual doorbell
value can be read from the shadow doorbell. To support this, use the
shadow doorbell memory as the single source of truth for the doorbell
register value.

In addition to these changes, simplify the EventIdx handling to match
the strategy we use for virtio and vmbus devices--update EventIdx
(thus letting the guest know it needs to write the real doorbell on the
next request) only when polling for work and the queue is currently
empty. This is easier to reason about than the existing heuristic
approach.

This does eliminate an optimization where we currently avoid updating
EventIdx when there are IOs in flight (thus increasing IO latency but
reducing the intercept rate), but a future change can easily add this
behavior back by avoiding polling (and instead just probing) a queue
when more than the desired number of IOs are in flight.

These changes are an improvement on their own, but they are also prep
work for supporting multiple submission queues per completion queue,
which is required by the NVMe spec and leveraged by Windows guests.
@jstarks jstarks requested review from a team as code owners June 22, 2025 09:24
@jstarks jstarks requested a review from Copilot June 23, 2025 12:01
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR overhauls shadow doorbell handling by switching to a unified DoorbellMemory, ensures real and shadow doorbells stay in sync, and simplifies event index updates. It also prepares for multi-queue support by centralizing doorbell state and streamlining queue APIs.

  • Migrate from per-queue DoorbellRegister/ShadowDoorbell to a shared DoorbellMemory protected by RwLock.
  • Refactor I/O worker to use poll_fn for readiness and remove the previous advance_evt_idx heuristic.
  • Redesign admin handler’s doorbell buffer config, adding supports_shadow_doorbells and consolidating validation.
  • Update coordinator logic to route all doorbell writes through the shared memory.
  • Adjust tests to use the new DOORBELL_BUFFER_BASE/EVT_IDX_BUFFER_BASE constants directly.

Reviewed Changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vm/devices/storage/nvme/src/workers/io.rs Swapped to DoorbellMemory, removed ShadowDoorbell, and updated queue polling/event index logic.
vm/devices/storage/nvme/src/workers/coordinator.rs Replaced Vec<DoorbellRegister> with shared DoorbellMemory and simplified doorbell writes.
vm/devices/storage/nvme/src/workers/admin.rs Refactored admin queue creation, doorbell-buffer config, and added supports_shadow_doorbells.
vm/devices/storage/nvme/src/tests/shadow_doorbell_tests.rs Updated shadow doorbell tests to read/write directly to buffer constants.
vm/devices/storage/nvme/Cargo.toml Removed unneeded event-listener workspace entry.

} else {
tracelimit::warn_ratelimited!(index, value, "unknown doorbell");
}
self.doorbells.read().write(index, value);
Copy link
Preview

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using read() to acquire a read lock when mutating DoorbellMemory may allow concurrent writes without proper synchronization. Consider using self.doorbells.write() to obtain a write lock before calling write(index, value).

Suggested change
self.doorbells.read().write(index, value);
self.doorbells.write().write(index, value);

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell why you need a RwLock here. This is the only instance I can find where the nvme emulator accesses this doorbell. read() is not protecting against concurrent access (which I presume is fine, since subsequent updates to the same doorbell index will be strictly more accurate). What is this protecting against?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's protecting against changes to the backing GuestMemory and offsets (i.e., installation of the shadow doorbell buffers), as well as changes to the wakers.

@mattkur mattkur requested review from eric135 and gurasinghMS June 23, 2025 21:39
}

pub fn write(&self, db_id: u16, value: u32) {
if db_id as usize >= self.wakers.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an assert statement like in read()?

}
listener.await;
fn write_event_idx(&self, db_id: u16, val: u32) {
if let Err(err) = self.mem.write_plain(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for db_id value?

Copy link
Contributor

@mattkur mattkur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change, John!

.inspect_err(|err| {
tracelimit::error_ratelimited!(
error = err as &dyn std::error::Error,
"failed to read doorbell memory"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add some identifier so that it's possible to debug which nvme device is seeing this error (here and in other places).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants