-
Notifications
You must be signed in to change notification settings - Fork 132
nvme: improve shadow doorbell support #1582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Currently, shadow doorbells are implemented in a way where there are essentially two doorbell values: the shadow one and the "real" one. Observed changes to the shadow value are reflected in the real value but not vice versa. This causes problems when guests do not use, or only inconsistently use, the shadow doorbells for a given queue. The spec is somewhat ambiguous on this front, but it's clear from guest behavior that the expectation is that the shadow doorbells and the real doorbells remain in perfect sync, so that the actual doorbell value can be read from the shadow doorbell. To support this, use the shadow doorbell memory as the single source of truth for the doorbell register value. In addition to these changes, simplify the EventIdx handling to match the strategy we use for virtio and vmbus devices--update EventIdx (thus letting the guest know it needs to write the real doorbell on the next request) only when polling for work and the queue is currently empty. This is easier to reason about than the existing heuristic approach. This does eliminate an optimization where we currently avoid updating EventIdx when there are IOs in flight (thus increasing IO latency but reducing the intercept rate), but a future change can easily add this behavior back by avoiding polling (and instead just probing) a queue when more than the desired number of IOs are in flight. These changes are an improvement on their own, but they are also prep work for supporting multiple submission queues per completion queue, which is required by the NVMe spec and leveraged by Windows guests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR overhauls shadow doorbell handling by switching to a unified DoorbellMemory
, ensures real and shadow doorbells stay in sync, and simplifies event index updates. It also prepares for multi-queue support by centralizing doorbell state and streamlining queue APIs.
- Migrate from per-queue
DoorbellRegister
/ShadowDoorbell
to a sharedDoorbellMemory
protected byRwLock
. - Refactor I/O worker to use
poll_fn
for readiness and remove the previousadvance_evt_idx
heuristic. - Redesign admin handler’s doorbell buffer config, adding
supports_shadow_doorbells
and consolidating validation. - Update coordinator logic to route all doorbell writes through the shared memory.
- Adjust tests to use the new
DOORBELL_BUFFER_BASE
/EVT_IDX_BUFFER_BASE
constants directly.
Reviewed Changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
vm/devices/storage/nvme/src/workers/io.rs | Swapped to DoorbellMemory , removed ShadowDoorbell , and updated queue polling/event index logic. |
vm/devices/storage/nvme/src/workers/coordinator.rs | Replaced Vec<DoorbellRegister> with shared DoorbellMemory and simplified doorbell writes. |
vm/devices/storage/nvme/src/workers/admin.rs | Refactored admin queue creation, doorbell-buffer config, and added supports_shadow_doorbells . |
vm/devices/storage/nvme/src/tests/shadow_doorbell_tests.rs | Updated shadow doorbell tests to read/write directly to buffer constants. |
vm/devices/storage/nvme/Cargo.toml | Removed unneeded event-listener workspace entry. |
} else { | ||
tracelimit::warn_ratelimited!(index, value, "unknown doorbell"); | ||
} | ||
self.doorbells.read().write(index, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using read()
to acquire a read lock when mutating DoorbellMemory
may allow concurrent writes without proper synchronization. Consider using self.doorbells.write()
to obtain a write lock before calling write(index, value)
.
self.doorbells.read().write(index, value); | |
self.doorbells.write().write(index, value); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell why you need a RwLock
here. This is the only instance I can find where the nvme emulator accesses this doorbell. read()
is not protecting against concurrent access (which I presume is fine, since subsequent updates to the same doorbell index will be strictly more accurate). What is this protecting against?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's protecting against changes to the backing GuestMemory
and offsets (i.e., installation of the shadow doorbell buffers), as well as changes to the wakers.
vm/devices/storage/nvme/src/queue.rs
Outdated
} | ||
|
||
pub fn write(&self, db_id: u16, value: u32) { | ||
if db_id as usize >= self.wakers.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an assert statement like in read()
?
} | ||
listener.await; | ||
fn write_event_idx(&self, db_id: u16, val: u32) { | ||
if let Err(err) = self.mem.write_plain( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for db_id
value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this change, John!
.inspect_err(|err| { | ||
tracelimit::error_ratelimited!( | ||
error = err as &dyn std::error::Error, | ||
"failed to read doorbell memory" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add some identifier so that it's possible to debug which nvme device is seeing this error (here and in other places).
Currently, shadow doorbells are implemented in a way where there are essentially two doorbell values: the shadow one and the "real" one. Observed changes to the shadow value are reflected in the real value but not vice versa. This causes problems when guests do not use, or only inconsistently use, the shadow doorbells for a given queue.
The spec is somewhat ambiguous on this front, but it's clear from guest behavior that the expectation is that the shadow doorbells and the real doorbells remain in perfect sync, so that the actual doorbell value can be read from the shadow doorbell. To support this, use the shadow doorbell memory as the single source of truth for the doorbell register value.
In addition to these changes, simplify the EventIdx handling to match the strategy we use for virtio and vmbus devices--update EventIdx (thus letting the guest know it needs to write the real doorbell on the next request) only when polling for work and the queue is currently empty. This is easier to reason about than the existing heuristic approach.
This does eliminate an optimization where we currently avoid updating EventIdx when there are IOs in flight (thus increasing IO latency but reducing the intercept rate), but a future change can easily add this behavior back by avoiding polling (and instead just probing) a queue when more than the desired number of IOs are in flight.
These changes are an improvement on their own, but they are also prep work for supporting multiple submission queues per completion queue, which is required by the NVMe spec and leveraged by Windows guests.