Skip to content

REQ-NF-PERF-NDIS-001: Packet Forwarding Performance <1µs #46

@zarfld

Description

@zarfld

Requirement Statement

ID: REQ-NF-PERF-NDIS-001
Type: Non-Functional Requirement
Priority: Critical
Phase: Phase 02 - Requirements Analysis & Specification

The Intel AVB Filter Driver shall achieve packet forwarding latency <1µs (microsecond) in the NDIS filter fast path to meet AVB/TSN timing requirements for Class A traffic (125µs end-to-end latency budget).


Traceability

Traces to: #31 (StR-NDIS-FILTER-001: NDIS Filter Driver Implementation)

Architecture Decisions

  • Refined by: #121 (ADR-PERF-001: NDIS Fast Path Optimization)

Quality Scenarios

Test Cases


Rationale

Problem: AVB Class A traffic requires <125µs end-to-end latency (2ms observation window). Filter driver adds latency to packet path.

IEEE 802.1BA Budget:

  • NIC ingress: 20µs
  • NDIS filter processing: <1µs ← This requirement
  • Stack processing: 50µs
  • NIC egress: 20µs
  • Network transit: 34µs
  • Total: ~125µs

Failure Impact: If filter exceeds 1µs, Class A traffic violates latency guarantees → audio/video quality degradation.


Detailed Requirements

PERF-NDIS-001.1: Fast Path Bypass for AVB Traffic

Filter shall detect AVB packets in FilterReceiveNetBufferLists and forward without deep inspection:

Fast Path Criteria:

BOOLEAN IsAvbFastPath(PNET_BUFFER_LIST Nbl) {
    // Check EtherType == 0x22F0 (AVB Transport Protocol)
    if (EtherType == 0x22F0) return TRUE;
    
    // Check VLAN PCP == 6 or 7 (network control)
    if (VlanPresent &amp;&amp; (Pcp == 6 || Pcp == 7)) return TRUE;
    
    return FALSE;
}

Fast Path Actions:

  1. Tag NBL with AVB marker (OOB data)
  2. Forward immediately to next driver
  3. No packet inspection
  4. No buffer copying
  5. No IOCTL synchronization

Slow Path (non-AVB):

  • Full packet inspection allowed
  • Statistics collection
  • Logging/tracing
  • IOCTL operations

Latency Target: Fast path <500ns (allows 500ns margin for NDIS overhead)


PERF-NDIS-001.2: Zero-Copy Packet Forwarding

Filter shall use NDIS zero-copy APIs (no packet buffer allocation):

Receive Path:

VOID FilterReceiveNetBufferLists(
    NDIS_HANDLE FilterModuleContext,
    PNET_BUFFER_LIST NetBufferLists,
    NDIS_PORT_NUMBER PortNumber,
    ULONG NumberOfNetBufferLists,
    ULONG ReceiveFlags
) {
    PFILTER_ADAPTER_CONTEXT ctx = (PFILTER_ADAPTER_CONTEXT)FilterModuleContext;
    
    // Fast path: Forward original NBLs without cloning
    if (IsAvbFastPath(NetBufferLists)) {
        TagAvbTraffic(NetBufferLists);  // &lt;100ns: Set OOB flag
        
        NdisFIndicateReceiveNetBufferLists(
            ctx-&gt;FilterHandle,
            NetBufferLists,  // Original NBLs (no clone)
            PortNumber,
            NumberOfNetBufferLists,
            ReceiveFlags
        );
        return;
    }
    
    // Slow path: Inspect/modify if needed
    // ...
}

Send Path:

VOID FilterSendNetBufferLists(
    NDIS_HANDLE FilterModuleContext,
    PNET_BUFFER_LIST NetBufferLists,
    NDIS_PORT_NUMBER PortNumber,
    ULONG SendFlags
) {
    PFILTER_ADAPTER_CONTEXT ctx = (PFILTER_ADAPTER_CONTEXT)FilterModuleContext;
    
    // Fast path: Forward without modification
    if (IsAvbFastPath(NetBufferLists)) {
        NdisFSendNetBufferLists(
            ctx-&gt;FilterHandle,
            NetBufferLists,  // Original NBLs
            PortNumber,
            SendFlags
        );
        return;
    }
    
    // Slow path: Inspect/modify if needed
    // ...
}

Latency Impact:

  • Zero-copy: <100ns overhead (just function call + flag check)
  • Clone NBL: ~2-5µs (unacceptable for AVB)

PERF-NDIS-001.3: Lock-Free Packet Statistics

Filter shall use atomic operations for statistics (no spinlocks in fast path):

Statistics Structure:

typedef struct _FILTER_STATS {
    volatile LONG64 RxPackets;      // Atomic increment
    volatile LONG64 TxPackets;
    volatile LONG64 RxAvbPackets;   // AVB-specific counters
    volatile LONG64 TxAvbPackets;
    volatile LONG64 RxBytes;        // Atomic add
    volatile LONG64 TxBytes;
} FILTER_STATS;

Atomic Update (fast path):

// Good: Lock-free increment (1 CPU cycle on x64)
InterlockedIncrement64(&amp;ctx-&gt;Stats.RxAvbPackets);

// Bad: Spinlock (50-200 CPU cycles, cache line contention)
KeAcquireSpinLockAtDpcLevel(&amp;ctx-&gt;StatsLock);
ctx-&gt;Stats.RxAvbPackets++;
KeReleaseSpinLockFromDpcLevel(&amp;ctx-&gt;StatsLock);

Latency Impact:

  • Atomic operations: <10ns
  • Spinlock acquisition: 50-200ns (unacceptable in fast path)

Query Statistics (slow path IOCTL):

// Read atomic counters (no locks needed)
Stats-&gt;RxPackets = InterlockedCompareExchange64(&amp;ctx-&gt;Stats.RxPackets, 0, 0);  // Atomic read

PERF-NDIS-001.4: CPU Cache Optimization

Filter shall align critical data structures to cache line boundaries:

Cache-Aligned Context:

typedef struct DECLSPEC_CACHEALIGN _FILTER_ADAPTER_CONTEXT {
    // Hot path: First cache line (64 bytes)
    NDIS_HANDLE FilterHandle;        // Offset 0
    PHARDWARE_OPS HwOps;             // Offset 8
    PVOID HwContext;                 // Offset 16
    FILTER_STATS Stats;              // Offset 24 (fits in 64 bytes)
    
    // Cold path: Separate cache lines
    NDIS_SPIN_LOCK Lock;             // Offset 64 (next cache line)
    DEVICE_CONTEXT DeviceContext;    // Offset 128
    // ...
} FILTER_ADAPTER_CONTEXT;

Compiler Directive:

#define DECLSPEC_CACHEALIGN __declspec(align(64))  // x64 cache line size

Latency Impact:

  • Cache-aligned: <5ns access (L1 cache hit)
  • Unaligned: 50-200ns (cache line split, false sharing)

PERF-NDIS-001.5: Inline Critical Functions

Filter shall inline fast path functions to reduce call overhead:

Inline Directives:

__forceinline BOOLEAN IsAvbFastPath(PNET_BUFFER_LIST Nbl) {
    // Inline assembly or intrinsics for EtherType check
    USHORT etherType = *(PUSHORT)((PUCHAR)NblData + 12);  // Offset to EtherType
    return (etherType == 0x22F0);  // Single comparison, no function call
}

__forceinline VOID TagAvbTraffic(PNET_BUFFER_LIST Nbl) {
    NET_BUFFER_LIST_INFO(Nbl, NetBufferListFilteringInfo) = (PVOID)AVB_MARKER;
}

Compiler Optimization:

<PropertyGroup>
  <WholeProgramOptimization>true</WholeProgramOptimization>
  <LinkTimeCodeGeneration>UseLinkTimeCodeGeneration</LinkTimeCodeGeneration>
</PropertyGroup>
<ItemDefinitionGroup>
  <ClCompile>
    <Optimization>MaxSpeed</Optimization>
    <InlineFunctionExpansion>AnySuitable</InlineFunctionExpansion>
  </ClCompile>
</ItemDefinitionGroup>
<ItemDefinitionGroup>
  <ClCompile>
    <FavorSizeOrSpeed>Speed</FavorSizeOrSpeed>
  </ClCompile>
</ItemDefinitionGroup>

Latency Impact:

  • Inline: 0ns (code embedded in caller)
  • Function call: 5-20ns (stack frame, return address, branch prediction)

PERF-NDIS-001.6: Prefetch Packet Headers

Filter shall prefetch packet headers to reduce cache misses:

Prefetch Directive:

VOID FilterReceiveNetBufferLists(...) {
    PNET_BUFFER_LIST nbl = NetBufferLists;
    
    while (nbl) {
        PNET_BUFFER nb = NET_BUFFER_LIST_FIRST_NB(nbl);
        PVOID data = NdisGetDataBuffer(nb, 64, NULL, 1, 0);  // Prefetch 64 bytes
        
        _mm_prefetch((char*)data, _MM_HINT_T0);  // Prefetch to L1 cache
        
        // Process packet (data now in cache)
        if (IsAvbFastPath(data)) {
            // ...
        }
        
        nbl = NET_BUFFER_LIST_NEXT_NBL(nbl);
    }
}

Latency Impact:

  • With prefetch: <10ns header access (L1 cache)
  • Without prefetch: 50-200ns (L3 cache or RAM)

PERF-NDIS-001.7: DPC Processing for Receive

Filter shall process receive packets at DISPATCH_LEVEL (no IRQL transitions):

NDIS Receive Callback (runs at DISPATCH_LEVEL):

VOID FilterReceiveNetBufferLists(
    NDIS_HANDLE FilterModuleContext,
    PNET_BUFFER_LIST NetBufferLists,
    NDIS_PORT_NUMBER PortNumber,
    ULONG NumberOfNetBufferLists,
    ULONG ReceiveFlags
) {
    // Already at DISPATCH_LEVEL (DPC context)
    // No IRQL transition needed
    
    ASSERT(KeGetCurrentIrql() == DISPATCH_LEVEL);
    
    // Fast path processing
    // ...
}

Latency Impact:

  • DISPATCH_LEVEL: No IRQL transition overhead
  • PASSIVE → DISPATCH: 200-500ns per transition

PERF-NDIS-001.8: Avoid Memory Allocation in Fast Path

Filter shall pre-allocate all resources during initialization:

Pre-Allocated Pools:

typedef struct _FILTER_ADAPTER_CONTEXT {
    NDIS_HANDLE NblPoolHandle;   // Pre-allocated NBL pool (slow path only)
    NDIS_HANDLE NbPoolHandle;    // Pre-allocated NB pool
    NDIS_HANDLE BufferPool;      // Pre-allocated buffer pool
} FILTER_ADAPTER_CONTEXT;

Initialization:

NTSTATUS AllocateFilterPools(PFILTER_ADAPTER_CONTEXT ctx) {
    NET_BUFFER_LIST_POOL_PARAMETERS nblParams = {0};
    nblParams.Header.Type = NDIS_OBJECT_TYPE_DEFAULT;
    nblParams.Header.Size = sizeof(NET_BUFFER_LIST_POOL_PARAMETERS);
    nblParams.ProtocolId = NDIS_PROTOCOL_ID_DEFAULT;
    nblParams.ContextSize = 0;
    nblParams.fAllocateNetBuffer = FALSE;
    nblParams.PoolTag = &#39;LBVA&#39;;  // &#39;AVBL&#39;
    
    ctx-&gt;NblPoolHandle = NdisAllocateNetBufferListPool(ctx-&gt;FilterHandle, &amp;nblParams);
    if (!ctx-&gt;NblPoolHandle) return STATUS_INSUFFICIENT_RESOURCES;
    
    // Pre-allocate 64 NBLs (reused from pool)
    // ...
    
    return STATUS_SUCCESS;
}

Fast Path (no allocation):

// Good: Reuse from pre-allocated pool (200-500ns)
PNET_BUFFER_LIST clone = NdisAllocateNetBufferList(ctx-&gt;NblPoolHandle, ...);

// Bad: Allocate dynamically (2-10µs, unacceptable)
PNET_BUFFER_LIST clone = ExAllocatePoolWithTag(NonPagedPool, sizeof(NET_BUFFER_LIST), &#39;LBVA&#39;);

Latency Impact:

  • Pool allocation: 200-500ns
  • Dynamic allocation: 2-10µs (20x slower)

PERF-NDIS-001.9: Minimize Conditional Branches

Filter shall optimize branch prediction for fast path:

Branch Optimization:

// Good: Predict AVB traffic is rare (fall-through)
if (UNLIKELY(IsAvbFastPath(nbl))) {  // Hint: unlikely branch
    FastPathForward(nbl);
    return;
}

// Default path: Non-AVB processing
// ...

// Compiler hint for branch prediction
#define UNLIKELY(x) __builtin_expect(!!(x), 0)  // GCC/Clang
#define LIKELY(x) __builtin_expect(!!(x), 1)

Latency Impact:

  • Correct prediction: <1ns (pipelined)
  • Misprediction: 10-20ns (pipeline flush)

Error Scenarios

ES-PERF-NDIS-001: Fast Path Latency Exceeded

Condition: Packet forwarding takes >1µs (measured via TSC)
NTSTATUS: STATUS_TIMEOUT (0x00000102)
Recovery: Log event; switch to slow path for diagnostics
User Impact: AVB latency budget violated → audio/video glitches
Prevention: Performance profiling during development
Event ID: 17101 (Warning: Fast path latency exceeded)
Test: Timestamp packet entry/exit; verify <1µs

ES-PERF-NDIS-002: Memory Allocation in Fast Path

Condition: Fast path attempts dynamic allocation (detected via verifier)
NTSTATUS: STATUS_UNSUCCESSFUL (0xC0000001)
Recovery: Use pre-allocated pool instead
User Impact: Latency spike → potential packet drop
Prevention: Driver Verifier with low resource simulation
Event ID: 17102 (Error: Unexpected allocation in fast path)
Test: Enable Driver Verifier; verify no allocations during receive

ES-PERF-NDIS-003: Spinlock Contention in Fast Path

Condition: Multiple CPUs contend for statistics lock
NTSTATUS: N/A (performance degradation)
Recovery: Replace spinlock with atomic operations
User Impact: Latency increases 50-200ns per packet
Prevention: Lock-free atomic counters
Event ID: 17103 (Warning: Spinlock contention detected)
Test: Multi-core stress test; monitor lock wait time

ES-PERF-NDIS-004: Cache Line False Sharing

Condition: Multiple CPUs modify adjacent variables (same cache line)
NTSTATUS: N/A (performance degradation)
Recovery: Align hot variables to separate cache lines
User Impact: Latency increases 50-200ns (cache coherency traffic)
Prevention: DECLSPEC_CACHEALIGN on hot structures
Event ID: 17104 (Warning: Cache line contention detected)
Test: CPU cache profiler (Intel VTune); check false sharing

ES-PERF-NDIS-005: Packet Cloning in Fast Path

Condition: Fast path incorrectly clones NBLs instead of forwarding
NTSTATUS: N/A (performance degradation)
Recovery: Remove cloning; use zero-copy forward
User Impact: Latency increases 2-5µs per packet
Prevention: Code review; static analysis
Event ID: 17105 (Error: Unnecessary NBL cloning detected)
Test: ETW tracing; verify NdisAllocateCloneNetBufferList not called

ES-PERF-NDIS-006: IRQL Transition in Fast Path

Condition: Fast path lowers IRQL (KeRaiseIrql/KeLowerIrql)
NTSTATUS: N/A (performance degradation)
Recovery: Keep processing at DISPATCH_LEVEL
User Impact: Latency increases 200-500ns per transition
Prevention: IRQL assertions in fast path
Event ID: 17106 (Error: IRQL transition in fast path)
Test: Instrument KeRaiseIrql; verify not called during receive

ES-PERF-NDIS-007: Branch Misprediction Penalty

Condition: Fast path has unpredictable branches (50/50 split)
NTSTATUS: N/A (performance degradation)
Recovery: Reorganize code for fall-through common case
User Impact: Latency increases 10-20ns per misprediction
Prevention: Profile branch statistics; optimize hot path
Event ID: 17107 (Info: Branch misprediction rate high)
Test: CPU performance counters; measure misprediction rate

ES-PERF-NDIS-008: TLB Miss in Packet Access

Condition: Packet buffer not page-aligned → TLB miss
NTSTATUS: N/A (performance degradation)
Recovery: Prefetch packet data; use large pages if possible
User Impact: Latency increases 50-200ns on TLB miss
Prevention: NDIS handles alignment; driver cannot control
Event ID: 17108 (Info: TLB miss rate elevated)
Test: CPU performance counters (PMU); measure DTLB misses

ES-PERF-NDIS-009: Excessive Function Call Depth

Condition: Fast path has deep call stack (>5 levels)
NTSTATUS: N/A (performance degradation)
Recovery: Inline critical functions
User Impact: Latency increases 5-20ns per call
Prevention: __forceinline on hot path functions
Event ID: 17109 (Warning: Deep call stack in fast path)
Test: Call stack profiler; verify <3 levels in fast path

ES-PERF-NDIS-010: Non-Temporal Store Pollution

Condition: Fast path writes pollute CPU cache
NTSTATUS: N/A (performance degradation)
Recovery: Use non-temporal stores for large buffers
User Impact: Latency increases 20-100ns (cache eviction)
Prevention: _mm_stream_si64 for non-temporal writes
Event ID: 17110 (Info: Cache pollution detected)
Test: Cache profiler; measure eviction rate


Performance Metrics

PM-PERF-NDIS-001: Fast Path Latency (Target)

Target: <1µs (1000ns) packet forwarding
Measurement: RDTSC timestamp at entry/exit of FilterReceiveNetBufferLists
Threshold: 95th percentile <1µs, 99th percentile <1.5µs
Test: High-rate AVB traffic (10,000 pps); measure per-packet latency

PM-PERF-NDIS-002: Fast Path Throughput

Target: 1 Gbps line rate (AVB traffic)
Measurement: Measure packets/second with 1500-byte frames
Threshold: >80,000 pps (1 Gbps / 12,000 bits per packet)
Test: Packet generator; saturate filter with AVB traffic

PM-PERF-NDIS-003: CPU Utilization

Target: <5% CPU per 1 Gbps AVB traffic
Measurement: Windows Performance Monitor (% Processor Time)
Threshold: <5% CPU on 4-core system
Test: Sustained 1 Gbps AVB traffic; measure driver CPU time

PM-PERF-NDIS-004: Cache Miss Rate

Target: <1% L1 cache miss rate in fast path
Measurement: CPU performance counters (PEBS/PMU)
Threshold: <1% L1 miss rate
Test: Intel VTune cache analysis during packet processing

PM-PERF-NDIS-005: Branch Misprediction Rate

Target: <0.5% branch mispredictions in fast path
Measurement: CPU performance counters (branch-misses event)
Threshold: <0.5% of all branches
Test: perf stat -e branch-misses (Linux) or VTune (Windows)

PM-PERF-NDIS-006: Memory Allocations

Target: Zero allocations in fast path
Measurement: Driver Verifier pool allocation tracking
Threshold: 0 allocations during packet forwarding
Test: Driver Verifier + ETW tracing; verify no ExAllocatePool calls

PM-PERF-NDIS-007: Spinlock Wait Time

Target: 0ns (no spinlocks in fast path)
Measurement: ETW kernel tracing (SpinlockAcquire events)
Threshold: 0 spinlock acquisitions in fast path
Test: Concurrency stress test; verify no spinlock events

PM-PERF-NDIS-008: Atomic Operation Latency

Target: <10ns per InterlockedIncrement64
Measurement: RDTSC micro-benchmark
Threshold: <10ns average
Test: Tight loop of InterlockedIncrement64; measure overhead

PM-PERF-NDIS-009: Packet Drop Rate

Target: 0% packet drops under load
Measurement: NDIS statistics (IfOutDiscards, IfInDiscards)
Threshold: 0 drops at 1 Gbps sustained
Test: 24-hour stress test at line rate; verify no drops


Acceptance Criteria (Gherkin Format)

Feature: Packet Forwarding Performance &lt;1µs
  As an AVB application
  I need minimal filter latency
  So that Class A traffic meets &lt;125µs end-to-end latency

  Scenario: Fast path latency under 1µs
    Given AVB packet stream at 10,000 pps
    When filter receives packets
    Then 95% of packets forwarded in &lt;1µs
    And 99% of packets forwarded in &lt;1.5µs

  Scenario: Zero-copy forwarding
    Given AVB packet with EtherType 0x22F0
    When filter processes packet
    Then original NBL forwarded (no clone)
    And no memory allocation occurs
    And latency &lt;500ns

  Scenario: Lock-free statistics
    Given 4 CPUs processing packets concurrently
    When updating packet counters
    Then atomic operations used (no spinlocks)
    And no cache line contention
    And latency &lt;10ns per counter update

  Scenario: Line rate throughput
    Given 1 Gbps AVB traffic (80,000 pps)
    When sustaining load for 1 hour
    Then 0% packet drops
    And CPU utilization &lt;5%
    And latency remains &lt;1µs

  Scenario: Cache-optimized data structures
    Given hot path variables in first cache line
    When accessing FilterHandle, HwOps, Stats
    Then all accesses hit L1 cache (&lt;5ns)
    And no cache line splits
    And no false sharing between CPUs

Dependencies

Prerequisites:


Effort Estimation

Complexity: High (requires low-level optimization)
Estimated Effort: 40 hours (optimize + profile + test)


Status: Draft
Created: 2025-12-09
Enhanced: 2025-12-10 (Added 10 error scenarios, 9 performance metrics, Event IDs 17101-17110)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions