bug: OLR crashes reading online redo with stale/long-lived sequence

## Summary

OLR fails with `ERROR 50051 did not find lwn at offset` when reading online redo logs under heavy write load.

## Root Cause Analysis

### Theory 1: Stale blocks beyond Oracle write position — DISPROVED

Hex-dump analysis of both XE (10 MB) and RAC (1 GB) redo files shows Oracle does NOT stamp new sequence numbers on old blocks. Old blocks keep their original sequence headers, and `checkBlockHeader()` correctly returns EMPTY for them. The error offset (8.7 MB) was well within the valid data range of the current sequence.

### Theory 2: Read-while-write race on LWN headers — LIKELY

The Parser tracks LWN (Logical Write Number) groups using a chain:
1. Read LWN header at `lwnEndBlock` (Parser.cpp:1400-1406)
2. Extract `lwnSize` from offset 28 in the block (4-byte uint32)
3. Set `lwnEndBlock = currentBlock + lwnSize` for the next LWN
4. If `lwnSize` is corrupted, `lwnEndBlock` points to a non-LWN block → ERROR 50051

For online redo, Oracle is actively writing blocks. If OLR reads a block that Oracle is in the process of writing, the `lwnSize` field (or the `vld` flag) could be partially written, causing OLR to miscalculate the next LWN position.

This explains:
- **Hard to reproduce** — requires reading the exact block Oracle is writing at that moment
- **Only on online redo** — archived redo is fully written, no race
- **More likely under heavy load** — more blocks being written per second
- **The stale thread correlation was coincidental** — heavy DML on one node causes rapid redo writes, increasing the chance of a read-while-write collision

## Hex-Dump Evidence

**Oracle XE (10 MB redo, group 1, seq 142):**
```
Block 2 – 18148:    seq 142 (current) — 8.86 MB of real data
Block 18149 – 20218: seq 140 (old)     — correctly detected as EMPTY by OLR
Block 20219 – 20476: seq 106 (older)   — correctly detected as EMPTY by OLR
```

**Oracle RAC (1 GB redo, group 8, seq 1522):**
```
Block 2 – 1452006:     seq 1522 (current) — 709 MB of real data
Block 1452007 – 1462108: seq 1520 (old)    — correctly detected as EMPTY
Block 1462109 – 1477044: seq 1516 (older)  — correctly detected as EMPTY
Block 1477045+:          seq 0 (zeroes)    — correctly detected as EMPTY
```

## Potential Fix

The Parser should handle `(vld & 0x04) == 0` at a valid block (correct sequence, valid CRC) as "block not yet fully written" rather than a fatal error — retry/wait instead of throwing ERROR 50051. This is similar to how the Reader handles EMPTY blocks.

## Reproduction

Could not reproduce in 20+ automated attempts. The race window is very small. May need to add intentional delays in the Reader to widen the race window for testing.

## Workaround

`ALTER SYSTEM SWITCH ALL LOGFILE` before starting OLR forces fresh redo files, avoiding any partially-written blocks from a previous write session.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: OLR crashes reading online redo with stale/long-lived sequence #8

Summary

Root Cause Analysis

Theory 1: Stale blocks beyond Oracle write position — DISPROVED

Theory 2: Read-while-write race on LWN headers — LIKELY

Hex-Dump Evidence

Potential Fix

Reproduction

Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: OLR crashes reading online redo with stale/long-lived sequence #8

Description

Summary

Root Cause Analysis

Theory 1: Stale blocks beyond Oracle write position — DISPROVED

Theory 2: Read-while-write race on LWN headers — LIKELY

Hex-Dump Evidence

Potential Fix

Reproduction

Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions