Skip to content

bug: OLR crashes reading online redo with stale/long-lived sequence #8

@rophy

Description

@rophy

Summary

OLR fails with ERROR 50051 did not find lwn at offset when reading online redo logs under heavy write load.

Root Cause Analysis

Theory 1: Stale blocks beyond Oracle write position — DISPROVED

Hex-dump analysis of both XE (10 MB) and RAC (1 GB) redo files shows Oracle does NOT stamp new sequence numbers on old blocks. Old blocks keep their original sequence headers, and checkBlockHeader() correctly returns EMPTY for them. The error offset (8.7 MB) was well within the valid data range of the current sequence.

Theory 2: Read-while-write race on LWN headers — LIKELY

The Parser tracks LWN (Logical Write Number) groups using a chain:

  1. Read LWN header at lwnEndBlock (Parser.cpp:1400-1406)
  2. Extract lwnSize from offset 28 in the block (4-byte uint32)
  3. Set lwnEndBlock = currentBlock + lwnSize for the next LWN
  4. If lwnSize is corrupted, lwnEndBlock points to a non-LWN block → ERROR 50051

For online redo, Oracle is actively writing blocks. If OLR reads a block that Oracle is in the process of writing, the lwnSize field (or the vld flag) could be partially written, causing OLR to miscalculate the next LWN position.

This explains:

  • Hard to reproduce — requires reading the exact block Oracle is writing at that moment
  • Only on online redo — archived redo is fully written, no race
  • More likely under heavy load — more blocks being written per second
  • The stale thread correlation was coincidental — heavy DML on one node causes rapid redo writes, increasing the chance of a read-while-write collision

Hex-Dump Evidence

Oracle XE (10 MB redo, group 1, seq 142):

Block 2 – 18148:    seq 142 (current) — 8.86 MB of real data
Block 18149 – 20218: seq 140 (old)     — correctly detected as EMPTY by OLR
Block 20219 – 20476: seq 106 (older)   — correctly detected as EMPTY by OLR

Oracle RAC (1 GB redo, group 8, seq 1522):

Block 2 – 1452006:     seq 1522 (current) — 709 MB of real data
Block 1452007 – 1462108: seq 1520 (old)    — correctly detected as EMPTY
Block 1462109 – 1477044: seq 1516 (older)  — correctly detected as EMPTY
Block 1477045+:          seq 0 (zeroes)    — correctly detected as EMPTY

Potential Fix

The Parser should handle (vld & 0x04) == 0 at a valid block (correct sequence, valid CRC) as "block not yet fully written" rather than a fatal error — retry/wait instead of throwing ERROR 50051. This is similar to how the Reader handles EMPTY blocks.

Reproduction

Could not reproduce in 20+ automated attempts. The race window is very small. May need to add intentional delays in the Reader to widen the race window for testing.

Workaround

ALTER SYSTEM SWITCH ALL LOGFILE before starting OLR forces fresh redo files, avoiding any partially-written blocks from a previous write session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions