Troubleshooting the data corruption [DRAFT]

If you've encountered error message like the following:

'fsck' found errors on device /dev/longhorn/pvc-6288f5ea-5eea-4524-a84f-afa14b85780d but could not correct them.

Then you have a data corruption situation.

Underlying disk went bad

You can identify if the error is caused by one of the underlying disks went bad by following the steps in https://github.com/yasker/longhorn/wiki/Identify-the-corrupted-replica-%5BDRAFT%5D . You should be able to identify which replica went bad.

If most of the replicas on the disk went bad, that means the disk is unreliable now and should be replaced.

If it's only one replica on the disk went bad, it can be a situation known as bit rot. In this case, remove the replica is good enough.

If all the replicas are identical, then we need to recover the volume using the snapshots.

The reason for causing this is probably the bad bit was written from the workload the volume attached to. To revert to a previous snapshot

Attach the volume in maintenance mode to any node.
Revert to a snapshot.
1. You should start with the latest one.
Detach the volume from maintenance mode to any node.
Re-attach the volume to a node you has the access to.
Mount the volume from /dev/longhorn/<volume_name> and check the volume content.
If the volume content is still incorrect, repeat from step 1.
Once you find a usable snapshot, make a new snapshot from there and start using the volume as normal.

If all of the methods above failed, use a backup to recover the volume.