Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem are we solving?
In our test environment(3master, 3filer, 9volume server), we found that many "volume is read only" errors were reported for a long period of time. It is found that the disk space of a certain volume server node has reached minfreeSpace, so that all volumes on the node are "read only".
Analyze
Theoretically, the error will only last up to 5s (heartbeat time interval). Through the log, it is found that for a long period of time, when the master allocates fid, it will always be allocated to the "read only" volume. The states of master's writable and volume are inconsistent.
It is found that during the vaccum process, volume1 (for example) is on node1, node2, and node3, and only volume1 on node1 satisfies the GarbageRatio >= garbageThreshold condition, while volume1 on node2 and node3 does not.
At this time, only the volume on node1 will be vacuumed. When committing, the disk space of node1 is released after vacuum, then volume1 on node1 will not be "read only", and volume1 will be set to writable. But at this time, the disks of node2 and node3 are full, and volume1 should not be writable. At this time, the writables information of the master is incorrect.
How are we solving the problem?
Check all replicas during vacuum commit, and set volume1 to writable only when all replicas are not "read only".
How is the PR tested?
Checks