Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-balancing chunks bug #237

Open
shtephan opened this issue Sep 27, 2019 · 12 comments
Open

re-balancing chunks bug #237

shtephan opened this issue Sep 27, 2019 · 12 comments

Comments

@shtephan
Copy link

shtephan commented Sep 27, 2019

Re-replication for re-balance utilization doesn't check that the destination chunk server for re-balancing already has the chunk stored on it. If the chunk initially had 2 replicas, one on the re-balance source chunk server and one on the re-replication destination chunk server, after the rebalancing the file will only have one replica, on the destination one.
I can provide additional logs from the environment I'm using it to prove it if needed.

@mikeov
Copy link
Contributor

mikeov commented Nov 18, 2019

I'd guess the implicit assumption here is that file replication is greater than 1, possibly 2. If replication is 1 then, if I read the description correctly, the system works as expected. If replication is greater than 1 then it isn't. Though in both cases re-replication would not occur, only replication check might need to delete extra replicas in the case when actual replication is greater than the target.

@shtephan
Copy link
Author

shtephan commented Nov 18, 2019

Our problem is that we loose redundancy in the case described above for files with replication >1
after the re-balancing process ends, a file with replication=2 may have only one replica (because the rebalancing tries to overlap the rebalanced chunk over an other existing one)
So, my opinion is that when a rebalance of a chunk occurs the servers already holding the chunk must be excluded from the possible list for moving the chunk that is currently targeted for rebalancing

@shtephan
Copy link
Author

A similar issue happens for evacuation. We evacuated our entire cluster to another one that temporarily joined in order to upgrade disks and on this move we lost several hundreds files, just from the move.
Maybe someone can have a look on the selection algorithm of the target server when a chunk is evacuated.

@botsbreeder
Copy link

I will add some more details to the issue above:
so we were evacuating ~hundred of disks. Chunks were re-replicating. The problem is that after replicating a chunk metaserver was making decision which chunk to drop based on free space. Disks we evacuate were getting more and more of free space, so this is what was happening:
copy chunk from evacuating disk to new location -> delete chunk from new location (based on free space criteria)
Files were RS 6,3 encoded, means replication is 1, so after the chunk is deleted there is no copy, the only way to restore is RS recovery, but we were loosing chunks faster than it can recover.
So in short period we left with RS encoded files having many "holes", in this case ~20% of chunks were lost, in many places we lost more that 3 chunks in each 9 chunk group so RS was not able to recover.

@mikeov
Copy link
Contributor

mikeov commented May 31, 2020

In general it might not be trivial to interpret trace messages and make correct conclusion. If I remember right, there are multiple reasons for chunk deletion to be issued and the corresponding trace messages to appear in the meta server debug trace log.

For example, chunk replica deletion could be issued due replication or recovery failure. In such case chunk replica deletion can be issued in order to ensure that a possibly partial replica gets deleted.

I’d recommend to start from inspecting chunk server log, looking for “media / disk” IO failures. Inspecting chunk and meta server counters available in meta server web UI might also help to diagnose the problem.

I do not recall encountering a problem similar to the one described here in the last few years on any of the actively used file systems with a few petabytes per day read and write load (append with replication 2 and RS 6+3), and in end to end testing with and without failure injection.

It is possible that the problem is due to the specifics of the system configuration or use pattern.

For example presently file system has no continuous background verification mechanism that verifies all chunk replicas. Existing “chunk directory” / “disk” health check only includes periodically creating, writing, reading back, and deleting a small file. It is conceivable that lack of such mechanism might manifest itself as noticeable replication / recovery failures due to latent undetected media read failures.

Another possibility that comes to mind is that replication / recovery parameters do not match HW configuration capabilities / IO bandwidth, resulting in IO failures / timeouts. Though in such a case after replication / recovery activity stops recent enough code should be able to re-discover replicas that became unavailable due to IO load.

If the problem is due to the meta server replica / layout management bug, then perhaps the ideal way would be to try to reproduce the problem in the simple / small (3 chunk server nodes for example) / controlled test setup. Exiting test scrips can be used to create such a setup, for example src/test-scripts/run_endurance_mc.sh.

@avodaniel
Copy link

I created separated repository based on 2.2.1 (https://github.com/hagrid-the-developer/qfs/) that tries to simulate the issue and also to fix it (but probably only partly).

It contains docker file that runs build and part of tests. It runs recoverytest.sh for 3 chunks servers, not 2. Finally it runs test-lost-chunks.sh, that:

  • Creates 25 files.
  • Runs evacuation of 127.0.0.1:20400
  • Waits some time (100s)
  • Prints files in chunk directories.

I tried to run docker build . -t qfstest:v2.2.1 with disabled commit avodaniel/qfs@3528d7e that tries to fix the issue.
I was investigating chunk 131217 from file abc-010.xyz. Related logs are in subdirectory failure (https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure), from failure/metaserver-drf.err:

      07-29-2020 17:51:39.752 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20401 reason: evacuation
      ...
      07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:11920) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20401 OK replications in flight: 9
      ...
      07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:11503) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 hibernated: 0 needed: -1
      07-29-2020 17:51:39.774 INFO - (LayoutManager.cc:12470) <32,131217> excludes: srv:  other: 3 all: 3 rack: other: 2 all: 2 keeping: 127.0.0.1 20400 20400 0.983188 discarding: 127.0.0.1 20401 20401 0.98691
      ...
      07-29-2020 17:51:39.774 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
      ...
      07-29-2020 17:51:39.775 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20401 chunk: 131217 version: 1 removed: 1

It seems that chunk was evacuated, but after evacuation finished and chunk was removed from evacuated server, LayoutManager::CanReplicateChunkNow() was called and it decided that chunk had too many replicas and that replica in the new destination should be removed, because evacuated server had a lot of free space. Unfortunately, there is a short time window, when it is difficult to distinguished evacuated server from normal server. That's why I tried to introduce other set of chunks that should be or are evacuated but haven't been removed yet in commit avodaniel/qfs@3528d7e.

The issue doesn't happen every time, on my laptop with macOS it happed about 50% of times.

@mikeov
Copy link
Contributor

mikeov commented Aug 5, 2020 via email

@avodaniel
Copy link

@mikeov I tried to apply the change 1b5a3f5 to my branch avodaniel/qfs@b3963db and it seems that the issue is still present. It also seems that wait is always 0. Related logs from one test run are in https://github.com/hagrid-the-developer/qfs/tree/fix-rebalancing-issue/failure.002, important part of metaserver-drf.err:

    08-05-2020 20:26:43.563 INFO - (LayoutManager.cc:11084) starting re-replication: chunk: 131217 from: 127.0.0.1 20400 to: 127.0.0.1 20402 reason: evacuation
    ...
    08-05-2020 20:26:43.592 DEBUG - (ChunkServer.cc:2199) 127.0.0.1 20402 cs-reply: -seq: 1805093697390353841 log: 0 0 3891 status: 0  replicate chunk: 131217 version: 1 file: 32 fileSize: -1 path:  recovStripes: 0 seq: 1805093697390353841 from: 127.0.0.1 20400 to: 127.0.0.1 20402
    ...
    08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:11936) replication done: chunk: 131217 version: 1 status: 0 server: 127.0.0.1 20402 OK replications in flight: 9
    08-05-2020 20:26:43.593 DEBUG - (LayoutManager.cc:11517) re-replicate: chunk: <32,131217> version: 1 offset: 0 eof: 10485760 replicas: 2 retiring: 0 target: 1 rlease: 0 wait: 0 hibernated: 0 needed: -1
    08-05-2020 20:26:43.593 INFO - (LayoutManager.cc:12486) <32,131217> excludes: srv:  other: 3 all: 3 rack: other: 3 all: 3 keeping: 127.0.0.1 20400 20400 0.889967 discarding: 127.0.0.1 20402 20402 0.894062
    ...
    08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0  down: 0 log-chunk-in-flight: 127.0.0.1 20400 logseq: 0 0 3954 type: STL chunk: -1 version: 0 remove: 1 chunk-stale-notify: sseq: -1 size: 1 ids: 131217
    08-05-2020 20:26:43.594 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20400 chunk: 131217 version: 1 removed: 1
    ...
    08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3741) CLIF done: status: 0  down: 0 log-chunk-in-flight: 127.0.0.1 20402 logseq: 0 0 3957 type: DEL chunk: 131217 version: 0 remove: 1 meta-chunk-delete: chunk: 131217 version: 0 staleId: 0
    08-05-2020 20:26:43.595 DEBUG - (LayoutManager.cc:3713) -srv: 127.0.0.1 20402 chunk: 131217 version: 1 removed: 1

@mikeov
Copy link
Contributor

mikeov commented Aug 12, 2020 via email

@avodaniel
Copy link

Thanks @mikeov, it seems it really helped. We will do more tests on a larger cluster.

@mikeov
Copy link
Contributor

mikeov commented Aug 14, 2020 via email

@avodaniel
Copy link

It seems that chunks really don't get lost now. But sometimes evacuation is really slow. Usually all chunks are moved quickly and then it takes very long time to transfer the last chunk: there are long delays between copies of its pieces (like 5minutes). And sometimes it takes long time until file evacuation is changed to evacuation.done. Is this expected? It is inconvenient but acceptable. We use QFS from master till commit cb3e898 (included).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants