New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunk allocation with rack awareness #235
Comments
From qfsfsck output: |
Rack IDs outside the range from 0 to 65535 are considered invalid, and ignored by chunk placement logic. Presently only rack IDs specified by metaServer.rackPrefixes parameter are validated, and the error message emitted in the case when rack id is outside valid range. In the case if all chunk server rack IDs are outside valid range the FSCK will report all chunks as with no rack ID assigned. Present design assumes that the number of racks (failure groups) is reasonably small less than a 100 or so. I’d recommend to use one chunk server per physical node / host, with adequate number of network IO (“client”) and disk IO threads. By default the number of IO threads is 2 per chunk directory / IO device / “disk”. |
Thank you @mikeov, this is very useful! We are going to fix the rackids in the chunk config and see how the chunk placement goes. |
Closing this as resolved. |
We have a ~25PB qfs 2.0.0 cluster with rackId configured on the chunkservers. Our physical servers have several disks, so we have multiple chunkservers per physical server. For this reason each physical server have a unique rackId.
We allocate one primary + one replica for every chunk. The goal is that every chunk should survive a complete failure of any physical server.
But somehow both primary and replica chunks end up on the same physical server, i.e. the same rackId.
This is our metaserver config:
and this is a chunkserver config:
So for instance other chunkservers on this exact same physical server also have the
681000
rackId. So far it happened several times that a physical server died and we've lost chunks, because they were on the same physical server, although assigned to a different chunkserver within that same physical server.Could you please take a look at our configs and see if we are doing something wrong?
The text was updated successfully, but these errors were encountered: