Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unsuccessful_pools prevents rebalancing when new nodes added. #18

Closed
jmamma opened this issue Oct 5, 2022 · 3 comments
Closed

unsuccessful_pools prevents rebalancing when new nodes added. #18

jmamma opened this issue Oct 5, 2022 · 3 comments

Comments

@jmamma
Copy link

jmamma commented Oct 5, 2022

Added two new storage nodes to a cluster that was completely balanced by placementoptimizer.py

Performed upmap_ramapped.py to set all pgs to active+clean
https://github.com/HeinleinSupport/cern-ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Expected subsequent runs of placementoptimizer.py to shift data to the new nodes.

[2022-10-05 14:40:45,104]  BAD => osd.350 already has too many of pool=1 (84 >= 79.38328132674108)
[2022-10-05 14:40:45,104] TRY-1 move 1.33d3 osd.154 => osd.154 (1256/1256)
[2022-10-05 14:40:45,104]  BAD move to source OSD makes no sense
[2022-10-05 14:40:45,104] SKIP pg 1.1a18 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.3a7e since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.7d8 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.2d59 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.152d since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.1b63 since pool (1) can't be balanced more
...

Instead: Entire pool is blacklisted and no further optimisations can be found.

Disabling this code fixes the issue:

if pg_pool in unsuccessful_pools:

I'm wondering what is the purpose of unsuccessful_pools?

@TheJJ
Copy link
Owner

TheJJ commented Nov 13, 2022

to quote the code comment:

                    # we tried all osds to place this pg,
                    # so the shardsize is just too big
                    # if pg_size_choice is auto, we try to avoid this PG anyway,
                    # but if we still end up here, it means the choices for moves are really
                    # becoming tight.
                    unsuccessful_pools.add(pg_pool)

so we assume that the shardsize is the problem and we couldn't place the pg because of that.
and we end up at that statement when there was no single target osd possible to place that pg.
so other pgs of the same pool should also have no chance of getting a valid destination.

i wonder in this scenario why disabling that check allows finding a destination for another pg of the same pool :)

please update to the latest git version and try again.

if it fails, can you send me another state file so i can try to reproduce locally?

@TheJJ
Copy link
Owner

TheJJ commented May 8, 2023

Still interested in an analysis? If so, please send a state dump :)

@TheJJ
Copy link
Owner

TheJJ commented Oct 27, 2023

I'll close this for now - please notify me if you have another shot to reproduce with a state file.

@TheJJ TheJJ closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants