unsuccessful_pools prevents rebalancing when new nodes added. #18

jmamma · 2022-10-05T03:44:58Z

Added two new storage nodes to a cluster that was completely balanced by placementoptimizer.py

Performed upmap_ramapped.py to set all pgs to active+clean
https://github.com/HeinleinSupport/cern-ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Expected subsequent runs of placementoptimizer.py to shift data to the new nodes.

[2022-10-05 14:40:45,104]  BAD => osd.350 already has too many of pool=1 (84 >= 79.38328132674108)
[2022-10-05 14:40:45,104] TRY-1 move 1.33d3 osd.154 => osd.154 (1256/1256)
[2022-10-05 14:40:45,104]  BAD move to source OSD makes no sense
[2022-10-05 14:40:45,104] SKIP pg 1.1a18 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.3a7e since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.7d8 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.2d59 since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.152d since pool (1) can't be balanced more
[2022-10-05 14:40:45,104] SKIP pg 1.1b63 since pool (1) can't be balanced more
...

Instead: Entire pool is blacklisted and no further optimisations can be found.

Disabling this code fixes the issue:

ceph-balancer/placementoptimizer.py

Line 2041 in d0cd6a8

if pg_pool in unsuccessful_pools:

I'm wondering what is the purpose of unsuccessful_pools?

The text was updated successfully, but these errors were encountered:

TheJJ · 2022-11-13T16:36:11Z

to quote the code comment:

                    # we tried all osds to place this pg,
                    # so the shardsize is just too big
                    # if pg_size_choice is auto, we try to avoid this PG anyway,
                    # but if we still end up here, it means the choices for moves are really
                    # becoming tight.
                    unsuccessful_pools.add(pg_pool)

so we assume that the shardsize is the problem and we couldn't place the pg because of that.
and we end up at that statement when there was no single target osd possible to place that pg.
so other pgs of the same pool should also have no chance of getting a valid destination.

i wonder in this scenario why disabling that check allows finding a destination for another pg of the same pool :)

please update to the latest git version and try again.

if it fails, can you send me another state file so i can try to reproduce locally?

TheJJ · 2023-05-08T20:07:28Z

Still interested in an analysis? If so, please send a state dump :)

TheJJ · 2023-10-27T13:52:05Z

I'll close this for now - please notify me if you have another shot to reproduce with a state file.

TheJJ mentioned this issue Feb 17, 2023

Do Not Skip the Entire Pool When Balancing PGs #24

Closed

TheJJ closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unsuccessful_pools prevents rebalancing when new nodes added. #18

unsuccessful_pools prevents rebalancing when new nodes added. #18

jmamma commented Oct 5, 2022 •

edited

Loading

TheJJ commented Nov 13, 2022 •

edited

Loading

TheJJ commented May 8, 2023

TheJJ commented Oct 27, 2023

unsuccessful_pools prevents rebalancing when new nodes added. #18

unsuccessful_pools prevents rebalancing when new nodes added. #18

Comments

jmamma commented Oct 5, 2022 • edited Loading

TheJJ commented Nov 13, 2022 • edited Loading

TheJJ commented May 8, 2023

TheJJ commented Oct 27, 2023

jmamma commented Oct 5, 2022 •

edited

Loading

TheJJ commented Nov 13, 2022 •

edited

Loading