Depending on the BLAS you have there are myriad optimal sizes. [For example](scipy/scipy#3144 (comment))
the optimal length under one implementation is any of 2^a 3^b 5^c 7^d 11^e 13^f,
where e+f is either 0 or 1, and the other exponents are arbitrary. It's just a
matter of how the implementation chooses to carve up subproblems. Halves---making
powers of 2 fast---is traditional, but not the only choice.