I recently wrote up a parallelized implementation of the Mann-Whitney U test, for my own use (gist is here). For the types of tests we tend to do in scRNAseq (lots of different features, 2d arrays) it basically scales with the number of cores you can throw at it. When you're doing a lot of tests this is very nice!
Given that scanpy already has a dependency on numba this would be a pretty simple thing to add, if you want to do so. Thought I would just point it out!