We've come a long way already but unfortunately we're still only at 100pps per threads (playouts per second) on 19x19. Judging from the numbers other people get at least 10x faster is what we should aim for.
There are many things to try, but most boil down to doing less work. It seems to be especially important to do less counting of liberties. See for example this thread on the computer go mailing list.
Version 0.3.1 has the following performance characteristics:
Running full_uct_cycle_19x19 through a profiler it seems that most of the time is spent in fix_atari. Both during the playouts and when calculating the priors.
Of course it is, it's reading out ladders.
I will have to have a closer look at the numbers but it seems that on 19x19 it's a 4x slowdown which is massive. I wonder if it's really worth it. I guess I will have to run the benchmarks. :)