The Session class was extended from ThreadedDict, which is extended from threading.local. It looks like __init__ is called once for each thread for thread local objects. That made session to add multiple processors to the application, one for each thread/request. Fixed this issue by keeping the threadeddict as an attribute instead of extending from it.
Using threading.local instead of managing thread-local state manually improved the performance by 50x. Also added a threadlocal implementation for python 2.3 as it doesn't have threading.local.