# Python's Infamous GIL
## removing it :)

Larry Hastings

> refer to `Python's Infamous GIL` (another talk) for a preface to this

# GIL ramifications
* easy to get right
* no deadlocks
* low overhead
  * touched infrequently
* single-threaded code is fast
* if it's i/o bound, everything is OKAY
* if it's CPU bound, you're out of luck

This worked okay in 1992, but the world has changed

today, even our eyeglasses have gone multicore


python hasn't

# Your computer's resources

Python can use all of them!

(except the 8 cores)


# Previous GILectomy attempts
python 1.4 1999
* no api changes!
* inperpreter globals -> struct
* mutex lock on incr/decr
* made your code 4-7x slower!! 

> Beazley talk GIL removal and the patch of lore

# Technical considerations
* reference counting
  * python counts how many times every object is referred to
  * multiple threads incr or decr the ref counter causes a race condition
  
* globals and static info
  * per thread
  * shared singletons (a string or module, all the small integers)
* c extention parallelism and reentrancy
  * currently python makes sure there is only one thread
* atomicity (of operations)
  * appending to lists, adding to dicts take place in multiple threads
  * currently, these block to prevent incomplete states
  
  
# Political considerations
* don't hurt single threaded performance
  * Guido decreed!
* don't break c extensions
  * this would be super painful
* don't make it too complicated
  * the current code is easy to work on
  * complexity discourages contribution
  


# Things that won't work

tracing garbage collection instead of reference counting
 * like java, rust go, D
 * will break c extensions
 * it's complicated

software transactional memory
* early development
* amazing performance
* will break c extensions
* you have to be a genius to understand it
* hard to get it right


# Larry's Suggestions
He already removed it... now he wants it to be faster

* keep reference counting
* atomic incr/decr
  * intel processors are optimized for this
  * 30% slower
* globals and statics
  * PyThreadState variable already has this covered
* shared singletons
  * shared... as they should be (so no problem here)
* c ext parallelism and reentrancy
  * no way to avoid breaking stuff here
  
* atomicity: 
  * locks everywhere!
  
  
## Lock api
all mutable (as in C mutable) objects need to be locked

* str
  * hash (lazily computed) - this starts as -1... if multiple threads touch it, they'll race
  * same goes for:
    * utf8
    * wstr
    
* userspace locks
  * in linux, there's a `futex` primitive
  * windows: `critical_section`
  * osx: `pthread_mutex`
  * other platforms: who knows???
  
  
## On the Political constraints
* it could be fast, without breaking anything
  * make the "no GIL" version a different build
  * two entry points for c extentions
    * that way, non-GIL is opt-in
  * a single .so?

PyType_ready - let's make this non-optional, to shed the backward compatibility stuff 

> See also PEP 489

* Complexity: people have said if it's fast enough that's okay


# how to remove the GIL
0. atomic incr and decr
1. decide on what lock to use
2. lock dictobject (all entry points have to get locks)
3. lock listobject
4. lock 10 freelists (tuples, ints, etc)
5. disable garbage collector, track & untrack
6. murder the GIL (just go comment it out)
7. use tls thread state - currently, it is just a single variable
8. fix the tests

# Language summit benchmarks
~3.5x slower wall time - how long everything took to finish

~25x slower cpu time - when talking about multiple cores, you have to multiply wall clock by how many cores

> refer to slides for graphs


  

# Why is it so slow
2. lock contention
1. synchronization and cache misses
  * each time the ref counter is changed... it blows away the cache
  * with no GIL, the code has to be done the "safe" multithreaded way (which is slower)
  
  
# Where to go now in the GILectomy branch?

* reference counting
  * buffered refernce counting
    * unsyncronized in a single thread
  * immortal objects
    * no need to change the ref count
  * coalesced reference counting
* thread private locking
  * most of the time, an object never leaves the thread where it was created
    * this can be used to optimize by defaulting to lock an object to the original thread
* garbage collection
  * stop-the-world
  * bufferend track/untrack
  
* auto-lock around c extensions
  * like a private GIL just for them
  
  
# Final thoughts
This might not actually work

Won't be interesting until adding threads and cores actually makes it faster :)