`concurrent.futures`
=======================
Built-in concurrency made easy
-------------------------------------

### Matteo Guzzo

Python User Group Berlin - Oct 12 2017

# About me <img src="./images/me.jpg" width=100/>

### I'm a freelance data scientist

Passionate about machine learning and software development

Find me @:
- `guzzo.matteo@gmail.com`
- Twitter: `@_teoguso`
- GitHub: `teoguso`

# Prologue

I have **3 months of tweets** (100+ GB) stored on cloud storage that need to be polished and uploaded to the company's database **by next week**.

**Problem**: data are stored in 15000+ gzipped JSON files and sequential processing (i.e. using a python script) would take ~15 days.

# What to do?

<img src="./images/concurrency_lookup.png" width=1024 />

<img src="./images/concurrent_stdlib.png" width=1024 />

<img src="./images/concurrent_stdlib_hl.png" width=1024 />

<img src="./images/concurrent_stdlib_closeup.png" width=720 />

# What is `concurrent.futures`?

- Part of the Python standard library since 3.2
- High-level API for writing concurrent code
- Uses the `multiprocessing` and `multithreading` modules
  under the hood

# What is `concurrent.futures`?

> The concurrent.futures module provides a high-level interface for asynchronously executing *callables*.  
  The asynchronous execution can be performed with threads, using *ThreadPoolExecutor*, or separate processes,
  using *ProcessPoolExecutor*.
  Both implement the same interface, which is defined by the abstract *Executor* class.

# Case study

<img src="./images/rect5627.png" width="400"/>

# Case study

- Fetch data

- Process data (--> Produce output)

- Store (Upload?) output

# See accompanying notebook

`visualization.ipynb`

# Summary

<img src="./images/summary.png" width=720/>

# Threads

- Can massively parallelize pure I/O work
- Lightweight and cheap (shared memory)!
- Live in the process where they are spawned (limited by the GIL)
- Not helpful for CPU-bound work

# Processes

- Parallelize CPU and I/O work
- Heavy on memory (each process has a copy of the program)
- Message-passing overhead
- Need serializable objects

# Concurrency - Pros

- Massive speedup possible (depending on available resources)
- Extreme ease of use with `concurrent.futures`

# Concurrency - Cons

- Code has to be deconstructed/restructured
- Complexity increases
- Debugging becomes harder

# Takeaways

# Less is more

Stick with pure sequential python for as long as you can

and you'll have a happy life :-D

# Don't reinvent the wheel

Dedicated libraries often do a very good job
and there's plenty for python

# Takeaways

- Concurrency is cool but often non-trivial
- Stick with pure sequential python as long as you can
- Profile your code before doing anything else
- Use the right tool for the job

# But I really need to parallelize a lot

Then you'll need:

- Dask https://dask.pydata.org  
  (massively distributed computing, out-of-core data processing) 
- PySpark https://spark.apache.org/docs/0.9.0/python-programming-guide.html  
  (do you really have to?)

- Asynchronous code (`asyncio`, `curio`, ...)

# I really don't want to mess with concurrency, 
# but I need faster code

- Numba (http://numba.pydata.org)

- Cython (http://cython.org/)

# Further reading

[Efficiently Exploiting Multiple Cores with Python](http://python-notes.curiousefficiency.org/en/latest/python3/multicore_python.html)

A more in-depth look at the hows and whys  
of concurrency in Python

# Repeat after me...

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
