In [1]:
import os 

os.chdir("../../scripts/threading_examples")


## Thread-based Concurrency

A thread refers to a thread of execution in a computer program. 

Each program is a process and has at least one thread that executes instructions for that process. 

When we run a Python script, it starts an instance of the Python interpreter (a process) that runs our code in the main thread. The main thread is the default thread of a Python process. 

The underlying operating system controls how new threads are created, when threads are executed, and which CPU core executes them.

### Problem

Sometimes we may need to create and start new threads to run additional tasks concurrently, rather than wait until the main thread finishes first. 

### Solution

A task can be run in a new thread by manually creating an instance of the `Thread` class and specifying the function to run in the new thread via the `target` argument:

In [2]:
!python3 manual_thread_creation.py


This is another thread


Running the script (a python program, which is a process) above creates an instance of the `Thread` class configured to run the `task()` function.

The thread is then started by calling the `start()` method on the instance. The main thread of the Python process then blocks until the new thread terminates. Blocked means execution of the main thread gets stuck there; put another way, the thread is put to sleep by the system and yields the processor to the new thread. 

The new thread executes the `task()` function before terminating. The main thread then continues on and the program ends.

## Thread Pool

Manually managing threads is not efficient since creating and destroying many threads frequently are very computationally expensive.

Instead, we would prefer to keep worker threads around for reuse if we expect to run many ad hoc tasks throughout our program, which can be achieved using a thread pool.

*A thread pool is a programming pattern for automatically managing a pool of worker threads*. The pool is responsible for a fixed number of threads, where each thread in the pool is called a worker.

* It controls when the threads are created, such as when they are needed.
  
* It controls how many tasks each worker thread can execute before being replaced.
  
* It also controls what the workers should do when they are not being used, such as making them *wait* (i.e., waiting for a signal from another thread) without consuming computational resources.
  
The pool can provide a generic interface for executing ad hoc tasks with a variable number of arguments, much like the target attribute on the Thread class, but does not require that we choose a thread to run the task, start the thread, or wait for the task to complete.

<p align="center">
  <img width="650" height="400" img src="../../doc/images/thread_pool.png">
</p>

### The `concurrent` Package


The `concurrent.futures` module provides a high-level interface for asynchronously executing callables. The `ThreadPoolExecutor` class extends the `Executor` class and returns a `Future` object. Note, the `Executor` is an abstract class that provides methods to execute calls asynchronously. It should not be used directly, but through its concrete subclasses.

#### The `Executor` Class

The `Executor` class has three methods:

- `submit(fn, /, *args, **kwargs)` – dispatch a function to be executed as `fn(*args, **kwargs)` and return a `Future` object. 
  
- `map(func, *iterables, timeout=None, chunksize=1)` – execute a function asynchronously for each element in an iterable.

  - Similar to `map(func, *iterables)` except:

    - the iterables are collected immediately rather than lazily.

    - `func` is executed asynchronously and several calls to func may be made concurrently.
  
- `shutdown(wait=True, *, cancel_futures=False)` – shut down the executor. Calls to `Executor.submit()` and `Executor.map()` made after shutdown will raise `RuntimeError`.
  
When we create a new instance of the `ThreadPoolExecutor` class, Python starts the `Executor`.


#### The `Future` Class

The `Future` class encapsulates the asynchronous execution of a callable. `Future` instances are created by `Executor.submit()` and should not be created directly except for testing. The `Future` class has two important methods:

- `result(timeout=None)` – return the result of an asynchronous operation. If the call hasn’t yet completed then this method will wait up to `timeout` seconds. If the call hasn’t completed in timeout seconds, then a `concurrent.futures.TimeoutError` will be raised. If `timeout` is not specified or `None`, there is no limit to the wait time.
  
- `exception(timeout=None)` – return the exception of an asynchronous operation in case an exception occurs. 

- `cancel()` - Attempt to cancel the call. If the call is currently being executed or finished running and cannot be cancelled then the method will return False, otherwise the call will be cancelled and the method will return True.

- `done()` - Return `True` if the call was successfully cancelled or finished running.

### Configurations

#### Number of Thread Workers

By default the number of workers is stored in the `_max_workers` property.

In [3]:
# Report the default number of worker threads on our system
from concurrent.futures import ThreadPoolExecutor
# Create a thread pool with the default number of worker threads
# Use the argument 'max_workers' to specify the number of worker threads
pool = ThreadPoolExecutor()
# report the number of worker threads chosen by default
print(pool._max_workers)


12


#### Thread Names

In [4]:
!python3 configure_thread_name.py


python3: can't open file '/Users/kenwu/Desktop/Python/python_automation/concurrency/scripts/threading_examples/configure_thread_name.py': [Errno 2] No such file or directory


#### Initializer

We might choose to set an initializer function for worker threads if we would like each thread to set up resources specific to the thread.

If the initializer function takes arguments, they can be passed in via the `initargs` argument to the thread pool, which is a tuple of arguments to pass to the initializer function.

Examples might include a thread-specific log file or a thread-specific connection to a remote resource like a server or database. The resource would then be available to all tasks executed by the thread, rather than being created and discarded or opened and closed for each task.

These thread-specific resources can then be stored somewhere where the worker thread can reference, like a global variable, or in a thread local variable. Care must be taken to correctly close these resources once you are finished with the thread pool.

In [5]:
!python3 configure_initializer.py


python3: can't open file '/Users/kenwu/Desktop/Python/python_automation/concurrency/scripts/threading_examples/configure_initializer.py': [Errno 2] No such file or directory


## `ThreadPoolExecutor` Life-Cycle

There are four main steps in the lifecycle of the `ThreadPoolExecutor` class:

- Create: Create the thread pool by calling the constructor `ThreadPoolExecutor()`
- Submit: Submit tasks and get futures.
    - Submit tasks with `map()`
    - Submit tasks with `submit()`
- Wait: Wait and get results as tasks complete (optional)
    - Wait for results to complete with `wait()`
    - Wait for results with `as_completed()`
- Shutdown: Shut down the thread pool.
    - Shutdown manually by calling `shutdown()`
    - Shutdown automatically with the context manager

<p align="center">
  <img width="300" height="400" img src="../../doc/images/thread_pool_lifecycle.png">
</p>

### Single Threaded

Since the `task(id)` function takes 1 second, calling it four times takes about 4 seconds.

In [6]:
!python3 single_thread.py


Starting task 0
Finished task 0
Starting task 1
Finished task 1
Starting task 2
Finished task 2
Starting task 3
Finished task 3
It took 4.014967162 second(s) to finish.


### Multi-threaded

In [7]:
# Using map
!python3 multi_thread_map.py


Starting task 0
Starting task 1
Starting task 2
Starting task 3
Finished task 0
Finished task 1
Finished task 2
Finished task 3
It took 1.001994466 second(s) to finish.


In [8]:
# Using submit() and result()
!python3 multi_thread_submit.py


Starting task 0
Starting task 1
Starting task 2Starting task 3

Finished task 0
Finished task 1
Finished task 2
Finished task 3
It took 1.004299569 second(s) to finish.


### Downloading Images To Disk

The program below downloads 20 images from Wikipedia using a thread pool:


In [9]:
# Single threaded sequential
!python3 download_seq.py


Python_bivittatus_1701.jpg was downloaded successfully
Python_Regius.jpg was downloaded successfully
Baby_carpet_python_caudal_luring.jpg was downloaded successfully
Rock_python_pratik.JPG was downloaded successfully
Dulip_Wilpattu_Python1.jpg was downloaded successfully
File:Image_created_with_a_mobile_phone.png was downloaded successfully
File:TEIDE.JPG was downloaded successfully
File:Pencil_drawing_of_a_girl_in_ecstasy.jpg was downloaded successfully
File:Cristiano_Ronaldo_2018.jpg was downloaded successfully
File:Ronaldo_-_Manchester_United_vs_Chelsea.jpg was downloaded successfully
File:Ronaldo_in_2018.jpg was downloaded successfully
File:Cristiano_Ronaldo_20120609.jpg was downloaded successfully
File:1_cristiano_ronaldo_2016.jpg was downloaded successfully
File:Contr%C3%B4le_de_Cristiano_Ronaldo.jpg was downloaded successfully
File:ANSI_ISO_C++_WP.jpg was downloaded successfully
File:Python_3._The_standard_type_hierarchy.png was downloaded successfully
File:Python_Powered.png wa

In [10]:
# Remove all files ending with .jpg or .JPG
!rm *.[jJ][pP][gG]
!rm *.[pP][nN][gG]


In [11]:
# Concurrent download
!python3 download_multithread.py


File:TEIDE.JPG was downloaded successfully
File:Image_created_with_a_mobile_phone.png was downloaded successfully
File:Pencil_drawing_of_a_girl_in_ecstasy.jpg was downloaded successfully
Python_Regius.jpg was downloaded successfully
Python_bivittatus_1701.jpg was downloaded successfully
Baby_carpet_python_caudal_luring.jpg was downloaded successfully
File:Python_3._The_standard_type_hierarchy.png was downloaded successfully
File:Cristiano_Ronaldo_20120609.jpg was downloaded successfully
File:Ronaldo_-_Manchester_United_vs_Chelsea.jpg was downloaded successfully
File:Python_Powered.png was downloaded successfully
File:Muhammad_Ali_NYWTS.jpg was downloaded successfully
File:Cristiano_Ronaldo_2018.jpg was downloaded successfully
File:Ronaldo_in_2018.jpg was downloaded successfully
Dulip_Wilpattu_Python1.jpg was downloaded successfully
File:JoeEMartinCassiusClay1960.jpg was downloaded successfully
File:Muhammad_Ali_and_Jimmy_Carter.jpg was downloaded successfully
File:ANSI_ISO_C++_WP.jpg w

In [12]:
!rm *.[jJ][pP][gG]
!rm *.[pP][nN][gG]


As can be seen, the thread pool implementation is much faster. 

### Reading Many CSV From Disk

The program below reads 10 csv files (~ 8.77 GB on disk) into memory using a thread pool:



In [13]:
# Sequential
!python3 read_csv_seq.py


It took  0.00020584100000009542  seconds to finish reading the csv files.


In [14]:
# Concurrent
!python3 read_csv_multithread.py


It took  0.00021250700000008393  seconds to finish reading the csv files.


### Types of Tasks

We can use the `ThreadPoolExecutor` when:

- Our tasks can be defined by a pure function that has no state or side effects.

- Our task can fit within a single Python function, likely making it simple and easy to understand.
  
- We need to perform the same task many times with different arguments, e.g. homogeneous tasks.

- We need to apply the same function to each object in a collection in a for-loop.
  
*Thread pools work best when applying the same pure function on a set of different data, e.g. homogeneous tasks, heterogeneous data.*

In addition, we should use threads and the ThreadPool for IO-bound tasks. An IO-bound task is a type of task that involves reading from or writing to a device, file, or socket connection. The operations involve input and output (IO), and the speed of these operations is bound by the device, hard drive, or network connection. This is why these tasks are referred to as IO-bound. Here are some examples:

- Reading or writing a file from the hard drive.
  
- Reading or writing to standard output, input or error (stdin, stdout, stderr).

- Printing a document.

- Downloading or uploading a file.

- Querying a server.

- Querying a database.

- Taking a photo or recording a video.