### This notebook is from a linkedin class on python concurrent programming

### sequential/serial execution
* program execute a series of instructions sequentially
* one instruction is executed at any give moment
* speed of the pogram is limited by cpu and how fast it can execute that series of instructions

### parallel programming
* with multiple processes, the instructions can be broken down into independent parts and executed simultaneously by different processes
* components that depend on all parts need the coordinations between the different parts.
* extra complexity is added to coordinate the actions, so the processing speed is not linear with the number of processors.
* parallel execution increases throughput by
  * accomplish a single task faster
  * accomplish more tasks in a given time
  * scale of the problem that can solve. Big computational tasks have to rely on parallel programming to save time, which outweights the cost of added hardware 
  
### multiprocessor architectures
* Flynn's taxonomy (4 classes of computer architecture based on number of concurrent instruction/control streams and number of data streams
  * SISD (single instruction single data)
    + sequential computer with a single processor unit
    + one single instruction at any given moment
  * SIMD (single instruction multiple data)  
    + parallel computer with multiple processor units
    + execute the same instructions at any give momonet, but can operate on different data element
    + for example, both executing chopping, one on onion, one on carrot, and their operations are in sync
    + suitable for applications that perform the same handful of operations on a massive set of data elements, such as in image analysis. modern computers use GPU with SIMD instructions to do that
  * MISD (mutiple instruction, single data)
    + each processor unit independently execute its own separate series of instructions, but all of them are operating on the single stream of data.
    + not a commonly used architecture
  * MIMD (multiple instruction, multiple data)
    + multiple processor units. Every processor unit can process a different series of instructions
    + at the same time, each of those processors can be operating on a different set of data
    + most commonly used architecture in Flynn's taxonomy from multiple core pcs to network clusters in supercomputers.
    + separated further into two parallel programming models:
      + SPMD (single program, multiple data)
        + multiple porcessing units excute a copy of the same single program simultaneously.
        + they can each use a different data.
        + different from SIMD since in SIMD, processing units execute the same instruction at the same time. In SPMD, procssing units just execute the same program
        + the processes can run asynchronously and the program usually includes conditional logic that allows different tasks of the program to only execute the specific parts of the program
        + example, two processors execute the same recipe, but can execute the different parts of the recipe
        + most common of parallel programming. using multiple processor computer to execute the same program as a MIMD architecture
      + MPMD (multiple program, multiple data)
        + Each processing unit is processing a different program.
        + processors execute indepently on different programs and may on different data. (a head/manager nodes with many worker nodes for function decomposition)
* another aspect to conside to categorize computer architectures is based on how memory is organized and how computer access data
  + memory opertes at a speed that is usually slower than processor speed.
  + when one processor is reading or writing to memory, it only prevents other processors to access that same memory element
  + two main memory architecures for parallel computing
    + shared memory
      + all processors access the same memory with global address space. Although each processor executes its own instructions independently, if one process changes a memory loaction, all the processors will see the change.
      + this doesn't mean all the data are on the same physical device. It could be spread across a cluster of systems. The key is all processors see everything happens in the shared memory space.
      + shared memory architectures have two categories based on how processors are connected to memory and how fast they can access the memory
      + easier to programming since it is easy to access data in shared memory
      + difficult to scale since adding more processors to a shared memory system increases the traffic on the shared memory bus and cost to main the cache coherency with communications between all the parts.
      + programmer is responsible to synchronize memory accesses to ensure correct behavior.
        + uniform memory access (UMA)
          + all processors have equal access to the memory and they can access it equally fast.
          + Symmetric multiprocessing system (SMP) is a typical UMA architecture.
            + two or more identical processor connected to a single shared memory through a system bus (processors connect to cache memory, which connects to system bus, which connects to manin memory, all connections are bi-directional)
            + each of processor core of computer or mobile phone is treated as a separate processor as a SMP architecture.
              + each core has its own cache as a small, very fast piece of memory that only it can see. The core uses it to store data it frequently works with.
              + the challenge is that if a processor copies a copy of data from shared memory and changes it in its local cache, the change needs to be updated back in the shared memory before another processor reads the old value. This issue is called cache coherency. It is handled by the hardware in multicore processors 
        + nonuniform memory access (NUMA)
          + physically connect multiple SMP systems (which is a UMA type architecture) together. The access is non-uniform because some processors will have quicker access to certain parts of the memory than others. (these SMP systems are connected by system bus, and are located on different positions of system bus. It takes longer to access the memory through the bus compared to shared memory within the same SMP). Overall, every processor can still see everything in memory.
    + distributed memory
      + each processor has its own memory with its own address space and there is no global address space. All processors are connected through some sort of network (such as an ethernet).
      + each processor operates independently. if it makes changes to its local memory, that change is not automatically reflected in the memory of other processors. 
      + it is up to programmer to explicitly define how and when data is communicated between the nodes. (difficult)
      + advantage of NUMA is it is scalable
        + adding more processors to the system, you get more memory. This makes it cost-effective to use commodity, off-the-shelf computers and network equipment to build large distributed memory systems. 

### Threads and processes
* process:
  + when a computer runs an application, that instance of the program executing is referred to as a process
    + includes code, data, and state information
    + independent instance of a running program
    + has its own, separate memory address and space
    + can have hundreds of processes at the same time and an operating system's job is to manage all of them
    + sharing resouces between processes will need to use inter-process communication(IPC)
      + sockets and pipes
      + shared memory
      + remote procedure calls
* within each process, there are one or more smaller sub-elements called threads
  + each thread is an independent path of execution through the program
  + a different sequence of instructions
  + only exists as part of a process (subset of a process)
  + basic unit that os manages. Os schedules threads for execution and allocates time on the processor to execute them.
  + threads of the same process share the process's address space so they can access to the same resources and memory, including code varialbes, and data, making it easy to work together.
  + sharing resources between processes is not as easy as sharing between threads in the same process.
  + threads are light-weight and require less overhead to create and terminate
  + operation system can switch between threads faster than processes  

### concurrency and parallel execution
* concurrency: ability of a program to be broken into parts that can be run indepently of each other. These parts can be executed out of order or partially out of order without impacting the result.
* independent tasks without multiple processors will be executed by switching back and forth between them, but only one task can be executed at a moment. This may give an illustion of parallel execution, but it is just concurrent execution since only one task is executed at a moment.
  + with multiple hardware, such as multiple processors, multiple tasks can be executed simultaneously, then we have parallel execution
* concurrency refers to the program structure that enables to deal with multiple things at once
* parallelism refers to siumultaneous execution that actually doing multiple things at once
* concurrent programming is useful for I/O dependent tasks. when a thread is waiting for I/O response, we can use another thread to accept user's input.
* parallel processing is useful for computational intensive tasks, such as matrix multiplication.

### concurrent python thread
* using threads to handle concurrent tasks in python is straightforward.
* pyhton interpreter will not allow concurrent threads to execute simultaneously and parallel due to GIL (global interpreter lock)
* Global interpreter lock is a mechanism that limits python to only execute one thread at a time when CPython is used as the interpreter
* GIL provide a simple way to provide thread-safe memory management for thread-safe operations.
* multi-thread is still useful for many I/O bound applications since GIL will not lock threads
* for CPU-bound applications, such as intensive computational tasks, GIL can negatively impact performance. 
  + we can implement parallel algorithms as external library functions such as C++ called by python functions.
  + you can also use python multiprocessing package to use multiple processors instead of multiple threads.
    + each process will have its a separate interpreter with its own GIL, so different processors can execute simultaneously
    + communcations between processors are more difficult than between threads
    + uses more resources compared to creating multiple threads
    

In [1]:
import os
import threading

# a simple function that wastes CPU cycles forever
def cpu_waster():
    while True:
        pass

# display information about this process
print('\n  Process ID: ', os.getpid())
print('Thread Count: ', threading.active_count())
for thread in threading.enumerate():
    print(thread)

print('\nStarting 12 CPU Wasters...')
for i in range(12):
    threading.Thread(target=cpu_waster).start()

# display information about this process
print('\n  Process ID: ', os.getpid())
print('Thread Count: ', threading.active_count())
for thread in threading.enumerate():
    print(thread)



  Process ID:  10944
Thread Count:  6
<_MainThread(MainThread, started 6764)>
<Thread(IOPub, started daemon 7740)>
<Heartbeat(Heartbeat, started daemon 1356)>
<ControlThread(Control, started daemon 8664)>
<HistorySavingThread(IPythonHistorySavingThread, started 10936)>
<ParentPollerWindows(Thread-4, started daemon 2876)>

Starting 12 CPU Wasters...

  Process ID:  10944
Thread Count:  18
<_MainThread(MainThread, started 6764)>
<Thread(IOPub, started daemon 7740)>
<Heartbeat(Heartbeat, started daemon 1356)>
<ControlThread(Control, started daemon 8664)>
<HistorySavingThread(IPythonHistorySavingThread, started 10936)>
<ParentPollerWindows(Thread-4, started daemon 2876)>
<Thread(Thread-5 (cpu_waster), started 3156)>
<Thread(Thread-6 (cpu_waster), started 10548)>
<Thread(Thread-7 (cpu_waster), started 7372)>
<Thread(Thread-8 (cpu_waster), started 7884)>
<Thread(Thread-9 (cpu_waster), started 6424)>
<Thread(Thread-10 (cpu_waster), started 3352)>
<Thread(Thread-11 (cpu_waster), started 9220)