<center><h2><strong><font color="blue"> Advanced Programming for Data Science (APDS)</font></strong></h2></center>

<center><img alt="" src="images/covers/taudata-cover.jpg"/></center>

<center><h2><strong><font color="blue">APDS-06: Concurrency and Threading for I/O-Bound Tasks</font></strong></h2></center>

<b><center><h3>(C) Taufik Sutanto</h3></center>

<h2>1. Introduction</h2>

<p>This module provides a comprehensive and in-depth exploration of concurrency and threading within the context of modern data-intensive applications. As data analysts and data scientists increasingly work with streaming data, large datasets, and real-time systems, an understanding of concurrency becomes essential. This week examines theoretical foundations, practical applications, and implementation strategies in Python, with a strong focus on I/O-bound workloads commonly faced in data acquisition, API integration, and network-based tasks.</p>

<p>The Python <code>threading</code> module will serve as the primary implementation tool, supported by the <code>requests</code> library for practical network operations. This module also includes conceptual explanations, diagrams, and examples to prepare students for professional-level data workflow engineering.</p>

<h2>2. Learning Outcomes</h2>

<p>Upon completing this module, students will be able to:</p>

<ul>
  <li>Explain the concept of concurrency and distinguish it from parallelism.</li>
  <li>Describe the characteristics of I/O-bound versus CPU-bound tasks.</li>
  <li>Explain how the Python Global Interpreter Lock (GIL) influences concurrency models.</li>
  <li>Implement multi-threaded solutions for I/O-bound operations using the <code>threading</code> module and <code>requests</code>.</li>
  <li>Understand and apply basic thread management patterns.</li>
</ul>

# Program, Thread, & Process

## Processes in the Operating System

* A **Program** is a **static** entity on our computer.
* When a **program is executed**, it becomes a **process** (sometimes also referred to as a **Task**).
* Thus, a process is a program in execution.
* One program can consist of several processes.
* When multiple processors are available, processes can be executed in parallel.
* If there is only one processor, processes can be run alternately (very quickly) as if they are all running simultaneously.
* A process has (separate resources):
    * Code segment - text section
    * Data - global variables
    * Stack - local variables and functions
    * Heap - dynamically allocated variables/classes
    * State - ready, waiting, running.
    * process identifier, priority, etc.

<img alt="" src="images/contoh_proses_di_os.png" />

# Threads in the Operating System

* A Thread is a part (unit of execution) of a process.
* In other words, a thread is a subset of a process.
* A process always begins with a single (primary) thread.
* The primary thread can then create other threads.
* Threads have **shared resources**: memory, data, resources, files, etc.

### Example:
 - On our computer, Microsoft Word and, for instance, the Chrome browser are examples of processes.
 - In Microsoft Word, as we type, MS Word also performs autosave and autocorrect. Typing (editing), autosave, and autocorrect are examples of threads.

<img alt="" src="images/proses_thread.png" />

* image source: https://farhakm.wordpress.com/2015/03/30/process-vs-thread/

<h2>3. Concurrency vs. Parallelism</h2>

<h3>3.1 Key Definitions</h3>

<p><strong>Concurrency</strong> refers to the ability of a system to handle multiple tasks by interleaving their execution. Tasks may appear to run simultaneously, but on a single CPU, they are time-sliced.</p>

<p><strong>Parallelism</strong> involves executing tasks literally at the same time using multiple CPU cores.</p>

<h3>3.2 ASCII Diagram: Concurrency vs. Parallelism</h3>

<pre>
Concurrency (single core, tasks interleaving):

Time →
T1:  |----A---------C---------E-----|
T2:  |-------B---------D---------F--|


Parallelism (multi-core, tasks simultaneously):

Core 1: |----A----|----C----|----E----|
Core 2: |----B----|----D----|----F----|
</pre>


<h3>3.3 Relevance to Data Science</h3>

<ul>
    <li><strong>I/O-heavy operations</strong> such as downloading datasets, calling APIs, web scraping, or performing database queries benefit from concurrency (threading).</li>
    <li><strong>CPU-heavy operations</strong> such as numerical simulation, ML training loops, or statistical modeling benefit from parallel processing (multiprocessing or distributed frameworks).</li>
</ul>

<h2>4. I/O-Bound vs. CPU-Bound Tasks</h2>

<h3>4.1 I/O-Bound Tasks</h3>

<ul>
  <li>Network requests</li>
  <li>Disk operations</li>
  <li>Database calls</li>
  <li>File transfers</li>
</ul>

<h3>4.2 CPU-Bound Tasks</h3>

<p>These tasks require heavy computation such as numerical simulation, data transformation, and machine learning model training. Threading is generally ineffective for them due to the GIL.</p>

# I/O vs. Computation

* In general, processes that require significant computation (and little Input-Output I/O) will benefit from parallel programming (figure).
* I/O bound: communication via internet, hard disk, printer, etc.
* Processes that require significant computation: Math, Stats, Physics, Machine Learning, AI.

<img alt="" src="images/i-o-computation-process.png" />

* image source: https://realpython.com/python-concurrency/

# Advantages of Threading Programming

* Multi-threaded programs can run faster because threads can be executed on different CPUs.
* Multi-threaded programs remain responsive to user input.
* Existing threads can access global variables.
* Changes to a global variable by one thread are valid for other threads.
* Threads can have local variables.

<img alt="" src="images/sifat_thread_programming.png" />

<h2>6. The <code>threading</code> Module in Python</h2>

<p>The <code>threading</code> module provides low-level primitives for creating and controlling threads. Its core components include:</p>

<ul>
    <li><code>Thread</code> objects</li>
    <li><code>Lock</code>, <code>RLock</code>, <code>Semaphore</code></li>
    <li><code>Event</code> and <code>Condition</code></li>
    <li>Daemon vs. non-daemon threads</li>
</ul>

# The Thread Module in Python

* Thread (deprecated in Python 3, renamed to _Thread )
* threading

## Simple Example:
* It is best to run this in a terminal.
* `if __name__ == "__main__":` is mandatory in all Python code that uses threading/parallel programming.
 - Explanation of __main__: https://www.youtube.com/watch?v=IaKbhwLs0kw

In [None]:
# Importing Some Python Modules
import warnings; warnings.simplefilter('ignore')
import pandas as pd, numpy as np, seaborn as sns
import matplotlib.pyplot as plt
# Module for this week's lesson:
import threading
import time
from threading import Thread

plt.style.use('bmh'); sns.set()
np.random.seed(420)

In [None]:
def worker():
    print("Thread is running")

t = threading.Thread(target=worker)
t.start()
t.join()

<h3>6.2 Thread Lifecycle Diagram</h3>

<pre>           +---------+
           |   New   |
           +----+----+
                |
                v
         +------+------+
         |   Running   |
         +------+------+
                |
                v
         +------+------+
         |  Finished   |
         +-------------+
</pre>

## Passing Arguments to Threads

In [None]:
def greet(name):
    print(f"Hello, {name}")

t = threading.Thread(target=greet, args=("UIII Students",))
t.start()
t.join()

<h3>6.3 Daemon vs. Non-Daemon Threads</h3>

<p><strong>Daemon threads</strong> run in the background and automatically exit when the main program ends.</p>

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Mon Dec  7 08:59:27 2020
Simple Threading Example in Python 3
@author: Taufik Sutanto
"""
import time
from threading import Thread

def sleeper(i):
    nSleep = 3
    print("thread {} sleeps for {} seconds".format(i, nSleep))
    time.sleep(nSleep)
    print("thread %d woke up" % i)

if __name__ == "__main__":
    for i in range(10):
        t = Thread(target=sleeper, args=(i,))
        t.start()

In [None]:
import threading

def print_cube(num):
    print("Cube: {}".format(num * num * num))

def print_square(num):
    print("Square: {}".format(num * num))

if __name__ == "__main__":
    # creating thread
    t1 = threading.Thread(target=print_square, args=(10,))
    t2 = threading.Thread(target=print_cube, args=(10,))
    t1.start()  # starting thread 1
    t2.start()  # starting thread 2

    t1.join()  # wait until thread 1 is completely executed
    t2.join()  # wait until thread 2 is completely executed
    # both threads completely executed
    print("Done!")

<h3>6.5 Thread Management Tips</h3>

<ul>
  <li>Always call <code>join()</code> to ensure threads complete execution.</li>
  <li>Use descriptive thread names for debugging (e.g., <code>Thread(name="Downloader-1")</code>).</li>
  <li>Plan work distribution before creating threads.</li>
</ul>


<h2>7. Case Study: Concurrently Downloading Multiple Files/URLs</h2>

<p>This example demonstrates how concurrency improves performance when fetching multiple resources from the internet.</p>

<h3>7.1 Sequential Version</h3>

# Example of Website Scraping

### Note: Save the file via an editor (e.g., Spyder), and it must be run in a terminal (e.g., Command Prompt).

In [None]:
import requests
import time

def download_site(url, session):
    with session.get(url) as response:
        print(f"Read {len(response.content)} from {url}")

def download_all_sites(sites):
    with requests.Session() as session:
        for url in sites:
            download_site(url, session)

if __name__ == "__main__":
    sites = [
        "https://www.detik.com",
        "https://kompas.com",
    ] * 80
    start_time = time.time()
    download_all_sites(sites)
    duration = time.time() - start_time
    print(f"Downloaded {len(sites)} in {duration} seconds")

# Code Explanation

* The "sites" variable is a list with 2 URLs, but multiplied by 80, for a total of 160. Recall the properties of list multiplication.
* The `download_site()` function downloads all URLs in the **sites** list and then prints their size.
* `download_all_sites()` creates a "Session" and accesses each URL successively/sequentially.
* Finally, it prints the time required.
* The process is as shown in Figure 1 above.

# Threaded Programming Version

### Same as before, run this in a terminal

In [None]:
import concurrent.futures
import requests
import threading
import time

thread_local = threading.local()

def get_session():
    if not hasattr(thread_local, "session"):
        thread_local.session = requests.Session()
    return thread_local.session

def download_site(url):
    session = get_session()
    with session.get(url) as response:
        print(f"Read {len(response.content)} from {url}")

def download_all_sites(sites):
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        executor.map(download_site, sites)

if __name__ == "__main__":
    sites = [
        "https://detik.com",
        "https://kompas.com",
    ] * 80
    start_time = time.time()
    download_all_sites(sites)
    duration = time.time() - start_time
    print(f"Downloaded {len(sites)} in {duration} seconds")

# Explanation of the Code Above 

* When adding **threading**, the general structure is the same, with only a few modifications:
* The `download_all_sites` function:
 - **ThreadPoolExecutor** = Thread + Pool + Executor.
 - The **Pool** object creates a pool of threads (multiple threads); each thread can run in parallel.
 - The Executor is the component that controls (controls) how and when each thread in the pool will be executed (run).
* The **ThreadPoolExecutor** automatically manages creating, running, and destroying/releasing threads.
* The **.map()** method then executes the function with its inputs on each created thread.

<img alt="" src="images/threaded_process.png" />

* Image Source: https://realpython.com/python-concurrency/

<h2>8. Best Practices for Basic Threading</h2>

<ul>
  <li>Use threading only when tasks involve waiting.</li>
  <li>Avoid sharing unnecessary state between threads.</li>
  <li>Keep thread functions simple and predictable.</li>
  <li>Name threads for observability during debugging.</li>
  <li>Use thread pools for predictable workloads.</li>
</ul>

# The Python Global Interpreter Lock (GIL)

<h3>5.1 What is the GIL?</h3>

<p>The Global Interpreter Lock ensures that only one Python bytecode thread executes at a time, simplifying memory management in CPython.</p>

<h3>5.2 Why Threading Works for I/O-Bound Tasks</h3>

<pre>
Thread 1:  [Compute][I/O wait.......][Compute]
                          ↓ GIL released
Thread 2:                [Compute][I/O wait.......]
</pre>

* A Mutex (Lock) that allows only one thread to control the Python interpreter.
* This means only one thread is in the "execution" state at any given time. This is highly detrimental on systems with >1 CPU.
* two different native threads of the same process can't run Python code at once.

<img alt="" src="images/python_GIL.jpg" />

* image Source: https://www.slideshare.net/cjgiridhar/pycon11-python-threads-dive-into-gil-9315128

### Illustration of the GIL in Python

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Mon Dec  7 12:59:23 2020
Illustration of the GIL's Effect (single thread)
@author: Taufik Sutanto
"""
# single_threaded.py
import time

COUNT = 50000000

def countdown(n):
    while n>0:
        n -= 1

start = time.time()
countdown(COUNT)
end = time.time()

print('Time taken in seconds -', end - start)

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Mon Dec  7 12:59:23 2020
Illustration of the GIL's Effect (Multi-thread)
This will produce approximately the same time
@author: Taufik Sutanto
"""
import time
from threading import Thread

COUNT = 50000000

def countdown(n):
    while n>0:
        n -= 1

t1 = Thread(target=countdown, args=(COUNT//2,))
t2 = Thread(target=countdown, args=(COUNT//2,))

start = time.time()
t1.start()
t2.start()
t1.join()
t2.join()
end = time.time()

print('Time taken in seconds -', end - start)

# Solution: Multi-processing vs multi-threading

### Will be discussed in detail in the next lecture, after the Midterm Exam

In [None]:
# -*- coding: utf-8 -*-
"""
Created on Mon Dec  7 13:03:30 2020
Simple multi-processing example
@author: Taufik Sutanto
"""
from multiprocessing import Pool
import time

COUNT = 50000000
def countdown(n):
    while n>0:
        n -= 1

if __name__ == '__main__':
    pool = Pool(processes=2)
    start = time.time()
    r1 = pool.apply_async(countdown, [COUNT//2])
    r2 = pool.apply_async(countdown, [COUNT//2])
    pool.close()
    pool.join()
    end = time.time()
    print('Time taken in seconds -', end - start)

<h2>9. Summary</h2>

<p>This module introduced concurrency fundamentals and basic Python threading as they apply to I/O-bound operations. Students learned how to create threads, manage thread lifecycle, use daemon threads appropriately, and implement simple concurrency patterns for accelerating data acquisition tasks.</p>

<hr>

<h2>10. References</h2>

<p>Beazley, D., & Jones, B. K. (2013). <em>Python cookbook</em> (3rd ed.). O’Reilly Media.</p>

<p>Lutz, M. (2013). <em>Learning Python</em> (5th ed.). O’Reilly Media.</p>

<p>McKellar, P., & Bolz-Tereick, A. (2018). Understanding the Python GIL. <em>EuroPython Conference</em>.</p>

<p>Van Rossum, G., & Drake Jr., F. L. (2009). <em>The Python language reference</em>. Python Software Foundation.</p>


<center><h2><strong><font color="blue">End of Module</font></strong></h2></center>
<hr>
<center><img alt="" src="images/meme-cartoon/Python_Gil-223x300.jpg" width="480"/></center>