### **ThreadPoolExecutor for Web Scraping**

### What is ThreadPoolExecutor?
`ThreadPoolExecutor` is a Python class in the `concurrent.futures` module that allows you to manage a pool of threads efficiently. It simplifies multithreading by allowing you to run multiple tasks concurrently, making it ideal for I/O-bound tasks like web scraping.

---

### Why Use ThreadPoolExecutor in Web Scraping?
When web scraping, most of the time is spent waiting for server responses (I/O). Using `ThreadPoolExecutor` enables you to:
- Scrape multiple pages concurrently.
- Reduce overall execution time.
- Use system resources more efficiently.

---

### Basic Example
Hereâ€™s a simple example of using `ThreadPoolExecutor` to scrape multiple URLs:

```python
from concurrent.futures import ThreadPoolExecutor
import time

def scrape_page(url):
    print(f"Scraping: {url}")
    time.sleep(2)  # Simulates a delay for I/O-bound tasks
    return f"Data from {url}"

urls = [f"https://example.com/page{i}" for i in range(1, 6)]

with ThreadPoolExecutor(max_workers=3) as executor:  # 3 worker threads
    results = list(executor.map(scrape_page, urls))

print("Scraping completed!")
```
- **`max_workers=3`**: Creates 3 threads to scrape URLs concurrently.
- **`executor.map()`**: Maps the `scrape_page` function to each URL in the list.

---

### Benefits of Using ThreadPoolExecutor
- **Concurrency**: Reduces execution time for I/O-bound tasks.
- **Simple API**: Easy to use compared to manually managing threads.
- **Scalability**: Handles many tasks efficiently by reusing threads.

---

In [1]:
""" 
Objective: Compare time execution based on basic loop
"""
import time


# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using basic loop
def main():
    # TODO: Fill main function to run io_bound_scraping 5 times
    # TODO: (Optional) Estimate the time execution
    for i in range(5):
        try:
            io_bound_scraping(i)
        except Exception as e:
            print(f"Error: {e}")
        
# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 0 started.
scraping 0 completed.
scraping 1 started.
scraping 1 completed.
scraping 2 started.
scraping 2 completed.
scraping 3 started.
scraping 3 completed.
scraping 4 started.
scraping 4 completed.


In [2]:
""" 
Objective: Compare time execution based on the number of workers
"""
import time
from concurrent.futures import ThreadPoolExecutor

# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    # TODO: Creating a ThreadPoolExecutor with 1 threads
    with ThreadPoolExecutor(max_workers=1) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1,5))

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 1 started.
scraping 1 completed.
scraping 2 started.
scraping 2 completed.
scraping 3 started.
scraping 3 completed.
scraping 4 started.
scraping 4 completed.


In [3]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 2 workers
import time
from concurrent.futures import ThreadPoolExecutor
# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    # TODO: Creating a ThreadPoolExecutor with 1 threads
    with ThreadPoolExecutor(max_workers=2) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1,5))

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 1 started.
scraping 2 started.
scraping 1 completed.
scraping 3 started.
scraping 2 completed.
scraping 4 started.
scraping 3 completed.
scraping 4 completed.


In [4]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 4 workers
# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    # TODO: Creating a ThreadPoolExecutor with 1 threads
    with ThreadPoolExecutor(max_workers=4) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1,5))

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 1 started.scraping 2 started.

scraping 3 started.
scraping 4 started.
scraping 1 completed.scraping 2 completed.

scraping 3 completed.
scraping 4 completed.


In [6]:
""" 
Objective: Compare time execution based on the number of workers
"""
# TODO: Recreate previous code with the 500 workers for 1000
# TODO: Analyze how your program manage to execute 500 workers at once
# Simulating an I/O-bound scraping with time.sleep()
def io_bound_scraping(scraping_id):
    print(f"scraping {scraping_id} started.")
    time.sleep(2)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    # TODO: Creating a ThreadPoolExecutor with 1 threads
    with ThreadPoolExecutor(max_workers=500) as executor:
        # Use map to run the scrapings concurrently
        executor.map(io_bound_scraping, range(1,1000))

# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 1 started.
scraping 2 started.
scraping 3 started.
scraping 4 started.
scraping 5 started.
scraping 6 started.
scraping 7 started.
scraping 8 started.
scraping 9 started.
scraping 10 started.
scraping 11 started.
scraping 12 started.
scraping 13 started.
scraping 14 started.
scraping 15 started.
scraping 16 started.
scraping 17 started.
scraping 18 started.
scraping 19 started.
scraping 20 started.
scraping 21 started.
scraping 22 started.
scraping 23 started.
scraping 24 started.
scraping 25 started.
scraping 26 started.
scraping 27 started.
scraping 28 started.
scraping 29 started.
scraping 30 started.
scraping 31 started.
scraping 32 started.
scraping 33 started.
scraping 34 started.
scraping 35 started.
scraping 36 started.
scraping 37 started.
scraping 38 started.
scraping 39 started.
scraping 40 started.
scraping 41 started.
scraping 42 started.
scraping 43 started.
scraping 44 started.
scraping 45 started.
scraping 46 started.
scraping 47 started.
scraping 48 started.
s

In [None]:
""" 
Objective: Concurrently run function with 2 parameters or more
"""
# TODO: Import necessary package
# TODO: Create a function to simulate I/O bound task with 2 parameters: task_id and delay time
# TODO: Create list of task_id and delay time
# TODO: Run your function with multi-threading by mapping your function with all the parameters
# Simulating an I/O-bound scraping with time.sleep()

import time
from concurrent.futures import ThreadPoolExecutor

def io_bound_scraping(scraping_id, delay):
    print(f"scraping {scraping_id} started.")
    time.sleep(delay)  # Simulate I/O operation
    print(f"scraping {scraping_id} completed.")

# Using ThreadPoolExecutor with map for faster execution
def main():
    list_task = [(1, 2), (2, 3), (3, 1), (4, 4)]
    with ThreadPoolExecutor(max_workers=4) as executor:
        executor.map(io_bound_scraping, *zip(*list_task))


# Call the main function to run the scrapings
if __name__ == "__main__":
    main()


scraping 1 started.
scraping 2 started.
scraping 3 started.
scraping 4 started.
scraping 3 completed.
scraping 1 completed.
scraping 2 completed.
scraping 4 completed.


In [None]:
"""
Homework Assignment: Improve previous code. 
Instead of creating a list of delay time, combine the list of task_id with a constant value of delay time
using lambda
"""

In [None]:
""" 
Objective: Implement multi-threading in web scraping
"""
# TODO: Implement multi-threading on your bookstoscrape project inside new branch
# TODO: Put github url here for grading

https://github.com/rjrizani/data_scraping



In [None]:
""" 
Objective: Implement multi-threading in web scraping
"""
# TODO: Find any news site that you like: Tribun, Detik, BBC, nytimes, etc
# TODO: Extract data from the site in CSV
# TODO: Push on github and put the link here

#-News-Scraping from BBC

https://github.com/rjrizani/BBC

### **Reflection**
Monitor your resources usage while executing multi-threading, what do you think?

(answer here)

### **Exploration**
While Multi-threading is like adding "more engine", there is a better approach for improve scraping time. Find out about asynchronous concept and be prepared for the next class.