## Scraping data using multithreading

Created by: [tanyongsheng.net](https://tanyongsheng.net)

----
Introduction:
- Multi-threading is beneficial in situations where the program spends a significant amount of time waiting for external resources, such as I/O operations, rather than performing intense computational tasks.
- When scraping data from multiple web pages, you can use multi-threading to issue HTTP requests for different pages concurrently. This helps reduce the overall time spent waiting for responses, as threads can work on fetching pages simultaneously.

Reference: 
1. MultiThreading in Python | Python Concurrent futures | ThreadPoolExecutor https://www.youtube.com/watch?v=i0Tey6Gprnc&t=495s
2. How to Make Web Scraping Faster - Python Tutorial https://oxylabs.io/blog/how-to-make-web-scraping-faster
3. Comcrawl script which uses multi-threading for scraping: https://github.com/michaelharms/comcrawl/blob/a89236080c5e7f4ce6a2e0d39c5f59671f22181e/comcrawl/utils/search.py#L11
4. Multi-threaded web scraping with Python https://sean.eulenberg.de/posts/2020-05-26-multi-threaded-webscraping-with-python/


### Background of this scraping task with multi-threading concept

We're making 10 requests to [https://httpbin.org/delay/10](https://httpbin.org/delay/10), where each request takes 10 seconds to get a response. We'll test three approaches:

1. **10 Threads:**
- Sending 10 web requests simultaneously using threads. 
- Estimate: it takes around 10 - 11s to finish scraping with 10 threads (=10*10s/10)

2. **No Multithreading:**
- Scraping data without using multiple threads. 
- Estimate: it takes around 1 min 40s to finish scraping (=10*10s) 

3. **2 Threads:**
- Using only 2 threads to fetch the data.
- Estimate: it takes around 50s to finish scraping with 2 threads (=10*10s/2)

We'll compare their execution speeds to understand how multithreading affects the web scraping process.

### Case 1: Using 10 threads to scrape data

In [1]:
%pip install requests
%pip install tqdm
%pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting numpy<2,>=1.22.4 (from pandas)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.

In [2]:
%%time

import concurrent.futures # for multi-threading
import requests # for downloading data
from tqdm import tqdm # for displaying a smart progress meter in loops

urls = ["https://httpbin.org/delay/10"] * 10   ## Note: making 10 requests to this endpoint, 
                                                ## where each request takes 10 seconds to get a response  

session = requests.Session()

def scrape(url):
    response = session.get(url)
    return response.json()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as ex:
    results = {ex.submit(scrape, url): url for url in urls}
    
    data_list = []

    for result in tqdm(concurrent.futures.as_completed(results)):
        data = result.result() # get the scraped data

        if isinstance(data, list):
            data_list.extend(data)
        else:
            data_list.append(data)
    print(data_list)

10it [00:11,  1.17s/it]

[{'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac17-5821fa4a252e53cb2540b6b5'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac17-61d05c8243414d33060bd9b4'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac17-3b21848960b8fa6668a5ebcc'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers':




In [3]:
import pandas
pandas.DataFrame(data_list)

Unnamed: 0,args,data,files,form,headers,origin,url
0,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
1,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
2,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
3,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
4,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
5,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
6,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
7,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
8,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10
9,{},,{},{},"{'Accept': '*/*', 'Accept-Encoding': 'gzip, de...",34.168.24.183,https://httpbin.org/delay/10


### Comparison 1: Scraping without multi-threading

In [4]:
%%time
import requests
from tqdm import tqdm

urls = ["https://httpbin.org/delay/10"] * 10

session = requests.Session()

def scrape(url):
    response = session.get(url)
    return response.json()

data_list = []

for url in urls:
    data = scrape(url)
    if isinstance(data, list):
        data_list.extend(data)
    else:
        data_list.append(data)

print(data_list)

[{'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac22-381fba1d61cda913560a25cc'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac2d-7fbf312b633797735c30964c'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac37-41fd0d8465bd9bea2171f95d'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers':

### Comparison 2: Scraping with less threads (e.g., 2 threads)

In [5]:
%%time

import concurrent.futures
import requests

urls = ["https://httpbin.org/delay/10"] * 10
session = requests.Session()

def scrape(url):
    response = session.get(url)
    return response.json()

# scrape data on 2 threads only
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as ex:
    results = {ex.submit(scrape, url): url for url in urls}
    
    data_list = []

    for result in concurrent.futures.as_completed(results):    
        data = result.result()

        if isinstance(data, list):
            data_list.extend(data)
        else:
            data_list.append(data)
    print(data_list)

[{'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac88-4f374748352957433fac5d30'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac88-7844c93b5e7e96440c3dc66f'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0', 'X-Amzn-Trace-Id': 'Root=1-65f6ac92-43bde50b0c5899af5aeac7ef'}, 'origin': '34.168.24.183', 'url': 'https://httpbin.org/delay/10'}, {'args': {}, 'data': '', 'files': {}, 'form': {}, 'headers':

## Conclusion

Using 10 threads significantly reduces the wall time (around 12.7s) compared to the non-multithreading approach (around 1 min 42s). However, using only 2 threads (51.2s) also shows a notable improvement over the non-multithreading method, but it's not as efficient as using 10 threads. Therefore, the number of threads plays a crucial role in the speed of the web scraping process, and in this case, more threads lead to faster execution.

However, it's essential to note that multi-threading may not always provide a significant performance improvement for CPU-bound tasks due to the Global Interpreter Lock (GIL) in CPython. In such cases, you might consider using multiprocessing or asynchronous approaches instead. Additionally, be aware of the website's terms of service and ensure that your scraping activities comply with them.

## Computing environment

In [6]:
%load_ext watermark

%watermark

# print out pypi packages used
%watermark --iversions

# date
%watermark -u -n -t -z

Last updated: 2024-03-17T08:41:32.230769+00:00

Python implementation: CPython
Python version       : 3.10.13
IPython version      : 8.22.2

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.1.75-060175-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 16
Architecture: 64bit

pandas  : 2.2.1
requests: 2.31.0

Last updated: Sun Mar 17 2024 08:41:32UTC

