## An overview of pool in multiprocessing
The Pool class belongs to the multiprocessing module in Python, providing a convenient avenue for executing parallel tasks. When you start a Pool, you create a set of worker processes that are ready to perform tasks simultaneously. This capability makes it a pivotal tool for effective multiprocessing.

Let's offer a simplified analogy to understand the Pool concept better: Picture a boss with a team of workers. The boss has a to-do list. Instead of tackling all the tasks single-handedly, the boss divides them among the workers, who then carry them out at the same time. In this case, the boss represents the Pool, and the workers symbolize the worker processes.

Moving back to Python, when you establish a Pool, you need to determine the count of worker processes it should oversee. Usually, this count should match the number of cores on your computer, allowing you to fully harness your machine's processing capacity.

The Pool class incorporates various methods like map, imap, apply, and apply_async, each devised for distinctive scenarios of task distribution and execution. These methods aid in effectively distributing tasks to the worker processes and gathering the results once the tasks are done.

## Iteration methods: map and imap
The Pool class provides two methods, map and imap, for distributing a function call across various input values and collecting the results. These methods facilitate parallel data processing, which greatly enhances the performance of your program when handling large datasets or tasks that require a lot of computation.

The map method applies a function to every item in a provided iterable, such as a list or tuple, and returns a list of results. With Pool.map, this process happens in parallel, where each worker process deals with a part of the data. The method signature is Pool.map(function, iterable, chunksize=None). The chunksize argument is optional; it helps split the iterable into several chunks that are then sent to worker processes.

In [25]:
from multiprocessing import Pool

def square(n):
    return n * n

if __name__ == "__main__":
    with Pool(processes=4) as pool:
        numbers = [1, 2, 3, 4, 5,6,7,8,9,10,11,12,13,15,16]
        result = pool.map(square, numbers)
        print(result)  # Output: [1, 4, 9, 16, 25]

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 225, 256]


The imap method resembles map, but returns an iterator that gives results as soon as they are ready. This is advantageous when processing a flow of data or when you wish to start processing results before all tasks are finished. The method signature is Pool.imap(function, iterable, chunksize=1).

In [26]:
from multiprocessing import Pool

def square(n):
    return n * n

if __name__ == "__main__":
    with Pool(processes=4) as pool:
        numbers = [1, 2, 3, 4, 5]
        result_iterator = pool.imap(square, numbers)
        for result in result_iterator:
            print(result)  # Output: 1 4 9 16 25

1
4
9
16
25


# Application methods: apply and apply_async
The Pool class offers methods to process a function with arguments in parallel. These methods, apply and apply_async, aid in executing a function with given arguments and manage the distribution of these function calls among the pool of worker processes.

The apply method lets you submit a function and its arguments to a worker process within the pool. The method signature reads Pool.apply(func, args=(), kwds={}). Here, func is the function to be executed, args is a tuple of arguments, and kwds is a dictionary of keyword arguments.

In [28]:
from multiprocessing import Pool

def add_numbers(a, b):
    return a + b

if __name__ == "__main__":
    with Pool(processes=4) as pool:
        result = pool.apply(add_numbers, args=(5, 3))
        print(result)  # Output: 8

8


In the code snippet above, the add_numbers function, with its arguments (5, 3), is submitted to a worker process in the pool, which then calculates and returns the result.

The apply_async method works similarly to apply, but operates asynchronously. It returns promptly with an AsyncResult object without waiting for the computation to finish. You can use the get() method on the AsyncResult object to fetch the result once it's ready. The method signature reads Pool.apply_async(func, args=(), kwds={}, callback=None, error_callback=None).

In [29]:
from multiprocessing import Pool

def add_numbers(a, b):
    return a + b

if __name__ == "__main__":
    with Pool(processes=4) as pool:
        async_result = pool.apply_async(add_numbers, args=(5, 3))
        print(async_result.get())  # Output: 8

8


Practical implications: Real-world use cases of pool in multiprocessing
You can use the Pool class in Python's multiprocessing module when you need parallel processing to improve performance and efficiency.

Data Processing: Speed up the analysis of large datasets, like in data mining or statistical analysis tasks.

Image/Video Processing: Accelerate similar computationally intensive tasks, such as image resizing or video transcoding.

Web Scraping: Fetch and process data from multiple web pages simultaneously to hasten data collection.

Simulation and Modeling: Run simulations in areas like computational physics or financial modeling in parallel.

Machine Learning: Train multiple models at once or fine-tune parameters in parallel.

Network Operations: Monitor multiple network endpoints and perform simultaneous network scans for efficient network management.

In [30]:
import cProfile, pstats, io
from pstats import SortKey

profiler = cProfile.Profile()

def fib_list(n):
    if n < 2:
        return n
    sequence = [0, 1]
    profiler.enable()
    for i in range(2, n + 1):
        sequence.append(sequence[i - 1] + sequence[i - 2])
    profiler.disable()
    return sequence[n]

fib_list(300)

stream = io.StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats = stats.sort_stats(SortKey.CUMULATIVE)
stats.print_stats()
print(stream.getvalue())

         300 function calls in 0.000 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      299    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}





In [20]:
!pip3 install tqdm

[33mDEPRECATION: Loading egg at /home/oh856/miniconda3/lib/python3.11/site-packages/python_sonarqube_api-2.0.3-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mLooking in indexes: https://artifactory.dbgcloud.io/artifactory/api/pypi/pypi.python.org/simple


In [21]:
import tqdm

tqdm.tqdm_pandas()

TypeError: tqdm_pandas() missing 1 required positional argument: 'tclass'

In [None]:
import pandas as pd

df = pd.DataFrame({
   'Name': ['John', 'Jane', 'Bob', 'Mary', 'Ivan'],
   'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Moscow'],
   'Age': [32, 25, 47, 19, 45],
   'Income': [55000, 72000, 89000, 41000, 45000]
})

In [None]:
def change_row(row):
    row['Name'] = row['Name'].upper()
    row['City'] = row['City'].lower()
    row['Age'] = row['Age'] + 10
    row['Income'] = row['Income'] * 1.1
    return row

df = df.apply(change_row, axis=1)
print(df)

   Name         City  Age   Income
0  JOHN     new york   42  60500.0
1  JANE  los angeles   35  79200.0
2   BOB      chicago   57  97900.0
3  MARY      houston   29  45100.0
4  IVAN       moscow   55  49500.0


In [None]:
def add_tax(row):
    if row['Income'] > 60000:
        tax = row['Income'] * 0.1
    else:
        tax = row['Income'] * 0.05
    return tax

df['Tax'] = df.apply(add_tax, axis=1)
print(df)

   Name         City  Age   Income     Tax
0  JOHN     new york   42  60500.0  6050.0
1  JANE  los angeles   35  79200.0  7920.0
2   BOB      chicago   57  97900.0  9790.0
3  MARY      houston   29  45100.0  2255.0
4  IVAN       moscow   55  49500.0  2475.0


In [None]:
def add_suffix(col, suffix):
    return col + suffix

# Apply the function to a single column
df['Name'] = df['Name'].apply(add_suffix, suffix='_Smith')
print(df)

         Name         City  Age   Income     Tax
0  JOHN_Smith     new york   42  60500.0  6050.0
1  JANE_Smith  los angeles   35  79200.0  7920.0
2   BOB_Smith      chicago   57  97900.0  9790.0
3  MARY_Smith      houston   29  45100.0  2255.0
4  IVAN_Smith       moscow   55  49500.0  2475.0


In [None]:
def add_value(number):
    return number + 100


# Here, axis = 0 by default - thus, the function is applied to each column
df[["Income", "Tax"]] = df[["Income", "Tax"]].apply(add_value)
print(df)

         Name         City  Age   Income     Tax
0  JOHN_Smith     new york   42  60600.0  6150.0
1  JANE_Smith  los angeles   35  79300.0  8020.0
2   BOB_Smith      chicago   57  98000.0  9890.0
3  MARY_Smith      houston   29  45200.0  2355.0
4  IVAN_Smith       moscow   55  49600.0  2575.0


In [None]:
def mean_of_column(col):
    return col.mean()

result = df[['Income', 'Tax']].apply(mean_of_column, result_type='broadcast')

In [None]:
from tqdm import tqdm

tqdm.pandas()

result = df[['Income', 'Tax']].progress_apply(mean_of_column, result_type='broadcast')

# here you will see this progress bar, we had two columns, so 2 is the number on the progress bar

# 100%|██████████| 2/2 [00:00<00:00, 1981.72it/s]

100%|██████████| 2/2 [00:00<00:00, 4192.21it/s]


In [None]:
df.head().progress_apply(mean_of_column,result_type='broadcast')

 20%|██        | 1/5 [00:00<00:00, 3010.99it/s]


TypeError: Could not convert string 'JOHN_SmithJANE_SmithBOB_SmithMARY_SmithIVAN_Smith' to numeric