# Real-World Use Case: Multiprocessing for Data Processing

Multiprocessing shines when you have CPU-intensive tasks that can be parallelized. This notebook demonstrates practical examples like data transformation, image processing simulation, and batch computations.

## What We'll Learn

1. Data Transformation with Multiprocessing
2. Batch Processing Large Datasets
3. Parallel Calculations
4. Performance Comparison
5. Best Practices

---

## 1. Data Transformation with ProcessPoolExecutor

Processing large lists of data in parallel.

In [None]:
from concurrent.futures import ProcessPoolExecutor
import time

def process_data_chunk(chunk):
    """Simulate CPU-intensive data processing"""
    # Example: Complex mathematical transformations
    result = []
    for num in chunk:
        # Simulate CPU-intensive work
        value = sum([num ** 2 + i for i in range(1000)])
        result.append(value)
    return result

# Generate large dataset
data = list(range(10000))

# Split data into chunks
def chunk_list(lst, chunk_size):
    for i in range(0, len(lst), chunk_size):
        yield lst[i:i + chunk_size]

chunks = list(chunk_list(data, 1000))  # 10 chunks of 1000 items

# Sequential processing
print("=== Sequential Processing ===")
start = time.time()
sequential_results = []
for chunk in chunks:
    sequential_results.extend(process_data_chunk(chunk))
sequential_time = time.time() - start
print(f"Time: {sequential_time:.2f} seconds")

# Parallel processing with multiprocessing
print("\n=== Parallel Processing (Multiprocessing) ===")
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
    parallel_results = []
    for result in executor.map(process_data_chunk, chunks):
        parallel_results.extend(result)
parallel_time = time.time() - start
print(f"Time: {parallel_time:.2f} seconds")

print(f"\nSpeedup: {sequential_time/parallel_time:.2f}x faster!")

---

## 2. Image Processing Simulation

Simulating parallel image processing (resize, filter, etc.).

In [None]:
from concurrent.futures import ProcessPoolExecutor
import time

def process_image(image_path):
    """Simulate image processing (resize, filter, compress, etc.)"""
    # Simulate CPU-intensive operations
    time.sleep(0.1)  # I/O: reading image
    
    # Simulate heavy processing (filters, transformations)
    result = 0
    for i in range(1000000):
        result += i ** 2
    
    return f"Processed: {image_path}"

# Simulate batch of images
image_files = [f"image_{i}.jpg" for i in range(20)]

print("Processing 20 images...")
start = time.time()

# Parallel image processing
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_image, image_files))

elapsed = time.time() - start
print(f"\nAll images processed in {elapsed:.2f} seconds")
print(f"Average time per image: {elapsed/len(image_files):.2f} seconds")

---

## Summary

**Key Takeaways:**

1. **Best For CPU-Bound Tasks**:
   - Data transformations
   - Mathematical computations
   - Image/video processing
   - Scientific calculations
   - Machine learning training

2. **Performance Gains**:
   - Near-linear speedup with number of cores
   - 2-4x faster on typical quad-core machines
   - Bypasses Python's GIL completely

3. **Best Practices**:
   - Chunk large datasets for efficient distribution
   - Use `ProcessPoolExecutor` for simplicity
   - Balance chunk size vs overhead
   - Consider pickle overhead for large objects
   - Don't create more processes than CPU cores

4. **When NOT to Use**:
   - I/O-bound tasks (use threading)
   - Small, quick tasks (overhead > benefit)
   - Tasks with large data transfer between processes

**Real-World Applications:**
- Video encoding/transcoding
- Data science preprocessing
- Batch file processing
- Scientific simulations
- Cryptocurrency mining
- Machine learning model training

Multiprocessing unlocks the full power of modern multi-core CPUs!