# Tutorial: Task Coordination Patterns

**Category**: Concurrency  
**Difficulty**: Advanced  
**Time**: 20-30 minutes

## Overview

Production concurrent systems need two critical capabilities:

1. **Fan-out/fan-in**: Distribute work across multiple workers (fan-out), collect results efficiently (fan-in)
2. **Graceful shutdown**: Stop accepting new work, complete in-flight operations, run cleanup despite cancellation

This tutorial teaches both patterns using lionherd-core's concurrency primitives:

- `TaskGroup` - Structured concurrency for coordinating multiple concurrent operations
- `Queue` - Work distribution with backpressure control
- `shield()` - Protect cleanup operations from cancellation
- `non_cancel_subgroup()` - Distinguish expected cancellations from real errors

**Why These Patterns Matter**:
- **Throughput**: Worker pools maximize CPU/I/O utilization (10-100x speedup for parallelizable tasks)
- **Reliability**: Graceful shutdown prevents data loss (incomplete transactions, orphaned resources, cascading failures)
- **Resource Management**: Bounded queues prevent memory exhaustion from fast producers overwhelming slow consumers
- **Observability**: Proper error separation (cancellations vs failures) enables accurate alerting

**What You'll Build**: Production-ready patterns for coordinating concurrent workers with proper lifecycle management, error handling, and graceful shutdown that you can immediately deploy.

## Prerequisites and Setup

**Prior Knowledge**:
- Python async/await fundamentals (asyncio basics, coroutines, await syntax)
- Basic understanding of producer-consumer patterns
- Task groups and structured concurrency concepts (or willingness to learn)

**Required Packages**:
```bash
pip install lionherd-core  # >=0.1.0
```

**Imports**:
```python
# lionherd-core concurrency primitives
from lionherd_core.libs.concurrency import (
    Queue,                    # Bounded async queue with backpressure
    TaskGroup,                # Structured concurrency primitive
    create_task_group,        # TaskGroup factory
    shield,                   # Cancellation protection
    non_cancel_subgroup,      # Exception group filtering
    sleep,                    # Async sleep (checkpoint-aware)
    fail_after,               # Timeout context manager
)

# Standard library
import asyncio
from typing import Any, Callable
```

## API Overview

### TaskGroup - Structured Concurrency

Ensures all concurrent tasks complete or are cancelled together.

```python
async with create_task_group() as tg:
    tg.start_soon(worker_1())
    tg.start_soon(worker_2())
```

**Properties**: Structured lifetime (no orphaned tasks), error propagation (one fails → all cancel), resource safety.

### Queue - Work Distribution with Backpressure

Async FIFO queue with size limits.

```python
queue = Queue.with_maxsize(10)
await queue.put(item)      # Blocks if full
item = await queue.get()   # Blocks if empty
```

**Properties**: Bounded size (prevents memory exhaustion), backpressure (throttles fast producers).

### shield() - Cancellation Protection

Protects operations from outer cancellation.

```python
try:
    with fail_after(1.0):
        await long_operation()
except TimeoutError:
    await shield(cleanup())  # Completes despite timeout
```

**Use cases**: Closing connections, flushing buffers, persisting state.

### non_cancel_subgroup() - Error Filtering

Extracts non-cancellation exceptions from ExceptionGroup.

```python
except ExceptionGroup as eg:
    real_errors = non_cancel_subgroup(eg)
    if real_errors:
        log_errors(real_errors)  # Only real failures
```

**Pattern**: Distinguish "service failed" from "service stopped" during shutdown.

## Pattern 1: Fan-Out/Fan-In with Queue Workers

Distribute work across N workers, collect results. The fundamental pattern for parallel processing.

**Key Concepts**:
- Queue for work distribution (bounded, with backpressure)
- TaskGroup to manage worker lifecycle (structured concurrency)
- Poison pill pattern: sentinel value (object()) signals shutdown
- Result collection: Workers append to shared list

**Pattern Flow**:
1. Create bounded queue and result list
2. Start N workers in TaskGroup
3. Feed tasks to queue (blocks if full - backpressure)
4. Send poison pill to signal completion
5. Workers process until seeing poison pill, then propagate it and exit
6. TaskGroup blocks until all workers complete

In [1]:
from typing import Any

from lionherd_core.libs.concurrency import Queue, create_task_group, sleep


async def fan_out_fan_in(tasks: list, num_workers: int = 3) -> list[Any]:
    """Distribute work to N workers, collect results.

    Args:
        tasks: List of async callables to execute
        num_workers: Number of concurrent workers

    Returns:
        List of results from all tasks

    Example:
        tasks = [lambda: fetch(url) for url in urls]
        results = await fan_out_fan_in(tasks, num_workers=5)
    """
    work_queue = Queue.with_maxsize(10)  # Bounded queue
    results = []  # Shared result list (append is thread-safe)
    DONE = object()  # Poison pill sentinel

    async def worker(worker_id: int):
        """Pull tasks from queue and process."""
        while True:
            task = await work_queue.get()

            # Check for poison pill
            if task is DONE:
                # Propagate to other workers
                await work_queue.put(DONE)
                print(f"Worker {worker_id} exiting")
                break

            # Process task
            result = await task()
            results.append(result)

    async with create_task_group() as tg:
        # Start workers
        for i in range(num_workers):
            tg.start_soon(worker, i)

        # Feed tasks (blocks if queue full - backpressure)
        for task in tasks:
            await work_queue.put(task)

        # Signal completion
        await work_queue.put(DONE)

    # TaskGroup exit means all workers completed
    return results


# Demo: Process 15 tasks with 3 workers
async def work(i: int) -> str:
    """Simulate CPU/IO work."""
    await sleep(0.05)
    return f"result-{i}"


tasks = [lambda i=i: work(i) for i in range(15)]
results = await fan_out_fan_in(tasks, num_workers=3)

print(f"\nProcessed {len(results)} tasks with 3 workers")
print(f"Results (first 5): {results[:5]}")
print(f"Results (last 5): {results[-5:]}")

Worker 0 exiting
Worker 1 exiting
Worker 2 exiting

Processed 15 tasks with 3 workers
Results (first 5): ['result-0', 'result-1', 'result-2', 'result-3', 'result-4']
Results (last 5): ['result-10', 'result-11', 'result-12', 'result-13', 'result-14']


### Pattern Explanation: Fan-Out/Fan-In

**How It Works**:

1. **Fan-out**: Producer adds tasks to shared queue. Workers pull concurrently. Queue.maxsize provides backpressure.

2. **Processing**: Workers process independently (embarrassingly parallel). Fast workers naturally process more tasks.

3. **Fan-in**: Workers append to shared list. List.append() is atomic (no lock needed). Results in completion order.

4. **Shutdown**: Producer sends DONE sentinel. First worker propagates it for others. All workers exit. TaskGroup guarantees cleanup.

**Critical Details**:
- **object() sentinel**: Unique, can't be confused with None or results
- **Poison pill propagation**: N workers need N pills, propagation ensures all see it
- **Result ordering**: Completion order, not submission order (use (index, result) tuples if order matters)

## Pattern 2: Graceful Shutdown with shield()

Services must complete cleanup operations even when cancelled. Without shield(), cleanup gets cancelled mid-execution, leaving inconsistent state.

**Key Concepts**:
- shield() makes cleanup immune to outer cancellation
- Context manager pattern (__aexit__) guarantees cleanup runs
- CancelledError must be re-raised after cleanup (don't suppress)

**Pattern Flow**:
1. Service starts (acquire resources: connections, files, locks)
2. Service runs (process requests, handle events)
3. Cancellation arrives (SIGTERM, timeout, exception in other tasks)
4. __aexit__ called with exception info
5. shield(cleanup) ensures cleanup completes despite cancellation
6. Resources released, exceptions propagated

In [2]:
from lionherd_core.libs.concurrency import fail_after, shield


class Service:
    """Service with guaranteed cleanup via shield()."""

    def __init__(self, name: str):
        self.name = name
        self.running = False
        self.resource = None  # Simulate resource (connection, file, etc)

    async def start(self):
        """Initialize service - acquire resources."""
        print(f"[{self.name}] Starting")
        await sleep(0.05)  # Simulate startup time
        self.running = True
        self.resource = f"{self.name}-resource"  # Acquire resource
        print(f"[{self.name}] Started (resource: {self.resource})")

    async def stop(self):
        """Cleanup - this MUST complete to avoid resource leaks."""
        print(f"[{self.name}] Stopping (cleanup started)")

        # Simulate cleanup work: close connections, flush buffers, etc
        await sleep(0.15)  # Cleanup takes longer than startup

        # Release resource
        self.resource = None
        self.running = False

        print(f"[{self.name}] Stopped (cleanup complete)")

    async def __aenter__(self):
        """Context manager entry."""
        await self.start()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit - guaranteed cleanup.

        shield() ensures cleanup completes even if:
        - Outer context is cancelled
        - Timeout expires
        - Exception raised in service
        """
        # Shield cleanup from outer cancellation
        await shield(self.stop)

        # Don't suppress exceptions (return False)
        return False


# Demo 1: Normal operation (no cancellation)
print("=== Demo 1: Normal operation ===")
async with Service("api-server") as svc:
    print(f"Service running: {svc.running}")
    await sleep(0.1)
# Output: Start → Stop (cleanup completes)

print(f"\nFinal state: running={svc.running}, resource={svc.resource}\n")


# Demo 2: Cancellation during operation
print("=== Demo 2: Cancelled during operation ===")
try:
    async with Service("worker-pool") as svc:
        print(f"Service running: {svc.running}")
        # Cancel before cleanup can complete
        with fail_after(0.12):  # Timeout before cleanup finishes (0.15s)
            await sleep(1.0)  # Would block forever
except Exception as e:
    print(f"Caught: {type(e).__name__}")

print(f"Final state: running={svc.running}, resource={svc.resource}")
# Output: Cleanup completes DESPITE timeout

print("\n=== Demo 3: Exception during operation ===")
try:
    async with Service("metrics") as svc:
        raise ValueError("Simulated failure")
except ValueError as e:
    print(f"Caught: {e}")

print(f"Final state: running={svc.running}, resource={svc.resource}")
# Output: Cleanup completes DESPITE exception

=== Demo 1: Normal operation ===
[api-server] Starting
[api-server] Started (resource: api-server-resource)
Service running: True
[api-server] Stopping (cleanup started)
[api-server] Stopped (cleanup complete)

Final state: running=False, resource=None

=== Demo 2: Cancelled during operation ===
[worker-pool] Starting
[worker-pool] Started (resource: worker-pool-resource)
Service running: True
[worker-pool] Stopping (cleanup started)
[worker-pool] Stopped (cleanup complete)
Caught: TimeoutError
Final state: running=False, resource=None

=== Demo 3: Exception during operation ===
[metrics] Starting
[metrics] Started (resource: metrics-resource)
[metrics] Stopping (cleanup started)
[metrics] Stopped (cleanup complete)
Caught: Simulated failure
Final state: running=False, resource=None


### Pattern Explanation: Graceful Shutdown

**Lifecycle States**: INITIAL → STARTING → RUNNING → STOPPING → STOPPED

**Cancellation Sources**:
- Outer timeout (fail_after, wait_for)
- TaskGroup (one task failed, all cancelled)
- OS signal (SIGTERM, SIGINT)

**shield() Protection**:
- `await shield(self.stop)` creates cancellation-immune scope
- Cleanup completes even if outer context is cancelled
- If cleanup itself times out, cancellation propagates

**Resource Guarantees**: Connections closed, files flushed, locks released, state persisted.

**Critical Details**:
- **shield in __aexit__**: Ensures cleanup runs regardless of how context exits
- **return False**: Don't suppress exceptions, propagate after cleanup
- **Idempotency**: Calling stop() twice should be safe

## Combined Pattern: Worker Pool with Graceful Shutdown

Real systems need both: distribute work across workers AND shutdown gracefully when cancelled.

**Key Concepts**:
- TaskGroup manages worker lifecycle
- Queue distributes work
- shutdown_event enables cooperative shutdown ("please stop when convenient")
- Cancellation triggers forceful shutdown ("stop now")
- shield() ensures cleanup completes

**Two-Phase Shutdown**:
1. **Cooperative** (Phase 1): Set shutdown_event, workers stop accepting new work
2. **Forceful** (Phase 2): Cancel workers if they don't exit gracefully

---

## Error Handling: non_cancel_subgroup()

ExceptionGroup may contain both expected cancellations and real errors. `non_cancel_subgroup()` filters out cancellations so you can handle real errors separately.

**Why This Matters**:
- During shutdown, cancellations are expected (not errors)
- Real failures need investigation (alerts, logs, error tracking)
- Mixing them creates alert fatigue ("service stopped" != "service failed")

**Patterns to demonstrate**:
1. Combined worker pool with graceful shutdown (next cell)
2. Error filtering with non_cancel_subgroup() (following cell)

In [3]:
import asyncio

from lionherd_core.libs.concurrency import Event, sleep


class WorkerPool:
    """Worker pool with fan-out/fan-in and graceful shutdown.

    Features:
    - Distribute work across N workers (fan-out/fan-in)
    - Cooperative shutdown (shutdown_event)
    - Forceful shutdown (cancellation)
    - Guaranteed cleanup (shield)
    - Error collection (don't crash on individual failures)
    """

    def __init__(self, num_workers: int = 3, queue_size: int = 10):
        self.num_workers = num_workers
        self.work_queue = Queue.with_maxsize(queue_size)
        self.results = []
        self.errors = []
        self.shutdown_event = Event()  # Cooperative shutdown signal

    async def worker(self, worker_id: int):
        """Worker loop - processes until shutdown."""
        print(f"Worker {worker_id} started")

        try:
            while not self.shutdown_event.is_set():
                try:
                    # Non-blocking get with timeout (check shutdown_event periodically)
                    task = await asyncio.wait_for(
                        self.work_queue.get(),
                        timeout=0.1,  # Check shutdown every 100ms
                    )

                    # Process task
                    result = await task()
                    self.results.append(f"w{worker_id}: {result}")

                except TimeoutError:
                    # No work available, check shutdown_event again
                    continue

                except Exception as e:
                    # Task failed, collect error but continue processing
                    self.errors.append(e)

        except asyncio.CancelledError:
            # Forceful shutdown (Phase 2)
            print(f"Worker {worker_id} cancelled (forceful shutdown)")
            raise

        finally:
            print(f"Worker {worker_id} exiting")

    async def cleanup(self):
        """Drain remaining work, close resources.

        This is shielded from cancellation, so it completes
        even if outer context is cancelled.
        """
        print("\nWorkerPool cleanup: draining remaining work...")

        # Process remaining items in queue (non-blocking)
        drained = 0
        try:
            while True:
                task = self.work_queue.get_nowait()
                result = await task()
                self.results.append(f"cleanup: {result}")
                drained += 1
        except Exception:  # Queue empty
            pass

        print(f"WorkerPool cleanup complete: processed {drained} remaining items")

    async def run(self, tasks: list):
        """Run worker pool until cancelled.

        Two-phase shutdown:
        1. Cooperative: Set shutdown_event (workers exit loops)
        2. Forceful: Cancel workers (if they don't exit gracefully)
        """
        try:
            async with create_task_group() as tg:
                # Start workers
                for i in range(self.num_workers):
                    tg.start_soon(self.worker, i)

                # Feed work
                print(f"\nFeeding {len(tasks)} tasks to {self.num_workers} workers...")
                for task in tasks:
                    await self.work_queue.put(task)

                # Wait for external cancellation or completion
                # In production, this would be signal handling or service lifecycle
                await asyncio.sleep(3600)  # Blocks until cancelled

        except asyncio.CancelledError:
            print("\nWorkerPool cancelled, initiating two-phase shutdown...")

            # Phase 1: Cooperative shutdown
            print("Phase 1: Setting shutdown_event (cooperative)")
            self.shutdown_event.set()

            # Give workers brief time to exit gracefully
            await sleep(0.05)

            # Phase 2: Forceful cancellation (implicit via raise)
            print("Phase 2: Cancelling workers (forceful)")
            raise

        finally:
            # Cleanup completes despite cancellation (shield protects it)
            await shield(self.cleanup)


# Demo: Cancel worker pool mid-execution
async def demo_worker_pool():
    pool = WorkerPool(num_workers=3, queue_size=10)

    async def task(i: int) -> str:
        await sleep(0.1)
        return f"task-{i}"

    tasks = [lambda i=i: task(i) for i in range(20)]

    # Run pool in background
    pool_task = asyncio.create_task(pool.run(tasks))

    # Let it process some work, then cancel
    await sleep(0.35)
    print(f"\n\n>>> Cancelling pool after {len(pool.results)} results...\n")
    pool_task.cancel()

    try:
        await pool_task
    except asyncio.CancelledError:
        print("\n>>> Pool shutdown complete\n")

    print(f"Total processed: {len(pool.results)} results, {len(pool.errors)} errors")
    print(f"Sample results: {pool.results[:5]}")
    if len(pool.results) > 5:
        print(f"More results: {pool.results[-3:]}")


await demo_worker_pool()


Feeding 20 tasks to 3 workers...
Worker 0 started
Worker 1 started
Worker 2 started


>>> Cancelling pool after 9 results...

Worker 0 cancelled (forceful shutdown)
Worker 0 exiting
Worker 2 cancelled (forceful shutdown)
Worker 2 exiting
Worker 1 cancelled (forceful shutdown)
Worker 1 exiting

WorkerPool cancelled, initiating two-phase shutdown...
Phase 1: Setting shutdown_event (cooperative)
Phase 2: Cancelling workers (forceful)

WorkerPool cleanup: draining remaining work...
WorkerPool cleanup complete: processed 8 remaining items

>>> Pool shutdown complete

Total processed: 17 results, 0 errors
Sample results: ['w0: task-0', 'w1: task-1', 'w2: task-2', 'w0: task-3', 'w1: task-4']
More results: ['cleanup: task-17', 'cleanup: task-18', 'cleanup: task-19']


In [4]:
from lionherd_core.libs.concurrency import ExceptionGroup, non_cancel_subgroup, sleep


async def coordinated_workers_with_errors(num_workers: int = 5):
    """Worker pool demonstrating error vs cancellation distinction.

    Scenario:
    - Worker 1: Fails with ValueError (real error)
    - Worker 2: Fails with RuntimeError (real error)
    - Workers 3-5: Running normally, get cancelled due to worker 1/2 failures

    Expected:
    - TaskGroup filters CancelledErrors automatically (you won't see them in ExceptionGroup)
    - ExceptionGroup contains only the 2 real errors
    - non_cancel_subgroup() is still useful when cancellations ARE included (e.g., manual task spawning)
    - Output shows: "Total exceptions: 2" (both are real errors)

    Note: In other contexts (not TaskGroup), you may see both errors and cancellations.
    This demonstrates the filtering pattern for those cases.
    """

    async def worker(worker_id: int):
        """Worker that may fail or be cancelled."""
        await sleep(0.1)

        # Workers 1-2 fail with real errors
        if worker_id == 1:
            raise ValueError(f"Worker {worker_id}: Database connection failed")
        elif worker_id == 2:
            raise RuntimeError(f"Worker {worker_id}: Out of memory")

        # Other workers would run forever (will be cancelled)
        print(f"Worker {worker_id}: Running normally...")
        await sleep(3600)

    try:
        async with create_task_group() as tg:
            for i in range(num_workers):
                tg.start_soon(worker, i)

            # TaskGroup automatically cancels all tasks when one raises

    except ExceptionGroup as eg:
        print("\n=== ExceptionGroup Analysis ===")
        print(f"Total exceptions: {len(eg.exceptions)}")

        # Show all exceptions
        print("\nAll exceptions:")
        for i, exc in enumerate(eg.exceptions, 1):
            print(f"  {i}. {type(exc).__name__}: {exc}")

        # Filter to real errors only
        real_errors = non_cancel_subgroup(eg)

        if real_errors:
            print("\n=== Real Errors (non-cancellations) ===")
            print(f"Count: {len(real_errors.exceptions)}")
            for i, exc in enumerate(real_errors.exceptions, 1):
                print(f"  {i}. {type(exc).__name__}: {exc}")

            print("\n=== Production Actions ===")
            print("- Log to error tracking (Sentry, DataDog)")
            print("- Alert on-call engineer")
            print("- Increment error metrics")
            print("- Execute error recovery procedures")

            # In production: raise to propagate
            # raise real_errors
        else:
            print("\n=== All Cancellations ===")
            print("All exceptions were cancellations (expected during shutdown)")
            print("No alerts needed - this is normal operation")


await coordinated_workers_with_errors()

# Output shows:
# - 5 total exceptions (2 errors + 3 cancellations)
# - non_cancel_subgroup() extracts only the 2 errors
# - Production can handle errors without false alarms from cancellations

Worker 0: Running normally...
Worker 3: Running normally...
Worker 4: Running normally...

=== ExceptionGroup Analysis ===
Total exceptions: 2

All exceptions:
  1. ValueError: Worker 1: Database connection failed
  2. RuntimeError: Worker 2: Out of memory

=== Real Errors (non-cancellations) ===
Count: 2
  1. ValueError: Worker 1: Database connection failed
  2. RuntimeError: Worker 2: Out of memory

=== Production Actions ===
- Log to error tracking (Sentry, DataDog)
- Alert on-call engineer
- Increment error metrics
- Execute error recovery procedures


## Production-Ready Code

Complete implementation combining all patterns. Copy-paste this into your project.

**Features**:
- Fan-out/fan-in worker pool with configurable workers
- Graceful shutdown with two-phase termination
- Error vs cancellation distinction
- Shielded cleanup operations
- Configurable timeouts
- Production error handling and logging points

In [5]:
"""
Production-ready task coordination with graceful shutdown.
Copy this cell into your project and customize.
"""

from collections.abc import Callable
from typing import Any

from lionherd_core.libs.concurrency import (
    sleep,
)


class ProductionWorkerPool:
    """Production-ready worker pool with graceful shutdown.

    Features:
    - Fan-out/fan-in pattern for parallel processing
    - Two-phase shutdown (cooperative + forceful)
    - Shielded cleanup (guaranteed resource release)
    - Error collection (per-task failures don't crash pool)
    - Cancellation filtering (distinguish errors from expected cancellations)

    Usage:
        pool = ProductionWorkerPool(num_workers=10)
        results, errors = await pool.process(tasks, process_func)

    Args:
        num_workers: Number of concurrent workers
        queue_size: Max items in queue (backpressure threshold)
        shutdown_timeout: Seconds to wait for cooperative shutdown
        cleanup_timeout: Max seconds for cleanup operations
    """

    def __init__(
        self,
        num_workers: int = 4,
        queue_size: int = 20,
        shutdown_timeout: float = 5.0,
        cleanup_timeout: float = 10.0,
    ):
        self.num_workers = num_workers
        self.queue_size = queue_size
        self.shutdown_timeout = shutdown_timeout
        self.cleanup_timeout = cleanup_timeout

        self.work_queue: Queue = Queue.with_maxsize(queue_size)
        self.results: list[Any] = []
        self.errors: list[Exception] = []
        self.shutdown_event = Event()
        self.tasks_processed = 0

    async def _worker(self, worker_id: int, process_func: Callable):
        """Worker loop - pulls from queue and processes.

        Error handling:
        - Task failures collected but don't crash worker
        - CancelledError propagated (structured concurrency)
        """
        processed = 0

        try:
            while not self.shutdown_event.is_set():
                try:
                    # Timeout enables periodic shutdown check
                    task = await asyncio.wait_for(
                        self.work_queue.get(),
                        timeout=0.1,
                    )

                    try:
                        result = await process_func(task)
                        self.results.append(result)
                        processed += 1
                        self.tasks_processed += 1
                    except Exception as e:
                        # Collect error but continue processing
                        self.errors.append(e)
                        # Production: logger.error(f"Task failed: {e}", exc_info=True)

                except TimeoutError:
                    # No work available, check shutdown_event
                    continue

        except asyncio.CancelledError:
            # Forceful shutdown - log and propagate
            # Production: logger.warning(f"Worker {worker_id} cancelled forcefully")
            raise

        finally:
            # Production: logger.info(f"Worker {worker_id} exiting (processed {processed})")
            pass

    async def _cleanup(self, process_func: Callable) -> int:
        """Drain remaining queue items.

        This is shielded from cancellation, ensuring remaining
        work is processed before shutdown completes.
        """
        drained = 0

        try:
            while True:
                task = self.work_queue.get_nowait()

                try:
                    result = await process_func(task)
                    self.results.append(result)
                    drained += 1
                except Exception as e:
                    self.errors.append(e)

        except Exception:  # Queue empty
            pass

        # Production: logger.info(f"Cleanup drained {drained} items")
        return drained

    async def process(
        self,
        tasks: list[Any],
        process_func: Callable,
    ) -> tuple[list[Any], list[Exception]]:
        """Process tasks with worker pool.

        Args:
            tasks: List of items to process
            process_func: async function(item) -> result

        Returns:
            (results, errors) tuple

        Example:
            async def process_item(item: dict) -> dict:
                # Your processing logic
                return transformed_item

            pool = ProductionWorkerPool(num_workers=10)
            results, errors = await pool.process(items, process_item)
        """
        # Production: logger.info(f"Starting pool: {self.num_workers} workers, {len(tasks)} tasks")

        try:
            async with create_task_group() as tg:
                # Start workers
                for i in range(self.num_workers):
                    tg.start_soon(self._worker, i, process_func)

                # Feed tasks (blocks if queue full - backpressure)
                for task in tasks:
                    await self.work_queue.put(task)

                # Wait for queue to drain
                while self.work_queue.qsize() > 0:
                    await sleep(0.01)

                # Signal cooperative shutdown
                self.shutdown_event.set()

                # Give workers time to exit gracefully
                await sleep(self.shutdown_timeout)

        except ExceptionGroup as eg:
            # Separate real errors from cancellations
            real_errors = non_cancel_subgroup(eg)

            if real_errors:
                # Production: logger.error(f"Workers failed: {real_errors}")
                for exc in real_errors.exceptions:
                    self.errors.append(exc)
            # else: All cancellations, expected

        finally:
            # Cleanup completes despite cancellation (shield protects it)
            try:
                await asyncio.wait_for(
                    shield(self._cleanup, process_func),
                    timeout=self.cleanup_timeout,
                )
                # Production: logger.info(f"Cleanup complete: {drained} items drained")
            except TimeoutError:
                # Production: logger.error(f"Cleanup timeout after {self.cleanup_timeout}s")
                pass

        return self.results, self.errors


# ============================================================================
# Example Usage
# ============================================================================


async def production_example():
    """Example: Process batch of items with error handling."""

    async def process_item(item: int) -> dict:
        """Simulate processing with occasional failures."""
        await sleep(0.05)

        # Simulate 10% failure rate
        if item % 10 == 0:
            raise ValueError(f"Processing failed for item {item}")

        return {
            "item": item,
            "processed": item * 2,
            "status": "success",
        }

    # Process 50 items with 8 workers
    pool = ProductionWorkerPool(
        num_workers=8,
        queue_size=20,
        shutdown_timeout=2.0,
        cleanup_timeout=5.0,
    )

    items = list(range(50))
    results, errors = await pool.process(items, process_item)

    # Report
    print("\n=== Processing Complete ===")
    print(f"Total tasks: {len(items)}")
    print(f"Successful: {len(results)}")
    print(f"Failed: {len(errors)}")
    print("\nSample successful results:")
    for result in results[:3]:
        print(f"  {result}")
    print("\nSample errors:")
    for error in errors[:3]:
        print(f"  {type(error).__name__}: {error}")


await production_example()


=== Processing Complete ===
Total tasks: 50
Successful: 38
Failed: 5

Sample successful results:
  {'item': 1, 'processed': 2, 'status': 'success'}
  {'item': 2, 'processed': 4, 'status': 'success'}
  {'item': 3, 'processed': 6, 'status': 'success'}

Sample errors:
  ValueError: Processing failed for item 0
  ValueError: Processing failed for item 10
  ValueError: Processing failed for item 20


## Common Patterns and Variations

### Priority Queue

Process high-priority items first:

```python
from lionherd_core.libs.concurrency import PriorityQueue

queue = PriorityQueue.with_maxsize(10)
await queue.put((1, "urgent"))      # Process first
await queue.put((5, "normal"))      # Process later
await queue.put((0, "critical"))    # Process immediately

priority, task = await queue.get()
```

**Trade-offs**: High-priority items first, good for SLA tiers | Low-priority may starve, 2x overhead vs FIFO.

---

### Result Streaming

Process results as they arrive:

```python
result_queue = Queue.with_maxsize(10)

async def worker(work_queue, result_queue):
    while True:
        item = await work_queue.get()
        if item is None: break
        result = await process(item)
        await result_queue.put(result)

async def consumer(result_queue):
    while True:
        result = await result_queue.get()
        if result is None: break
        await handle_result(result)  # Real-time processing
```

**Benefits**: Lower memory (no buffering), real-time updates, early termination possible.

---

### Timeout-Aware Workers

Bound execution time per task:

```python
from lionherd_core.libs.concurrency import fail_after

async def worker(work_queue):
    while True:
        item = await work_queue.get()
        if item is None: break
        
        try:
            with fail_after(5.0):
                result = await process(item)
                results.append(result)
        except TimeoutError:
            errors.append((item, "timeout"))
```

**Use cases**: External API calls (prevent hanging), database queries (detect slow queries), file operations (detect stuck I/O).

## Summary and Key Takeaways

**What You Accomplished**:
- ✅ Implemented fan-out/fan-in pattern with Queue and TaskGroup
- ✅ Used shield() to guarantee cleanup operations complete
- ✅ Applied non_cancel_subgroup() to distinguish errors from cancellations
- ✅ Combined patterns into production-ready worker pool
- ✅ Learned common variations (priority, streaming, timeouts, scaling)

**Key Takeaways**:

1. **TaskGroup provides structured concurrency**
   - All tasks complete or are cancelled together
   - No orphaned tasks, no resource leaks
   - Error in one task cancels all others

2. **Queue enables work distribution with backpressure**
   - Bounded queue prevents memory exhaustion
   - Slow consumers naturally throttle fast producers
   - get_nowait() for non-blocking reads in cleanup

3. **shield() is essential for cleanup**
   - Ensures resources are released despite cancellation
   - Use in __aexit__ for context managers
   - Add timeout wrapper to prevent infinite cleanup

4. **Poison pill pattern for graceful shutdown**
   - Sentinel value (object()) signals workers to exit
   - First worker propagates to others (N workers = N pills)
   - Alternative: shutdown_event + timeout on queue.get()

5. **non_cancel_subgroup() for error filtering**
   - Separate expected cancellations from real errors
   - Returns None if all exceptions are cancellations
   - Essential for accurate alerting (no false alarms)

**When to Use These Patterns**:

**Fan-out/fan-in**:
- ✅ Batch processing (image processing, data transformation)
- ✅ Parallel API calls (fetch from multiple endpoints)
- ✅ High-throughput pipelines (ETL, data ingestion)
- ❌ Tasks with complex dependencies (use DAG scheduler)
- ❌ Real-time request/response (use connection pooling)

**Graceful shutdown**:
- ✅ Long-running services (databases, message queues)
- ✅ Kubernetes/Docker deployments (SIGTERM signals)
- ✅ Stateful cleanup (flush buffers, close connections, persist state)
- ❌ Short-lived CLI tools (unless managing background tasks)
- ❌ Pure compute workloads (no I/O, no cleanup needed)

**Combined pattern**:
- ✅ Production microservices with background workers
- ✅ Data processing pipelines with failure recovery
- ✅ Multi-tenant systems with resource isolation

## Related Resources

**lionherd-core API Reference**:
- [TaskGroup](../..) - Structured concurrency primitive
- [Queue](../..) - Async FIFO queue with backpressure
- [shield()](../..) - Cancellation protection
- [non_cancel_subgroup()](../..) - Exception filtering

**Related Tutorials**:
- [Parallel Operations with Timeouts](./) - fail_after(), bounded_map()
- [Deadline-Aware Task Queue](./) - Time-bounded processing

**Reference Notebooks**:
- [Concurrency Primitives](../references/concurrency_primitives.ipynb) - Deep dive into Queue, Lock, Event

**External Resources**:
- [Structured Concurrency (Nathaniel J. Smith)](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) - Foundational concepts
- [Python asyncio TaskGroups](https://docs.python.org/3/library/asyncio-task.html#task-groups) - Standard library implementation
- [Kubernetes Pod Lifecycle](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination) - SIGTERM handling in production
