# Tutorial: Multi-Stage Data Processing Pipeline

**Category**: ln Utilities  
**Difficulty**: Intermediate  
**Time**: 15-20 minutes

## Problem Statement

Real-world data processing often requires multiple transformation steps: parse raw input, validate and clean, transform values, and deduplicate results. Each stage has different requirements—some need to flatten nested data, others must filter nulls or remove duplicates.

Traditional approaches chain list comprehensions or multiple map operations, resulting in verbose code with intermediate variables. Each stage requires manual null handling, flattening logic, and deduplication checks.

**Why This Matters**:
- **Clarity**: Pipeline intent obscured by implementation details (null checks, flattening loops)
- **Efficiency**: Intermediate list allocations for each transformation stage
- **Maintenance**: Scattered processing logic across multiple comprehensions

**What You'll Build**:
A 3-stage data cleaning pipeline using `lcall` that parses CSV data, validates/transforms values, and deduplicates results—using input/output flags for concise, declarative processing.

## Prerequisites

**Prior Knowledge**:
- Python lists and basic transformations
- Understanding of map/filter operations
- Familiarity with None handling

**Required Packages**:
```bash
pip install lionherd-core  # >=0.1.0
```

**Optional Reading**:
- [API Reference: lcall](../../docs/api/ln/list_call.md)
- [Reference Notebook: to_list](../references/ln_to_list.ipynb)

In [1]:
# lionherd-core
from lionherd_core.ln import lcall

## Solution Overview

We'll build a data cleaning pipeline using chained `lcall` operations:

1. **Stage 1 (Parse)**: Parse CSV strings, flatten nested results, remove nulls
2. **Stage 2 (Transform)**: Apply transformations, filter nulls, deduplicate
3. **Stage 3 (Format)**: Final formatting with output processing

**Key lionherd-core Features**:
- `lcall`: Apply function to each element with configurable input/output processing
- `input_*` flags: Process data BEFORE function application (flatten, dropna, unique)
- `output_*` flags: Process results AFTER function application

**Flow**:
```
Raw CSV Strings → lcall(parse, output_flatten, output_dropna) → 
  Parsed Values → lcall(transform, input_dropna, input_unique) → 
    Transformed → lcall(format) → 
      Final Results
```

**Expected Outcome**: Clean, deduplicated data from messy CSV input using declarative processing flags.

### Step 1: Single Stage - Basic Transformation

Start with a simple `lcall` operation: apply a function to each element in a list.

**Key Concept**: `lcall` maps a function over elements—like `map()` but returns a list directly.

In [2]:
# Simple transformation: convert strings to uppercase
words = ["hello", "world", "pipeline"]

result = lcall(words, str.upper)
print(f"Input:  {words}")
print(f"Output: {result}")
# Output: ['HELLO', 'WORLD', 'PIPELINE']

# With additional arguments
numbers = [1, 2, 3, 4, 5]
result = lcall(numbers, pow, 2)  # pow(x, 2) for each x
print(f"\nSquares: {result}")
# Output: [1, 4, 9, 16, 25]

Input:  ['hello', 'world', 'pipeline']
Output: ['HELLO', 'WORLD', 'PIPELINE']

Squares: [1, 4, 9, 16, 25]


### Step 2: Chaining Stages - Pipeline Composition

Chain multiple `lcall` operations: output of first stage becomes input of second stage.

**Key Concept**: Pass `lcall` result directly to next `lcall`—no intermediate variables needed.

In [3]:
# Stage 1: Split sentences into words
sentences = ["data processing", "pipeline example"]

# Stage 1: split each sentence
stage1 = lcall(sentences, str.split)
print(f"After split: {stage1}")
# [['data', 'processing'], ['pipeline', 'example']]

# Stage 2: flatten and uppercase each word
# Need to flatten first to get individual words
stage2 = lcall(stage1, str.upper, input_flatten=True)
print(f"After uppercase (flattened): {stage2}")
# ['DATA', 'PROCESSING', 'PIPELINE', 'EXAMPLE']

# Or chain in one expression:
result = lcall(lcall(sentences, str.split, output_flatten=True), str.upper)
print(f"\nChained result: {result}")

After split: [['data', 'processing'], ['pipeline', 'example']]
After uppercase (flattened): ['DATA', 'PROCESSING', 'PIPELINE', 'EXAMPLE']

Chained result: ['DATA', 'PROCESSING', 'PIPELINE', 'EXAMPLE']


### Step 3: Input Processing Flags

Use `input_*` flags to process data BEFORE applying the function.

**Key Flags**:
- `input_flatten`: Flatten nested structures before processing
- `input_dropna`: Remove `None` values before processing
- `input_unique`: Deduplicate inputs (requires flatten or dropna)

In [4]:
# Messy data with nesting and nulls
messy_data = [
    [1, 2, None],
    [3, None, 4],
    [2, 5],  # Note: 2 is duplicate
    None,
]

# Without input processing - would need manual handling
# With input processing - automatic cleanup
result = lcall(
    messy_data,
    lambda x: x * 10,
    input_flatten=True,  # Flatten [[1,2], [3,4]] → [1,2,3,4]
    input_dropna=True,  # Remove None values
    input_unique=True,  # Remove duplicates (requires flatten or dropna)
)

print(f"Input:  {messy_data}")
print(f"Output: {result}")
# Input processed: [1, 2, None, 3, None, 4, 2, 5, None]
# After dropna:   [1, 2, 3, 4, 2, 5]
# After unique:   [1, 2, 3, 4, 5]
# After transform: [10, 20, 30, 40, 50]

Input:  [[1, 2, None], [3, None, 4], [2, 5], None]
Output: [10, 20, 30, 40, 50]


### Step 4: Output Processing Flags

Use `output_*` flags to process results AFTER applying the function.

**Key Flags**:
- `output_flatten`: Flatten nested results
- `output_dropna`: Remove `None` from results
- `output_unique`: Deduplicate results (requires flatten or dropna)

In [5]:
# Function that returns nested results or None
def process_number(x):
    if x == 0:
        return None  # Invalid, will be filtered
    if x % 2 == 0:
        return [x, x // 2]  # Even: return [value, half]
    return x  # Odd: return as-is


numbers = [1, 2, 0, 3, 4, 0, 6]

# Without output processing
raw_result = lcall(numbers, process_number)
print(f"Without output flags: {raw_result}")
# [1, [2, 1], None, 3, [4, 2], None, [6, 3]]

# With output processing
clean_result = lcall(
    numbers,
    process_number,
    output_flatten=True,  # Flatten nested lists
    output_dropna=True,  # Remove None values
)
print(f"With output flags:    {clean_result}")
# [1, 2, 1, 3, 4, 2, 6, 3]

Without output flags: [1, [2, 1], None, 3, [4, 2], None, [6, 3]]
With output flags:    [1, 2, 1, 3, 4, 2, 6, 3]


### Step 5: Complete 3-Stage Pipeline

Combine everything: chain multiple `lcall` operations with different input/output flags for each stage.

**Real-World Example**: Parse CSV data → validate/transform → deduplicate.

In [6]:
# Messy CSV data from various sources
raw_csv = [
    "1,2,3",
    "4,,5",  # Empty value (becomes None)
    "6,7",
    None,  # Invalid row
    "2,8,9",  # Note: 2 is duplicate
]


# Stage 1: Parse CSV strings to integers
def parse_csv(line):
    """Parse CSV line, return list of ints (None for empty values)."""
    if line is None:
        return None
    return [int(x) if x else None for x in line.split(",")]


stage1 = lcall(
    raw_csv,
    parse_csv,
    input_dropna=True,  # Remove None rows before parsing
    output_flatten=True,  # Flatten [[1,2,3], [4,None,5]] → [1,2,3,4,None,5]
)
print(f"Stage 1 (Parse):     {stage1}")
# [1, 2, 3, 4, None, 5, 6, 7, 2, 8, 9]

# Stage 2: Transform (double values), remove nulls and duplicates
stage2 = lcall(
    stage1,
    lambda x: x * 2,
    input_flatten=True,  # Required for input_unique
    input_dropna=True,  # Remove None values before doubling
    input_unique=True,  # Remove duplicates (2 appears twice)
)
print(f"Stage 2 (Transform): {stage2}")
# Input after dropna+unique: [1, 2, 3, 4, 5, 6, 7, 8, 9]
# After transform: [2, 4, 6, 8, 10, 12, 14, 16, 18]

# Stage 3: Format as strings with prefix
stage3 = lcall(stage2, lambda x: f"val_{x}")
print(f"Stage 3 (Format):    {stage3}")
# ['val_2', 'val_4', 'val_6', 'val_8', 'val_10', 'val_12', 'val_14', 'val_16', 'val_18']

Stage 1 (Parse):     [1, 2, 3, 4, None, 5, 6, 7, 2, 8, 9]
Stage 2 (Transform): [2, 4, 6, 8, 10, 12, 14, 16, 18]
Stage 3 (Format):    ['val_2', 'val_4', 'val_6', 'val_8', 'val_10', 'val_12', 'val_14', 'val_16', 'val_18']


## Complete Working Example

Here's a copy-paste ready 3-stage pipeline that handles real-world data cleaning.

**Features**:
- ✅ Parse CSV with error handling
- ✅ Flatten nested results automatically
- ✅ Remove nulls and duplicates declaratively
- ✅ Transform and format in clear stages

In [7]:
"""
Production-ready 3-stage data pipeline using lcall.

Demonstrates: CSV parsing → validation/transform → formatting
"""
from lionherd_core.ln import lcall

# Sample data: messy CSV from various sources
raw_data = [
    "100,200,300",
    "400,,500",  # Missing value
    "600,700",
    None,  # Invalid row
    "200,800,900",  # Duplicate: 200
    "",  # Empty row
]


# Stage 1: Parse CSV lines
def parse_csv_line(line):
    """Parse CSV, return ints (None for empty values)."""
    if not line:
        return None
    return [int(x) if x else None for x in line.split(",")]


parsed = lcall(
    raw_data,
    parse_csv_line,
    input_dropna=True,  # Skip None/empty rows
    output_flatten=True,  # Flatten nested lists
    output_dropna=True,  # Remove None from parsed values
)


# Stage 2: Validate and transform
def validate_and_transform(value):
    """Keep values >= 100, convert to float."""
    return float(value) if value >= 100 else None


transformed = lcall(
    parsed,
    validate_and_transform,
    output_flatten=True,  # Required for output_unique
    output_dropna=True,  # Remove invalid values (< 100)
    output_unique=True,  # Remove duplicates
)


# Stage 3: Format output
def format_currency(value):
    """Format as currency string."""
    return f"${value:,.2f}"


final = lcall(transformed, format_currency)

print("Pipeline Results:")
print(f"  Raw input:   {len(raw_data)} rows")
print(f"  After parse: {len(parsed)} values")
print(f"  After valid: {len(transformed)} values")
print(f"  Final:       {final}")
# Final: ['$100.00', '$200.00', '$300.00', '$400.00', '$500.00', '$600.00', '$700.00', '$800.00', '$900.00']

Pipeline Results:
  Raw input:   6 rows
  After parse: 10 values
  After valid: 9 values
  Final:       ['$100.00', '$200.00', '$300.00', '$400.00', '$500.00', '$600.00', '$700.00', '$800.00', '$900.00']


## Production Considerations

### When to Use Input vs Output Flags

**Input Processing** (`input_*`):
- Use when data needs normalization BEFORE function sees it
- Example: Flatten nested API responses before validation
- Benefit: Function doesn't need to handle nesting/nulls

**Output Processing** (`output_*`):
- Use when function produces nested/nullable results
- Example: Function returns `[value, metadata]` pairs, need flat list
- Benefit: Declarative result transformation

### Performance

- **Chaining overhead**: Each `lcall` creates intermediate list
- **For large datasets (>10K items)**: Consider single-pass processing
- **For small-medium (< 10K)**: Clarity > micro-optimization

### Error Handling

```python
# Wrap stages in try/except for production
try:
    stage1 = lcall(data, parse, input_dropna=True)
    stage2 = lcall(stage1, transform, output_dropna=True)
except ValueError as e:
    # Handle parsing errors
    print(f"Pipeline failed: {e}")
```

### Testing

Test each stage independently:

```python
def test_parse_stage():
    result = lcall(["1,2", "3,4"], parse_csv_line, output_flatten=True)
    assert result == [1, 2, 3, 4]

def test_transform_stage():
    result = lcall([1, None, 2], lambda x: x * 2, input_dropna=True)
    assert result == [2, 4]
```

## Variations

### 1. Conditional Transformation

Apply different logic based on input values:

```python
def conditional_transform(x):
    if x < 100:
        return None  # Filter out
    elif x < 1000:
        return [x, "small"]  # Tag as small
    else:
        return [x, "large"]  # Tag as large

result = lcall(
    [50, 200, 1500],
    conditional_transform,
    output_dropna=True,
    output_flatten=True
)
# [200, 'small', 1500, 'large']
```

### 2. Parallel Processing with Shared Config

Pass configuration to each function call:

```python
def scale(value, factor, offset=0):
    return value * factor + offset

result = lcall(
    [1, 2, 3],
    scale,
    10,           # factor=10 passed to each call
    offset=5      # offset=5 as kwarg
)
# [15, 25, 35]  (each value: x*10 + 5)
```

### 3. Enum Value Extraction

Process enum values directly:

```python
from enum import Enum

class Status(Enum):
    ACTIVE = "active"
    INACTIVE = "inactive"

statuses = [Status.ACTIVE, Status.INACTIVE]
result = lcall(
    statuses,
    str.upper,
    input_use_values=True  # Extract .value before processing
)
# ['ACTIVE', 'INACTIVE']
```

## Summary

**What You Accomplished**:
- ✅ Chained multiple `lcall` operations for multi-stage pipelines
- ✅ Used `input_*` flags for pre-processing (flatten, dropna, unique)
- ✅ Used `output_*` flags for post-processing results
- ✅ Built production-ready CSV cleaning pipeline in ~30 lines

**Key Takeaways**:
1. **Declarative processing**: Use flags instead of manual loops for flatten/filter/dedupe
2. **Pipeline composition**: Chain `lcall` outputs as inputs for multi-stage transforms
3. **Input vs Output flags**: Process data before function (`input_*`) or after (`output_*`)
4. **Unique requires flatten/dropna**: Can't deduplicate without normalization

**When to Use This Pattern**:
- ✅ Multi-step data transformations (parse → validate → format)
- ✅ Cleaning messy data with nesting and nulls
- ✅ API response processing with nested/optional fields
- ❌ Single-step transformations (use list comprehension)
- ❌ Very large datasets needing streaming (use generators)

## Related Resources

**lionherd-core API Reference**:
- [lcall](../../docs/api/ln/list_call.md) - Synchronous list processing with input/output flags
- [to_list](../../docs/api/ln/to_list.md) - Underlying list conversion utility

**Reference Notebooks**:
- [to_list Patterns](../references/ln_to_list.ipynb) - Deep dive into flatten/dropna/unique

**External Resources**:
- [Python map() Documentation](https://docs.python.org/3/library/functions.html#map) - Built-in mapping alternative
- [itertools Documentation](https://docs.python.org/3/library/itertools.html) - Standard library iteration tools