# Tutorial: Fuzzy Validation with Custom Parameters

**Category**: ln Utilities  
**Difficulty**: Intermediate  
**Time**: 15-20 minutes

## Problem Statement

When validating data from external sources (LLM outputs, API responses, user uploads), field names often don't match your expected schema exactly. You might receive `"usr_name"` instead of `"user_name"`, or `"email_addr"` instead of `"email"`. Different data sources require different validation strategies—strict internal APIs should fail on unrecognized fields, while lenient external integrations should skip them.

**Why This Matters**:
- **Data Quality**: Rejecting valid data due to minor field name variations wastes information
- **Flexibility**: Different validation contexts (strict vs lenient) require different handling strategies

**What You'll Build**:
A fuzzy validation system using lionherd-core's `FuzzyMatchKeysParams` that handles field name variations and adapts validation strategies based on context.

## Prerequisites

**Prior Knowledge**:
- Python dictionaries and Pydantic models
- Basic understanding of string similarity concepts

**Required Packages**:
```bash
pip install lionherd-core  # >=0.1.0
```

In [1]:
# Standard library

# Third-party
from pydantic import BaseModel

# lionherd-core
from lionherd_core.ln import fuzzy_validate_pydantic
from lionherd_core.ln._fuzzy_match import FuzzyMatchKeysParams

## Solution Overview

We'll implement fuzzy validation with reusable parameter configurations:

1. **Default Behavior**: Use out-of-the-box fuzzy validation
2. **Parameter Objects**: Create `FuzzyMatchKeysParams` for reusable configs
3. **Validation Strategies**: Configure different modes (strict, lenient, fill-missing)

**Key lionherd-core Components**:
- `fuzzy_validate_pydantic()`: Validates data into Pydantic models with fuzzy key matching
- `FuzzyMatchKeysParams`: Reusable parameter object for consistent validation

**Expected Outcome**: Clean, validated data structures from messy inputs with configurable strategies for different data sources.

**Note**: Defaults use Jaro-Winkler similarity (0.85 threshold) and remove unmatched keys automatically.

### Step 1: Define Data Model and Test Data

Create a Pydantic model and sample data with common field name variations.

**Key Point**: Real data sources rarely match schemas perfectly—typos, abbreviations, and naming conventions vary.

In [2]:
# Define expected schema
class UserProfile(BaseModel):
    """User profile with strict field names."""

    user_name: str
    email: str
    age: int
    location: str


# Test data with field name variations
messy_data = {"usr_name": "Alice", "email": "alice@example.com", "age": 30, "loc": "NYC"}

print("Expected fields:", list(UserProfile.model_fields.keys()))
print("Actual fields:", list(messy_data.keys()))

Expected fields: ['user_name', 'email', 'age', 'location']
Actual fields: ['usr_name', 'email', 'age', 'loc']


### Step 2: Default Fuzzy Validation

Use default fuzzy validation with no custom parameters. This uses sensible defaults optimized for common use cases.

In [3]:
# Enable fuzzy matching with defaults
result = fuzzy_validate_pydantic(
    messy_data,
    UserProfile,
    fuzzy_match=True,
    fuzzy_match_params=None,  # Use defaults
)

print(f"Validated: {result.user_name}, {result.email}, age={result.age}, {result.location}")

Validated: Alice, alice@example.com, age=30, NYC


### Step 3: Create Reusable Parameter Objects

For repeated validations, create `FuzzyMatchKeysParams` objects. These are callable and reusable.

**Pattern**: Define validation strategies as named parameter objects for clarity.

In [4]:
# STRICT: High threshold, raise on unmatched
strict_params = FuzzyMatchKeysParams(
    similarity_threshold=0.9, handle_unmatched="raise", strict=True
)

# LENIENT: Lower threshold, fill missing required fields
lenient_params = FuzzyMatchKeysParams(
    similarity_threshold=0.7, handle_unmatched="fill", fill_value="Unknown", strict=False
)

# FILL_MISSING: Fill missing fields with defaults
fill_params = FuzzyMatchKeysParams(
    similarity_threshold=0.85, handle_unmatched="fill", fill_value="N/A"
)

print("Created 3 reusable parameter configurations")
print(f"  strict: {strict_params.similarity_threshold} threshold")
print(f"  lenient: {lenient_params.similarity_threshold} threshold")
print(f"  fill: fills missing with '{fill_params.fill_value}'")

Created 3 reusable parameter configurations
  strict: 0.9 threshold
  lenient: 0.7 threshold
  fill: fills missing with 'N/A'


### Step 4: Apply Parameter Objects to Different Data Sources

Demonstrate reusing parameter objects across multiple data sources. Different sources → different configs → consistent behavior per source.

**Key Point**: Parameter objects are immutable (frozen dataclasses) and can be safely reused.

In [5]:
# Scenario 1: Internal API (strict)
internal_data = {"user_name": "Eve", "email": "eve@company.com", "age": 28, "location": "Seattle"}

print("Internal API (strict):")
result = fuzzy_validate_pydantic(
    internal_data, UserProfile, fuzzy_match=True, fuzzy_match_params=strict_params
)
print(f"  ✓ Validated: {result.user_name}\n")

# Scenario 2: Third-party API (lenient)
third_party_data = {
    "username": "Frank",
    "emai": "frank@external.com",
    "ag": 45,
    "locaton": "Austin",
    "extra_field": "ignored",
}

print("Third-party API (lenient):")
result = fuzzy_validate_pydantic(
    third_party_data, UserProfile, fuzzy_match=True, fuzzy_match_params=lenient_params
)
print(f"  ✓ Validated: {result.user_name}, age={result.age}")
print("  (extra_field ignored)\n")

# Scenario 3: LLM output (fill missing)
llm_output = {"user_name": "Grace", "email": "grace@ai.com"}

print("LLM output (fill missing):")
filled = fill_params(llm_output, UserProfile.model_fields)
print(f"  ✓ Filled dict: {filled}")
print(f"  Missing fields filled with: '{fill_params.fill_value}'")

Internal API (strict):
  ✓ Validated: Eve

Third-party API (lenient):
  ✓ Validated: Frank, age=45
  (extra_field ignored)

LLM output (fill missing):
  ✓ Filled dict: {'user_name': 'Grace', 'email': 'grace@ai.com', 'location': 'N/A', 'age': 'N/A'}
  Missing fields filled with: 'N/A'


## Complete Working Example

Production-ready validation system handling multiple sources with different strategies.

In [6]:
"""
Production fuzzy validation pipeline.
"""
from typing import Literal

from pydantic import BaseModel

from lionherd_core.ln import fuzzy_validate_pydantic
from lionherd_core.ln._fuzzy_match import FuzzyMatchKeysParams


class UserProfile(BaseModel):
    user_name: str
    email: str
    age: int
    location: str


# Validation configs for different sources
CONFIGS = {
    "internal": FuzzyMatchKeysParams(
        similarity_threshold=0.9, handle_unmatched="raise", strict=True
    ),
    "external": FuzzyMatchKeysParams(
        similarity_threshold=0.75, handle_unmatched="fill", fill_value="Unknown", strict=False
    ),
    "llm": FuzzyMatchKeysParams(
        similarity_threshold=0.85,
        handle_unmatched="fill",
        fill_value="UNKNOWN",
        fill_mapping={"age": 0, "location": "Remote"},
    ),
}


def validate_user(
    data: dict, source: Literal["internal", "external", "llm"]
) -> tuple[bool, UserProfile | str]:
    """Validate with source-appropriate strategy."""
    config = CONFIGS[source]

    try:
        result = fuzzy_validate_pydantic(
            data, UserProfile, fuzzy_match=True, fuzzy_match_params=config
        )
        return (True, result)
    except Exception as e:
        return (False, f"{source} validation failed: {e}")


# Test cases
test_cases = [
    ("internal", {"user_name": "Alice", "email": "alice@co.com", "age": 30, "location": "NYC"}),
    ("external", {"username": "Bob", "emai": "bob@ext.com", "ag": 25, "extra": "data"}),
    ("llm", {"user_name": "Charlie", "email": "charlie@ai.com"}),
]

for source, data in test_cases:
    success, result = validate_user(data, source)
    print(f"[{source.upper()}]")
    if success:
        print(f"  ✓ {result.user_name}, {result.email}, age={result.age}")
    else:
        print(f"  ✗ {result}")

[INTERNAL]
  ✓ Alice, alice@co.com, age=30
[EXTERNAL]
  ✓ Bob, bob@ext.com, age=25
[LLM]
  ✓ Charlie, charlie@ai.com, age=0


## Production Considerations

### Threshold Tuning

- **0.95+**: Near-exact matches (minor typos only)
- **0.85**: Naming conventions (camelCase ↔ snake_case)
- **0.75**: More lenient (recommended range: 0.75-0.85)
- **<0.6**: Too permissive (random matches)

### Error Handling

```python
def safe_validate(data: dict, params: FuzzyMatchKeysParams):
    try:
        return fuzzy_validate_pydantic(
            data, UserProfile,
            fuzzy_match=True,
            fuzzy_match_params=params
        )
    except ValidationError as e:
        logger.error(f"Validation failed: {e}")
        return None
```

### Performance

- String similarity: O(n×m) for n input × m expected keys
- Jaro-Winkler fastest (~1-2ms for 10 fields)
- Overhead: <5ms per validation

**Optimization**: Reuse parameter objects (avoid repeated instantiation)

## Variations

### Custom Fill Mappings

```python
custom_fill = FuzzyMatchKeysParams(
    similarity_threshold=0.85,
    handle_unmatched="fill",
    fill_value="UNKNOWN",
    fill_mapping={
        "age": 0,
        "location": "Remote"
    }
)

# Field-specific defaults override fill_value
```

### Progressive Threshold

```python
def progressive_validate(data: dict):
    # Try strict first
    try:
        return fuzzy_validate_pydantic(
            data, UserProfile,
            fuzzy_match_params=FuzzyMatchKeysParams(
                similarity_threshold=0.95
            )
        )
    except:
        # Fall back to lenient
        return fuzzy_validate_pydantic(
            data, UserProfile,
            fuzzy_match_params=FuzzyMatchKeysParams(
                similarity_threshold=0.75
            )
        )
```

## Summary

**What You Accomplished**:
- ✅ Built fuzzy validation with reusable parameter objects
- ✅ Configured different strategies for different data sources
- ✅ Handled field name variations automatically

**Key Takeaways**:
1. **Parameter objects > inline dicts**: Reusability and type safety
2. **Threshold 0.75-0.85**: Balance between flexibility and correctness
3. **Different sources need different strategies**: Strict internal, lenient external

**When to Use**:
- ✅ LLM outputs with field name variations
- ✅ Third-party APIs with inconsistent naming
- ✅ User-uploaded data with flexible schemas
- ❌ Exact schema enforcement (use standard Pydantic)

## Related Resources

- [fuzzy_validate API](../../docs/api/ln/fuzzy_validate.md)
- [Pydantic Documentation](https://docs.pydantic.dev/)