# Fuzzy Validation - Resilient Data Validation for Lionherd

Fuzzy validation enables robust data validation in real-world scenarios where:

**Core Features:**
- **Fuzzy Parsing**: Extract JSON from markdown, mixed text, or malformed strings
- **Fuzzy Key Matching**: Match similar keys using string similarity algorithms (typos, case differences)
- **Pydantic Integration**: Seamless validation with Pydantic models
- **Lenient/Strict Modes**: Balance between flexibility and validation strictness
- **Handle Unmatched Strategies**: Multiple strategies for dealing with unexpected keys

**Why Fuzzy Validation?**
- LLM outputs are unpredictable (extra keys, typos, markdown wrapping)
- API responses vary (snake_case vs camelCase, missing fields)
- User input is messy (typos, optional fields, extra data)

**Two Main Functions:**
1. `fuzzy_validate_pydantic()` - For Pydantic model validation
2. `fuzzy_validate_mapping()` - For dict validation with expected keys

In [1]:
from pydantic import BaseModel, Field

from lionherd_core.ln import FuzzyMatchKeysParams, fuzzy_validate_mapping, fuzzy_validate_pydantic

## 1. Fuzzy Validation for Pydantic Models

The most common use case - validate LLM/API outputs into Pydantic models with fuzzy parsing and key matching.

In [2]:
# Define a Pydantic model
class Person(BaseModel):
    name: str
    age: int
    email: str


# Standard validation - already valid dict
valid_dict = {"name": "Alice", "age": 30, "email": "alice@example.com"}
person = fuzzy_validate_pydantic(valid_dict, Person)
print(f"Name: {person.name}, Age: {person.age}")

Name: Alice, Age: 30


### 1.1 Fuzzy Parsing - Extract JSON from Text

LLMs often wrap JSON in markdown or add commentary. Fuzzy parsing extracts the JSON.

In [3]:
# LLM output with markdown wrapping
llm_output = """
Here's the user information:

```json
{
    "name": "Bob",
    "age": 25,
    "email": "bob@example.com"
}
```

Hope this helps!
"""

# fuzzy_parse=True extracts JSON from markdown
person = fuzzy_validate_pydantic(llm_output, Person, fuzzy_parse=True)
print(f"✓ Parsed from markdown: {person.name}")

✓ Parsed from markdown: Bob


### 1.2 Fuzzy Key Matching - Handle Typos and Variations

When LLM/API returns keys with typos or case differences, fuzzy matching corrects them.

In [4]:
# Data with typos in keys
typo_data = {
    "nam": "Charlie",  # Missing 'e'
    "Age": 35,  # Wrong case
    "e_mail": "charlie@example.com",  # Underscore instead of no separator
}

# Without fuzzy matching - would fail validation
try:
    Person.model_validate(typo_data)
except Exception as e:
    print(f"Standard validation fails: {type(e).__name__}")

# With fuzzy matching - succeeds
person = fuzzy_validate_pydantic(
    typo_data, Person, fuzzy_match=True, fuzzy_match_params={"handle_unmatched": "remove"}
)
print(f"✓ Fuzzy matching succeeded: {person.name}, {person.age}")

Standard validation fails: ValidationError
✓ Fuzzy matching succeeded: Charlie, 35


### 1.3 FuzzyMatchKeysParams - Reusable Configuration

For consistent fuzzy matching behavior, create a reusable params object.

In [5]:
# Define fuzzy matching strategy
fuzzy_params = FuzzyMatchKeysParams(
    similarity_algo="jaro_winkler",  # Algorithm for string similarity
    similarity_threshold=0.85,  # 85% similarity required
    handle_unmatched="remove",  # Remove unmatched keys
)

# Use params object for consistent validation
messy_data = {"Name": "David", "AGE": 40, "Email": "david@example.com", "extra_field": "ignored"}

person = fuzzy_validate_pydantic(
    messy_data, Person, fuzzy_match=True, fuzzy_match_params=fuzzy_params
)
print(f"✓ Validated with params: {person.name}, age {person.age}")
print(f"Extra field removed: {not hasattr(person, 'extra_field')}")

✓ Validated with params: David, age 40
Extra field removed: True


## 2. Fuzzy Validation for Mappings

When you don't have a Pydantic model, validate dicts against expected keys directly.

In [6]:
# Define expected keys
expected_keys = ["user_id", "username", "is_active"]

# Input with variations
input_data = {"userId": "123", "UserName": "alice", "active": True}

# Fuzzy validate mapping
validated = fuzzy_validate_mapping(
    input_data,
    expected_keys,
    fuzzy_match=True,
    similarity_threshold=0.80,
    handle_unmatched="remove",
)

print(f"Validated keys: {list(validated.keys())}")
print(f"Values: {validated}")

Validated keys: ['user_id', 'username']
Values: {'user_id': '123', 'username': 'alice'}


### 2.1 Handle Unmatched Strategies

Control what happens with keys that don't match expected keys:
- `ignore` - Keep unmatched keys in output
- `remove` - Remove unmatched keys
- `raise` - Raise error if unmatched keys found
- `fill` - Fill missing expected keys with default value
- `force` - Fill missing keys, remove unmatched

In [7]:
data = {"name": "Alice", "unexpected_key": "value"}
expected = ["name", "age"]

# IGNORE - keep unmatched
result_ignore = fuzzy_validate_mapping(data, expected, handle_unmatched="ignore")
print(f"ignore: {list(result_ignore.keys())}")

# REMOVE - remove unmatched
result_remove = fuzzy_validate_mapping(data, expected, handle_unmatched="remove")
print(f"remove: {list(result_remove.keys())}")

# FILL - fill missing with default
result_fill = fuzzy_validate_mapping(data, expected, handle_unmatched="fill", fill_value=None)
print(f"fill: {result_fill}")

# FORCE - fill missing, remove unmatched
result_force = fuzzy_validate_mapping(data, expected, handle_unmatched="force", fill_value=0)
print(f"force: {result_force}")

# RAISE - error on unmatched
try:
    fuzzy_validate_mapping(data, expected, handle_unmatched="raise", fuzzy_match=False)
except ValueError as e:
    print(f"raise: {e}")

ignore: ['name', 'unexpected_key']
remove: ['name']
fill: {'name': 'Alice', 'age': None, 'unexpected_key': 'value'}
force: {'name': 'Alice', 'age': 0}
raise: Unmatched keys found: {'unexpected_key'}


### 2.2 Fill Mapping - Custom Values for Missing Keys

Provide custom default values for specific missing keys.

In [8]:
data = {"name": "Bob"}
expected = ["name", "age", "country"]

# Fill missing keys with custom values
result = fuzzy_validate_mapping(
    data,
    expected,
    handle_unmatched="force",
    fill_value="unknown",  # Default for all missing
    fill_mapping={"age": 0, "country": "US"},  # Custom values for specific keys
)

print(f"Result: {result}")
print(f"age filled with custom value: {result['age']}")
print(f"country filled with custom value: {result['country']}")

Result: {'name': 'Bob', 'country': 'US', 'age': 0}
age filled with custom value: 0
country filled with custom value: US


## 3. Strict Mode - Enforce All Expected Keys

Strict mode raises an error if any expected keys are missing after fuzzy matching.

In [9]:
incomplete_data = {"name": "Charlie"}
expected = ["name", "age", "email"]

# Lenient mode (strict=False) - no error
result_lenient = fuzzy_validate_mapping(
    incomplete_data, expected, strict=False, handle_unmatched="remove"
)
print(f"Lenient mode succeeded: {result_lenient}")

# Strict mode - raises error
try:
    fuzzy_validate_mapping(incomplete_data, expected, strict=True, handle_unmatched="remove")
except ValueError as e:
    print(f"✓ Strict mode raised error: {e}")

Lenient mode succeeded: {'name': 'Charlie'}
✓ Strict mode raised error: Missing required keys: {'email', 'age'}


## 4. String Similarity Algorithms

Different algorithms for different use cases:
- `jaro_winkler` (default) - Good for typos, favors prefix matches
- `levenshtein` - Edit distance, good for general typos
- `hamming` - Good for same-length strings
- Custom function - Provide your own `(str, str) -> float` function

In [10]:
data = {"usr_name": "alice"}  # Typo: usr_name vs user_name
expected = ["user_name"]

# Try different algorithms
for algo in ["jaro_winkler", "levenshtein"]:
    result = fuzzy_validate_mapping(
        data, expected, fuzzy_match=True, similarity_algo=algo, similarity_threshold=0.75
    )
    print(f"{algo}: {list(result.keys())}")

jaro_winkler: ['user_name']
levenshtein: ['user_name']


## 5. Similarity Threshold - Control Matching Strictness

Adjust threshold (0.0-1.0) to control how strict fuzzy matching is:
- `0.95+` - Very strict, only exact or near-exact matches
- `0.85` (default) - Balanced, catches common typos
- `0.70` - Lenient, matches more variations
- `0.50` - Very lenient, may match unrelated keys

In [11]:
data = {"fullname": "alice"}  # fullname vs full_name
expected = ["full_name"]

# Strict threshold - no match
result_strict = fuzzy_validate_mapping(
    data, expected, fuzzy_match=True, similarity_threshold=0.95, handle_unmatched="remove"
)
print(f"Threshold 0.95 (strict): {list(result_strict.keys())}")

# Balanced threshold - matches
result_balanced = fuzzy_validate_mapping(
    data, expected, fuzzy_match=True, similarity_threshold=0.85, handle_unmatched="remove"
)
print(f"Threshold 0.85 (balanced): {list(result_balanced.keys())}")

# Lenient threshold - matches
result_lenient = fuzzy_validate_mapping(
    data, expected, fuzzy_match=True, similarity_threshold=0.70, handle_unmatched="remove"
)
print(f"Threshold 0.70 (lenient): {list(result_lenient.keys())}")

Threshold 0.95 (strict): ['full_name']
Threshold 0.85 (balanced): ['full_name']
Threshold 0.70 (lenient): ['full_name']


## 6. Real-World Example: LLM Output Validation

Complete example: LLM returns JSON with typos, extra fields, and markdown wrapping.

In [12]:
# Define expected structure
class TaskOutput(BaseModel):
    task_id: str = Field(description="Unique task identifier")
    status: str = Field(description="Task status")
    priority: int = Field(description="Priority level 1-5")
    assignee: str = Field(description="Assigned user")


# Messy LLM output
llm_response = """
I've analyzed the task and here's the structured output:

```json
{
    "taskId": "TASK-123",
    "Status": "in_progress",
    "Priority": 3,
    "assignee_name": "alice@example.com",
    "extra_metadata": "This field wasn't requested",
    "comments": ["Internal LLM notes"]
}
```

The task looks ready to proceed!
"""

# Validate with fuzzy parsing and matching
task = fuzzy_validate_pydantic(
    llm_response,
    TaskOutput,
    fuzzy_parse=True,  # Extract JSON from markdown
    fuzzy_match=True,  # Match camelCase/snake_case variations
    fuzzy_match_params={
        "similarity_threshold": 0.80,  # Lenient matching
        "handle_unmatched": "remove",  # Remove extra fields
    },
)

print("✓ Successfully validated messy LLM output:")
print(f"  Task ID: {task.task_id}")
print(f"  Status: {task.status}")
print(f"  Priority: {task.priority}")
print(f"  Assignee: {task.assignee}")
print(f"  Extra fields removed: {not hasattr(task, 'extra_metadata')}")

✓ Successfully validated messy LLM output:
  Task ID: TASK-123
  Status: in_progress
  Priority: 3
  Assignee: alice@example.com
  Extra fields removed: True


## 7. Conversion from Various Formats

`fuzzy_validate_mapping()` can convert from strings, objects, and other formats using `to_dict()`.

In [13]:
# JSON string input
json_string = '{"name": "Eve", "age": 28}'
result = fuzzy_validate_mapping(json_string, ["name", "age"])
print(f"From JSON string: {result}")

# Pydantic model input
person_model = Person(name="Frank", age=45, email="frank@example.com")
result = fuzzy_validate_mapping(person_model, ["name", "email"], handle_unmatched="remove")
print(f"From Pydantic model: {result}")

From JSON string: {'name': 'Eve', 'age': 28}
From Pydantic model: {'name': 'Frank', 'email': 'frank@example.com'}


## 8. Error Handling and Suppression

Control how conversion errors are handled.

In [14]:
# Invalid input - raises by default
try:
    fuzzy_validate_mapping("not valid json at all", ["key"], suppress_conversion_errors=False)
except ValueError as e:
    print(f"Default behavior - raises: {type(e).__name__}")

# Suppress errors - returns empty dict
result = fuzzy_validate_mapping(
    "not valid json at all",
    ["key"],
    suppress_conversion_errors=True,
    handle_unmatched="fill",
    fill_value=None,
)
print(f"With suppression: {result}")

With suppression: {'key': None}


## Summary Checklist

**Fuzzy Validation Features:**
- ✅ Fuzzy JSON extraction from markdown/mixed text
- ✅ Fuzzy key matching with configurable similarity algorithms
- ✅ Pydantic integration with `fuzzy_validate_pydantic()`
- ✅ Dict validation with `fuzzy_validate_mapping()`
- ✅ Multiple handle_unmatched strategies (ignore/remove/raise/fill/force)
- ✅ Strict mode for enforcing required keys
- ✅ Custom fill values and mappings
- ✅ Configurable similarity threshold (0.0-1.0)
- ✅ Reusable configuration with `FuzzyMatchKeysParams`
- ✅ Error suppression for graceful degradation

**When to Use:**
- LLM output validation (unpredictable format, typos, extra commentary)
- API integration (varying field names, optional fields)
- User input processing (typos, missing fields)
- Data migration (schema evolution, field renaming)

**Best Practices:**
- Start with `fuzzy_parse=True, fuzzy_match=True` for LLM outputs
- Use `similarity_threshold=0.85` (default) for balanced matching
- Use `handle_unmatched="remove"` to strip unexpected fields
- Use `strict=True` when all fields are required
- Create reusable `FuzzyMatchKeysParams` for consistent behavior

**Next Steps:**
- See `element.ipynb` for base serialization patterns
- See `node.ipynb` for content validation
- See `flow.ipynb` for workflow validation