# Fuzzy Match Keys - Intelligent Dictionary Key Validation

Fuzzy matching enables robust dictionary validation by correcting typos and variations in key names using string similarity algorithms.

**Core Features:**
- **Multiple Algorithms**: Jaro-Winkler, Levenshtein, Sequence Matcher, Hamming, Cosine
- **Threshold Control**: Configurable similarity threshold (0.0-1.0)
- **Case Insensitive**: Smart matching ignoring case differences
- **Flexible Handling**: Multiple modes for unmatched keys (ignore, raise, remove, fill, force)
- **Type-Safe Params**: Reusable configuration via `FuzzyMatchKeysParams`

In [1]:
from lionherd_core.ln import FuzzyMatchKeysParams, fuzzy_match_keys

## 1. Basic Fuzzy Matching

Correct common typos and variations using default settings (Jaro-Winkler, 0.85 threshold).

In [2]:
# Define expected schema
expected_keys = ["name", "age", "email", "address"]

# Input with typos
user_data = {
    "nme": "Alice",  # typo: 'nme' → 'name'
    "age": 30,
    "emal": "alice@example.com",  # typo: 'emal' → 'email'
    "addres": "123 Main St",  # typo: 'addres' → 'address'
}

# Fuzzy match corrects keys
corrected = fuzzy_match_keys(user_data, expected_keys)
print("Original keys:", list(user_data.keys()))
print("Corrected keys:", list(corrected.keys()))
print("\nCorrected data:", corrected)

Original keys: ['nme', 'age', 'emal', 'addres']
Corrected keys: ['age', 'address', 'name', 'email']

Corrected data: {'age': 30, 'address': '123 Main St', 'name': 'Alice', 'email': 'alice@example.com'}


## 2. Similarity Algorithms

Different algorithms have different strengths. Compare behavior across algorithms.

In [3]:
# Test data with variation
data = {"usr_name": "Bob", "usr_age": 25}
expected = ["user_name", "user_age"]

# Compare algorithms
algorithms = ["jaro_winkler", "levenshtein", "sequence_matcher", "cosine"]

print("Algorithm Comparison (threshold=0.7):\n")
for algo in algorithms:
    result = fuzzy_match_keys(data, expected, similarity_algo=algo, similarity_threshold=0.7)
    print(f"{algo:20s}: {list(result.keys())}")

Algorithm Comparison (threshold=0.7):

jaro_winkler        : ['user_name', 'user_age']
levenshtein         : ['user_name', 'user_age']
sequence_matcher    : ['user_name', 'user_age']
cosine              : ['user_name', 'user_age']


## 3. Threshold Control

Threshold determines how similar strings must be to match (0.0 = any match, 1.0 = exact only).

In [4]:
# Data with varying typo severity
data = {
    "username": "Alice",  # exact match
    "usernme": "Bob",  # minor typo
    "xyz": "Charlie",  # completely different
}
expected = ["username"]

# Test different thresholds
thresholds = [0.5, 0.7, 0.85, 0.95, 1.0]

print("Threshold Impact:\n")
for threshold in thresholds:
    result = fuzzy_match_keys(
        data, expected, similarity_threshold=threshold, handle_unmatched="remove"
    )
    print(f"threshold={threshold:.2f}: matched keys = {list(result.keys())} (count={len(result)})")

Threshold Impact:

threshold=0.50: matched keys = ['username'] (count=1)
threshold=0.70: matched keys = ['username'] (count=1)
threshold=0.85: matched keys = ['username'] (count=1)
threshold=0.95: matched keys = ['username'] (count=1)
threshold=1.00: matched keys = ['username'] (count=1)


## 4. Handle Unmatched Modes

Control behavior for keys that don't match expected schema:
- **ignore**: Keep unmatched keys as-is (default)
- **raise**: Raise ValueError if unmatched keys found
- **remove**: Drop unmatched keys from output
- **fill**: Keep unmatched keys + add missing expected keys with fill_value
- **force**: Drop unmatched keys + add missing expected keys with fill_value

In [5]:
# Data with extra and missing keys
data = {"name": "Alice", "extra_field": "value"}
expected = ["name", "age", "email"]

# Mode: ignore (default) - keeps everything
result_ignore = fuzzy_match_keys(data, expected, handle_unmatched="ignore")
print("ignore mode:", list(result_ignore.keys()))
print("  → Keeps 'extra_field', doesn't add 'age'/'email'\n")

# Mode: remove - drops unmatched
result_remove = fuzzy_match_keys(data, expected, handle_unmatched="remove")
print("remove mode:", list(result_remove.keys()))
print("  → Drops 'extra_field', doesn't add 'age'/'email'\n")

# Mode: fill - keeps unmatched + fills missing
result_fill = fuzzy_match_keys(data, expected, handle_unmatched="fill", fill_value=None)
print("fill mode:", list(result_fill.keys()))
print("  → Keeps 'extra_field', adds 'age'/'email' with None\n")

# Mode: force - drops unmatched + fills missing
result_force = fuzzy_match_keys(data, expected, handle_unmatched="force", fill_value="<missing>")
print("force mode:", list(result_force.keys()))
print("  → Drops 'extra_field', adds 'age'/'email' with '<missing>'")

ignore mode: ['name', 'extra_field']
  → Keeps 'extra_field', doesn't add 'age'/'email'

remove mode: ['name']
  → Drops 'extra_field', doesn't add 'age'/'email'

fill mode: ['name', 'age', 'email', 'extra_field']
  → Keeps 'extra_field', adds 'age'/'email' with None

force mode: ['name', 'age', 'email']
  → Drops 'extra_field', adds 'age'/'email' with '<missing>'


## 5. Custom Fill Values

Provide default values for missing keys with `fill_value` and `fill_mapping`.

In [6]:
# Partial data
data = {"name": "Bob"}
expected = ["name", "age", "email", "active"]

# Global fill_value for all missing keys
result_global = fuzzy_match_keys(data, expected, handle_unmatched="force", fill_value="N/A")
print("Global fill_value='N/A':")
print(result_global)
print()

# Custom values per key via fill_mapping
result_custom = fuzzy_match_keys(
    data,
    expected,
    handle_unmatched="force",
    fill_value="<default>",  # fallback for unmapped keys
    fill_mapping={"age": 0, "email": "unknown@example.com", "active": True},
)
print("Custom fill_mapping with fallback:")
print(result_custom)

Global fill_value='N/A':
{'name': 'Bob', 'age': 'N/A', 'email': 'N/A', 'active': 'N/A'}

Custom fill_mapping with fallback:
{'name': 'Bob', 'age': 0, 'email': 'unknown@example.com', 'active': True}


## 6. Strict Mode

Enforce that all expected keys must be present (or matched) in the input.

In [7]:
# Complete data - strict passes
complete_data = {"name": "Alice", "age": 30, "email": "alice@example.com"}
expected = ["name", "age", "email"]

result = fuzzy_match_keys(complete_data, expected, strict=True)
print("✓ Strict mode passed with complete data")
print(result)
print()

# Incomplete data - strict raises
incomplete_data = {"name": "Bob"}
try:
    fuzzy_match_keys(incomplete_data, expected, strict=True)
except ValueError as e:
    print(f"✗ Strict mode raised: {e}")

✓ Strict mode passed with complete data
{'name': 'Alice', 'age': 30, 'email': 'alice@example.com'}

✗ Strict mode raised: Missing required keys: {'age', 'email'}


## 7. Raise on Unmatched

Detect unexpected keys in input data to enforce schema compliance.

In [8]:
# Expected schema
expected = ["name", "age", "email"]

# Valid data - no raise
valid_data = {"name": "Alice", "age": 30, "emal": "alice@example.com"}  # typo matches
result = fuzzy_match_keys(valid_data, expected, handle_unmatched="raise")
print("✓ Valid data (with fuzzy match):")
print(result)
print()

# Invalid data - has truly unmatched key
invalid_data = {"name": "Bob", "age": 25, "xyz": "unmatched"}
try:
    fuzzy_match_keys(invalid_data, expected, handle_unmatched="raise")
except ValueError as e:
    print(f"✗ Invalid data raised: {e}")

✓ Valid data (with fuzzy match):
{'name': 'Alice', 'age': 30, 'email': 'alice@example.com'}

✗ Invalid data raised: Unmatched keys found: {'xyz'}


## 8. Disable Fuzzy Matching

Use exact matching only by setting `fuzzy_match=False`.

In [9]:
data = {"name": "Alice", "nme": "Bob", "age": 30}
expected = ["name", "age"]

# With fuzzy matching (default)
fuzzy_result = fuzzy_match_keys(data, expected, fuzzy_match=True)
print("Fuzzy matching enabled:")
print(f"  Keys: {list(fuzzy_result.keys())}")
print(f"  'nme' → 'name': {fuzzy_result['name']}")
print()

# Without fuzzy matching
exact_result = fuzzy_match_keys(data, expected, fuzzy_match=False)
print("Fuzzy matching disabled (exact only):")
print(f"  Keys: {list(exact_result.keys())}")
print("  'nme' kept as-is (not matched)")

Fuzzy matching enabled:
  Keys: ['name', 'age', 'nme']
  'nme' → 'name': Alice

Fuzzy matching disabled (exact only):
  Keys: ['name', 'age', 'nme']
  'nme' kept as-is (not matched)


## 9. FuzzyMatchKeysParams - Reusable Configuration

Create reusable matchers with predefined settings using the `FuzzyMatchKeysParams` dataclass.

In [10]:
# Create a strict matcher for API validation
api_validator = FuzzyMatchKeysParams(
    similarity_algo="jaro_winkler",
    similarity_threshold=0.9,  # high threshold
    handle_unmatched="raise",  # reject unknown fields
    strict=True,  # require all fields
)

# Create a lenient matcher for user input
user_input_matcher = FuzzyMatchKeysParams(
    similarity_algo="levenshtein",
    similarity_threshold=0.7,  # more forgiving
    handle_unmatched="force",  # normalize to schema
    fill_value=None,
)

# Use the validators
user_data = {"usrname": "Alice", "age": 30}
expected = ["username", "age", "email"]

# Lenient matcher accepts typos and fills missing
result = user_input_matcher(user_data, expected)
print("Lenient matcher result:")
print(result)
print()

# Strict matcher would reject (uncomment to test)
# try:
#     api_validator(user_data, expected)
# except ValueError as e:
#     print(f"Strict matcher rejected: {e}")

Lenient matcher result:
{'age': 30, 'username': 'Alice', 'email': None}



## 10. Real-World Example: API Response Normalization

Handle variations in third-party API responses.

In [11]:
# Expected schema for user profiles
user_schema = ["id", "username", "email", "created_at", "is_active"]

# Simulated API responses with variations
api_responses = [
    # Response 1: exact match
    {
        "id": 1,
        "username": "alice",
        "email": "alice@example.com",
        "created_at": "2025-01-01",
        "is_active": True,
    },
    # Response 2: snake_case variations
    {
        "id": 2,
        "user_name": "bob",
        "email": "bob@example.com",
        "createdAt": "2025-01-02",
        "isActive": True,
    },
    # Response 3: typos and missing fields
    {"id": 3, "usrname": "charlie", "emal": "charlie@example.com"},
]

# Normalize all responses
normalizer = FuzzyMatchKeysParams(
    similarity_threshold=0.75,
    handle_unmatched="force",
    fill_value=None,
)

print("Normalized API Responses:\n")
for i, response in enumerate(api_responses, 1):
    normalized = normalizer(response, user_schema)
    print(f"Response {i}:")
    print(f"  Original keys: {list(response.keys())}")
    print(f"  Normalized: {normalized}")
    print()

Normalized API Responses:

Response 1:
  Original keys: ['id', 'username', 'email', 'created_at', 'is_active']
  Normalized: {'id': 1, 'username': 'alice', 'email': 'alice@example.com', 'created_at': '2025-01-01', 'is_active': True}

Response 2:
  Original keys: ['id', 'user_name', 'email', 'createdAt', 'isActive']
  Normalized: {'id': 2, 'email': 'bob@example.com', 'username': 'bob', 'created_at': '2025-01-02', 'is_active': True}

Response 3:
  Original keys: ['id', 'usrname', 'emal']
  Normalized: {'id': 3, 'username': 'charlie', 'email': 'charlie@example.com', 'created_at': None, 'is_active': None}



## 11. Performance: Exact Match First

The function uses a two-pass strategy for efficiency:
1. First pass: exact matches (fast)
2. Second pass: fuzzy matching only for remaining keys

This ensures exact matches are never overridden by fuzzy matches.

In [12]:
# Data with both exact and fuzzy candidates
data = {
    "name": "Exact Match",  # exact
    "nme": "Fuzzy Match",  # similar to 'name' but exact match takes precedence
    "age": 30,  # exact
}
expected = ["name", "age"]

result = fuzzy_match_keys(data, expected)
print("Result:", result)
print("\n✓ 'name' uses exact match value ('Exact Match'), not fuzzy match from 'nme'")
print("✓ 'nme' kept as extra key since 'name' already matched")

Result: {'name': 'Exact Match', 'age': 30, 'nme': 'Fuzzy Match'}

✓ 'name' uses exact match value ('Exact Match'), not fuzzy match from 'nme'
✓ 'nme' kept as extra key since 'name' already matched


## Summary Checklist

**Fuzzy Match Keys Essentials:**
- ✅ Multiple similarity algorithms (Jaro-Winkler, Levenshtein, etc.)
- ✅ Configurable similarity threshold (0.0-1.0)
- ✅ Case-insensitive matching (via underlying algorithms)
- ✅ Five unmatched handling modes (ignore, raise, remove, fill, force)
- ✅ Custom fill values per key via `fill_mapping`
- ✅ Strict mode for required keys validation
- ✅ Two-pass matching (exact first, then fuzzy)
- ✅ Reusable configuration via `FuzzyMatchKeysParams`
- ✅ Type-safe with full validation

**Use Cases:**
- API response normalization
- User input validation
- Config file processing
- Data migration and ETL
- Schema enforcement

**Next Steps:**
- See `string_similarity` for underlying algorithms
- See `Params` for parameter object pattern
- See Pydantic integration for schema validation