# String Similarity - Fuzzy Matching Algorithms

String similarity algorithms enable fuzzy matching for:

**Core Features:**
- **Five Algorithms**: Jaro-Winkler, Levenshtein, SequenceMatcher, Hamming, Cosine
- **Flexible Matching**: Threshold filtering, case sensitivity, most-similar selection
- **Custom Functions**: Bring your own similarity algorithm
- **Practical Use Cases**: Typo correction, duplicate detection, search suggestions

In [1]:
from lionherd_core.libs.string_handlers import (
    SimilarityAlgo,
    cosine_similarity,
    hamming_similarity,
    jaro_winkler_similarity,
    levenshtein_similarity,
    sequence_matcher_similarity,
    string_similarity,
)

## 1. Basic Usage - Finding Similar Strings

In [2]:
# Typo correction example
misspelled = "recieve"
dictionary = ["receive", "retrieve", "believe", "relieve", "achieve"]

# Default: Jaro-Winkler algorithm
matches = string_similarity(misspelled, dictionary)
print(f"Matches for '{misspelled}': {matches}")

# Get only the best match
best = string_similarity(misspelled, dictionary, return_most_similar=True)
print(f"Best match: {best}")

Matches for 'recieve': ['receive', 'relieve', 'retrieve', 'believe', 'achieve']
Best match: receive


## 2. Understanding Each Algorithm

Different algorithms have different strengths:

In [3]:
# Compare two strings with all algorithms
s1 = "kitten"
s2 = "sitting"

print(f"Comparing '{s1}' vs '{s2}':\n")

# Jaro-Winkler: Good for short strings, favors prefix matches
jw_score = jaro_winkler_similarity(s1, s2)
print(f"Jaro-Winkler:    {jw_score:.4f}  (favors prefix matches)")

# Levenshtein: Edit distance (insertions/deletions/substitutions)
lev_score = levenshtein_similarity(s1, s2)
print(f"Levenshtein:     {lev_score:.4f}  (minimum edit operations)")

# SequenceMatcher: Python's difflib, good for general text
seq_score = sequence_matcher_similarity(s1, s2)
print(f"SequenceMatcher: {seq_score:.4f}  (longest common subsequence)")

# Cosine: Character set overlap (order-independent)
cos_score = cosine_similarity(s1, s2)
print(f"Cosine:          {cos_score:.4f}  (character set overlap)")

Comparing 'kitten' vs 'sitting':

Jaro-Winkler:    0.7460  (favors prefix matches)
Levenshtein:     0.5714  (minimum edit operations)
SequenceMatcher: 0.6154  (longest common subsequence)
Cosine:          0.6000  (character set overlap)


In [4]:
# Hamming: Only works for equal-length strings (position-by-position comparison)
equal_length_1 = "karolin"
equal_length_2 = "kathrin"

ham_score = hamming_similarity(equal_length_1, equal_length_2)
print(f"Hamming similarity for '{equal_length_1}' vs '{equal_length_2}': {ham_score:.4f}")
print("(4 positions match out of 7: k, a, r/t, o/h, l/r, i, n)\n")

# Hamming returns 0.0 for different lengths
different_lengths = hamming_similarity("abc", "abcd")
print(f"Hamming for different lengths: {different_lengths}  (returns 0.0)")

Hamming similarity for 'karolin' vs 'kathrin': 0.5714
(4 positions match out of 7: k, a, r/t, o/h, l/r, i, n)

Hamming for different lengths: 0.0  (returns 0.0)


## 3. Algorithm Selection with SimilarityAlgo Enum

In [5]:
# Type-safe algorithm selection
word = "color"
choices = ["colour", "colored", "colors", "colony", "coral"]

print("Different algorithms give different results:\n")

for algo in SimilarityAlgo:
    matches = string_similarity(word, choices, algorithm=algo, threshold=0.6)
    print(f"{algo.value:20s}: {matches}")

Different algorithms give different results:

jaro_winkler        : ['colour', 'colors', 'colored', 'colony', 'coral']
levenshtein         : ['colour', 'colors', 'colored', 'colony']
sequence_matcher    : ['colour', 'colors', 'colored', 'colony', 'coral']
hamming             : None
cosine              : ['colour', 'colors', 'coral', 'colored', 'colony']


## 4. Threshold Filtering

Control match quality with threshold (0.0 to 1.0):

In [6]:
word = "python"
candidates = ["python", "pythonic", "jython", "cython", "java", "ruby"]

print(f"Finding matches for '{word}' with different thresholds:\n")

for threshold in [0.5, 0.7, 0.9]:
    matches = string_similarity(word, candidates, threshold=threshold)
    print(f"threshold={threshold}: {matches}")

Finding matches for 'python' with different thresholds:

threshold=0.5: ['python', 'pythonic', 'jython', 'cython']
threshold=0.7: ['python', 'pythonic', 'jython', 'cython']
threshold=0.9: ['python', 'pythonic']


## 5. Case Sensitivity

In [7]:
word = "Python"
choices = ["python", "PYTHON", "PyThOn", "java"]

# Case insensitive (default)
matches_insensitive = string_similarity(word, choices, case_sensitive=False)
print(f"Case insensitive: {matches_insensitive}")

# Case sensitive
matches_sensitive = string_similarity(word, choices, case_sensitive=True, threshold=0.99)
print(f"Case sensitive:   {matches_sensitive}")

Case insensitive: ['python', 'PYTHON', 'PyThOn', 'java']
Case sensitive:   None


## 6. Custom Similarity Functions

In [8]:
# Define custom similarity function
def length_based_similarity(s1: str, s2: str) -> float:
    """Similarity based on length difference."""
    if not s1 or not s2:
        return 0.0
    max_len = max(len(s1), len(s2))
    len_diff = abs(len(s1) - len(s2))
    return 1.0 - (len_diff / max_len)


word = "hello"
choices = ["hi", "hola", "hello", "greetings", "hey"]

matches = string_similarity(word, choices, algorithm=length_based_similarity, threshold=0.8)
print(f"Matches by length similarity: {matches}")
print(f"('{word}' is 5 chars, matches have similar length)")

Matches by length similarity: ['hello', 'hola']
('hello' is 5 chars, matches have similar length)


## 7. Practical Example: Command Typo Correction

In [9]:
# Simulate CLI command typo correction
available_commands = [
    "list",
    "get",
    "create",
    "update",
    "delete",
    "search",
    "filter",
    "export",
    "import",
]

typos = ["crate", "delte", "serch", "lst", "updat"]

print("Command typo correction:\n")
for typo in typos:
    suggestion = string_similarity(
        typo, available_commands, algorithm=SimilarityAlgo.JARO_WINKLER, return_most_similar=True
    )
    print(f"'{typo}' → Did you mean '{suggestion}'?")

Command typo correction:

'crate' → Did you mean 'create'?
'delte' → Did you mean 'delete'?
'serch' → Did you mean 'search'?
'lst' → Did you mean 'list'?
'updat' → Did you mean 'update'?


## 8. Practical Example: Duplicate Detection

In [10]:
# Find potential duplicates in a list
records = [
    "John Smith",
    "Jon Smith",
    "Jane Doe",
    "J. Smith",
    "Bob Johnson",
    "Robert Johnson",
]

print("Potential duplicates:\n")
processed = set()

for i, record in enumerate(records):
    if record in processed:
        continue

    # Check against remaining records
    remaining = records[i + 1 :]
    similar = string_similarity(
        record, remaining, algorithm=SimilarityAlgo.LEVENSHTEIN, threshold=0.7
    )

    if similar:
        print(f"{record}")
        for match in similar:
            print(f"  ↳ {match}")
            processed.add(match)
        processed.add(record)

Potential duplicates:

John Smith
  ↳ Jon Smith
  ↳ J. Smith
Bob Johnson
  ↳ Robert Johnson


## 9. Algorithm Comparison: Prefix vs Edit Distance

In [11]:
# Jaro-Winkler favors prefix matches
word = "micro"
options = ["microscope", "microphone", "macroscope", "telescope"]

print("Jaro-Winkler (favors prefix):")
jw_matches = string_similarity(word, options, algorithm=SimilarityAlgo.JARO_WINKLER, threshold=0.7)
print(f"  {jw_matches}\n")

# Levenshtein considers overall edit distance
print("Levenshtein (edit distance):")
lev_matches = string_similarity(word, options, algorithm=SimilarityAlgo.LEVENSHTEIN, threshold=0.5)
print(f"  {lev_matches}")

Jaro-Winkler (favors prefix):
  ['microscope', 'microphone', 'macroscope']

Levenshtein (edit distance):
  ['microscope', 'microphone']


## 10. Edge Cases and Error Handling

In [12]:
# Empty strings
result = string_similarity("", ["abc", "def"])
print(f"Empty string search: {result}\n")

# No matches
result = string_similarity("xyz", ["abc", "def"], threshold=0.9)
print(f"No matches above threshold: {result}\n")

# Invalid threshold
try:
    string_similarity("test", ["test"], threshold=1.5)
except ValueError as e:
    print(f"Invalid threshold: {e}\n")

# Empty candidate list
try:
    string_similarity("test", [])
except ValueError as e:
    print(f"Empty candidates: {e}")

Empty string search: ['abc', 'def']

No matches above threshold: None

Invalid threshold: threshold must be between 0.0 and 1.0

Empty candidates: correct_words must not be empty


## 11. Performance Considerations

In [13]:
import time

# Generate test data
word = "example"
candidates = [f"example_{i}" for i in range(1000)]

# Benchmark different algorithms
algorithms = [
    SimilarityAlgo.JARO_WINKLER,
    SimilarityAlgo.LEVENSHTEIN,
    SimilarityAlgo.SEQUENCE_MATCHER,
    SimilarityAlgo.COSINE,
]

print(f"Comparing 1 word against {len(candidates)} candidates:\n")

for algo in algorithms:
    start = time.perf_counter()
    matches = string_similarity(word, candidates, algorithm=algo, threshold=0.5)
    elapsed = (time.perf_counter() - start) * 1000  # Convert to ms
    print(f"{algo.value:20s}: {elapsed:6.2f}ms ({len(matches) if matches else 0} matches)")

Comparing 1 word against 1000 candidates:

jaro_winkler        :   3.82ms (1000 matches)
levenshtein         :  16.87ms (1000 matches)
sequence_matcher    :   4.75ms (1000 matches)
cosine              :   1.38ms (1000 matches)


## 12. Jaro-Winkler Scaling Parameter

In [14]:
# Jaro-Winkler scaling controls prefix bonus (default 0.1, max 0.25)
s1 = "martha"
s2 = "marhta"  # Transposition

print(f"Comparing '{s1}' vs '{s2}' with different scaling:\n")

for scaling in [0.0, 0.1, 0.2, 0.25]:
    score = jaro_winkler_similarity(s1, s2, scaling=scaling)
    print(f"scaling={scaling}: {score:.4f}")

print("\n(Higher scaling = stronger prefix bonus)")

Comparing 'martha' vs 'marhta' with different scaling:

scaling=0.0: 0.9444
scaling=0.1: 0.9611
scaling=0.2: 0.9778
scaling=0.25: 0.9861

(Higher scaling = stronger prefix bonus)


## Summary Checklist

**String Similarity Essentials:**
- ✅ Five algorithms: Jaro-Winkler, Levenshtein, SequenceMatcher, Hamming, Cosine
- ✅ Threshold filtering (0.0 to 1.0)
- ✅ Case-sensitive and case-insensitive matching
- ✅ Return all matches or most similar only
- ✅ Custom similarity functions supported
- ✅ Type-safe algorithm selection via SimilarityAlgo enum
- ✅ Stable ordering by score and index

**Algorithm Selection Guide:**
- **Jaro-Winkler**: Short strings, typo correction, prefix importance
- **Levenshtein**: General edit distance, robust for all cases
- **SequenceMatcher**: Python's difflib, good for text comparison
- **Hamming**: Fixed-length strings, position-by-position comparison
- **Cosine**: Order-independent character overlap

**Common Use Cases:**
- CLI command typo correction
- Duplicate record detection
- Search suggestions
- Data cleaning and normalization