# 11. Spell Checking and Correction\n
\n
In this notebook, we implement mechanisms to handle user typos and spelling errors.\n
\n
## Techniques Covered:\n
1. **Minimum Edit Distance (Levenshtein)**: Calculating the number of operations to transform one string to another.\n
2. **N-gram Overlap**: Finding similar words based on shared character n-grams.\n
3. **Simple Spelling Corrector**: Suggesting corrections for unknown words.

In [6]:
import os
import glob
from collections import Counter, defaultdict

## 1. Data Preparation\n
We need a vocabulary of correct words. We will build this from our document collection.

In [7]:
def load_vocabulary(data_dir="../data"):
    vocab = Counter()
    for filepath in glob.glob(os.path.join(data_dir, "*.txt")):
        with open(filepath, 'r', encoding='utf-8') as f:
            text = f.read()
            # Simple tokenization
            tokens = text.lower().split()
            # Filter purely numeric tokens and update counts
            valid_tokens = [t for t in tokens if not t.isnumeric()]
            vocab.update(valid_tokens)
    return vocab

vocabulary = load_vocabulary()
print(f"Vocabulary size: {len(vocabulary)} unique words")
print("Top 10 words:", vocabulary.most_common(10))

Vocabulary size: 457 unique words
Top 10 words: [('र', 50), ('छ।', 31), ('नेपालको', 15), ('नेपालमा', 15), ('छन्।', 15), ('हो।', 13), ('पनि', 13), ('हुन्।', 10), ('नेपाल', 9), ('नेपाली', 9)]


## 2. Minimum Edit Distance (Levenshtein)\n
\n
The edit distance between two strings is the minimum number of operations (insertions, deletions, substitutions) required to change one string into the other.\n
\n
We use dynamic programming to calculate this efficiently.

In [8]:
def simple_edit_distance(s1, s2):
    m = len(s1)
    n = len(s2)
    
    # Initialize matrix
    # dp[i][j] stores distance between s1[:i] and s2[:j]
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j
        
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                cost = 0
            else:
                cost = 1 # Substitution cost
            
            dp[i][j] = min(
                dp[i-1][j] + 1,      # Deletion
                dp[i][j-1] + 1,      # Insertion
                dp[i-1][j-1] + cost  # Substitution
            )
            
    return dp[m][n]

# Test
print(f"Distance('kitten', 'sitting'): {simple_edit_distance('kitten', 'sitting')}")
print(f"Distance('nepal', 'nipal'): {simple_edit_distance('nepal', 'nipal')}")

Distance('kitten', 'sitting'): 3
Distance('nepal', 'nipal'): 1


## 3. N-gram Based Correction\n
\n
For finding candidate corrections efficiently, we can use character n-grams. Words that share many n-grams are likely similar.\n
\n
$$ Jaccard(A, B) = \frac{|A \cap B|}{|A \cup B|} $$

In [9]:
def get_char_ngrams(text, n=2):
    # Add padding
    padded = f"${text}$"
    return [padded[i:i+n] for i in range(len(padded)-n+1)]

def jaccard_similarity(set1, set2):
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    return intersection / union if union > 0 else 0.0

def get_ngram_suggestions(word, vocab, n=2, top_k=5):
    word_ngrams = set(get_char_ngrams(word, n))
    scores = []
    
    for vocab_word in vocab:
        v_ngrams = set(get_char_ngrams(vocab_word, n))
        score = jaccard_similarity(word_ngrams, v_ngrams)
        if score > 0:
            scores.append((vocab_word, score))
    
    return sorted(scores, key=lambda x: x[1], reverse=True)[:top_k]

# Test N-gram suggestions
typo = "नेपल" # Correct: नेपाल
print(f"Suggestions for '{typo}':")
print(get_ngram_suggestions(typo, vocabulary.keys()))

Suggestions for 'नेपल':
[('नेपाल', 0.5714285714285714), ('नेपाली', 0.3333333333333333), ('नेपालको', 0.3), ('नेपालमा', 0.3), ('नेपालका', 0.3)]


## 4. Integrated Spell Checker\n
\n
We combine these techniques:\n
1. If word is in vocabulary, it's correct.
2. If not, find candidates with high n-gram overlap.
3. Rank candidates by Minimum Edit Distance (primary) and Frequency (secondary).

In [10]:
def spell_check(word, vocab_counts, top_k=3):
    if word in vocab_counts:
        return [(word, 0, vocab_counts[word])] # Exact match
    
    # 1. Get candidates using bigrams (coarse filter)
    candidates = [w for w, score in get_ngram_suggestions(word, vocab_counts.keys(), n=2, top_k=20)]
    
    # 2. Refine using Edit Distance
    refined = []
    for cand in candidates:
        dist = simple_edit_distance(word, cand)
        freq = vocab_counts[cand]
        refined.append((cand, dist, freq))
    
    # Sort by: Distance (asc), then Frequency (desc)
    refined.sort(key=lambda x: (x[1], -x[2]))
    
    return refined[:top_k]

# Interactive Test
test_words = ["नेपल", "काठमडौ", "सरकार", "बिकास"]

for w in test_words:
    corrections = spell_check(w, vocabulary)
    print(f"Query: {w:<10} | Suggestions: {corrections}")

Query: नेपल       | Suggestions: [('नेपाल', 1, 9), ('नेपाली', 2, 9), ('खेल', 2, 1)]
Query: काठमडौ     | Suggestions: [('काठमाडौं', 2, 2), ('काम', 3, 1), ('काठमाडौंमा', 4, 1)]
Query: सरकार      | Suggestions: [('सरकारले', 2, 3), ('सरकारको', 2, 2), ('सुधार', 2, 1)]
Query: बिकास      | Suggestions: [('विकास', 1, 1), ('निकाय', 2, 1), ('इतिहास', 3, 4)]
