# Baseline models for Spell Correction

### 1. Setup and Data Loading
Lets set up our environment by importing the necessary libraries, then load the training and validation sets that was created and saved in the previous notebook.Using the ast library to safely convert the string representation of the target_nouns list back into an actual list.

In [6]:
import pandas as pd
import numpy as np
from nltk.metrics.distance import edit_distance
from collections import Counter
import ast # To safely evaluate string representations of lists
import re
from tqdm.notebook import tqdm
import spacy

# Initialize tqdm for pandas
tqdm.pandas()

# Load spacy model
nlp = spacy.load("en_core_web_sm")

# Load the preprocessed datasets
try:
    train_df = pd.read_csv('../dataset/train.csv')
    val_df = pd.read_csv('../dataset/validation.csv')
    
    # Safely convert the 'target_nouns' column from string back to a list
    val_df['target_nouns'] = val_df['target_nouns'].apply(ast.literal_eval)
    
    print(f"Training data loaded with {len(train_df)} rows.")
    print(f"Validation data loaded with {len(val_df)} rows.")
except FileNotFoundError:
    print("Error: Make sure 'train.csv' and 'validation.csv' are present.")
    train_df = pd.DataFrame()
    val_df = pd.DataFrame()

# Build the vocabulary from the training data for both models
if not train_df.empty:
    all_correct_words = ' '.join(train_df['correct_cleaned_baseline'].astype(str)).split()
    vocabulary = set(all_correct_words)
    print(f"Vocabulary built with {len(vocabulary)} unique words.")
else:
    vocabulary = set()

Training data loaded with 7000 rows.
Validation data loaded with 1500 rows.
Vocabulary built with 9188 unique words.


### 2. Baseline Model 1: Dictionary & Levenshtein Distance

This model uses a direct dictionary-based approach. For each word in an incorrect sentence, it checks for its existence in the vocabulary. If the word is not found, the model searches the vocabulary for the word with the smallest Levenshtein (edit) distance and replaces it, provided the distance is within a set threshold (e.g., 2)

In [7]:
correction_cache = {}
MAX_EDIT_DISTANCE = 2 # Max distance for a correction

def get_levenshtein_correction(word, vocab):
    """Finds the best correction based on Levenshtein distance."""
    if word in correction_cache:
        return correction_cache[word]
    if word in vocab:
        correction_cache[word] = word
        return word

    # Find the best suggestion
    suggestions = [(edit_distance(word, v_word), v_word) for v_word in vocab if abs(len(word) - len(v_word)) <= MAX_EDIT_DISTANCE]
    
    if not suggestions:
        correction_cache[word] = word # No suggestions found
        return word

    best_suggestion = min(suggestions, key=lambda x: x[0])
    
    # Use suggestion if within threshold
    if best_suggestion[0] <= MAX_EDIT_DISTANCE:
        correction_cache[word] = best_suggestion[1]
        return best_suggestion[1]
    else:
        correction_cache[word] = word
        return word

def baseline_sentence_correction(sentence, vocab):
    """Applies Levenshtein correction to an entire sentence."""
    words = sentence.split()
    corrected_words = [get_levenshtein_correction(word, vocab) for word in words]
    return ' '.join(corrected_words)

# Apply correction to the validation set
if not val_df.empty:
    print("Applying Levenshtein correction to the validation set...")
    val_df['levenshtein_predicted'] = val_df['incorrect_cleaned_baseline'].progress_apply(
        lambda x: baseline_sentence_correction(x, vocabulary)
    )
    print("Levenshtein correction complete.")
    # Display sample results
    display(val_df[['correct_cleaned_baseline', 'levenshtein_predicted']].head())

Applying Levenshtein correction to the validation set...


  0%|          | 0/1500 [00:00<?, ?it/s]

Levenshtein correction complete.


Unnamed: 0,correct_cleaned_baseline,levenshtein_predicted
0,patients are recommended to consult their heal...,patients are recommended to consult their heal...
1,amydio forte is a powerful medication often us...,imidio forte is a powerful medication often us...
2,provera 40 contains medroxyprogesterone acetat...,provera-d contains medrexiprogesterone acetate...
3,ceemi-o is a popular over-the-counter medicati...,imci h is a popular over-the-counter medicatio...
4,l-arginine sr is a popular supplement known fo...,lrg9-sr is a popular supplement known for its ...


### 3. Baseline Model 2: N-gram Language 
The N-gram model enhances the Levenshtein approach by adding context. For a misspelled word, it first generates a list of potential candidates (using edit distance). Then, it uses a pre-built N-gram (bigram and trigram) model to score each candidate based on the probability of it appearing with the preceding words. The candidate that forms the most probable sequence is chosen as the correction.

In [8]:
# --- Step 1: Build N-gram models from training data ---
print("Building N-gram models...")
bigram_counts = Counter()
trigram_counts = Counter()

for sentence in train_df['correct_cleaned_baseline']:
    words = ['<s>'] + sentence.split() + ['</s>'] # Add start/end tokens
    bigram_counts.update(zip(words, words[1:]))
    trigram_counts.update(zip(words, words[1:], words[2:]))

print("N-gram models built.")

# --- Step 2: Implement the N-gram correction logic ---
def get_ngram_correction(previous_word, current_word, vocab):
    """Generates candidates and scores them using N-gram context."""
    if current_word in vocab:
        return current_word

    # Generate candidates (similar to Levenshtein model)
    suggestions = [v_word for v_word in vocab if edit_distance(current_word, v_word) <= MAX_EDIT_DISTANCE]
    if not suggestions:
        return current_word

    # Score candidates based on bigram probability
    best_candidate = current_word
    max_score = -1

    for candidate in suggestions:
        # Simple scoring: use the frequency count from our bigram model
        score = bigram_counts.get((previous_word, candidate), 0)
        if score > max_score:
            max_score = score
            best_candidate = candidate
            
    return best_candidate

def ngram_sentence_correction(sentence, vocab):
    """Applies N-gram based correction to an entire sentence."""
    words = ['<s>'] + sentence.split() # Add start token for context
    corrected_words = []
    
    for i in range(1, len(words)):
        previous_word = words[i-1]
        current_word = words[i]
        corrected_word = get_ngram_correction(previous_word, current_word, vocab)
        corrected_words.append(corrected_word)
        words[i] = corrected_word # Update for context of next word
        
    return ' '.join(corrected_words)

# --- Step 3: Apply N-gram correction to the validation set ---
if not val_df.empty:
    print("\nApplying N-gram correction to the validation set...")
    val_df['ngram_predicted'] = val_df['incorrect_cleaned_baseline'].progress_apply(
        lambda x: ngram_sentence_correction(x, vocabulary)
    )
    print("N-gram correction complete.")
    # Display sample results
    display(val_df[['correct_cleaned_baseline', 'ngram_predicted']].head())

Building N-gram models...
N-gram models built.

Applying N-gram correction to the validation set...


  0%|          | 0/1500 [00:00<?, ?it/s]

N-gram correction complete.


Unnamed: 0,correct_cleaned_baseline,ngram_predicted
0,patients are recommended to consult their heal...,patients are recommended to consult their heal...
1,amydio forte is a powerful medication often us...,imidio forte is a powerful medication often us...
2,provera 40 contains medroxyprogesterone acetat...,provera-d contains medrexiprogesterone acetate...
3,ceemi-o is a popular over-the-counter medicati...,some low is a popular over-the-counter medicat...
4,l-arginine sr is a popular supplement known fo...,lrg9-sr is a popular supplement known for its ...


### 4. Evaluation
We will now evaluate both models on the four specified metrics.

#### Evaluation Metrics
##### Sentence Accuracy:
- **Significance**: This is the most stringent metric. It measures the percentage of predicted sentences that are a perfect, character-for-character match with the ground truth sentences. A high score indicates the model is capable of producing completely error-free transcriptions.
- **Calculation**: It is calculated by counting the number of predicted sentences that exactly match the correct sentences, then dividing by the total number of sentences.

##### Word Accuracy:
- **Significance**: This metric provides a more granular measure of performance by calculating the percentage of correctly predicted words. It is useful for understanding the overall quality of the transcription, even if entire sentences are not perfect.
- **Calculation**: It is calculated by comparing the predicted and true sentences word-by-word, summing the number of matching words, and dividing by the total number of words in the true sentences .


##### Character Error Rate (CER):

- **Significance**: CER measures the dissimilarity between the predicted and true sentences at the character level. It is especially relevant for spell correction tasks as it quantifies how "close" a prediction is to the correct version. A lower CER indicates a better performance.
- **Calculation**: CER is computed using the Levenshtein distance, which determines the minimum number of single-character edits (insertions, deletions, or substitutions) needed to change the predicted sentence into the correct one. This total edit distance is then normalized by dividing by the total number of characters in the correct sentences.


##### Noun Accuracy:


- **Significance**: As per the assignment's focus, this is a crucial domain-specific metric. It specifically evaluates the model's ability to correctly identify and transcribe nouns, which are often the critical medication names and medical terms in this context.
- **Calculation**: First, nouns are extracted from both the true (target) sentences and the model's predicted sentences using POS tagging. The metric then measures the overlap between these two sets of nouns, calculated as the number of correctly predicted nouns divided by the total number of nouns in the true sentences.


In [10]:
def calculate_sentence_accuracy(df, true_col, pred_col):
    correct_sentences = (df[true_col] == df[pred_col]).sum()
    total_sentences = len(df)
    return (correct_sentences / total_sentences) * 100

def calculate_word_accuracy(df, true_col, pred_col):
    total_words = 0
    correct_words = 0
    for _, row in df.iterrows():
        true_words = str(row[true_col]).split()
        pred_words = str(row[pred_col]).split()
        total_words += len(true_words)
        for i in range(min(len(true_words), len(pred_words))):
            if true_words[i] == pred_words[i]:
                correct_words += 1
    return (correct_words / total_words) * 100 if total_words > 0 else 0

def calculate_cer(df, true_col, pred_col):
    total_distance = 0
    total_chars = 0
    for _, row in df.iterrows():
        true_sent = str(row[true_col])
        pred_sent = str(row[pred_col])
        total_distance += edit_distance(true_sent, pred_sent)
        total_chars += len(true_sent)
    return (total_distance / total_chars) * 100 if total_chars > 0 else 0

def calculate_noun_accuracy(df, true_nouns_col, pred_sent_col):
    total_true_nouns = 0
    correctly_predicted_nouns = 0
    for _, row in df.iterrows():
        true_nouns = set(row[true_nouns_col])
        pred_doc = nlp(str(row[pred_sent_col]))
        pred_nouns = set([token.text for token in pred_doc if token.pos_ in ('NOUN', 'PROPN')])
        correctly_predicted_nouns += len(true_nouns.intersection(pred_nouns))
        total_true_nouns += len(true_nouns)
    return (correctly_predicted_nouns / total_true_nouns) * 100 if total_true_nouns > 0 else 0

# --- Function to run all evaluations ---
def evaluate_model(df, model_pred_col):
    true_col = 'correct_cleaned_baseline'
    true_nouns_col = 'target_nouns'
    
    sa = calculate_sentence_accuracy(df, true_col, model_pred_col)
    wa = calculate_word_accuracy(df, true_col, model_pred_col)
    cer = calculate_cer(df, true_col, model_pred_col)
    na = calculate_noun_accuracy(df, true_nouns_col, model_pred_col)
    
    return {"Sentence Accuracy": sa, "Word Accuracy": wa, "CER": cer, "Noun Accuracy": na}

# Evaluate Levenshtein Model
levenshtein_results = evaluate_model(val_df, 'levenshtein_predicted')

# Evaluate N-gram Model
ngram_results = evaluate_model(val_df, 'ngram_predicted')

# Display results in a DataFrame
results_df = pd.DataFrame([levenshtein_results, ngram_results], index=['Levenshtein', 'N-gram'])
print("\n--- Model Performance Comparison ---")
display(results_df.style.format("{:.2f}%"))


--- Model Performance Comparison ---


Unnamed: 0,Sentence Accuracy,Word Accuracy,CER,Noun Accuracy
Levenshtein,7.00%,68.64%,4.00%,73.93%
N-gram,6.53%,68.53%,4.19%,73.52%


### 5.**Analysis and Conclusion** 

#### **Performance Summary**

The evaluation of the baseline models on the validation set produced the following results:

| Model | Sentence Accuracy | Word Accuracy | CER | Noun Accuracy |
| :--- | :--- | :--- | :--- | :--- |
| **Levenshtein** | 7.00% | 68.64% | 4.00% | 73.93% |
| **N-gram** | 6.53% | 68.53% | 4.19% | 73.52% |


#### **Analysis**

* **Key Finding**: The simpler **Levenshtein model unexpectedly outperformed the N-gram model** on all metrics. This suggests the N-gram model's limited context, likely affected by data sparsity, offered no practical advantage.

* **Metric Insights**:
    * The **Noun Accuracy is higher than the overall Word Accuracy**, which is a positive sign that the models are more effective on the assignment's primary targets (medical nouns)[cite: 8, 17].
    * The **very low Sentence Accuracy (~7%)** confirms that achieving perfect sentence correction with these methods is highly challenging.
    * The **low Character Error Rate (CER)**, despite poor word/sentence scores, indicates the models can fix small typos but fail on complex issues like **segmentation errors** (`health care` vs. `healthcare`), which was a major error type identified in the EDA


#### **Conclusion**

These baseline models successfully establish a performance benchmark but their limitations are clear. Their inability to handle complex, structural errors like word segmentation provides a **strong, data-driven justification for progressing to advanced transformer-based models like T5**, as outlined in the assignment. These advanced architectures are specifically designed to overcome such weaknesses by interpreting the context of the entire sentence.