# Baseline models for Spell Correction

## 1. Setup and Data Loading
Lets set up our environment by importing the necessary libraries, then load the training and validation sets that was created and saved in the previous notebook.Using the ast library to safely convert the string representation of the target_nouns list back into an actual list.

In [1]:
import pandas as pd
import numpy as np
from nltk.metrics.distance import edit_distance
from tqdm.notebook import tqdm
import ast # To safely evaluate string representations of lists
from nltk.translate.bleu_score import sentence_bleu
import spacy

# Initialize tqdm for pandas
tqdm.pandas()

# Load the spaCy model for noun extraction during evaluation
nlp = spacy.load("en_core_web_sm")

print("Libraries imported successfully.")

# Load the preprocessed datasets
try:
    train_df = pd.read_csv('../dataset/train.csv')
    val_df = pd.read_csv('../dataset/validation.csv')
except FileNotFoundError:
    print("Error: Make sure 'train.csv' and 'validation.csv' are in the 'dataset' directory.")
    # Create empty dataframes to avoid further errors if files are not found
    train_df = pd.DataFrame()
    val_df = pd.DataFrame()

# Safely convert the 'target_nouns' column from string back to a list
# This is necessary because CSVs don't store list types natively
if 'target_nouns' in val_df.columns:
    val_df['target_nouns'] = val_df['target_nouns'].apply(ast.literal_eval)

print(f"Training data loaded with {len(train_df)} rows.")
print(f"Validation data loaded with {len(val_df)} rows.")

display(val_df.head())

Libraries imported successfully.
Training data loaded with 7000 rows.
Validation data loaded with 1500 rows.


Unnamed: 0,correct sentences,ASR-generated incorrect transcriptions,correct_word_count,incorrect_word_count,error_pairs,error_pairs_diff,correct_cleaned_baseline,incorrect_cleaned_baseline,correct_cleaned_advanced,incorrect_cleaned_advanced,target_nouns
0,Patients are recommended to consult their heal...,Patients are recommended to consult their heal...,14,14,"[('tizarex', 'tzx')]","[('tizarex', 'tzx')]",patients are recommended to consult their heal...,patients are recommended to consult their heal...,patients are recommended to consult their heal...,patients are recommended to consult their heal...,"[patients, healthcare, provider, tizarex, guid..."
1,AMYDIO FORTE is a powerful medication often us...,"Imidio Forte is a powerful medication, often u...",14,14,"[('amydio', 'imidio')]","[('amydio', 'imidio')]",amydio forte is a powerful medication often us...,imidio forte is a powerful medication often us...,amydio forte is a powerful medication often us...,imidio forte is a powerful medication often us...,"[amydio, forte, medication, pain, inflammation]"
2,Provera 40 contains medroxyprogesterone acetat...,Provera-40 contains medrexiprogesterone acetat...,13,12,"[('provera', 'provera40'), ('40', 'contains'),...","[('provera', 'provera-40'), ('medroxyprogester...",provera 40 contains medroxyprogesterone acetat...,provera-40 contains medrexiprogesterone acetat...,provera 40 contains medroxyprogesterone acetat...,provera-40 contains medrexiprogesterone acetat...,"[provera, medroxyprogesterone, acetate, hormon..."
3,CEEMI-O is a popular over-the-counter medicati...,"Simi, oh, is a popular over-the-counter medica...",12,13,"[('ceemio', 'simi'), ('is', 'oh'), ('a', 'is')...","[('ceemi-o', 'simi')]",ceemi-o is a popular over-the-counter medicati...,simi oh is a popular over-the-counter medicati...,ceemi-o is a popular over-the-counter medicati...,simi oh is a popular over-the-counter medicati...,"[ceemi, o, counter, medication, flu, symptoms]"
4,L-arginine SR is a popular supplement known fo...,LRG9-SR is a popular supplement known for its ...,14,13,"[('larginine', 'lrg9sr'), ('sr', 'is'), ('is',...","[('l-arginine', 'lrg9-sr')]",l-arginine sr is a popular supplement known fo...,lrg9-sr is a popular supplement known for its ...,l-arginine sr is a popular supplement known fo...,lrg9-sr is a popular supplement known for its ...,"[l, arginine, sr, supplement, benefits, health]"


## 2. Baseline Model: Dictionary & Levenshtein Distance

Our baseline model will follow a traditional spell-correction approach as outlined in the assignment:

1.  **Build a Dictionary:** We will create a comprehensive vocabulary set from all the words in the `correct_cleaned_baseline` column of our **training data**. This set will serve as our ground truth for "correct" words.
2.  **Correction Logic:** For each word in an incorrect sentence, we will check if it exists in our dictionary.
    *   If it does, we assume it's correct.
    *   If it doesn't, we search our dictionary for the word with the smallest **Levenshtein distance** (edit distance).
    *   If this smallest distance is below a set threshold (e.g., 2), we replace the incorrect word. Otherwise, we leave it unchanged to avoid incorrect "corrections."

#### 2.1 Build the Vocabulary Dictionary

In [16]:
# Create a vocabulary from the correct sentences in the training data
# Using a set provides O(1) average time complexity for lookups, which is very fast.
if not train_df.empty:
    all_correct_words = ' '.join(train_df['correct_cleaned_baseline'].astype(str)).split()
    vocabulary = set(all_correct_words)
    print(f"Vocabulary built with {len(vocabulary)} unique words.")
else:
    vocabulary = set()
    print("Vocabulary is empty as training data could not be loaded.")

# Let's look at a few words from our vocabulary
print("Sample words from vocabulary:", list(vocabulary)[:10])

Vocabulary built with 9188 unique words.
Sample words from vocabulary: ['highlight', 'clohart-a', 'oftec', 'sapsyl-fe', 'desthama', 'caspofungin', 'greatly', 'flagel', 'cenegermin-bkbj', 'breakthrough']


#### 2.2 Implement the Correction Algorithm

In [17]:
# This dictionary will act as a cache to store corrections we've already computed,
# speeding up the process significantly if the same incorrect word appears many times.
correction_cache = {}
MAX_EDIT_DISTANCE = 2 # The maximum distance to consider a word for correction

def get_correction(word, vocab):
    """
    Finds the best correction for a word from the vocabulary based on Levenshtein distance.
    """
    # If the word is already in our cache, return the cached correction
    if word in correction_cache:
        return correction_cache[word]
    
    # If the word is in our vocabulary, it's correct
    if word in vocab:
        correction_cache[word] = word
        return word

    # Find the best suggestion from the vocabulary
    suggestions = [(edit_distance(word, v_word), v_word) for v_word in vocab if abs(len(word) - len(v_word)) <= MAX_EDIT_DISTANCE]
    
    if not suggestions:
        correction_cache[word] = word # No suggestions found
        return word

    # Get the suggestion with the minimum edit distance
    best_suggestion = min(suggestions, key=lambda x: x[0])
    
    # If the best suggestion is within our threshold, use it
    if best_suggestion[0] <= MAX_EDIT_DISTANCE:
        correction_cache[word] = best_suggestion[1]
        return best_suggestion[1]
    else:
        # The closest word is still too different, so assume the original word is correct
        correction_cache[word] = word
        return word

def baseline_sentence_correction(sentence, vocab):
    """
    Applies the correction logic to an entire sentence.
    """
    words = sentence.split()
    corrected_words = [get_correction(word, vocab) for word in words]
    return ' '.join(corrected_words)

# --- Test the function on a sample sentence ---
sample_incorrect_sentence = "take one tablet of lisinopril for high blod presure"
corrected_sentence = baseline_sentence_correction(sample_incorrect_sentence, vocabulary)
print(f"Original:  '{sample_incorrect_sentence}'")
print(f"Corrected: '{corrected_sentence}'")

Original:  'take one tablet of lisinopril for high blod presure'
Corrected: 'take one tablet of fosinopril for high blood pressure'


#### 2.3 Apply Correction to the Validation Set

In [4]:
if not val_df.empty:
    print("Applying baseline correction to the validation set...")
    val_df['baseline_predicted'] = val_df['incorrect_cleaned_baseline'].progress_apply(
        lambda x: baseline_sentence_correction(x, vocabulary)
    )
    print("Correction complete.")
    display(val_df[['incorrect_cleaned_baseline', 'baseline_predicted', 'correct_cleaned_baseline']].head())
else:
    print("Validation DataFrame is empty. Skipping correction.")

Applying baseline correction to the validation set...


  0%|          | 0/1500 [00:00<?, ?it/s]

Correction complete.


Unnamed: 0,incorrect_cleaned_baseline,baseline_predicted,correct_cleaned_baseline
0,patients are recommended to consult their heal...,patients are recommended to consult their heal...,patients are recommended to consult their heal...
1,imidio forte is a powerful medication often us...,imidio forte is a powerful medication often us...,amydio forte is a powerful medication often us...
2,provera-40 contains medrexiprogesterone acetat...,provera-d contains medrexiprogesterone acetate...,provera 40 contains medroxyprogesterone acetat...
3,simi oh is a popular over-the-counter medicati...,some h is a popular over-the-counter medicatio...,ceemi-o is a popular over-the-counter medicati...
4,lrg9-sr is a popular supplement known for its ...,lrg9-sr is a popular supplement known for its ...,l-arginine sr is a popular supplement known fo...


## **3. Evaluation**

Evaluating the performance of our baseline model using the metrics specified in the assignment.

#### **3.1 Word-Level Accuracy**

This metric calculates the percentage of words that were correctly predicted.

In [5]:
def calculate_word_accuracy(df, true_col, pred_col):
    total_words = 0
    correct_words = 0
    for index, row in df.iterrows():
        true_words = str(row[true_col]).split()
        pred_words = str(row[pred_col]).split()
        for i in range(min(len(true_words), len(pred_words))):
            if true_words[i] == pred_words[i]:
                correct_words += 1
        total_words += len(true_words)
    return (correct_words / total_words) * 100 if total_words > 0 else 0

if 'baseline_predicted' in val_df.columns:
    word_acc = calculate_word_accuracy(val_df, 'correct_cleaned_baseline', 'baseline_predicted')
    print(f"Word-Level Accuracy: {word_acc:.2f}%")

Word-Level Accuracy: 68.60%


#### **3.2 Noun-Specific Accuracy**

This is a crucial metric for this assignment. We'll measure how many of the required medical nouns were correctly identified in the output.


In [6]:
def calculate_noun_accuracy(df, true_nouns_col, pred_sent_col):
    total_true_nouns = 0
    correctly_predicted_nouns = 0
    
    for index, row in df.iterrows():
        true_nouns = set(row[true_nouns_col])
        
        # Extract nouns from the predicted sentence
        pred_doc = nlp(str(row[pred_sent_col]))
        pred_nouns = set([token.text for token in pred_doc if token.pos_ in ('NOUN', 'PROPN')])
        
        # Find the intersection of the two sets
        correctly_predicted_nouns += len(true_nouns.intersection(pred_nouns))
        total_true_nouns += len(true_nouns)
        
    return (correctly_predicted_nouns / total_true_nouns) * 100 if total_true_nouns > 0 else 0

if 'baseline_predicted' in val_df.columns:
    noun_acc = calculate_noun_accuracy(val_df, 'target_nouns', 'baseline_predicted')
    print(f"Noun-Specific Accuracy: {noun_acc:.2f}%")

Noun-Specific Accuracy: 73.89%


#### **3.3 BLEU Score**

BLEU score measures how similar the predicted sentence is to the reference sentence.

In [9]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction 
if 'baseline_predicted' in val_df.columns:
    bleu_scores = []
    # Instantiate the smoothing function once before the loop
    chencherry = SmoothingFunction()
    
    for index, row in val_df.iterrows():
        reference = [str(row['correct_cleaned_baseline']).split()]
        candidate = str(row['baseline_predicted']).split()
        
        # The candidate sentence must not be empty
        if not candidate:
            bleu_scores.append(0)
            continue
            
        # Use the instantiated smoothing function
        # We use method1 as a simple, standard smoothing technique
        score = sentence_bleu(reference, candidate, smoothing_function=chencherry.method1)
        bleu_scores.append(score)
    
    avg_bleu = np.mean(bleu_scores) * 100 # Often expressed as a percentage
    print(f"Average BLEU Score: {avg_bleu:.2f}")
else:
    print("Column 'baseline_predicted' not found. Skipping BLEU score calculation.")

Average BLEU Score: 79.08


### **4. Analysis and Conclusion**

Summarizing the findings and analyzing the baseline model's performance.

**Performance Summary:**

| Metric | Score |
| :--- | :--- |
| Word-Level Accuracy | 68.60 % |
| Noun-Specific Accuracy | 73.89% |
| Average BLEU Score | 79.08 |

**Analysis:**

The baseline model provides a solid starting point. It is effective at correcting **simple, character-level misspellings** for words that are present in our training vocabulary (e.g., `blod` -> `blood`).

However, it has several significant weaknesses that were predicted during our EDA:

*   **Segmentation Errors:** The model is completely unable to handle segmentation errors. It cannot merge `"health care"` into `"healthcare"` because it processes words one by one.
*   **Out-of-Vocabulary (OOV) Words:** If a misspelled word's correct form was not in our training data, the model cannot correct it.
*   **Context-Blindness:** The model has no understanding of context. It might make a phonetically plausible but medically incorrect substitution if that word happens to be closer in edit distance.
*   **Computational Cost:** The process of calculating edit distance against a large vocabulary for every incorrect word is computationally expensive and slow.

**Conclusion:** This baseline fulfills the assignment requirements and clearly demonstrates the limitations of traditional, non-contextual spell-correction methods. It sets a clear performance benchmark that we will aim to surpass with our advanced transformer-based models in the next notebook.

## **5. Discussion: Other Baseline Approaches (N-gram Model)**

The assignment also lists N-gram language models as a traditional spell-correction approach. It's important to discuss how this method works and how it differs from our implemented baseline.

**Concept:**

An N-gram language model uses statistical properties of the text to determine the probability of a sequence of words. For spell correction, it leverages this by choosing the correction candidate that forms the most probable sequence with its surrounding words (its context).

**Example Workflow:**

1.  **Input Sentence:** "take one tablet of **licinopril** for high **blod** presure"
2.  **Candidate Generation:** For a misspelled word like `"blod"`, candidates are generated (e.g., "blood", "blond", "bold").
3.  **Contextual Scoring:** The N-gram model evaluates each candidate based on its context. It would calculate:
    *   `Probability("high blood pressure")`
    *   `Probability("high blond pressure")`
4.  **Selection:** Since "high blood pressure" is a very common trigram in medical and general text, it would receive a much higher probability score. The model would therefore select "blood" as the correct replacement.

**Advantages over our current baseline:**

*   **Context-Aware:** This is its primary advantage. It can resolve ambiguities where multiple words have a similar edit distance. For example, it could distinguish between correcting "flor" to "floor" or "flour" based on the preceding words ("baking ___" vs. "on the ___").

**Limitations:**

*   **Limited Context Window:** An N-gram model's context is limited to the `n-1` preceding words. It cannot capture long-range dependencies in a sentence.
*   **Data Sparsity:** Many valid word sequences (especially for trigrams or higher) may not appear in the training data, leading to zero probabilities.
*   **Inability to Handle Segmentation:** Like our current baseline, it still processes word-by-word and cannot effectively handle segmentation errors like merging `"health care"` into `"healthcare"`.
