<a href="https://colab.research.google.com/github/racoon69/NLP/blob/main/NLP_asg3_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Assignment 3 - Misspellings in the context of natural language processing within the medical field

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Set Directory

In [5]:
import os
csv_path = '/content/drive/MyDrive/NLP'
os.chdir(csv_path)

## Load and Inspect Dataset

In [6]:
import pandas as pd

df = pd.read_csv("mtsamples.csv") #load the CSV file

print("shape of dataset:", df.shape) # Show basic info and a few rows
df.head()

shape of dataset: (4999, 6)


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


## Select and Sample 10 Transcriptions

In [7]:
df = df[['transcription']].dropna() #keep only the 'transcription' column and remove rowns with missing values
df = df.sample(10, random_state=42).reset_index(drop=True) #Sample 10 transciprtions for this experiment
df.head() #show preview

Unnamed: 0,transcription
0,"HISTORY OF PRESENT ILLNESS:, The patient is w..."
1,"PREPROCEDURE DIAGNOSIS:, Chest pain secondary..."
2,"HISTORY OF PRESENT ILLNESS: , The patient is a..."
3,"PREOPERATIVE DIAGNOSIS: , End-stage renal dise..."
4,"PREOPERATIVE DIAGNOSIS: , Persistent pneumonia..."


##Clean the transcriptions

In [8]:
import re

def clean_text(text): #clean transcription text function
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text) #keep only letters and spaces
    text = re.sub(r'\s+', ' ', text) #normalise spaces
    return text.strip()

df['clean_transcription'] = df['transcription'].apply(clean_text) #apply cleaning function to each transcription

df[['transcription', 'clean_transcription']].head() #Show preview

Unnamed: 0,transcription,clean_transcription
0,"HISTORY OF PRESENT ILLNESS:, The patient is w...",history of present illness the patient is well...
1,"PREPROCEDURE DIAGNOSIS:, Chest pain secondary...",preprocedure diagnosis chest pain secondary to...
2,"HISTORY OF PRESENT ILLNESS: , The patient is a...",history of present illness the patient is a ye...
3,"PREOPERATIVE DIAGNOSIS: , End-stage renal dise...",preoperative diagnosis endstage renal disease ...
4,"PREOPERATIVE DIAGNOSIS: , Persistent pneumonia...",preoperative diagnosis persistent pneumonia ri...


## Inject Synthetic Misspellings

In [16]:
import random

def introduce_typos(text, typo_prob=0.1): #inject simple typos function ( character swaps)
    words = text.split()
    new_words = []
    for word in words:
        if len(word) > 3 and random.random() < typo_prob:
          i= random.randint(0,len(word)-2)
          word = word[:i] +word[i+1] + word[i] + word[i+2:]
        new_words.append(word)
    return ' '.join(new_words)

df['misspelled_transcription'] = df['clean_transcription'].apply(lambda x: introduce_typos(x, typo_prob=0.3)) #apply to clean transcriptions
df[['clean_transcription','misspelled_transcription']].head(10) #show results in table form

for i in range(5): #priint a few samples
  print(f"\n original: {df['clean_transcription'][i]}")
  print(f"misspelled: {df['misspelled_transcription'][i]}")


 original: history of present illness the patient is well known to me for a history of irondeficiency anemia due to chronic blood loss from colitis we corrected her hematocrit last year with intravenous iv iron ultimately she had a total proctocolectomy done on to treat her colitis her course has been very complicated since then with needing multiple surgeries for removal of hematoma this is partly because she was on anticoagulation for a right arm deep venous thrombosis dvt she had early this year complicated by septic phlebitischart was reviewed and i will not reiterate her complex historyi am asked to see the patient again because of concerns for coagulopathyshe had surgery again last month to evacuate a pelvic hematoma and was found to have vancomycin resistant enterococcus for which she is on multiple antibiotics and followed by infectious disease nowshe is on total parenteral nutrition tpn as welllaboratory data labs today showed a white blood count of hemoglobin hematocrit and 

## Spell Correction with SymSpell

In [17]:

!pip install symspellpy #install SymSpell package

Collecting symspellpy
  Downloading symspellpy-6.9.0-py3-none-any.whl.metadata (3.9 kB)
Collecting editdistpy>=0.1.3 (from symspellpy)
  Downloading editdistpy-0.1.5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.9 kB)
Downloading symspellpy-6.9.0-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading editdistpy-0.1.5-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.1/144.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: editdistpy, symspellpy
Successfully installed editdistpy-0.1.5 symspellpy-6.9.0


In [18]:
#set up SymSpell and Load Dictionary
from symspellpy.symspellpy import SymSpell, Verbosity

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7) #intialise symspell with max edit distance of 2

!wget -q https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell/frequency_dictionary_en_82_765.txt #download and load frequency dictionary

dictionary_path = "frequency_dictionary_en_82_765.txt" #load dictionary into SymSpell
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

True

## Correct Misspelled Text Using SymSpell

In [20]:
def correct_text(text): #Fucntion to correct text using SymSpell (word-by-word)
  corrected = []
  for word in text.split():
    suggestions = sym_spell.lookup(word, Verbosity.CLOSEST, max_edit_distance=2)
    corrected.append(suggestions[0].term if suggestions else word)
  return ' '.join(corrected)

df['symspell_corrected'] = df['misspelled_transcription'].apply(correct_text) #apply spell correction
df[['misspelled_transcription','symspell_corrected']].head() #show before and after correction

Unnamed: 0,misspelled_transcription,symspell_corrected
0,history of persent lilness the paitent is well...,history of present illness ﻿the patient is wel...
1,preprocedure diagnoiss chest pain secondary to...,preprocedure diagnosis chest pain secondary to...
2,history of present ilnless the patient is a ey...,history of present illness ﻿the patient is a h...
3,preoperative diganosis endstage renal disaese ...,preoperative diagnosis onstage renal disease w...
4,preoperative diagnosis persistent pneumonia ri...,preoperative diagnosis persistent pneumonia ri...


## Evaluate Accuracy Using BLEU Score

In [21]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

smoothie = SmoothingFunction().method1 #adding smoothing to avoid zero BLEU score for small differences

def compute_bleu(reference, hypothesis): #compute BLEU score
  return sentence_bleu([reference.split()], hypothesis.split(), smoothing_function=smoothie)

df['bleu_score'] = df.apply(lambda row: compute_bleu(row['clean_transcription'], row['symspell_corrected']), axis=1) #apply to all rows

print("Average BLEU Score:", df['bleu_score'].mean()) #show results

df[['clean_transcription', 'symspell_corrected', 'bleu_score']].head() #show comparison

Average BLEU Score: 0.6498668945583055


Unnamed: 0,clean_transcription,symspell_corrected,bleu_score
0,history of present illness the patient is well...,history of present illness ﻿the patient is wel...,0.776852
1,preprocedure diagnosis chest pain secondary to...,preprocedure diagnosis chest pain secondary to...,0.561517
2,history of present illness the patient is a ye...,history of present illness ﻿the patient is a h...,0.671791
3,preoperative diagnosis endstage renal disease ...,preoperative diagnosis onstage renal disease w...,0.574541
4,preoperative diagnosis persistent pneumonia ri...,preoperative diagnosis persistent pneumonia ri...,0.646287


## Save Results to CSV

In [22]:
df.to_csv("corrected_transcription_output.csv", index=False)

from google.colab import files
files.download("corrected_transcription_output.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>