# Project: Grammar Error Correction

This project aims to develop a quique autogrammer correction system using `Natural Language Processing` NLP techniques.


### Train t5-small language model

The below project involves training a text-to-text transformation model using the T5 architecture implemented in the HappyTransformer library.
The model is trained on a dataset containing pairs of input sentences with grammatical errors and their corrected versions.

The processed CVS file is taken from the once all the EDA process is complete (from previous python file). Only the non-duplicate, error free rows are taken and split into training and evaluation datasets.


The model's performance is calculated using BLEU, ROUGE, and F0.5 scores.

In [None]:
!pip install happytransformer
!pip install rouge_score

# T4 GPU

from IPython.display import clear_output
clear_output()

In [None]:
import csv
import pandas as pd
from sklearn.model_selection import train_test_split
from happytransformer import HappyTextToText, TTSettings, TTTrainArgs
import nltk
from rouge_score import rouge_scorer
from sklearn.metrics import fbeta_score
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
happy_tt = HappyTextToText("T5", "t5-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [None]:
# Load your grammar correction data from CSV
data_df = pd.read_csv("processed_grammar_correction.csv", sep = ",", quotechar='"', escapechar='\\')


input_sentences = data_df.iloc[:, 0].tolist()
correct_sentences = data_df.iloc[:, 1].tolist()


print("Input sentences:", input_sentences[:5])
print("Correct sentences:", correct_sentences[:5])




Input sentences: ['I goes to the store everyday.', 'They was playing soccer last night.', 'She have completed her homework.', "He don't know the answer.", 'The sun rise in the east.']
Correct sentences: ['I go to the store everyday.', 'They were playing soccer last night.', 'She has completed her homework.', "He doesn't know the answer.", 'The sun rises in the east.']


In [None]:

# Split into training and evaluation datasets
input_train, input_eval, target_train, target_eval = train_test_split(
    input_sentences, correct_sentences, test_size=0.2, random_state=42)


In [None]:
ttt_args = TTTrainArgs(num_train_epochs=3, batch_size=8, max_input_length=512, max_output_length=512)

# Save training data to a CSV file
training_df = pd.DataFrame({
    "input": input_sentences,
    "target": correct_sentences
})
training_file = "training_data.csv"
training_df.to_csv(training_file, index=False)

In [None]:

# Train the model on your data
happy_tt.train(training_file, args = ttt_args)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

Map:   0%|          | 0/174 [00:00<?, ? examples/s]



Step,Training Loss,Validation Loss
1,1.8639,1.516025
59,0.6608,0.414013
118,0.4545,0.32574
177,0.3507,0.283249
236,0.2884,0.268133
295,0.2248,0.257789
354,0.2283,0.25075
413,0.2175,0.229434
472,0.1808,0.228338
531,0.1837,0.224968


In [None]:
# Save evaluation data to a CSV file
evaluation_df = pd.DataFrame({
    "input": input_eval,
    "target": target_eval
})
evaluation_file = "evaluation_data.csv"
evaluation_df.to_csv(evaluation_file, index=False)

In [None]:
# Let's generate predictions

eval_data_df = pd.read_csv(evaluation_file)

predictions = []
for input_text in eval_data_df['input']:
    result = happy_tt.generate_text(input_text)
    predictions.append(result.text)

In [None]:
# Let's also add predictions to the evaluation DataFrame

eval_data_df['predicted_text'] = predictions

In [None]:
# Let's check if all the 3 columns are present
eval_data_df.head()

Unnamed: 0,input,target,predicted_text
0,Can you help me this problem,Can you help me with this problem?,Can you help me with this problem? I have a pr...
1,The restaurant offers a wide varity of dishes.,The restaurant offers a wide variety of dishes.,The restaurant offers a wide variety of dishes.
2,The small little puppy was playing in the yard.,The small puppy was playing in the yard.,The small puppy was playing in the yard.
3,Download the file and open it,Downloaded the file and opened it,Download the file and open it. Download the file
4,They don’t like ice cream.,They don’t like ice cream.,They don’t like ice cream.


In [None]:
# Let's first compute BLEU score
references = [nltk.word_tokenize(text_1) for text_1 in eval_data_df['target']]
hypotheses = [nltk.word_tokenize(text_2) for text_2 in eval_data_df['predicted_text']]
bleu_score = nltk.translate.bleu_score.corpus_bleu([[ref] for ref in references], hypotheses)

print("Bleu score:", bleu_score)

Bleu score: 0.7344866964617515


In [None]:
# Let's now compute ROUGE scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
rouge_scores = {
    'rouge1': [],
    'rouge2': [],
    'rougeL': []
}


for reference, hypothesis in zip(eval_data_df['target'], eval_data_df['predicted_text']):
    scores = scorer.score(reference, hypothesis)
    for key in rouge_scores:
        rouge_scores[key].append(scores[key].fmeasure)


# Let's calculate average ROUGE scores
avg_rouge_scores = {key: sum(values) / len(values) for key, values in rouge_scores.items()}


In [None]:
# Let's also compute F0.5 score
def compute_fbeta_score(y_true, y_pred, beta=0.5):
    # Step 1: Sentences need to be tokenized sentences
    y_true_tokens = [nltk.word_tokenize(sent) for sent in y_true]
    y_pred_tokens = [nltk.word_tokenize(sent) for sent in y_pred]

    # Step 2: Lists are flattened
    y_true_flat = [item for sublist in y_true_tokens for item in sublist]
    y_pred_flat = [item for sublist in y_pred_tokens for item in sublist]

    # Step 3: Binary representation of each token is created
    all_tokens = list(set(y_true_flat + y_pred_flat))
    y_true_binary = [1 if token in y_true_flat else 0 for token in all_tokens]
    y_pred_binary = [1 if token in y_pred_flat else 0 for token in all_tokens]

    return fbeta_score(y_true_binary, y_pred_binary, beta=beta)

f_05Score = compute_fbeta_score(eval_data_df['target'].tolist(), eval_data_df['predicted_text'].tolist())


In [None]:
# Let's print BLEU, ROUGE and f0.5 scores
print(f"BLEU score: {bleu_score:.4f}")
for key, value in avg_rouge_scores.items():
    print(f"{key} score: {value:.4f}")
print(f"F0.5 score: {f_05Score:.4f}")

BLEU score: 0.7345
rouge1 score: 0.9063
rouge2 score: 0.8501
rougeL score: 0.9002
F0.5 score: 0.9051


In [None]:
# Let's Save the Model to Google Drive (to use it in streamlit application)

import os
model_dir = "/content/drive/My Drive/trained_t5_model"
os.makedirs(model_dir, exist_ok=True)
happy_tt.save(model_dir)

print(f"Model saved to {model_dir}")

Model saved to /content/drive/My Drive/trained_t5_model


Below are real world examples for which error correction is done by the model.

In [None]:
beam_settings =  TTSettings(num_beams=5, min_length=1, max_length=20)

In [None]:
example_1 = "This sentences, has bads grammar and spelling!"
actual_result_1 = "These sentences have bad grammar and spelling!"

result_1 = happy_tt.generate_text(example_1, args=beam_settings)
print("Model's output:", result_1.text)

print("Actual output:", actual_result_1)

Model's output: This sentences, have bad grammar and spelling!
Actual output: This sentence has bad grammar and spelling!


In the above example, the model didn't 100% correct the grammar when compared to manual correction.

In [None]:
result_2 = happy_tt.generate_text("The small little puppy was playing in the yard.", args=beam_settings)
print("Model's output:", result_2.text)

print("Actual output:", "The small puppy was playing in the yard.")

Model's output: The small puppy was playing in the yard.
Actual output: The small puppy was playing in the yard.


In the above example, the model's output matched with manual grammar correction

In [None]:
result_3 = happy_tt.generate_text("Computers don work anymore", args=beam_settings)
print("Model's output:", result_3.text)

print("Actual output:", "Computers don't work anymore")

Model's output: Computers don't work anymore
Actual output: Computers don't work anymore


In the above example, the model's output matched with manual grammar correction