# Project: Grammar Error Correction

This project aims to develop a quique autogrammer correction system using `Natural Language Processing` NLP techniques.


#### Train bart-base language model.

This project entails the training of a text-to-text transformation model utilising the bart-base architecture provided in the HappyTransformer package.
The model undergoes training using a dataset that consists of pairs of input phrases having grammatical faults and their corresponding repaired versions.

The processed CVS file is obtained after completing all the exploratory data analysis (EDA) steps in the preceding Python file. Only the unique, error-free rows are selected and divided into separate training and assessment datasets.


The model's performance is calculated using BLEU, ROUGE, and F0.5 scores.

The below model is very similar to the used for training of t5-small model

In [6]:
!pip install happytransformer
!pip install rouge_score

# T4 GPU

from IPython.display import clear_output
clear_output()

In [7]:
import csv
import pandas as pd
from sklearn.model_selection import train_test_split
from happytransformer import HappyTextToText, TTSettings, TTTrainArgs
import nltk
from rouge_score import rouge_scorer
from sklearn.metrics import fbeta_score
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
happy_tt = HappyTextToText("BART", "facebook/bart-base")

In [9]:
# Load your grammar correction data from CSV
data_df = pd.read_csv("processed_grammar_correction.csv", sep = ",", quotechar='"', escapechar='\\')


input_sentences = data_df.iloc[:, 0].tolist()
correct_sentences = data_df.iloc[:, 1].tolist()


print("Input sentences:", input_sentences[:5])
print("Correct sentences:", correct_sentences[:5])

Input sentences: ['I goes to the store everyday.', 'They was playing soccer last night.', 'She have completed her homework.', "He don't know the answer.", 'The sun rise in the east.']
Correct sentences: ['I go to the store everyday.', 'They were playing soccer last night.', 'She has completed her homework.', "He doesn't know the answer.", 'The sun rises in the east.']


In [10]:

# Split into training and evaluation datasets
input_train, input_eval, target_train, target_eval = train_test_split(
    input_sentences, correct_sentences, test_size=0.2, random_state=42)


In [11]:
ttt_args = TTTrainArgs(num_train_epochs=3, batch_size=8, max_input_length=512, max_output_length=512)

# Save training data to a CSV file
training_df = pd.DataFrame({
    "input": input_sentences,
    "target": correct_sentences
})
training_file = "training_data.csv"
training_df.to_csv(training_file, index=False)

In [12]:

# Train the model on your data
happy_tt.train(training_file, args = ttt_args)


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/1559 [00:00<?, ? examples/s]

Map:   0%|          | 0/174 [00:00<?, ? examples/s]



Step,Training Loss,Validation Loss
1,0.8103,1.227144
59,0.5862,0.389758
118,0.4313,0.315759
177,0.3406,0.238711
236,0.2379,0.240713
295,0.1864,0.239482
354,0.1587,0.230264
413,0.1751,0.232284
472,0.0997,0.228387
531,0.1022,0.224264


In [13]:
# Save evaluation data to a CSV file
evaluation_df = pd.DataFrame({
    "input": input_eval,
    "target": target_eval
})
evaluation_file = "evaluation_data.csv"
evaluation_df.to_csv(evaluation_file, index=False)

In [14]:
# Let's generate predictions

eval_data_df = pd.read_csv(evaluation_file)

predictions = []
for input_text in eval_data_df['input']:
    result = happy_tt.generate_text(input_text)
    predictions.append(result.text)

In [15]:
# Let's also add predictions to the evaluation DataFrame

eval_data_df['predicted_text'] = predictions

eval_data_df.head()

Unnamed: 0,input,target,predicted_text
0,Can you help me this problem,Can you help me with this problem?,Can you help me with this problem?
1,The restaurant offers a wide varity of dishes.,The restaurant offers a wide variety of dishes.,The restaurant offers a wide variety of dishes.
2,The small little puppy was playing in the yard.,The small puppy was playing in the yard.,The small puppy was playing in the yard.
3,Download the file and open it,Downloaded the file and opened it,Downloaded the file and opened it.
4,They don’t like ice cream.,They don’t like ice cream.,They don’t like ice cream.


In [16]:
# Let's first compute BLEU score
references = [nltk.word_tokenize(text_1) for text_1 in eval_data_df['target']]
hypotheses = [nltk.word_tokenize(text_2) for text_2 in eval_data_df['predicted_text']]
bleu_score = nltk.translate.bleu_score.corpus_bleu([[ref] for ref in references], hypotheses)

print("Bleu score:", bleu_score)

Bleu score: 0.8450379212712446


In [17]:
# Let's now compute ROUGE scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
rouge_scores = {
    'rouge1': [],
    'rouge2': [],
    'rougeL': []
}


for reference, hypothesis in zip(eval_data_df['target'], eval_data_df['predicted_text']):
    scores = scorer.score(reference, hypothesis)
    for key in rouge_scores:
        rouge_scores[key].append(scores[key].fmeasure)


# Let's calculate average ROUGE scores
avg_rouge_scores = {key: sum(values) / len(values) for key, values in rouge_scores.items()}


In [18]:
# Let's also compute F0.5 score
def compute_fbeta_score(y_true, y_pred, beta=0.5):
    # Step 1: Sentences need to be tokenized sentences
    y_true_tokens = [nltk.word_tokenize(sent) for sent in y_true]
    y_pred_tokens = [nltk.word_tokenize(sent) for sent in y_pred]

    # Step 2: Lists are flattened
    y_true_flat = [item for sublist in y_true_tokens for item in sublist]
    y_pred_flat = [item for sublist in y_pred_tokens for item in sublist]

    # Step 3: Binary representation of each token is created
    all_tokens = list(set(y_true_flat + y_pred_flat))
    y_true_binary = [1 if token in y_true_flat else 0 for token in all_tokens]
    y_pred_binary = [1 if token in y_pred_flat else 0 for token in all_tokens]

    return fbeta_score(y_true_binary, y_pred_binary, beta=beta)

f_05Score = compute_fbeta_score(eval_data_df['target'].tolist(), eval_data_df['predicted_text'].tolist())


In [19]:
# Let's print BLEU, ROUGE and f0.5 scores
print(f"BLEU score: {bleu_score:.4f}")
for key, value in avg_rouge_scores.items():
    print(f"{key} score: {value:.4f}")
print(f"F0.5 score: {f_05Score:.4f}")

BLEU score: 0.8450
rouge1 score: 0.9361
rouge2 score: 0.9041
rougeL score: 0.9355
F0.5 score: 0.9314


In [26]:
# Let's Save the Model to Google Drive (to use it in streamlit application)

import os
model_dir = "/content/drive/My Drive/trained_bart_base"
os.makedirs(model_dir, exist_ok=True)
happy_tt.save(model_dir)

print(f"Model saved to {model_dir}")

Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


Model saved to /content/drive/My Drive/trained_bart_base


In [21]:
beam_settings =  TTSettings(num_beams=5, min_length=1, max_length=20)

In [22]:
example_1 = "This sentences, has bads grammar and spelling!"
actual_result_1 = "These sentences have bad grammar and spelling!"

result_1 = happy_tt.generate_text(example_1, args=beam_settings)
print("Model's output:", result_1.text)

print("Actual output:", actual_result_1)

Model's output: This book has bad grammar and spelling!
Actual output: These sentences have bad grammar and spelling!


In [23]:
result_2 = happy_tt.generate_text("The small little puppy was playing in the yard.", args=beam_settings)
print("Model's output:", result_2.text)

print("Actual output:", "The small puppy was playing in the yard.")

Model's output: The small puppy was playing in the yard.
Actual output: The small puppy was playing in the yard.


In [24]:
result_3 = happy_tt.generate_text("Computers don work anymore", args=beam_settings)
print("Model's output:", result_3.text)

print("Actual output:", "Computers don't work anymore")

Model's output: Computers don't work anymore
Actual output: Computers don't work anymore


In the above examples, the model's output matched with manual grammar correction for most of the cases
