# Dataset
https://bekushal.medium.com/cleaned-lang8-dataset-for-grammar-error-detection-79aaa31150aa

https://github.com/christianversloot/machine-learning-articles/blob/main/easy-grammar-error-detection-correction-with-machine-learning.md
https://github.com/pushapgandhi/Grammer_error_correction/blob/main/Final.ipynb

# Task
Given a statement, check whether it is grammatically right or not. If not, correct the sentence

## Pipeline
1. Data Collection
2. Preprocessing
    + Tokenize
    + Clean (Lowercase and Punctuations)
3. Model Building
    + LSTM
    + BART or T5
4. Inference
    + Error Detection - Model should tell you the likelihood that the sentence is gramatically correct
5. Metrics


## Links
- [Github Codes](https://github.com/pushapgandhi/Grammer_error_correction/blob/main/EDA.ipynb)

In [1]:
import pandas as pd
import numpy as np
import os
%pip install jsonlines
import jsonlines
import re
%pip install matplotlib
%pip install seaborn
%pip install wordcloud
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')
from tqdm import tqdm
from nltk.corpus import stopwords
from wordcloud import WordCloud
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
f2 = open("/Users/daver/Desktop/College Work/NLP_Lab_Exam_Codes/Applications/resources/Grammar/en2cn-2k.en2nen2cn","r",encoding="UTF-8") # READING THE FILE

lines2 = f2.readlines()
inp2 = []
tgt2 = []

for i in range(2000): 
    inp2.append(lines2[i*3])
    tgt2.append(lines2[i*3+1])

In [3]:
inp2

['U wan me to "chop" seat 4 u nt?\n',
 'Yup. U reaching. We order some durian pastry already. U come quick.\n',
 'They become more ex oredi... Mine is like 25... So horrible n they did less things than last time...\n',
 "I'm thai. what do u do?\n",
 'Hi! How did your week go? Haven heard from you for some time... Hows everything?\n',
 'Haha... Okay... You going to mail her? Or you want me to reply...\n',
 'Look for it on glass table in front of tv\n',
 'Nah im goin 2 the wrks with j wot bout u?\n',
 'Lea so wanna exchange hp number?\n',
 'Ok see u next time then. Will be back in jul.\n',
 "I'll meet u b4 lec then...\n",
 'Hi! I am Ellen 18 chinese from KL.U?\n',
 'Gelek got my msg? Help my chop seat k...Hehe\n',
 'Plenty of old men...N guys ard 20 plus...Sick lei. Everyday msg...I go hm write profiles 4 u 2 c. Den where u workg now? There? As e recept? Y u littat?\n',
 'Ya...Too heart broken..Scattered everywhere rem?Dats y woke up late...Hehehe no la...Heart broken but no course.Only 

In [4]:
df = pd.DataFrame()
df["input"] = inp2
df["output"] = tgt2
df["y"] = list("2"*len(inp2))

In [5]:
df

Unnamed: 0,input,output,y
0,"U wan me to ""chop"" seat 4 u nt?\n",Do you want me to reserve seat for you or not?\n,2
1,Yup. U reaching. We order some durian pastry a...,Yeap. You reaching? We ordered some Durian pas...,2
2,They become more ex oredi... Mine is like 25.....,They become more expensive already. Mine is li...,2
3,I'm thai. what do u do?\n,I'm Thai. What do you do?\n,2
4,Hi! How did your week go? Haven heard from you...,Hi! How did your week go? Haven't heard from y...,2
...,...,...,...
1995,Hmmm... Thk i usually book on wkends... Depend...,Hmm. I think I usually book on weekends. It de...,2
1996,ask them got any sms messages to gif me lei......,Can you ask them whether they have for any sms...,2
1997,We r near coca oredi...\n,We are near Coca already.\n,2
1998,hall Eleven. Got lectures le mah.èn forget abt...,Hall eleven. Got lectures. And forget about co...,2


In [6]:
'''THIS FUNCTION REMOVES THE SPACES BETWEEN THE CONTRACTED WORDS AND REMOVING UNNECESSARY SPACES IN THE SENTENCES
            ca n't ==> can't 
            I 'm ===> I'm ...etc
'''
def remove_spaces(text):
    text = re.sub(r" '(\w)",r"'\1",text)
    text = re.sub(r" \,",",",text)
    text = re.sub(r" \.+",".",text)
    text = re.sub(r" \!+","!",text)
    text = re.sub(r" \?+","?",text)
    text = re.sub(" n't","n't",text)
    text = re.sub("[\(\)\;\_\^\`\/]","",text)
    
    return text


'''THIS FUNCTION DECONTRACTS THE CONTRACTED WORDS'''
#REF : https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python

def decontract(text):
    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    return text


'''THIS FUNCTION PREPROCESSES THE TEXT '''
def preprocess(text):
    text = re.sub("\n","",text)
    text = remove_spaces(text)   # REMOVING UNWANTED SPACES
    text = re.sub(r"\.+",".",text)
    text = re.sub(r"\!+","!",text)
    text = decontract(text)    # DECONTRACTION
    text = re.sub("[^A-Za-z0-9 ]+","",text)
    text = text.lower()
    return text

In [7]:
df["processed_input"] = df.input.apply(preprocess)
df["processed_output"] = df.output.apply(preprocess)

In [8]:
df.head(5)

Unnamed: 0,input,output,y,processed_input,processed_output
0,"U wan me to ""chop"" seat 4 u nt?\n",Do you want me to reserve seat for you or not?\n,2,u wan me to chop seat 4 u nt,do you want me to reserve seat for you or not
1,Yup. U reaching. We order some durian pastry a...,Yeap. You reaching? We ordered some Durian pas...,2,yup u reaching we order some durian pastry alr...,yeap you reaching we ordered some durian pastr...
2,They become more ex oredi... Mine is like 25.....,They become more expensive already. Mine is li...,2,they become more ex oredi mine is like 25 so h...,they become more expensive already mine is lik...
3,I'm thai. what do u do?\n,I'm Thai. What do you do?\n,2,i am thai what do u do,i am thai what do you do
4,Hi! How did your week go? Haven heard from you...,Hi! How did your week go? Haven't heard from y...,2,hi how did your week go haven heard from you f...,hi how did your week go have not heard from yo...


In [9]:
df.drop(["input","output"], axis = 1, inplace = True) 

In [10]:
df.head(5)

Unnamed: 0,y,processed_input,processed_output
0,2,u wan me to chop seat 4 u nt,do you want me to reserve seat for you or not
1,2,yup u reaching we order some durian pastry alr...,yeap you reaching we ordered some durian pastr...
2,2,they become more ex oredi mine is like 25 so h...,they become more expensive already mine is lik...
3,2,i am thai what do u do,i am thai what do you do
4,2,hi how did your week go haven heard from you f...,hi how did your week go have not heard from yo...


In [11]:
df.shape

(2000, 3)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   y                 2000 non-null   object
 1   processed_input   2000 non-null   object
 2   processed_output  2000 non-null   object
dtypes: object(3)
memory usage: 47.0+ KB
