# About

----------------

In my [last post](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/503453), I talked about how we shouldn't forget the Holistic Scoring Rubric document. An important part of my post (I find) is the following:

> - **3**:
    - In general: demonstrates developing mastery.
    - Has **one or more of the following weaknesses** (this is different to scores 4, 5 and 6 - it is explicitly announced as "one or more weaknesses").
        * Point of view exists, but it's hardly critical or without proper examples, reasons or evidence from the source text.
        * Lapses in coherence and not properly organized.
        * Weak coherence and unsmooth development of ideas.
        * OK-ish use of language using sometimes weak vocab or inappropriate words, lacks variety.
        * Substantial grammar errors and spelling errors.

keyphrase being: *"Has **one or more of the following weaknesses**"* 

## Experiment

----------------

Hence I did some experimenting. Maybe I could find the threshold of grammar and spelling errors that determines when an essay musn't be scored more than a 3, as then it would have one of those weaknesses that doesn't allow it to be scored better. 

I used the [grammarly model](https://huggingface.co/grammarly/coedit-large) like so to detect those mistakes, sentence by sentence foreach essay, which resulted in about 350k sentences:

```
tokenizer = AutoTokenizer.from_pretrained("grammarly/coedit-large")
model = T5ForConditionalGeneration.from_pretrained("grammarly/coedit-large").to(device)

input_text = 'Fix grammatical errors in this sentence: When I grow up, I starti to understand what he said is quite right.'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids, max_length=256)
edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(edited_text)
# When I grow up, I start to understand what he said is quite right.
```

I then calculated the differences between the grammar corrected senteces and the original through the Levenshtein distance, normalized the value over the df and plotted it:

![](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F17421843%2F8841d7ffd7a6ce3a821b0463c11e3339%2Fgrammar_threshold.png?generation=1716191274537630&alt=media)

it can be seen that **a) scores nicely depend on the grammar errors** and b) **there exists a threshold with the green line** (also observer how there are no outliers above score 3).

---------------

## This Notebook

This notebook is just for documentation sake so you can retrace what I did and maybe find errors in it. You can also play with the grammarly corrected texts yourself and maybe find features of your own.

In [None]:
import pandas as pd
import spacy
import re
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from difflib import SequenceMatcher

from transformers import AutoTokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Model

Create and test the model instance.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("grammarly/coedit-large")
model = T5ForConditionalGeneration.from_pretrained("grammarly/coedit-large").to(device)

In [None]:
input_text = 'Fix grammatical errors in this sentence: When I grow up, I starti to understand what he said is quite right.'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids, max_length=256)
edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(edited_text)


# Work

In [None]:
train_df = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv').head(1)
display(train_df.head())
print(len(train_df))

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
def remove_html(x):
    html=re.compile(r'<.*?>')
    return html.sub(r'',x)

def preprocess_text(x):
    # x = x.lower()
    x = remove_html(x)
    x = re.sub("@\w+", '',x)
    #x = re.sub("'\d+", '',x)
    # x = re.sub("\d+", '',x)
    x = re.sub("http\w+", '',x)
    #x = re.sub(r"\s+", " ", x)
    #x = re.sub(r"\.+", ".", x)
    #x = re.sub(r"\,+", ",", x)
    x = x.strip()
    return x

def extract_sentences(text):
    return [sent.text for sent in nlp(text).sents]

def extract_pos(text):
    doc = nlp(text)
    pos = []
    for token in doc:
        pos.append(token.pos_)
    return ' '.join(pos)

def extract_grammar_correction(text):
    input_text = f'Fix grammatical errors in this sentence: {text}'
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(input_ids, max_length=len(input_ids[0]) + 30)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def levenshtein_distance(token1, token2):
    distances = np.zeros((len(token1) + 1, len(token2) + 1))

    for t1 in range(len(token1) + 1):
        distances[t1][0] = t1

    for t2 in range(len(token2) + 1):
        distances[0][t2] = t2
        
    a = 0
    b = 0
    c = 0
    
    for t1 in range(1, len(token1) + 1):
        for t2 in range(1, len(token2) + 1):
            if (token1[t1-1] == token2[t2-1]):
                distances[t1][t2] = distances[t1 - 1][t2 - 1]
            else:
                a = distances[t1][t2 - 1]
                b = distances[t1 - 1][t2]
                c = distances[t1 - 1][t2 - 1]
                
                if (a <= b and a <= c):
                    distances[t1][t2] = a + 1
                elif (b <= a and b <= c):
                    distances[t1][t2] = b + 1
                else:
                    distances[t1][t2] = c + 1
    return distances[len(token1)][len(token2)]

In [None]:
train_df['sentences'] = train_df['full_text'].apply(lambda f: extract_sentences(preprocess_text(f)))

In [None]:
rows = []
counter = 0

for index, row in train_df.iterrows():
    if(counter % 150 == 0):
        print("Done with: " + str(counter))

    sentences = row['sentences']
    for k in range(0, len(sentences)):
        rows.append({
            'essay_id': row['essay_id'],
            'sentence': sentences[k],
        })
    counter += 1
        
train_df_sentences = pd.DataFrame(rows)
display(train_df_sentences.head())
print(len(train_df_sentences))

In [None]:
train_df_sentences['grammar'] = train_df_sentences['sentence'].apply(extract_grammar_correction)
display(train_df_sentences.head())

**LV distance**

In [None]:
train_df_sentences['lv_dist'] = np.vectorize(levenshtein_distance)(train_df_sentences['sentence'], train_df_sentences['grammar'])
display(train_df_sentences.head()) 

# EDA

In [None]:
train_df_grammarly = pd.read_json('/kaggle/input/train-grammarly-aes/train_grammarly.json')
train_df = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv')

In [None]:
train_df_grammarly.head()

In [None]:
agg_funcs = {
    'sentence': lambda x: ' '.join(x),
    'grammar': lambda x: ' '.join(x),
    'lv_dist': 'sum'
}

In [None]:
train_df_grammarly = train_df_grammarly.groupby('essay_id').agg(agg_funcs).reset_index()
train_df_grammarly.rename(columns={'sentence': 'full_text'}, inplace=True)

In [None]:
final_df = pd.merge(train_df_grammarly, train_df[['essay_id', 'score']], on='essay_id', how='left')

**Store it**

In [None]:
final_df.to_csv('train_grammarly.csv', index=False)

In [None]:
final_df = pd.read_csv('/kaggle/working/train_grammarly.csv')
display(final_df.head())

**Normalize lv_dist**

In [None]:
final_df['normalized_lv_dist'] = final_df['lv_dist'] / final_df['full_text'].apply(lambda f: len(f))

**Count grammar errors raw**

In [None]:
def count_corrections(full_text, grammar):
    matcher = SequenceMatcher(None, full_text, grammar)
    corrections = sum(triple[-1] for triple in matcher.get_opcodes() if triple[0] != 'equal')
    return corrections

In [None]:
final_df['corrections'] = final_df.apply(lambda row: count_corrections(row['full_text'], row['grammar']), axis=1)

In [None]:
final_df['corrections'] = final_df['full_text'].apply(lambda f: len(f)) / final_df['corrections']

In [None]:
display(final_df.head())

**Plotting**

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(x='normalized_lv_dist', y='score', data=final_df)
sns.regplot(x='normalized_lv_dist', y='score', data=final_df, scatter=False, color='red')

plt.title('Levenshtein Distance vs Score')
plt.xlabel('Levenshtein Distance (lv_dist)')
plt.ylabel('Score')
plt.ylim(0, 7)
plt.show()

## Where is the Grammar Treshhold?

In [None]:
threshold = final_df[final_df['score'] > 3]['normalized_lv_dist'].max()
print(threshold)

In [None]:
plt.figure(figsize=(16, 6))
sns.scatterplot(x='normalized_lv_dist', y='score', data=final_df)

# Add vertical line at the threshold
plt.axvline(x=threshold, color='green', linestyle='--', label=f'Threshold = {threshold:.2f}')
plt.title('Normalized Error-Levenshtein Distance vs Score')
plt.xlabel('Normalized Error-Levenshtein Distance')
plt.ylabel('Score')
plt.ylim(0, 7)
plt.legend()
plt.show()