# LB Probe if Test Data in Persuade Corpus
Is this notebook, we probe the LB test data to see if any test essays are duplicate matches from the Persuade 2.0 corpus. We do not expect them to be, but we will check. There is a discussion [here][1] and previous notebook to analyze train data [here][2]

As a baseline, we will use the simple notebook [here][3] (which submits essay length mean targets) and achieves `LB=0.703`. If using targets from Persuade corpus whenever we find a match improves our LB score, then we have evidence that some test data (and their LB targets) are duplicates of Persuade 2.0 corpus.

[1]: https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/493962#2760994
[2]: https://www.kaggle.com/code/abdullahmeda/persuade-train-essays-analysis/
[3]: https://www.kaggle.com/code/ianchute/no-model-baseline

# Load Persuade and Train Data
Here is Persuade 2.0 corpus and competition train data.

In [None]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import cupy as cp

persuade = pd.read_csv('/kaggle/input/persaude-corpus-2/persuade_2.0_human_scores_demo_id_github.csv')
print('Persuade corpus 2.0 shape:', persuade.shape )
persuade.head()

In [None]:
train = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv')
print('Train data shape:', train.shape )
train.head()

# Create CountVectorizer Embeddings
We will use normalized CountVectorizer with English stop words and with max vocab 1024 to create embeddings

In [None]:
%%time
from sklearn.feature_extraction.text import CountVectorizer
model = CountVectorizer(stop_words='english',max_features=1024)
p_embed = model.fit_transform(persuade.full_text.values)
p_embed = cp.array(p_embed.toarray())
norm = cp.sqrt( cp.sum(p_embed*p_embed,axis=1, keepdims=True) )
p_embed = p_embed / norm

In [None]:
%%time
train_embed = model.transform(train.full_text.values)
train_embed = cp.array(train_embed.toarray())
norm = cp.sqrt( cp.sum(train_embed*train_embed,axis=1, keepdims=True) )
train_embed = train_embed / norm

# Cosine Similarity Search
We use GPU CuPy to quickly multiply the normalized embedding matrices of Persuade and Train to compute cosine similarity and find the TopK matches. We will only use Top1 match.

In [None]:
%%time
top1 = cp.dot(p_embed, train_embed.T)
top1 = cp.argmax(top1,axis=0)

In [None]:
train['full_text_p'] = ''
train['score_p'] = -1
for k in range(len(train)):
    if k%500==0: print(k,', ',end='')
    train.loc[k,'full_text_p'] = persuade.loc[top1[k].item(),'full_text']
    train.loc[k,'score_p'] = persuade.loc[top1[k].item(),'holistic_essay_score']

# Compute Levenstein Distance
We will assume a match if `normalized levenstein distance < 0.1`. We use 90% character match in case Kaggle changed the essays slightly.

In [None]:
!pip install /kaggle/input/polyleven-whl-files/polyleven-0.8-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl

In [None]:
from polyleven import levenshtein
MATCH_THRESHOLD = 0.1

def levenshtein_distance(row):
    s1 = row.full_text
    s2 = row.full_text_p
    lev = levenshtein(s1.strip(), s2.strip())
    lev = lev/max(len(s1.strip()), len(s2.strip()))
    row['lev'] = lev
    return row    

In [None]:
%%time
train = train.apply(levenshtein_distance,axis=1)
train.head()

In [None]:
c = (train.lev<MATCH_THRESHOLD).sum()
print(f'There are {c} out of {len(train)} train essays that match Persuade Corpus 2.0 with threshold {MATCH_THRESHOLD}')

In [None]:
plt.hist( train.lev.values, bins=100)
plt.plot([MATCH_THRESHOLD,MATCH_THRESHOLD],[0,100],'--',color='black',
         label=f'We use\nthreshold\nfor match\n= {MATCH_THRESHOLD}')
plt.ylim((0,100))
plt.legend()
plt.title('Histrogram of Normalized Levenstein\nbetween Persuade data and Train data',size=14)
plt.show()

# EDA Matches
Below we show 1 match for sanity check. We could explore more here.

In [None]:
tmp = train.loc[train.lev<MATCH_THRESHOLD]
p = (tmp.score != tmp.score_p).sum()
print(f'Among {c} matches, all targets match except {p}')

In [None]:
row = tmp.sample(1,random_state=42).iloc[0]
print(f"Example essay from train data with score = {row['score']}:\n")
print( row.full_text )

In [None]:
print(f"Example matched essay from Persuade Corpus with Persuade score = {row['score_p']}:\n")
print( row.full_text_p )

# Create Submission CSV Baseline
This baseline solution comes from [here][1] and achieves `LB=0.703`. By having a baseline with known LB score, we will know whether our modifications based on matches with Persuade corpus (in next section) improve or hurt our LB score.

[1]: https://www.kaggle.com/code/ianchute/no-model-baseline

In [None]:
train2 = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv",
                     index_col="essay_id")
test2 = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv",
                    index_col="essay_id")

train2["length"] = train2.full_text.str.replace("[Ee\s]", "", regex=True).str.len() // 50
test2["length"] = test2.full_text.str.replace("[Ee\s]", "", regex=True).str.len() // 50

modes = (train2
    .groupby("length")
    .score
    .agg(lambda s: s.value_counts().keys()[0])
    .sort_index()
    .reindex(range(0, 200))
    .fillna(0)
    .clip(lower=1)
    .cummax()
    .astype(int))

from sklearn.metrics import cohen_kappa_score
qwk = cohen_kappa_score(train2.score, train2.length.map(modes), weights="quadratic")
modes.plot.line(title=f"Baseline Prediction versus Essay Character Length / 50\n(with spaces and letter E's removed)\nBaseline CV QWK Score = {qwk}")
plt.ylabel('Prediction')
plt.xlabel("Essay character length / 50 (with E's and spaces removed)")
plt.show()

print('\nBaseline predictions:')
baseline = test2.length.map(modes).rename("score").reset_index()
baseline.head()

# Modify Submission CSV using Persuade Matches
We will search test data for matches with Persuade 2.0 corpus. Whenever we find a match with `normalized levenstein distance < 0.1`, we will use the target label from Persuade corpus as our prediction. Otherwise we will use the baseline solution from the previous section (which comes from [here][1] and achieves `LB=0.703`)

[1]: https://www.kaggle.com/code/ianchute/no-model-baseline

In [None]:
test = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv')
print('Test data shape:', train.shape )
test.head()

In [None]:
test_embed = model.transform(test.full_text.values)
test_embed = cp.array(test_embed.toarray())
norm = cp.sqrt( cp.sum(test_embed*test_embed,axis=1, keepdims=True) )
test_embed = test_embed / norm

In [None]:
top1 = cp.dot(p_embed, test_embed.T)
top1 = cp.argmax(top1,axis=0)

In [None]:
test['full_text_p'] = ''
test['score_p'] = -1
test = test.merge(baseline, on='essay_id', how='left').fillna(3)
for k in range(len(test)):
    test.loc[k,'full_text_p'] = persuade.loc[top1[k].item(),'full_text']
    test.loc[k,'score_p'] = persuade.loc[top1[k].item(),'holistic_essay_score']

In [None]:
test = test.apply(levenshtein_distance,axis=1)
test.head()

In [None]:
test.loc[test.lev<MATCH_THRESHOLD,'score'] = test.loc[test.lev<MATCH_THRESHOLD,'score_p']
sub = test[['essay_id','score']].copy()
sub.score = sub.score.astype('int32') # to be safe
sub.to_csv('submission.csv',index=False)
print('Submission shape:',sub.shape)
sub.head()