### ⭐️ What this notebook is about
I've been experiencing the discrepancy between CV and LB. For example in my previous [notebook](https://www.kaggle.com/code/emiz6413/cv-0-825-lb-0-803-deberta-v3-small-with-huber-loss) I used 5 fold `StratifiedKFold` split and obtained cv=0.825 while LB was 0.803. As many reported, CV has been constantly higher than LB (by around 0.2~0.3). <br>
There is [a great discussion](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/499959) by @ragnar123, that points out the potential problem of using vanilla stratified k-fold split if the test set contains essays given by different prompt(s) than the training set. <br>
@ragnar123 suggested using `GroupKFold` to seperate essays of different prompts in the training and evaluation set to reduce the discrepancy between CV and LB. <br>
This notebook is my attempt to predict the prompts from the essay.

### ✒️ Method
1. Extract essays that exist in [persuade 2.0 corpus](https://www.kaggle.com/datasets/nbroad/persaude-corpus-2). We get 12871 duplicates (intersection set) and 4436 that don't exist in persuade (difference set).
2. Persuade corpus contains prompt_name. We train a prompt classifier using the intersection set. We can easily achieve very high accuracy (>0.99) even with small models like `deberta-v3-small` trained for 1 epoch.
3. Predict the prompt on the difference set using the classifier trained in step 2. And save the predicted prompts as csv.

### 🛑 Credit
I implemented the method mentioned in [the discussion](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/discussion/499959) by @ragnar123. I highly recommend checking the liked discussion if you haven't.

In [None]:
import os
import copy

import torch
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, EvalPrediction
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

In [None]:
train_df = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv")
persuade = pd.read_csv("/kaggle/input/persaude-corpus-2/persuade_2.0_human_scores_demo_id_github.csv")

#### Find the number of train $|S_{train}|$

In [None]:
len(train_df)

#### Find intersection $S_{train} \cap S_{persuade}$ and difference $S_{train} \setminus S_{persuade}$

In [None]:
intersection = pd.merge(train_df, persuade, on="full_text", how="inner")[["essay_id", "full_text", "score", "prompt_name"]].reset_index(drop=True)
difference = train_df[~train_df["essay_id"].isin(intersection["essay_id"])].reset_index(drop=True)
print("len(intersection):", len(intersection))
print("len(difference):", len(difference))

### Plot the histogram of prompts colored by score

In [None]:
px.histogram(intersection, x="prompt_name", color="score")

### Split the intersection
I will split the intersection into 2 folds with StratifiedKFold based on the combination of their score and the prompt.<br>
To do that, I first create a new column called `score_and_prompt` which is a concatenation of score and prompt. Then we can target the column to split.

In [None]:
intersection["score_and_prompt"] = intersection["score"].astype(str) + "-" + intersection["prompt_name"]
intersection.head()

In [None]:
intersection["fold"] = None
splitter = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
folds = list(splitter.split(X=np.zeros(len(intersection)), y=intersection["score_and_prompt"].values))
for fold_idx, (_, val_idx) in enumerate(folds):
    intersection.loc[val_idx, "fold"] = fold_idx

In [None]:
intersection.head()

### Visualize the split

In [None]:
px.histogram(intersection, x="prompt_name", color="score", facet_row="fold")

### Convert prompt to prompt id
We're going to train a model to classify prompt. Currently, the target column `prompt_name` is a string, so we need to convert it to id and name it `label`.

In [None]:
intersection["label"] =  intersection["prompt_name"].astype("category").cat.codes
intersection.head()

In [None]:
prompt_to_id = intersection.drop_duplicates(subset=("prompt_name", "label"))[["prompt_name", "label"]]
label2id = {row["prompt_name"]: row["label"] for _, row in prompt_to_id.iterrows()}
id2label = {row["label"]: row["prompt_name"] for _, row in prompt_to_id.iterrows()}
prompt_to_id

### Instantiate the model & tokenizer
I will use `deberta-v3-xsmall` here for good trade-off of performance and speed.

In [None]:
checkpoint = "microsoft/deberta-v3-xsmall"

class ModelInit:
    def __init__(self, checkpoint, label2id, id2label):
        self.model = AutoModelForSequenceClassification.from_pretrained(
            checkpoint, 
            num_labels=len(label2id),
            label2id=label2id,
            id2label=id2label,
        )
        self.state_dict = copy.deepcopy(self.model.state_dict())
        
    def __call__(self):
        self.model.load_state_dict(self.state_dict)
        return self.model
    
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model_init = ModelInit(checkpoint, label2id=label2id, id2label=id2label)

### Tokenize the dataset
I will use `max_length=1024` here.

In [None]:
intersection_ds = Dataset.from_pandas(intersection)
diff_ds = Dataset.from_pandas(difference)
intersection_ds = intersection_ds.map(lambda i: tokenizer(i["full_text"], max_length=1024, truncation=True), batched=True)
diff_ds = diff_ds.map(lambda i: tokenizer(i["full_text"], max_length=1024, truncation=True), batched=True)

### Train prompt classifier

In [None]:
def compute_metrics(eval_pred: EvalPrediction) -> dict[str, float]:
    predictions = eval_pred.predictions
    y_true = eval_pred.label_ids
    y_pred = predictions.argmax(-1)
    acc = accuracy_score(y_true, y_pred)
    return {"acc": acc}

In [None]:
args = TrainingArguments(
    output_dir="output",
    report_to="none",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=8,
    learning_rate=1e-5,
    lr_scheduler_type="constant",
    warmup_ratio=0.0,
    num_train_epochs=1,
    weight_decay=0.01,
    optim="adamw_torch",
    fp16=torch.cuda.is_available()
)

In [None]:
predictions = []

for fold_idx in np.unique(intersection_ds["fold"]):
    args.output_dir = os.path.join("output", f"fold_{fold_idx}")
    train_ds = intersection_ds.select([i for i, f in enumerate(intersection_ds["fold"]) if f != fold_idx])
    eval_ds = intersection_ds.select([i for i, f in enumerate(intersection_ds["fold"]) if f == fold_idx])
    trainer = Trainer(
        args=args, 
        model_init=model_init,
        train_dataset=train_ds, 
        eval_dataset=eval_ds, 
        tokenizer=tokenizer, 
        compute_metrics=compute_metrics
    )
    trainer.train()
    # predict on eval dataset to visualize the result
    preds = trainer.predict(eval_ds)
    fig, ax = plt.subplots()
    ConfusionMatrixDisplay.from_predictions(
        y_true=preds.label_ids, 
        y_pred=preds.predictions.argmax(-1),
        ax=ax
    )
    acc = accuracy_score(y_true=preds.label_ids, y_pred=preds.predictions.argmax(-1))
    ax.set_title(f"fold-{fold_idx} acc: {acc:.3f}")
    fig.show()
    # predict on difference dataset
    test_preds = trainer.predict(diff_ds)
    predictions.append(test_preds.predictions)
    
predictions = np.stack(predictions, axis=0).mean(axis=0)  # average the result of 2 folds

### Save the result

In [None]:
essay_id = diff_ds["essay_id"]
prompt_name = [id2label[i] for i in predictions.argmax(-1)]  # convert prompt id back to prompt name
result_df = pd.DataFrame({"essay_id": essay_id, "prompt_name": prompt_name, "predicted": [True] * len(essay_id)})
result_df

In [None]:
intersection_df = intersection[["essay_id", "prompt_name"]].copy()
intersection_df["predicted"] = False
final_df = pd.concat([result_df, intersection_df])
final_df.to_csv("predicted_prompt.csv", index=False)
final_df