This dataset in this competition is a subset of the dataset from the [Feedback Prize - Evaluating Student Writing](https://www.kaggle.com/competitions/feedback-prize-2021) (2021) competition. It seems likely that the entire dataset's use will be part of the top solutions in this competition.

As a first pass at using the 2021 dataset, I take the 5-fold DeBerta v3 model trained on my original [Training notebook](https://www.kaggle.com/code/lextoumbourou/feedback-prize-eda-and-model-training) and make predictions on the entire 2021 dataset. I average the softmax probability of the 2021 data from each fold.

I have included a field that represents whether the row is in the 2022 set, so you can plan validation folds accordingly.

# Imports

In [None]:
from tqdm.auto import tqdm
from pathlib import Path
from types import SimpleNamespace
import logging

from datasets import Dataset

import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
from scipy.special import softmax
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import log_loss

# To work around the aggressive HuggingFace log spam.
logging.disable(logging.WARNING)

# From this Gist: https://gist.github.com/ihoromi4/b681a9088f348942b01711f251e5f964
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

# Compare Datasets

Let's do a quick comparison of datasets.

In [None]:
import pandas as pd
train_2021_df = pd.read_csv('../input/feedback-prize-2021/train.csv')
total_size_2021 = len(train_2021_df)
num_essays_2021 = train_2021_df.id.nunique()
train_df = pd.read_csv(f'../input/feedback-prize-eda-and-model-training/train_folds.csv')
total_size_2022 = len(train_df)
num_essays_2022 = train_df.essay_id.nunique()

Firstly, let's compare total number of rows and number of unique essays across datasets.

In [None]:
comp_df = pd.DataFrame(
    [('2021', total_size_2021, num_essays_2021),
    ('2022', total_size_2022, num_essays_2022)],
    columns=['dataset', 'total', 'unique essays']
).set_index('dataset')
comp_df

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 5))
for i, field in enumerate(['total', 'unique essays']):
    comp_df[field].plot.bar(title=field, ax=axes[i])
plt.show()

So with **11,403** additional essays and **107,527** total extra rows, last year's competition data seems useful.

Let's also compare `discourse_type` distribution.

In [None]:
fig, axes = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(18, 5))
for i, (year, df) in enumerate([(2021, train_2021_df), (2022, train_df)]):
    df.discourse_type.value_counts(normalize=True).plot.bar(ax=axes[i], title=f'{year}')
fig.suptitle('Discourse Type Distribution Comparison', y=1.08)
plt.show()

Seems very similar indeed!

Let's check to see if all the ids in the 2022 competition set exist in the original.

In [None]:
ids_2022 = set(train_df.essay_id.unique())
ids_2021 = set(train_2021_df.id.unique())

In [None]:
ids_2022 - ids_2021

**All essays in 2022 are also in 2021.**

In [None]:
len(ids_2021 - ids_2022)

11,403 essays exist in 2021 that aren't in 2022.

Let's see an example to ensure they're otherwise identical.

In [None]:
from IPython.core.display import display, HTML
display(HTML(
    f"""
    <table>
        <tr>
          <td>Essay id <b>007ACE74B050</b> in <b>2021</b> dataset</th>
          <td>Essay id <b>007ACE74B050</b> in <b>2022</b> dataset</th>
        </tr>
        <tr>
          <td>{'<br><br>'.join(list(train_2021_df[train_2021_df.id == '007ACE74B050'].discourse_text.values))}</td>
          <td>{'<br><br>'.join(list(train_df[train_df.essay_id == '007ACE74B050'].discourse_text.values))}</td>
        </tr>
    </table>
    
    """
))

Looks good to me!

I will create a new column in the 2021 set for any essay in the 2022 set to ignore during training.

In [None]:
train_2021_df['in_2022'] = train_2021_df.id.apply(lambda x: x in ids_2022)

In [None]:
train_2021_df.in_2022.value_counts()

In [None]:
test_df = pd.read_csv(f'../input/feedback-prize-effectiveness/test.csv')

# Config

In [None]:
config = SimpleNamespace()

config.n_folds = 5
config.seed = 420
config.lr = 1e-5
config.weight_decay = 0.01
config.epochs = 3
config.batch_size = 16
config.warm_up_ratio = 0.1
config.max_len = 256
config.hidden_dropout_prob = 0.2
config.label_smoothing_factor = 0
config.output_path = Path('./')
config.model_path = Path('../input/feedback-prize-eda-and-model-training')
config.input_path = Path('../input/feedback-prize-effectiveness')

In [None]:
transformers.logging.set_verbosity_error()

seed_everything(config.seed)

# Generate Topics

In [None]:
train_2021_df = train_2021_df.rename(columns={'id': 'essay_id'})

In [None]:
topic_pred_df = pd.read_csv('../input/feedback-topics-identification-with-bertopic/topic_model_feedback.csv')
topic_pred_df = topic_pred_df.drop(columns={'prob'})
topic_pred_df = topic_pred_df.rename(columns={'id': 'essay_id'})

topic_meta_df = pd.read_csv('../input/feedback-topics-identification-with-bertopic/topic_model_metadata.csv')
topic_meta_df = topic_meta_df.rename(columns={'Topic': 'topic', 'Name': 'topic_name'}).drop(columns=['Count'])
topic_meta_df.topic_name = topic_meta_df.topic_name.apply(lambda n: ' '.join(n.split('_')[1:]))

topic_pred_df = topic_pred_df.merge(topic_meta_df, on='topic', how='left')

In [None]:
train_2021_df = train_2021_df.merge(topic_pred_df, on='essay_id', how='left')

# Prepare Data

In [None]:
labels = ['Adequate', 'Effective', 'Ineffective']
tokenizer = AutoTokenizer.from_pretrained(config.model_path / 'fold_0')

def tokenizer_func(x):
    return tokenizer(x["inputs"], get_essay(x['essay_fn']), truncation=True, return_overflowing_tokens=False)

def get_essay(essay_fns):
    essay_cache = {}

    output = []
    for essay_fn in essay_fns:
        if essay_fn not in essay_cache:
            essay_txt = open(essay_fn).read()
            essay_cache[essay_fn] = essay_txt
        output.append(essay_cache[essay_fn])

    return output

def add_inputs(df, basepath):
    df['essay_fn'] = basepath + '/' + df.essay_id + '.txt'
    df['inputs'] = df.discourse_type + ' ' + tokenizer.sep_token + ' ' + df.topic_name + ' ' + tokenizer.sep_token + ' ' + df.discourse_text
    return df

train_2021_df = add_inputs(train_2021_df, '../input/feedback-prize-2021/train')
train_df = add_inputs(train_df, str(config.input_path / 'train'))

# Model

In [None]:
import torch
from torch import nn
from transformers import AutoConfig, AutoModelForSequenceClassification
from transformers.models.deberta_v2.modeling_deberta_v2 import ContextPooler
from transformers.models.deberta_v2.modeling_deberta_v2 import StableDropout
from transformers.modeling_outputs import TokenClassifierOutput
from transformers import DebertaV2ForSequenceClassification

def get_dropouts(num, start_prob, increment):
    return [nn.Dropout(start_prob + (increment * i)) for i in range(num)]  

class MeanPooling(nn.Module):
    def __init__(self):
        super(MeanPooling, self).__init__()
        
    def forward(self, last_hidden_state, attention_mask):
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
        sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        mean_embeddings = sum_embeddings / sum_mask
        return mean_embeddings

class CustomModel(nn.Module):
    def __init__(self, backbone):
        super(CustomModel, self).__init__()
        
        self.model = backbone
        self.config = self.model.config
        self.num_labels = self.config.num_labels

        # self.pooler = ContextPooler(self.config)
        self.pooler = MeanPooling()
        
        self.classifier = nn.Linear(self.config.hidden_size, self.num_labels)
    
        self.dropouts = get_dropouts(num=5, start_prob=0.1, increment=0.1)
    
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None
    ):
        outputs = self.model.deberta(
            input_ids,
            token_type_ids=token_type_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        
        encoder_layer = outputs[0]
        pooled_output = self.pooler(encoder_layer, attention_mask)
                      
        # Multi-sample dropout.
        num_dps = float(len(self.dropouts))
        for ii, drop in enumerate(self.dropouts):
            if ii == 0:
                logits = (self.classifier(drop(pooled_output)) / num_dps)
            else:
                logits += (self.classifier(drop(pooled_output)) / num_dps)

        loss = None
        if labels is not None:
            loss_fn = nn.CrossEntropyLoss()
            logits = logits.view(-1, self.num_labels)
            loss = loss_fn(logits, labels.view(-1))

        output = (logits,) + outputs[1:]

        return TokenClassifierOutput(loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions)

In [None]:
def get_model():
    model_config = AutoConfig.from_pretrained(config.model_path / 'backbone_config/config.json')
    model = DebertaV2ForSequenceClassification(model_config)
    
    return CustomModel(model)

# Inference

In [None]:
#train_2021_df = train_2021_df.sample(n=100)
#train_df = train_df.sample(n=100)

In [None]:
all_2021_data = np.zeros((config.n_folds, len(train_2021_df), len(labels)))
all_val_preds = []

for fold_num in range(config.n_folds):
    print(f'Do fold {fold_num}')

    tokenizer = AutoTokenizer.from_pretrained(config.model_path / f'fold_{fold_num}')
    tokenizer.model_max_length = config.max_len

    model = get_model()

    state_dict = torch.load(config.model_path / f'fold_{fold_num}/pytorch_model.bin')
    model.load_state_dict(state_dict)  

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding='longest')

    args = TrainingArguments(
        output_dir=config.output_path,
        learning_rate=config.lr,
        lr_scheduler_type='cosine',
        fp16=True,
        evaluation_strategy='epoch',
        per_device_train_batch_size=config.batch_size,
        per_device_eval_batch_size=config.batch_size * 2,
        report_to="none",
        save_strategy='no',
        label_smoothing_factor=config.label_smoothing_factor
    )
    
    trainer = Trainer(
        model,
        args,
        tokenizer=tokenizer,
        data_collator=data_collator
    )

    # Make predictions on the OOF data (to verify my model works okay).
    val_df = train_df.query(f'fold == {fold_num}').reset_index(drop=True)
    val_dataset = Dataset.from_pandas(val_df[['inputs', 'essay_fn']])
    
    print('Predict on 2022 dataset')
    val_tok_dataset = val_dataset.map(tokenizer_func, batched=True, remove_columns=('inputs', 'essay_fn'))
    val_preds = trainer.predict(val_tok_dataset)
    val_preds_softmax = softmax(val_preds.predictions, axis=1)
    val_df[labels] = val_preds_softmax
    all_val_preds.append(val_df)
    
    # Make predictions on 2021 data
    print('Predict on 2021 dataset')
    val_dataset_2021 = Dataset.from_pandas(train_2021_df[['inputs', 'essay_fn']])
    val_tok_dataset_2021 = val_dataset_2021.map(tokenizer_func, batched=True, remove_columns=('inputs', 'essay_fn'))
    outputs_2021 = trainer.predict(val_tok_dataset_2021) 
    softmax_outputs_2021 = softmax(outputs_2021.predictions, axis=1)
    
    all_2021_data[fold_num] = softmax_outputs_2021

# Validate Results

Let's check to see that the model is returning the CV results we expect on this competition's data.

In [None]:
val_preds_df = pd.concat(all_val_preds)

In [None]:
log_loss(val_preds_df['discourse_effectiveness'], val_preds_df[labels])

# Save Results

Before saving results, I will format to match the 2022 data.

In [None]:
preds_2021 = np.mean(all_2021_data, axis=0)
train_2021_df = train_2021_df.rename(columns={'id': 'essay_id'})
train_2021_df_output = train_2021_df.drop(columns=['discourse_start', 'discourse_end', 'discourse_type_num', 'predictionstring', 'inputs'])
train_2021_df_output[labels] = preds_2021
train_2021_df_output['discourse_effectiveness'] = train_2021_df_output[labels].idxmax(axis=1)
train_2021_df_output.to_csv('train_2021_preds.csv', index=False)

In [None]:
train_2021_df_output.head()