<a href="https://colab.research.google.com/github/yuriao/DataScienceProjects/blob/main/commonlit_debertav3_base_distilroberta_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


This notebook is based on the notebook https://www.kaggle.com/code/ao9mame/commonlit-deberta-with-transformers/notebook with some changes. I have tried to keep the notebook simple baseline short on training time. Have used a transformer model distilroberta-base.

This notebook is also based on:
- https://www.kaggle.com/code/synful/simple-distilroberta-base-10mins-to-train
- https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f



In [None]:
from google.colab import drive
drive.mount('commonLit_data')

Drive already mounted at commonLit_data; to attempt to forcibly remount, call drive.mount("commonLit_data", force_remount=True).


In [None]:
!pip install transformers[torch]



In [None]:
!pip install datasets



In [None]:
!pip install sentencepiece



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os

In [None]:
import re
import transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import Dataset as Dataset1
from torch.utils.data import Dataset
from sklearn.metrics import mean_squared_error
import torch
import gc


import warnings
warnings.simplefilter("ignore")

## 1. load data

In [None]:
DATA_DIR = "/content/commonLit_data/MyDrive/commonLit_data/"

prompts_train = pd.read_csv(DATA_DIR + "prompts_train.csv")
prompts_test = pd.read_csv(DATA_DIR + "prompts_test.csv")
summaries_train = pd.read_csv(DATA_DIR + "summaries_train.csv")
summaries_test = pd.read_csv(DATA_DIR + "summaries_test.csv")
sample_submission = pd.read_csv(DATA_DIR + "sample_submission.csv")


## 2. Removing [\n\r\t] characters from the text.

In [None]:
summaries_train["text"] = summaries_train["text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)
#prompts_train["prompt_text"] = prompts_train["prompt_text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)

summaries_test["text"] = summaries_test["text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)
#prompts_test["prompt_text"] = prompts_test["prompt_text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)

In [None]:
# merge prompt and summaries
summaries_train = summaries_train.merge(prompts_train, how="left", on="prompt_id")
summaries_test = summaries_test.merge(prompts_test, how="left", on="prompt_id")

In [None]:
max_length=512

Using distil version of Roberta transformer to shorten training time. Other transformer models could also be tried.

In [None]:

pth1='microsoft/deberta-base'
pth2='/content/commonLit_data/MyDrive/commonLit_data/distilroberta-base'

tokenizer1 = AutoTokenizer.from_pretrained(pth1)
tokenizer2 = AutoTokenizer.from_pretrained(pth2)

Using GPU. Model is initiated as a regression model with 2 labels, content and wording.

In [None]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


Initiate data collator

In [None]:
data_collator1 = DataCollatorWithPadding(tokenizer=tokenizer1)
data_collator2 = DataCollatorWithPadding(tokenizer=tokenizer2)

Competition metric is Mean Columnwise Root Mean Squared Error（MCRMSE).

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = mean_squared_error(labels, predictions, squared=False)
    return {"rmse": rmse}

New train/validation/test made with just the required columns.

In [None]:
train_content_all = summaries_train[["prompt_question","text", "content", "wording"]] # use question, not text, prompt_text drag performance


In [None]:
from sklearn.model_selection import train_test_split
train_content,test_content=train_test_split(train_content_all,test_size=0.7,random_state=42)
test1=test_content[["prompt_question","text"]]

Define tokenizing function for train and test dataset.

In [None]:
def tokenize_function1(examples):
    labels = [examples["content"], examples["wording"]]
    tokenized1 = tokenizer1(examples["text"],
                           examples["prompt_question"],
                           padding=False,
                           truncation=True,
                           max_length=max_length)
    return {
        **tokenized1,
        "labels": labels,
    }


def tokenize_function2(examples):
    labels = [examples["content"], examples["wording"]]
    tokenized2 = tokenizer2(examples["text"],
                           examples["prompt_question"],
                           padding=False,
                           truncation=True,
                           max_length=max_length)
    return {
        **tokenized2,
        "labels": labels,
    }


def tokenize_function_test1(examples):
        tokenized1 = tokenizer1(examples["text"],
                               examples["prompt_question"],
                               padding=False,
                               truncation=True,
                               max_length=max_length)
        return tokenized1

def tokenize_function_test2(examples):
        tokenized2 = tokenizer2(examples["text"],
                               examples["prompt_question"],
                               padding=False,
                               truncation=True,
                               max_length=max_length)

        return tokenized2

Define training arguments. Increasing 'num_train_epochs' increases training time about an epoch per minute on GPU.

In [None]:
from transformers import TrainingArguments

# ハイパーパラメータの設定
training_args = TrainingArguments(
    output_dir="output",             # saving directory
    per_device_train_batch_size=6,   # training batch sz
    per_device_eval_batch_size=6,    # validation batch sz
    learning_rate=1.5e-5,            # 学習率
    lr_scheduler_type="linear",      # 学習率スケジューラの種類
    warmup_ratio=0.01,               # 学習率のウォームアップの長さを指定
    num_train_epochs=3,              # エポック数
    save_strategy="epoch",           # チェックポイントの保存タイミング
    logging_strategy="epoch",        # ロギングのタイミング
    evaluation_strategy="epoch",     # 検証セットによる評価のタイミング
    load_best_model_at_end=True,     # 訓練後に開発セットで最良のモデルをロード
    metric_for_best_model="rmse",  # 最良のモデルを決定する評価指標
    greater_is_better=False,         # MCRMSEの場合、低い値が良いのでFalseを設定
    fp16=False,                      # 自動混合精度演算の有効化(CPUを利用する場合、Falseを設定)
    report_to='none',                # WandBへの出力
    save_total_limit=1               # 保存するモデル数
)



Run the trainer.

In [None]:
from sklearn.model_selection import KFold
n_splits = 2
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)

In [None]:
trainer1s=[]
trainer2s=[]
for fold, (train_idx, val_idx) in enumerate(kfold.split(train_content)):
    print(f"Fold {fold + 1}")
    # dataframe to dataset obj
    train_dataset_content = Dataset1.from_pandas(train_content.iloc[train_idx,:], preserve_index=False) # content
    val_dataset_content = Dataset1.from_pandas(train_content.iloc[val_idx,:], preserve_index=False) # content

    # Mapping tokenizing function to the datasets
    train_tokenized_datasets_content1 = train_dataset_content.map(tokenize_function1, batched=False)
    val_tokenized_datasets_content1 = val_dataset_content.map(tokenize_function1, batched=False)

    train_tokenized_datasets_content2 = train_dataset_content.map(tokenize_function2, batched=False)
    val_tokenized_datasets_content2 = val_dataset_content.map(tokenize_function2, batched=False)

    # model for each fold
    model1 = AutoModelForSequenceClassification.from_pretrained(
        pth1,
        num_labels=2,
        problem_type="regression",
    ).to(device)

    model2 = AutoModelForSequenceClassification.from_pretrained(
        pth2,
        num_labels=2,
        problem_type="regression",
    ).to(device)

    trainer1 = Trainer(
        model=model1,
        train_dataset=train_tokenized_datasets_content1,
        eval_dataset=val_tokenized_datasets_content1,
        data_collator=data_collator1,
        args=training_args,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer1
    )

    trainer2 = Trainer(
        model=model2,
        train_dataset=train_tokenized_datasets_content2,
        eval_dataset=val_tokenized_datasets_content2,
        data_collator=data_collator2,
        args=training_args,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer2
    )

    trainer1.train()
    trainer2.train()

    trainer1.save_model("best_model")
    trainer1s.append(trainer1)

    trainer2.save_model("best_model")
    trainer2s.append(trainer2)

Fold 1


Map:   0%|          | 0/1074 [00:00<?, ? examples/s]

Map:   0%|          | 0/1075 [00:00<?, ? examples/s]

Map:   0%|          | 0/1074 [00:00<?, ? examples/s]

Map:   0%|          | 0/1075 [00:00<?, ? examples/s]

Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DebertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rmse
1,0.5711,0.360972,0.587606
2,0.3042,0.296326,0.538626
3,0.2193,0.305767,0.548469


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rmse
1,0.6532,0.36362,0.595338
2,0.3574,0.343894,0.579787
3,0.2914,0.305783,0.543789


Fold 2


Map:   0%|          | 0/1075 [00:00<?, ? examples/s]

Map:   0%|          | 0/1074 [00:00<?, ? examples/s]

Map:   0%|          | 0/1075 [00:00<?, ? examples/s]

Map:   0%|          | 0/1074 [00:00<?, ? examples/s]

Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Rmse
1,0.5803,0.396351,0.627778
2,0.3034,0.350986,0.587149
3,0.2211,0.337376,0.57663


Epoch,Training Loss,Validation Loss,Rmse
1,0.6566,0.382105,0.612007
2,0.3341,0.32367,0.563833
3,0.2826,0.317403,0.557917


In [None]:
n_splits = 6
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)

In [None]:

trainer2s=[]
for fold, (train_idx, val_idx) in enumerate(kfold.split(train_content)):
    print(f"Fold {fold + 1}")
    # dataframe to dataset obj
    train_dataset_content = Dataset1.from_pandas(train_content.iloc[train_idx,:], preserve_index=False) # content
    val_dataset_content = Dataset1.from_pandas(train_content.iloc[val_idx,:], preserve_index=False) # content

    # Mapping tokenizing function to the datasets
    train_tokenized_datasets_content2 = train_dataset_content.map(tokenize_function2, batched=False)
    val_tokenized_datasets_content2 = val_dataset_content.map(tokenize_function2, batched=False)

    # model for each fold
    model2 = AutoModelForSequenceClassification.from_pretrained(
        pth2,
        num_labels=2,
        problem_type="regression",
    ).to(device)

    trainer2 = Trainer(
        model=model2,
        train_dataset=train_tokenized_datasets_content2,
        eval_dataset=val_tokenized_datasets_content2,
        data_collator=data_collator2,
        args=training_args,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer2
    )
    trainer2.train()

    trainer2.save_model("best_model")
    trainer2s.append(trainer2)

Fold 1


Map:   0%|          | 0/1790 [00:00<?, ? examples/s]

Map:   0%|          | 0/359 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Rmse
1,0.5212,0.362177,0.594066
2,0.318,0.317474,0.554259
3,0.2584,0.312783,0.553197


Fold 2


Map:   0%|          | 0/1791 [00:00<?, ? examples/s]

Map:   0%|          | 0/358 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Rmse
1,0.567,0.358923,0.589726
2,0.3225,0.292325,0.530092
3,0.2669,0.270441,0.511901


Fold 3


Map:   0%|          | 0/1791 [00:00<?, ? examples/s]

Map:   0%|          | 0/358 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Rmse
1,0.5713,0.389116,0.620465
2,0.3349,0.292752,0.533453
3,0.2747,0.279281,0.52109


Fold 4


Map:   0%|          | 0/1791 [00:00<?, ? examples/s]

Map:   0%|          | 0/358 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Rmse
1,0.5461,0.343404,0.579871
2,0.3407,0.290145,0.531419
3,0.2726,0.286742,0.527353


Fold 5


Map:   0%|          | 0/1791 [00:00<?, ? examples/s]

Map:   0%|          | 0/358 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Rmse
1,0.5663,0.453307,0.669067
2,0.338,0.339021,0.578956
3,0.2682,0.356125,0.593316


Fold 6


Map:   0%|          | 0/1791 [00:00<?, ? examples/s]

Map:   0%|          | 0/358 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/commonLit_data/MyDrive/commonLit_data/distilroberta-base and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Rmse
1,0.564,0.432816,0.648987
2,0.3173,0.321412,0.560031
3,0.2666,0.289327,0.532243


Predicting on test data

In [None]:
test_dataset = Dataset1.from_pandas(test1, preserve_index=False)
test_tokenized_dataset1 = test_dataset.map(tokenize_function_test1, batched=False)
test_tokenized_dataset2 = test_dataset.map(tokenize_function_test2, batched=False)

Map:   0%|          | 0/5016 [00:00<?, ? examples/s]

Map:   0%|          | 0/5016 [00:00<?, ? examples/s]

In [None]:
content_list=[]
wording_list=[]
for i in range(0,len(trainer1s)):
    predictions1=trainer1s[i].predict(test_tokenized_dataset1)
    predictions2=trainer2s[i].predict(test_tokenized_dataset2)
    content_list.append(predictions1.predictions[:, 0].tolist())
    wording_list.append(predictions1.predictions[:, 1].tolist())
    content_list.append(predictions2.predictions[:, 0].tolist())
    wording_list.append(predictions2.predictions[:, 1].tolist())

In [None]:
content_pred=np.mean(np.array(content_list).T,axis=1)
wording_pred=np.mean(np.array(wording_list).T,axis=1)

In [None]:
content_pred

array([ 0.6589453 , -0.07093564, -0.3442428 , ...,  0.0821618 ,
        0.99153143, -0.02248432])

In [None]:
print(mean_squared_error(content_pred, test_content['content'], squared=False))
print(mean_squared_error(wording_pred, test_content['wording'], squared=False))
print(np.mean([mean_squared_error(content_pred, test_content['content'], squared=False),mean_squared_error(wording_pred, test_content['wording'], squared=False)]))

0.44483635750304557
0.6130521623644906
0.528944259933768
