<a href="https://colab.research.google.com/github/yuriao/DataScienceProjects/blob/main/commonlit_debertav3_base_distilroberta_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


This notebook is based on the notebook https://www.kaggle.com/code/ao9mame/commonlit-deberta-with-transformers/notebook with some changes. I have tried to keep the notebook simple baseline short on training time. Have used a transformer model distilroberta-base.

This notebook is also based on:
- https://www.kaggle.com/code/synful/simple-distilroberta-base-10mins-to-train
- https://towardsdatascience.com/how-to-apply-transformers-to-any-length-of-text-a5601410af7f



In [19]:
from google.colab import drive
drive.mount('commonLit_data')

Drive already mounted at commonLit_data; to attempt to forcibly remount, call drive.mount("commonLit_data", force_remount=True).


In [20]:
!pip install transformers[torch]



In [21]:
!pip install datasets



In [22]:
!pip install sentencepiece



In [23]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory


import os

In [24]:
import re
import transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import Dataset as Dataset1
from torch.utils.data import Dataset
from sklearn.metrics import mean_squared_error
import torch
import gc
from transformers import TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

import warnings
warnings.simplefilter("ignore")

## 1. load data

In [25]:
DATA_DIR = "/content/commonLit_data/MyDrive/commonLit_data/"

prompts_train = pd.read_csv(DATA_DIR + "prompts_train.csv")
prompts_test = pd.read_csv(DATA_DIR + "prompts_test.csv")
summaries_train = pd.read_csv(DATA_DIR + "summaries_train.csv")
summaries_test = pd.read_csv(DATA_DIR + "summaries_test.csv")
sample_submission = pd.read_csv(DATA_DIR + "sample_submission.csv")


## 2. Removing [\n\r\t] characters from the text.

In [26]:
summaries_train["text"] = summaries_train["text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)
#prompts_train["prompt_text"] = prompts_train["prompt_text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)

summaries_test["text"] = summaries_test["text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)
#prompts_test["prompt_text"] = prompts_test["prompt_text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)

In [27]:
# merge prompt and summaries
summaries_train = summaries_train.merge(prompts_train, how="left", on="prompt_id")
summaries_test = summaries_test.merge(prompts_test, how="left", on="prompt_id")

Use the following huggingface models

In [28]:
model_name='distilbert'

Using GPU. Model is initiated as a regression model with 2 labels, content and wording.

In [29]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Competition metric is Mean Columnwise Root Mean Squared Error（MCRMSE).

In [30]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = mean_squared_error(labels, predictions, squared=False)
    return {"rmse": rmse}

Define tokenizing function for train and test dataset.

In [31]:
def tokenize_function_train(examples):
    sep = tokenizer1.sep_token
    labels = [examples["content"], examples["wording"]]
    tokenized = tokenizer1(examples["prompt_title"]+sep+examples["prompt_question"]+sep+examples["text"],padding=True,truncation=True,max_length=512)
    return {**tokenized,"labels": labels}


def tokenize_function_test(examples):
    sep = tokenizer1.sep_token
    tokenized1 = tokenizer1(examples["prompt_title"]+sep+examples["prompt_question"]+sep+examples["text"],padding=True,truncation=True,max_length=512)
    return tokenized1

model training utility function

In [32]:
def huggingface_model_train(model_name,n_splits):

  data_collator1 = DataCollatorWithPadding(tokenizer=tokenizer1)

  # training parameter setting
  training_args = TrainingArguments(
      output_dir="output",             # saving directory
      per_device_train_batch_size=6,   # training batch sz
      per_device_eval_batch_size=6,    # validation batch sz
      learning_rate=1.5e-5,            # learning rate
      lr_scheduler_type="linear",      # learning changing scheduler
      warmup_ratio=0.01,               # 学習率のウォームアップの長さを指定
      num_train_epochs=4,              # エポック数
      save_strategy="epoch",           # チェックポイントの保存タイミング
      logging_strategy="epoch",        # ロギングのタイミング
      evaluation_strategy="epoch",     # 検証セットによる評価のタイミング
      load_best_model_at_end=True,     # 訓練後に開発セットで最良のモデルをロード
      metric_for_best_model="rmse",  # 最良のモデルを決定する評価指標
      greater_is_better=False,         # MCRMSEの場合、低い値が良いのでFalseを設定
      fp16=False,                      # 自動混合精度演算の有効化(CPUを利用する場合、Falseを設定)
      report_to='none',                # WandBへの出力
      save_total_limit=1               # 保存するモデル数
  )

  kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)

  trainers=[]
  for fold, (train_idx, val_idx) in enumerate(kfold.split(train_content)):
      print(f"Fold {fold + 1}")
      # dataframe to dataset obj
      train_dataset_content = Dataset1.from_pandas(train_content.iloc[train_idx,:], preserve_index=False) # content
      val_dataset_content = Dataset1.from_pandas(train_content.iloc[val_idx,:], preserve_index=False) # content

      # Mapping tokenizing function to the datasets
      train_tokenized_datasets_content1 = train_dataset_content.map(tokenize_function_train, batched=False)
      val_tokenized_datasets_content1 = val_dataset_content.map(tokenize_function_train, batched=False)

      # model for each fold
      model = AutoModelForSequenceClassification.from_pretrained(
          model_name,
          num_labels=2,
          problem_type="regression",
      ).to(device)

      trainer = Trainer(
          model=model,
          train_dataset=train_tokenized_datasets_content1,
          eval_dataset=val_tokenized_datasets_content1,
          data_collator=data_collator1,
          args=training_args,
          compute_metrics=compute_metrics,
          tokenizer=tokenizer1
      )

      trainer.train()

      trainer.save_model("best_model")
      trainers.append(trainer)

      return trainers

prediction function

In [33]:
def huggingface_model_predict(test1,trainers):
  test_dataset = Dataset1.from_pandas(test1, preserve_index=False)
  test_tokenized_dataset1 = test_dataset.map(tokenize_function_test, batched=False)

  content_list=[]
  wording_list=[]
  for i in range(0,len(trainers)):
      predictions1=trainers[i].predict(test_tokenized_dataset1)
      content_list.append(predictions1.predictions[:, 0].tolist())
      wording_list.append(predictions1.predictions[:, 1].tolist())

  content_pred=np.mean(np.array(content_list).T,axis=1)
  wording_pred=np.mean(np.array(wording_list).T,axis=1)

  return content_pred,wording_pred

New train/validation/test made with just the required columns.

In [34]:
train_content_all = summaries_train[["prompt_question","prompt_title","text", "content", "wording"]] # use question, not text, prompt_text drag performance
train_content,test_content=train_test_split(train_content_all,test_size=0.7,random_state=42)
test1=test_content[["prompt_question","prompt_title","text"]]


Predicting on test data

In [35]:
model_names=['albert-large-v2']

In [36]:
mcrmse_all=[]
for model_name in model_names:
  tokenizer1 = AutoTokenizer.from_pretrained(model_name)
  trainers=huggingface_model_train(model_name,6)
  content_pred,wording_pred=huggingface_model_predict(test1,trainers)
  mcrmse_all.append([mean_squared_error(content_pred, test_content['content'], squared=False),mean_squared_error(wording_pred, test_content['wording'], squared=False),np.mean([mean_squared_error(content_pred, test_content['content'], squared=False),mean_squared_error(wording_pred, test_content['wording'], squared=False)])])


Fold 1


Map:   0%|          | 0/1790 [00:00<?, ? examples/s]

Map:   0%|          | 0/359 [00:00<?, ? examples/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-large-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rmse
1,0.5262,0.39947,0.62361
2,0.3292,0.327295,0.564308
3,0.2526,0.310611,0.54809
4,0.1976,0.292922,0.535072
5,0.1442,0.288434,0.530843
6,0.1024,0.288472,0.531212


Map:   0%|          | 0/5016 [00:00<?, ? examples/s]

In [37]:
print(mcrmse_all)

[[0.4355934107652204, 0.5950492000967131, 0.5153213054309668]]
