<a href="https://www.kaggle.com/code/yaaangzhou/commonlit-roberta-baseline-model?scriptVersionId=142279621" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Created by Yang Zhou**

**[CommonLit]Roberta Baseline Model**

**7 Sep 2023**

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">[CommonLit]Roberta Baseline Model</center>
<p><center style="color:#949494; font-family: consolas; font-size: 20px;">Automatically assess summaries written by students in grades 3-12</center></p>

***

**The goal of this competition is to generate a model to automatically score student summaries.The goal of the competition is to help teachers and learning platforms provide better feedback to students on their writing.**

**I used ML modeles in another [notebook](https://www.kaggle.com/code/yaaangzhou/commonlit-machine-learning-baseline-model), but it was obvious that it did not work well as a baseline model, so I considered further using LLM to complete the task.**

# 0. Imports

In [1]:
import numpy as np
import pandas as pd

import transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer

from datasets import Dataset as Dataset_HF
from torch.utils.data import Dataset

import torch
import gc
import re

# Metrics
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import warnings
warnings.simplefilter("ignore")



# 1. Load Datas

In [2]:
data_dir = "/kaggle/input/commonlit-evaluate-student-summaries/"
train_pro = pd.read_csv(data_dir + 'prompts_train.csv')
train_sum = pd.read_csv(data_dir + 'summaries_train.csv')

test_pro = pd.read_csv(data_dir + 'prompts_test.csv')
test_sum = pd.read_csv(data_dir + 'summaries_test.csv')

submission = pd.read_csv(data_dir + 'sample_submission.csv')

In [3]:
train = train_sum.merge(train_pro, how="left", on="prompt_id")
test = test_sum.merge(test_pro, how="left", on="prompt_id")

print("Full train dataset shape is {}".format(train.shape))

Full train dataset shape is (7165, 8)


In [4]:
train.head(3)

Unnamed: 0,student_id,prompt_id,text,content,wording,prompt_question,prompt_title,prompt_text
0,000e8c3c7ddb,814d6b,The third wave was an experimentto see how peo...,0.205683,0.380538,Summarize how the Third Wave developed over su...,The Third Wave,Background \r\nThe Third Wave experiment took ...
1,0020ae56ffbf,ebad26,They would rub it up with soda to make the sme...,-0.548304,0.506755,Summarize the various ways the factory would u...,Excerpt from The Jungle,"With one member trimming beef in a cannery, an..."
2,004e978e639e,3b9047,"In Egypt, there were many occupations and soci...",3.128928,4.231226,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...


# 2. Data Prepocessing

First, we need to remove some special symbols in the text because it has no semantic meaning.

In [5]:
train['text'] = train["text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)
test['text'] = test["text"].replace(re.compile(r'[\n\r\t]'), ' ', regex=True)

# 3. Set Configuration

In [6]:
class CFG:
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    max_length=512
    hidden_dropout_prob=0.005
    attention_probs_dropout_prob=0.005
    model_name = "distilroberta-base"
    tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/distilroberta-base')
    model = AutoModelForSequenceClassification.from_pretrained('/kaggle/input/distilroberta-base',num_labels=2,problem_type="regression").to(device)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /kaggle/input/distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
model_config = AutoConfig.from_pretrained('/kaggle/input/distilroberta-base')
model_config.update({
        "hidden_dropout_prob": CFG.hidden_dropout_prob,
        "attention_probs_dropout_prob": CFG.attention_probs_dropout_prob,
        "num_labels": 2,
        "problem_type": "regression",
    })

## Data Collector

In [8]:
data_collator = DataCollatorWithPadding(tokenizer=CFG.tokenizer)

## Evaluation Function

![image.png](attachment:1760ad27-a14a-43f9-8fd1-5cc8242c4c13.png)

In [9]:
def caculate_mcrmse(eval_pred):
    predictions, labels = eval_pred
    squared_errors = np.square(predictions - labels)
    mean_squared_errors = np.mean(squared_errors, axis=0)
    rmse = np.sqrt(mean_squared_errors)

    mcrmse_value = np.mean(rmse)
    content_rmse = rmse[0]
    wording_rmse = rmse[1]
    
    return {
        "mcrmse": mcrmse_value,
        "content_rmse": content_rmse,
        "wording_rmse": wording_rmse
    }

In [10]:
df_train, df_valid = train_test_split(train, test_size=0.2, random_state=42, stratify=train['prompt_id'])

# 4. Dataset for different targets

In [11]:
df_train.head(3)

Unnamed: 0,student_id,prompt_id,text,content,wording,prompt_question,prompt_title,prompt_text
3916,8a31b8cc1996,3b9047,In the social pyramid of ancient Egypt the pha...,-0.077267,0.424365,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...
3985,8c9411cfc953,39c16e,Aristotle claims that an ideal tragedy should ...,0.55907,-0.634924,Summarize at least 3 elements of an ideal trag...,On Tragedy,Chapter 13 \r\nAs the sequel to what has alrea...
1888,4387107feb4d,3b9047,The ancient Egyptian system of government was ...,1.376083,2.389443,"In complete sentences, summarize the structure...",Egyptian Social Structure,Egyptian society was structured like a pyramid...


In [12]:
train_content = df_train[["prompt_question", "text", "content", "wording"]]
valid_content = df_valid[["prompt_question", "text", "content", "wording"]]

df_test = test[["prompt_question","text"]]

**Now we need to transform the dataframe to dataset.**

In [13]:
train_dataset_content = Dataset_HF.from_pandas(train_content, preserve_index=False) 
valid_dataset_content = Dataset_HF.from_pandas(valid_content, preserve_index=False) 
test_dataset = Dataset_HF.from_pandas(df_test, preserve_index=False) 

# 5. Tokenizer

**Now, we will tokenize the text. We need to distinguish between training set and test set. If it is a training set, the tokenizer will return a dictionary including input id, attention mask, and labels.**

**If it is a training set, only the input id and attention mask will be returned.**

In [14]:
def tokenize_function(examples,dataset='train'):
    if dataset == 'train':
        labels = [examples["content"], examples["wording"]]
        tokenized = CFG.tokenizer(examples["text"],
                                  examples["prompt_question"],
                                  padding=False,
                                  truncation=True,
                                  max_length=CFG.max_length)
        return {**tokenized,"labels": labels}
        
    elif dataset == 'test':
        tokenized = CFG.tokenizer(examples["text"],
                                  examples["prompt_question"],
                                  padding=False,
                                  truncation=True,
                                  max_length=CFG.max_length)
        
        return tokenized

In [15]:
train_tokenized_datasets_content = train_dataset_content.map(lambda example: tokenize_function(example, dataset='train'), batched=False)
valid_tokenized_datasets_content = valid_dataset_content.map(lambda example: tokenize_function(example, dataset='train'), batched=False)
test_tokenized_datasets_content = test_dataset.map(lambda example: tokenize_function(example,dataset='test'),batched=False)

  0%|          | 0/5732 [00:00<?, ?ex/s]

  0%|          | 0/1433 [00:00<?, ?ex/s]

  0%|          | 0/4 [00:00<?, ?ex/s]

In [16]:
gc.collect()

174

# 6. Training a model

In [17]:
training_args = TrainingArguments(
    output_dir="output",             
    per_device_train_batch_size=8,   
    per_device_eval_batch_size=4,    
    learning_rate=1.5e-5,            
    lr_scheduler_type="linear",      
    warmup_ratio=0.01,               
    num_train_epochs=15,              
    save_strategy="epoch",           
    logging_strategy="epoch",        
    evaluation_strategy="epoch",    
    load_best_model_at_end=True,     
    metric_for_best_model="mcrmse",  
    greater_is_better=False,         
    fp16=False,                      
    report_to='none',                
    save_total_limit=1               
)

trainer = Trainer(
    model=CFG.model,
    train_dataset=train_tokenized_datasets_content,
    eval_dataset=valid_tokenized_datasets_content,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=caculate_mcrmse,
    tokenizer=CFG.tokenizer
)
trainer.train()

trainer.save_model("best_model")

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Mcrmse,Content Rmse,Wording Rmse
1,0.4531,0.288443,0.531323,0.452969,0.609677
2,0.2762,0.35039,0.586642,0.507647,0.665638
3,0.2281,0.369965,0.600998,0.507371,0.694625
4,0.1928,0.261579,0.50836,0.452238,0.564481
5,0.1673,0.249775,0.495507,0.430333,0.560681
6,0.1393,0.268534,0.514674,0.454302,0.575046
7,0.1215,0.259357,0.506212,0.450473,0.561951
8,0.104,0.298844,0.545781,0.514698,0.576865
9,0.0934,0.281917,0.527285,0.464934,0.589635
10,0.0815,0.292449,0.538026,0.483466,0.592586


# 7. Prediction

In [18]:
predictions = trainer.predict(test_tokenized_datasets_content)
predictions

PredictionOutput(predictions=array([[-1.2407573, -1.0679674],
       [-1.2453878, -1.0692157],
       [-1.2433419, -1.0744982],
       [-1.2469695, -1.069482 ]], dtype=float32), label_ids=None, metrics={'test_runtime': 0.0102, 'test_samples_per_second': 391.662, 'test_steps_per_second': 97.915})

In [19]:
content_list = predictions.predictions[:, 0].tolist()
wording_list = predictions.predictions[:, 1].tolist()

In [20]:
df_test

Unnamed: 0,prompt_question,text
0,Summarize...,Example text 1
1,Summarize...,Example text 2
2,Summarize...,Example text 3
3,Summarize...,Example text 4


In [21]:
submission["content"] = content_list
submission["wording"] = wording_list

submission.to_csv("submission.csv", index=False)
submission.head()

Unnamed: 0,student_id,content,wording
0,000000ffffff,-1.240757,-1.067967
1,111111eeeeee,-1.245388,-1.069216
2,222222cccccc,-1.243342,-1.074498
3,333333dddddd,-1.246969,-1.069482
