In [1]:
import warnings
warnings.filterwarnings("ignore")

### Loading and preprocessing the dataset

In [2]:
from datasets import load_dataset 
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_dataset = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)

tokenized_datasets = raw_dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map: 100%|██████████| 1725/1725 [00:00<00:00, 10893.57 examples/s]


### Defining the Trainer Arguments

In [3]:
from transformers import TrainingArguments
from transformers import AutoModelForSequenceClassification

training_args = TrainingArguments("test_trainer")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)  
# `bert-base-uncased` wasn't trained for sequence classification, that's the reason for the below warning




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


_You will notice that unlike before, you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now._

Why You Can Skip `data_collator=data_collator`:

Since the `Trainer` already knows you provided a tokenizer, it will automatically use `DataCollatorWithPadding` as the default. Therefore, explicitly passing the `data_collator` argument (as you've done with `data_collator=data_collator`) is **redundant**.

If you leave out `data_collator=data_collator`, the `Trainer` will still use `DataCollatorWithPadding` by default, ensuring that padding is handled.

In [None]:
from transformers import Trainer 

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()

In [9]:
# prediction

predicted = trainer.predict(tokenized_datasets["validation"])
print(predicted.predictions.shape, predicted.label_ids.shape)

100%|██████████| 51/51 [00:21<00:00,  2.41it/s]

(408, 2) (408,)



