We will start with the microsoft/deberta-v3-base model. And the MRPC subset of GLUE.

In [1]:
# !pip install transformers
# !pip install datasets
# !pip3 install torch torchvision
# ! pip install ipywidgets widgetsnbextension pandas-profiling


## Load Model
Here are the documentation:
https://huggingface.co/transformers/v4.9.1/model_doc/deberta_v2.html
For config:
https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/configuration#transformers.PretrainedConfig



In [60]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
config = AutoConfig.from_pretrained(MODEL_NAME)
config.num_labels = 2

# AutoModelForSequenceClassification is adding a classification head on top of the pretrained model. 
MODEL_NAME = "microsoft/deberta-v3-base"
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, config=config)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME,use_fast=False)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


*2024.06.17* I enountered a issue while loading the tokenizer, here is a post that solved the problem (make sure to restart the kernal): https://discuss.huggingface.co/t/error-with-new-tokenizers-urgent/2847  
The "use_fast" parameter is from this post.  
Another question is whether we should set the number of labels to be 1 or 2, since this is looking like a binary classification problem. But according to this website, 2 is also acceptable:  
https://stackoverflow.com/questions/71768061/huggingface-transformers-classification-using-num-labels-1-vs-2

## Load Data

In [53]:
from datasets import load_dataset
datasets = load_dataset("nyu-mll/glue", "mrpc")

In [54]:
# First take a sample of the data
train_dataset = datasets['train']
sample_1 = train_dataset[0]
sample_1

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

### Representation
The first problem I face is how to represent paraphrase: Paraphrase data is made of two sentences and a label indicationg whether they are paraphrase. So the label is clearly the output, what should the input look like? Here we found an good answer: https://huggingface.co/transformers/v3.0.2/glossary.html#token-type-ids

In [55]:
# !pip install numpy==1.26.4

*2024.06.17* I encounter a problem: when I try to convert dataset to torch format, and then I try to access the first item of the train set, it shows this error:   
ValueError: Unable to avoid copy while creating an array as requested.  
So I found a solution in this website  
https://support.gurobi.com/hc/en-us/articles/25787048531601-Compatibility-issues-with-numpy-2-0#:~:text=ValueError%3A%20Unable%20to%20avoid%20copy,4).  
I fixed the problem by downgrading numpy.

In [56]:
import torch
from torch.utils.data import DataLoader
# tokenize the entire dataset: I make sure we pad every sentence (pair) to token length of 102
# I first use tokenizer() to tokenize the entire train,test,val set separately, and see that the maximum length of tokens is 102. 
def tokenize(sample):
    tokenized_dataset = tokenizer(
        sample['sentence1'],
        sample['sentence2'],  
        truncation=True,               # Truncate sequences longer than the model's max length
        padding='max_length',          # Pad to the maximum length
        max_length = 102,              # I tried to pad them separately and see that the max length of token is 102
        return_token_type_ids=True,    # Return token type IDs
        return_attention_mask=True,    # Return attention mask
    )
    return tokenized_dataset

tokenized_datasets = datasets.map(tokenize, batched =True) 
tokenized_datasets = tokenized_datasets.remove_columns(['sentence1','sentence2','idx'])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

## Train the model

In [61]:
dataloaders = {set_name: DataLoader(dataset, batch_size=2) for set_name, dataset in tokenized_datasets.items()}
# next(iter(dataloaders['train']))

In [62]:
output = None
for batch in dataloaders['train']:
    # the batch already includes the 'labels' (y) and the input_ids, masks, input_type_ids (x). 
    output = model(**batch)
    break

In [63]:
output

SequenceClassifierOutput(loss=tensor(0.6932, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0660, -0.0719],
        [-0.0627, -0.0683]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)