# RTE (Recognizing Textual Entailment) with DeBERTa
## Using a pretrained DeBERTa model fine-tuned on MNLI for zero-shot text classification on SNLI
Inspired by Keras code example [Semantic Similarity with BERT](https://keras.io/examples/nlp/semantic_similarity_with_bert/)

## Setup

In [26]:
# !pip install pandas pytorch-lightning transformers wandb 
# !pip install evaluate sklearn

In [1]:
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, 
    TrainingArguments, Trainer
    )
import torchmetrics
# import evaluate
import wandb

  from .autonotebook import tqdm as notebook_tqdm


## Custom dataset

In [2]:
MAX_LENGTH = 128*2
HUB_MODEL_CHECKPOINT = 'microsoft/deberta-base-mnli'
MODEL_NAME = HUB_MODEL_CHECKPOINT.split("/")[-1]

In [4]:
# tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL_CHECKPOINT)
# print(tokenizer.cls_token_id)
# print(tokenizer.sep_token_id)
# tokenizer('my name is thierry', 'my name is thierry')

In [3]:
def _construct_data_path(mode):
    mode = mode if mode != 'valid' else 'dev'
    return f'SNLI_Corpus/snli_1.0_{mode}.csv'


def _preprocess(df):
    df.dropna(axis=0, inplace=True) 
    df = df[df.similarity != "-"]
    df['label'] = df["similarity"].apply(
        lambda x: 0 if x == "contradiction" else 1 if x == "entailment" else 2
        )
    for key in ['sentence1', 'sentence2']:
        df[key] = df[key].astype(str)
    return df


class SNLIDataset(Dataset):
    def __init__(self, mode, tokenizer_name, nrows=None) -> None:
        self.df = pd.read_csv(_construct_data_path(mode), nrows=nrows)
        self.df = _preprocess(self.df)
        self.sentence_pairs = self.df[['sentence1', 'sentence2']].values
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        sentence_pair = self.sentence_pairs[idx]
        encoded = self.tokenizer(sentence_pair[0],
                                 sentence_pair[1],
                                 padding='max_length',
                                 max_length=MAX_LENGTH, 
                                 return_tensors='pt', 
                                 truncation=True)
        labels = self.df.label.values[idx]
        features = {feature: encoded[feature].to(torch.int32).squeeze() for feature in ['input_ids', 'attention_mask', 'token_type_ids']}
        features.update({'labels': labels})
        return features

In [6]:
# train_ds = SNLIDataset('test', tokenizer_name=HUB_MODEL_CHECKPOINT, nrows=1000)
# inputs = train_ds.__getitem__(0)
# inputs

In [7]:
# inputs['input_ids']

## Build model

In [4]:
def get_number_of_trainable_params(model):
    return np.sum(np.array([p.numel() for p in model.parameters() if p.requires_grad]))

In [5]:
# LOCAL_MODEL_CHECKPOINT = './deberta-base-mnli-finetuned-snli/checkpoint-189'

model = AutoModelForSequenceClassification.from_pretrained(HUB_MODEL_CHECKPOINT)
assert model.num_labels == 3, 'The number of labels should be 3 for a RTE task'
print(f'Original number of trainable params: {get_number_of_trainable_params(model)}')

for name, param in model.named_parameters():
    if not name.startswith('classifier'):
        param.requires_grad = False

print(f'Actual number of trainable params: {get_number_of_trainable_params(model)}')

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Original number of trainable params: 139194627
Actual number of trainable params: 2307


## Experiments

In [28]:
# acc = evaluate.load('accuracy')
# acc

In [9]:
TRAIN_SAMPLES = 1000
EVAL_SAMPLES = 100
BATCH_SIZE = 10
MAX_EPOCHS = 3
PROJECT_NAME = f'{MODEL_NAME}-finetuned-snli'

wandb.init(project=PROJECT_NAME)

train_ds = SNLIDataset('train', tokenizer_name=HUB_MODEL_CHECKPOINT, nrows=TRAIN_SAMPLES)
valid_ds = SNLIDataset('valid', tokenizer_name=HUB_MODEL_CHECKPOINT, nrows=EVAL_SAMPLES)

train_args = TrainingArguments(
    output_dir=PROJECT_NAME,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=MAX_EPOCHS,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    report_to='wandb'
)

def compute_metrics(eval_pred):
    metric = torchmetrics.functional.accuracy
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    acc = metric(torch.tensor(predictions).to(torch.int32), torch.tensor(labels).to(torch.int32))
    return {'accuracy': acc}

trainer = Trainer(
    model,
    train_args,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    compute_metrics=compute_metrics,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
loading configuration file config.json from cache at /Users/thierry.wendling/.cache/huggingface/hub/models--microsoft--deberta-base-mnli/snapshots/a80a6eb013898011540b19bf1f64e21eb61e53d6/config.json
Model config DebertaConfig {
  "_name_or_path": "microsoft/deberta-base-mnli",
  "architectures": [
    "DebertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "ge

In [10]:
trainer.train()

***** Running training *****
  Num examples = 998
  Num Epochs = 3
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 10
  Gradient Accumulation steps = 1
  Total optimization steps = 300
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
 33%|███▎      | 100/300 [05:45<11:08,  3.34s/it]***** Running Evaluation *****
  Num examples = 99
  Batch size = 10

 33%|███▎      | 100/300 [06:13<11:08,  3.34s/it]Saving model checkpoint to deberta-base-mnli-finetuned-snli/checkpoint-100
Configuration saved in deberta-base-mnli-finetuned-snli/checkpoint-100/config.json


{'eval_loss': 0.48488369584083557, 'eval_accuracy': 0.808080792427063, 'eval_runtime': 27.9447, 'eval_samples_per_second': 3.543, 'eval_steps_per_second': 0.358, 'epoch': 1.0}


Model weights saved in deberta-base-mnli-finetuned-snli/checkpoint-100/pytorch_model.bin
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
 67%|██████▋   | 200/300 [11:54<06:13,  3.73s/it]***** Running Evaluation *****
  Num examples = 99
  Batch size = 10

 67%|██████▋   | 200/300 [12:21<06:13,  3.73s/it]Saving model checkpoint to deberta-base-mnli-finetuned-snli/checkpoint-200
Configuration saved in deberta-base-mnli-finetuned-snli/checkpoint-200/config.json


{'eval_loss': 0.2748744785785675, 'eval_accuracy': 0.9292929172515869, 'eval_runtime': 26.7304, 'eval_samples_per_second': 3.704, 'eval_steps_per_second': 0.374, 'epoch': 2.0}


Model weights saved in deberta-base-mnli-finetuned-snli/checkpoint-200/pytorch_model.bin
  query_layer = query_layer / torch.tensor(scale, dtype=query_layer.dtype)
  p2c_att = torch.matmul(key_layer, torch.tensor(pos_query_layer.transpose(-1, -2), dtype=key_layer.dtype))
100%|██████████| 300/300 [18:33<00:00,  3.47s/it]***** Running Evaluation *****
  Num examples = 99
  Batch size = 10

100%|██████████| 300/300 [19:02<00:00,  3.47s/it]Saving model checkpoint to deberta-base-mnli-finetuned-snli/checkpoint-300
Configuration saved in deberta-base-mnli-finetuned-snli/checkpoint-300/config.json


{'eval_loss': 0.2659257650375366, 'eval_accuracy': 0.9191918969154358, 'eval_runtime': 29.4959, 'eval_samples_per_second': 3.356, 'eval_steps_per_second': 0.339, 'epoch': 3.0}


Model weights saved in deberta-base-mnli-finetuned-snli/checkpoint-300/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from deberta-base-mnli-finetuned-snli/checkpoint-200 (score: 0.9292929172515869).
100%|██████████| 300/300 [19:03<00:00,  3.81s/it]

{'train_runtime': 1143.6845, 'train_samples_per_second': 2.618, 'train_steps_per_second': 0.262, 'train_loss': 0.7578939819335937, 'epoch': 3.0}





TrainOutput(global_step=300, training_loss=0.7578939819335937, metrics={'train_runtime': 1143.6845, 'train_samples_per_second': 2.618, 'train_steps_per_second': 0.262, 'train_loss': 0.7578939819335937, 'epoch': 3.0})

In [11]:
test_ds = SNLIDataset('test', HUB_MODEL_CHECKPOINT, nrows=None)

trainer.evaluate(test_ds)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
loading configuration file config.json from cache at /Users/thierry.wendling/.cache/huggingface/hub/models--microsoft--deberta-base-mnli/snapshots/a80a6eb013898011540b19bf1f64e21eb61e53d6/config.json
Model config DebertaConfig {
  "_name_or_path": "microsoft/deberta-base-mnli",
  "architectures": [
    "DebertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "ge

{'eval_loss': 0.38426440954208374,
 'eval_accuracy': 0.8617671132087708,
 'eval_runtime': 31438.5074,
 'eval_samples_per_second': 0.312,
 'eval_steps_per_second': 0.031,
 'epoch': 3.0}

In [12]:
wandb.finish()

0,1
eval/accuracy,▁█▇▄
eval/loss,█▁▁▅
eval/runtime,▁▁▁█
eval/samples_per_second,██▇▁
eval/steps_per_second,██▇▁
train/epoch,▁▅███
train/global_step,▁▅███
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁

0,1
eval/accuracy,0.86177
eval/loss,0.38426
eval/runtime,31438.5074
eval/samples_per_second,0.312
eval/steps_per_second,0.031
train/epoch,3.0
train/global_step,300.0
train/total_flos,458980142515200.0
train/train_loss,0.75789
train/train_runtime,1143.6845
