## Sequence Classification on GLUE

Following this guide from HuggingFace: https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb


The guide is generalized across the 9 GLUE tasks, so some code here is general. Based on the variable 'task', the training is carried out. However, generating outputs has to be done manually keeping the task in mind.

~Samyukt Sriram

In [None]:
#Installing packages

!pip install transformers
!pip install datasets


In [2]:
import scipy
import sklearn
import numpy as np

from transformers import AutoTokenizer, DataCollatorWithPadding, TFAutoModelForSequenceClassification, create_optimizer
from transformers.keras_callbacks import KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

import tensorflow as tf
from datasets import load_dataset, load_metric, ReadInstruction

GLUE Benchmark is a set of 9 classification tasks. For reference, here is a quick summary from the guide notebook (link above)

- [CoLA](https://nyu-mll.github.io/CoLA/) (Corpus of Linguistic Acceptability) Determine if a sentence is grammatically correct or not.is a  dataset containing sentences labeled grammatically correct or not.
- [MNLI](https://arxiv.org/abs/1704.05426) (Multi-Genre Natural Language Inference) Determine if a sentence entails, contradicts or is unrelated to a given hypothesis. (This dataset has two versions, one with the validation and test set coming from the same distribution, another called mismatched where the validation and test use out-of-domain data.)
- [MRPC](https://www.microsoft.com/en-us/download/details.aspx?id=52398) (Microsoft Research Paraphrase Corpus) Determine if two sentences are paraphrases from one another or not.
- [QNLI](https://rajpurkar.github.io/SQuAD-explorer/) (Question-answering Natural Language Inference) Determine if the answer to a question is in the second sentence or not. (This dataset is built from the SQuAD dataset.)
- [QQP](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Quora Question Pairs2) Determine if two questions are semantically equivalent or not.
- [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) (Recognizing Textual Entailment) Determine if a sentence entails a given hypothesis or not.
- [SST-2](https://nlp.stanford.edu/sentiment/index.html) (Stanford Sentiment Treebank) Determine if the sentence has a positive or negative sentiment.
- [STS-B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) (Semantic Textual Similarity Benchmark) Determine the similarity of two sentences with a score from 1 to 5.
- [WNLI](https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html) (Winograd Natural Language Inference) Determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not. (This dataset is built from the Winograd Schema Challenge dataset.)

In [3]:
#GLUE Task strings:

GLUE_TASKS = [
              'cola',
              'mnli',
              'mnli-mm',
              'mrpc',
              'qnli',
              'qqp',
              'rte',
              'sst2',
              'stsb',
              'wnli'
]

In [4]:
#Setting up task and model:

task = 'mnli'
model_checkpoint = 'distilbert-base-uncased' #Make sure the model is compatible with classification tasks
batch_size = 16 #This might be need to tweaked based on task and model.

In [5]:
actual_task = 'mnli' if task == 'mnli-mm' else task #mnli-mm is an exception, all other task strings can be passed as is.

validation_key = (
    'validation_mismatched' if task == 'mnli-mm'
    else 'validation_matched' if task == 'mnli'
    else 'validation'
)


test_key = (
    'test_mismatched' if task == 'mnli-mm'
    else 'test_matched' if task == 'mnli'
    else 'test'
)

#Loading a fraction of the data to save 
dataset = load_dataset('glue', actual_task, split = {'train':'train[:2%]', f'{validation_key}':f'{validation_key}[:1%]', f'{test_key}':f'{test_key}[:1%]'})
metric = load_metric('glue', actual_task)

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mnli (download: 298.29 MiB, generated: 78.65 MiB, post-processed: Unknown size, total: 376.95 MiB) to /root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/313M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

In [6]:
print(dataset['train'][5])
dataset

{'premise': "my walkman broke so i'm upset now i just have to turn the stereo up real loud", 'hypothesis': "I'm upset that my walkman broke and now I have to turn the stereo up really loud.", 'label': 0, 'idx': 5}


DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 7854
    })
    validation_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 98
    })
    test_matched: Dataset({
        features: ['premise', 'hypothesis', 'label', 'idx'],
        num_rows: 98
    })
})

In [7]:
#Preprocessing

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [8]:
#Dictionary to keep track of task and their corresponding column keys in the dataset

#Most of the tasks have different features, so this becomes helpful

task_to_keys = {
    'cola':('sentence', None),
    'mnli':('premise', 'hypothesis'),
    'mnli-mm':('premise', 'hypothesis'),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

In [9]:
sentence1_key, sentence2_key = task_to_keys[task] #Not actual_task, but shouldn't matter.

def preprocess_function(examples):
  if sentence2_key is None:
    return tokenizer(examples[sentence1_key], truncation = True)
  return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation = True)


encoded_dataset = dataset.map(preprocess_function, batched = True)


#For the DataCollator function, we need to specify which columns are tokenized inputs. 
pre_tokenizer_columns = set(dataset['train'].features)
tokenizer_columns = list(set(encoded_dataset['train'].features) - pre_tokenizer_columns)




  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [10]:
encoded_dataset['train'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'hypothesis': Value(dtype='string', id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'label': ClassLabel(num_classes=3, names=['entailment', 'neutral', 'contradiction'], id=None),
 'premise': Value(dtype='string', id=None)}

In [11]:
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors ='tf')


#Again this is bc of the mnli-mm task differences
validation_key = (
    'validation_mismatched' if task == 'mnli-mm'
    else 'validation_matched' if task == 'mnli'
    else 'validation'
)

tf_train_dataset = encoded_dataset['train'].to_tf_dataset(
    columns = tokenizer_columns,
    label_cols = ['labels'],
    shuffle = True,
    batch_size = 16,
    collate_fn = data_collator,
)

tf_validation_dataset = encoded_dataset[validation_key].to_tf_dataset(
    columns = tokenizer_columns,
    label_cols = ['labels'],
    shuffle = False, #?? Why not shuffle here. Reasons seems to be shuffling is unnecessary, as model isn't learning on this. Just adds excess computation.
    batch_size = 16,
    collate_fn = data_collator,
)

In [12]:
#Defining Loss and Model

num_labels = 3 if task.startswith('mnli') else 1 if task == 'stsb' else 2 #Can adjust this based on task.

if task == 'stsb':
  loss = tf.keras.losses.MeanSquaredError()
  num_labels = 1 #Unnecessary as we already did this above.
elif task.startswith('mnli'):
  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  num_labels = 3 #Same as above.
else:
  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  num_labels = 2

model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels = num_labels)


Downloading:   0%|          | 0.00/347M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_projector', 'activation_13', 'vocab_layer_norm', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

In [13]:
#Compiling the model

num_epochs = 3
batches_per_epoch = len(encoded_dataset['train']) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)


#create_optimizer() is AdamW with weight and learning rate decay
optimizer, schedule = create_optimizer(
    init_lr = 2e-5, num_warmup_steps = 0, num_train_steps = total_train_steps
)

model.compile(optimizer = optimizer, loss = loss)

In [14]:
#Setting up KerasMetricCallback, can handle any metric computation (like BLEU, ROUGE)
#Useful for other callbacks like TensorBoard, EarlyStopping. 


metric_name = (
    'pearson' if task == 'stsb'
    else 'matthews_correlation' if task == 'cola'
    else 'accuracy'
)

def compute_metrics(eval_predictions):
  predictions, labels = eval_predictions
  if task != 'stsb':
    predictions = np.argmax(predictions, axis=1)
  else:
    predictions = predictions[:,0]
  return metric.compute(predictions=predictions, references = labels)

metric_callback = KerasMetricCallback(
    metric_fn = compute_metrics, eval_dataset = tf_validation_dataset
)

In [15]:
#Training

tensorboard_callback = TensorBoard(log_dir = "./text_classification_model_save/logs")
callbacks = [metric_callback, tensorboard_callback]

model.fit(
    tf_train_dataset,
    validation_data = tf_validation_dataset,
    epochs = num_epochs,
    callbacks = callbacks
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f9e3813d9d0>

In [27]:
#MNLI

input_ids = tf.constant(tokenizer.encode([
                                          "my walkman broke so i'm upset now i just have to turn the stereo up real loud", \
                                           "I'm upset that my walkman broke and now I have to turn the stereo up."]))  # Batch size 1, training example slightly modified
outputs = model(input_ids)
logits = outputs[0]
print(logits)

#[entailment, neutral, contradiction] p = 0.5 <=> logit = 0

tf.Tensor([[ 2.6447868  -1.6716129  -0.90757364]], shape=(1, 3), dtype=float32)
