<a href="https://colab.research.google.com/github/qmeng222/transformers-for-NLP/blob/main/Fine_Tuning_Recognizing_Textual_Entailment_(RTE).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About the project:
---
Text classifier with 2 input sentences.

# Load the dataset:

In [1]:
# install libraries:
!pip install transformers datasets
# `transformers` library: for working with pre-trained NLP models
# `datasets` library: for working with datasets



In [2]:
from datasets import load_dataset # from `datasets` library, import the `load_dataset` func for downloading datasets
import numpy as np # import `numpy` library for numerical computations in Python

In [3]:
# use the `load_dataset` func to load the RTE (Recognizing Textual Entailment) dataset from the GLUE (General Language Understanding Evaluation) benchmark:
raw_datasets = load_dataset("glue", "rte")
# benchmark is a standardized set of tasks or datasets that are used to evaluate the performance of ML models
# RTE is a task to determine whether one piece of text logically entails another

# Examine the raw dataset:

In [4]:
type(raw_datasets)

datasets.dataset_dict.DatasetDict

In [5]:
raw_datasets.shape

{'train': (2490, 4), 'validation': (277, 4), 'test': (3000, 4)}

In [6]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2490
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 277
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3000
    })
})

In [7]:
# check the attributes:
raw_datasets['train'].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['entailment', 'not_entailment'], id=None),
 'idx': Value(dtype='int32', id=None)}

# Examine each feature within the training subset:

In [8]:
raw_datasets['train']['sentence1'][:3]

['No Weapons of Mass Destruction Found in Iraq Yet.',
 'A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.',
 'Herceptin was already approved to treat the sickest breast cancer patients, and the company said, Monday, it will discuss with federal regulators the possibility of prescribing the drug for more breast cancer patients.']

👆 A list.

In [9]:
raw_datasets['train']['sentence2'][:3]

['Weapons of Mass Destruction Found in Iraq.',
 'Pope Benedict XVI is the new leader of the Roman Catholic Church.',
 'Herceptin can be used to treat breast cancer.']

In [10]:
raw_datasets['train']['label'][:3]

[1, 0, 0]

In [11]:
raw_datasets['train']['idx'][3:6]


[3, 4, 5]

# Tokenize:

In [12]:
# import classes from the HF transformers library:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
# `AutoTokenizer` class: automatically load the appropriate tokenizer for a specific pre-trained model
# `AutoModelForSequenceClassification` class: automatically load a pre-trained model suitable for sequence classification based on the provided model identifier
# `AutoConfig` class: load a pre-configured model configuration based on the model identifier or name
# `Trainer` calss: for training and evaluating models
# `TrainingArguments` class: customize the training arguments (hyperparameters etc.) for the training process

In [13]:
# model identifier (specify the name of a pre-trained model):
checkpoint = 'distilbert-base-cased' # or checkpoint = 'bert-base-cased'

In [14]:
# automatically load the appropriate tokenizer:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [15]:
# tokenize the 1st pair of sentences:
tokenizer(
    raw_datasets['train']['sentence1'][0],
    raw_datasets['train']['sentence2'][0]
)

{'input_ids': [101, 1302, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 6355, 119, 102, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [16]:
# assign the res of the last expression to a variable
result = _

In [17]:
result

{'input_ids': [101, 1302, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 6355, 119, 102, 20263, 1104, 8718, 14177, 17993, 17107, 1107, 5008, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [18]:
# convert a sequence of token ids back into a human-readable string:
tokenizer.decode(result['input_ids'])

'[CLS] No Weapons of Mass Destruction Found in Iraq Yet. [SEP] Weapons of Mass Destruction Found in Iraq. [SEP]'

👆 CLS (classification) token added to the beginning of the sequence.

SEP (separator) token to separate different segments of the sequence, especially in tasks involving multiple sentences or sequences.

# Load the pre-trained model:

In [19]:
# automatically load a pre-trained model suitable for sequence classification based on the provided model identifier:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Evaluate model performance:

In [20]:
# from `datasets` library, import the `load_metric` func to use pre-defined evaluation metrics for assessing the model performance:
from datasets import load_metric

In [21]:
# load the evaluation metric associated with the RTE task from the GLUE benchmark:
metric = load_metric("glue", "rte")
metric

  metric = load_metric("glue", "rte")


Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

In [22]:
# use the `compute` method of a loaded metric for a dummy evaluation:
metric.compute(predictions=[1, 1, 1], references=[1, 0, 0])

{'accuracy': 0.3333333333333333}

In [23]:
# from the `metrics` module within the `sklearn` library, import the `f1_score` func
# to compute the F1 score for evaluating classification models:
from sklearn.metrics import f1_score

In [24]:
# compute evaluation metrics based on the logits (raw predictions) and true labels:
def compute_metrics(logits_and_labels):
  logits, labels = logits_and_labels # unpack the tuple, which is provided by the evaluation loop of a model
  predictions = np.argmax(logits, axis=-1) # (for classification purpose) compute the predictions by taking the index of the maximum logit along the last axis
  acc = np.mean(predictions == labels) # compute the ave accuracy
  f1 = f1_score(labels, predictions) # compute the F1 score
  return {'accuracy': acc, 'f1': f1} # return a dict containing the computed metrics

In [25]:
# this function takes a batch of examples
# where each example is a dictionary with keys 'sentence1' and 'sentence2'
# the values associated with these keys are the text sentences to be tokenized
# truncate the tokens if they exceed the maximum token length supported by the tokenizer
def tokenize_fn(batch):
  return tokenizer(batch['sentence1'], batch['sentence2'], truncation=True)

In [26]:
# apply the `tokenize_fn` function to each batch of examples in the `raw_datasets`:
tokenized_datasets = raw_datasets.map(tokenize_fn, batched=True)

# The trainer (for training and evaluation):

In [27]:
!pip install transformers[torch]



In [29]:
# create an instance of the `TrainingArguments` class with specific configuration settings:
training_args = TrainingArguments(
    output_dir='training_dir', # specify the directory where the trained model and associated files will be saved
    evaluation_strategy='epoch', # evaluation (validation) will be performed at the end of each epoch
    save_strategy='epoch', # a checkpoint - a snapshot of the model's parameters (weights and biases) and other relevant info) - will be saved to disk at the end of each epoch
    num_train_epochs=5, # the model will be trained for 5 epochs
    per_device_train_batch_size=16, # how many training examples will be processed in each forward and backward pass
    per_device_eval_batch_size=64, # each eval batch will contain 64 examples
    logging_steps=150, # after every 150 batches, the training logs will be displayed; otherwise 'no log' will appear
)

In [30]:
# configure an instance of the `Trainer` class:
trainer = Trainer(
    model, # the model to be trained or fine-tuned (the model created using `AutoModelForSequenceClassification.from_pretrained()`)
    training_args, # the instance of the TrainingArguments class
    train_dataset=tokenized_datasets["train"], # refer to the `tokenized_datasets`
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer, # the tokenizer loaded to preprocess the input data
    compute_metrics=compute_metrics, # the func for computing evaluation metrics
)

In [31]:
# train the model (forward, backward, update params, logging, eval):
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6983,0.69357,0.530686,0.644809
2,0.6437,0.745733,0.501805,0.530612
3,0.4018,1.009437,0.534296,0.509506
4,0.1923,1.560241,0.552347,0.474576
5,0.1004,1.960213,0.534296,0.527473


TrainOutput(global_step=780, training_loss=0.39399332159604783, metrics={'train_runtime': 210.4194, 'train_samples_per_second': 59.168, 'train_steps_per_second': 3.707, 'total_flos': 544524318051096.0, 'train_loss': 0.39399332159604783, 'epoch': 5.0})

In [32]:
# save the trained model and its associated training state to a specified directory:
trainer.save_model('my_saved_model')

# Transformer pipeline:

In [33]:
# from the HF transformers library, import the pipeline class for using pre-trained models:
from transformers import pipeline

# create a 'text classification' pipeline that uses a previously trained model checkpoint:
p = pipeline(
    'text-classification', # specify the task for the pipeline
    model='my_saved_model', # specify the path of the pre-trained model checkpoint
    device=0 # use GPU if possible
)

In [34]:
# use the text classification pipeline to classify a pair of texts
p({'text': 'I went to the store', 'text_pair': 'I am a bird'})

{'label': 'LABEL_1', 'score': 0.8360171914100647}