## Installing Essential NLP Libraries  📚

These lines of code install essential libraries for working with Natural Language Processing (NLP) tasks:

1. `!pip install transformers`

    This line installs the transformers library, which provides pre-trained NLP models and tools for various NLP tasks.

2. `!pip install datasets`

    This line installs the datasets library, which offers functionalities for loading, processing, and managing various NLP datasets.

3. `!pip install accelerate`

    This line installs the accelerate library, which provides tools for speeding up training and inference of deep learning models in NLP.

By installing these libraries, you gain a powerful toolkit for various NLP tasks, enabling you to leverage pre-trained models, manage datasets effectively, and accelerate training processes.


In [1]:
!pip install transformers
!pip install datasets
!pip install accelerate >=0.21.0

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16


#2.  Setting-up  Data, Tokenizer  💻

1. **Importing Essential Libraries:**
    - `from datasets import load_dataset`: This line imports the `load_dataset` function from the `datasets` library, which is used for loading datasets from various sources.
    - `from transformers import AutoTokenizer`: This line imports the `AutoTokenizer` class from the `transformers` library, which is used for tokenizing text, a crucial step in preparing text data for language models.

2. **Loading the Dataset:**
    ```python
    raw_datasets = load_dataset("glue", "mrpc")
    ```
    This line loads the MRPC dataset from the GLUE benchmark, a collection of natural language understanding tasks. It stores the loaded dataset in the `raw_datasets` variable.

    The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

    https://huggingface.co/datasets/glue/viewer/mrpc

3. **Selecting a Pre-trained Model:**
    ```python
    checkpoint = "bert-base-uncased"
    ```
    This line specifies the pre-trained language model to be used, which is `bert-base-uncased`, a transformer-based model for various NLP tasks.

4. **Loading the Tokenizer:**
    ```python
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    ```
    This line downloads the appropriate tokenizer for the chosen model and creates a tokenizer object, ready to split text into tokens that the model can understand.

In essence, this code sets up the foundation for working with a specific NLP dataset and preparing text data for language model training or inference.


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer

datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

type(datasets)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

datasets.dataset_dict.DatasetDict

The code defines a function named `reduce_dataset` that takes a dataset and a target number of rows, aiming to create a smaller version of the dataset.






In [3]:
from datasets import DatasetDict

def reduce_dataset(dataset_dict, target_num_rows):
  """Copies a DatasetDict and reduces the number of rows in each split to a specified amount.

  Args:
    dataset_dict (datasets.DatasetDict): The input dataset dictionary.
    target_num_rows (int): The desired number of rows in each split of the copied dataset.

  Returns:
    datasets.DatasetDict: The copied and reduced dataset dictionary.
  """
  copied_dataset_dict = DatasetDict()
  for split, dataset in dataset_dict.items():
    copied_dataset = dataset.select(range(target_num_rows))
    copied_dataset_dict[split] = copied_dataset

  return copied_dataset_dict

target_num_rows = 200

datasets = reduce_dataset(datasets, target_num_rows)

type(datasets)

datasets.dataset_dict.DatasetDict

#3. Preprocessing Data using Tokenizer 🛠️
1. **Function Definition:**

    ```python
    def tokenize_function(example):
    ```

    This defines a function called `tokenize_function` that takes an individual data sample (`example`) as input.

2. **Tokenization with Truncation:**

    ```python
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
    ```

    This line performs the core functionality of the function. It employs the `tokenizer` object (created earlier) to tokenize two text pieces: `example["sentence1"]` and `example["sentence2"]`. The `truncation=True` argument ensures that any sentences exceeding a specific length (defined by the model) are shortened to fit within that limit.

3. **Applying Tokenization to Entire Dataset:**

    ```python
    tokenized_datasets = datasets.map(tokenize_function, batched=True)
    ```

    This line applies the `tokenize_function` to the entire dataset using the `datasets.map` function. `batched=True` instructs the map function to process multiple data samples simultaneously, potentially improving efficiency. The result is stored in the `tokenized_datasets` variable, effectively transforming the data from plain text to a format suitable for the language model.

In essence, this code snippet tokenizes individual data samples within a dataset, considering potential length constraints, and applies the transformation to the entire dataset efficiently.
```

In [4]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

#4. Trainer Class  🏋️‍♂️

1.**Importing a class:**

  ```from transformers import TrainingArguments: ```

This line imports the TrainingArguments class from the transformers library.

2.**Creating an instance:**

```training_args = TrainingArguments("test-trainer"):```

 This line creates an instance of the TrainingArguments class and assigns it to the variable training_args.
The TrainingArguments class provides a way to define and manage various parameters used during training with the transformers library. It allows you to specify settings like:

Output directory: Where to save training outputs (models, checkpoints, etc.).

In [7]:
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")
training_args

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_le

#4.  Setting-up  Model 🤖

1. **Importing a Class:**

`from transformers import AutoModelForSequenceClassification`:

This line imports the AutoModelForSequenceClassification class from the transformers library. This class is specifically designed for sequence classification tasks, aiming to assign a label or category to a given sequence of text or data.

2. **Loading a Pretrained Model and Configuring It:**

`model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)`:

This line does two important things:
Loads a pre-trained model: It loads a pre-trained language model from the specified checkpoint. The from_pretrained method handles downloading the model architecture and weights, making it ready to use.

3. **Configures for the task:**

It tailors the model for a sequence classification task with 2 possible labels. The num_labels=2 argument ensures the model's output layer has the appropriate number of neurons to handle a binary classification problem.

In [11]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
# chekc the configuration of the model
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

#5. Trainer Class

1. **model**:
This is the actual NLP model you want to train. It's likely an instance of a pre-trained model class from the transformers library (e.g., AutoModelForSequenceClassification, T5ForConditionalGeneration, etc.).
2. **training_args**: This is an instance of the TrainingArguments class, also from the transformers library. It contains various hyperparameters and configurations that control the training process, such as learning rate, batch size, epochs, and more. You typically create this object separately, configuring the desired hyperparameters for your training run.
3. **train_dataset**: This is a PyTorch Dataset object or dictionary representing the training data. It's assumed that the data has already been preprocessed and tokenized (converted into numerical representations suitable for the model), and is stored in the tokenized_datasets dictionary under the key "train".
4. **eval_dataset**: This is similar to train_dataset, but represents the evaluation data used to monitor the model's performance during training and after training is complete. It's retrieved from the tokenized_datasets dictionary under the key "validation".
5. **tokenizer**: This is an instance of a tokenizer object, also from the transformers library. It's responsible for converting text data into numerical tokens that the model understands. This argument is likely the same tokenizer that was used to preprocess the datasets.


In [14]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

In [15]:
trainer.train()


Step,Training Loss


TrainOutput(global_step=75, training_loss=0.4372893778483073, metrics={'train_runtime': 10.8087, 'train_samples_per_second': 55.511, 'train_steps_per_second': 6.939, 'total_flos': 21813550933440.0, 'train_loss': 0.4372893778483073, 'epoch': 3.0})

In [16]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(200, 2) (200,)


In [17]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

In [18]:
!pip install evaluate
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.735, 'f1': 0.836923076923077}

In [19]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [20]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.648622,0.685,0.813056
2,No log,0.71143,0.72,0.830303
3,No log,0.67667,0.72,0.826087


TrainOutput(global_step=75, training_loss=0.4858089192708333, metrics={'train_runtime': 21.4014, 'train_samples_per_second': 28.036, 'train_steps_per_second': 3.504, 'total_flos': 21903995358720.0, 'train_loss': 0.4858089192708333, 'epoch': 3.0})