# <font color = 'indianred'>**Understanding Inputs/Outputs for BERT** </font>

**Objective:**

We will

**Key Changes and Innovations from prebious Notebook:**

1. **Advanced Tokenization Techniques:**
   - We delve deeper into advanced tokenization, using BERT's pre-trained tokenizer. We need to use this tokenizer so that the inputs are compatible with the pre-trained model.
2. **Introduction of Pre-Trained BERT Model:**
   - Unlike the custom model in the first notebook, we now utilize a pre-trained BERT model. This approach allows us to benefit from a model already trained on a vast corpus of text, bringing in rich contextual embeddings. By using a pre-trained model, we can get better accuracy compared to training a model from scratch.



<img src ="https://drive.google.com/uc?export=view&id=1IQgmPzHxbVw3a7EfwWfIGtiPAZY7mAMD" width =800>








# <font color = 'indianred'> **1. Setting up the Environment** </font>



In [None]:
from pathlib import Path
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount("/content/drive")
    !pip install datasets transformers -U -qq


Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25h

<font color = 'indianred'> *Load Libraries* </font>

In [None]:
# standard data science librraies for data handling and v isualization
import torch.nn as nn
import torch
import matplotlib.pyplot as plt


# New libraries introduced in this notebook
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, AutoModel
from transformers import AutoConfig
from transformers import PreTrainedModel, PretrainedConfig
from transformers import DataCollatorWithPadding
from transformers.modeling_outputs import SequenceClassifierOutput


# <font color = 'indianred'>**2. Understanding Tokenization**

Tokenization is the process of dividing a sequence of text into smaller parts called tokens.

*Why we need Tokenization?*

In natural language processing (NLP), this is an essential preprocessing step. NLP models do not accept raw strings directly; instead, these <font color='indianred'><b>models expect the input text to be tokenized and converted into numerical vectors</b>.</font>

*Tokenization Approaches*

Let's explore the three common tokenization strategies - word, character, and subword - and examine their advantages and disadvantages.

1. *Word Tokenization*: It breaks the text into individual words. This is the most intuitive way to split text but can suffer from Out-of-Vocabulary (OOV) Words (Words not found in the training vocabulary ) issues and rigidity in handling variations in words.

2. *Character Tokenization*: It breaks the text into individual characters, offering flexibility to represent any string. This can help to deal with OOV, rare words and mis-spellings. However, we lose semantic information and NLP model will need to learn linguistic structures like words from the data.

3. *Subword Tokenization*: Subword tokenization represents a middle ground between word and character tokenization, dividing text into units that may include whole words or character n-grams. This approach aims to harness the advantages of both character and word tokenization.
   - Rare Words Handling: Breaks down rare, complex, and misspelled words into smaller units, facilitating easier interpretation by the model.
   - Frequent Words Preservation: Retains frequently used words as individual entities, keeping input length manageable.


*What is a pre-trained Tokenizer?*

The pre-trained tokenizer is the tokenizer (for a specific model) trained on a large corpus of text and has learned a set of rules for breaking down words and sentences into tokens. Using these rules, it has created a fixed vocabulary. This vocabulary is a mapping between unique tokens (words, subwords, or characters depending on the tokenizer's method) and unique IDs.

The tokenizer uses this fixed vocabulary to tokenize the new data that we pass. The tokenizer follows these steps to create subtokens for the new dataset:
Longest Match Rule: The tokenizer looks for the longest matching subword token in its vocabulary. If the whole word is present in the vocabulary, it is not split, and the tokenizer takes the entire word as one token.

*Subword Splitting*: If the word is not in the vocabulary or only a part of it is present, the tokenizer breaks it down into subword tokens. It selects the longest matching subword token from the beginning of the word and assigns it as the first token. Then, it looks for the longest matching subword token from the remaining part of the word and assigns it as the next token. This process continues until the entire word is covered by subword tokens.

*Why we need a pre-trained Tokenizer?*

We aim to fine-tune the pre-trained mode (BERT)l. For this reason, employing the same tokenizer used during BERT's original training is critical to fully leverage the model's capabilities. BERT's training involved a specific tokenization method, the WordPiece Tokenization, which is integral to how the model understands and processes language. By using this tokenizer, we ensure compatibility with the pre-trained embeddings in BERT's embedding layer. This alignment is crucial as it maintains the contextual integrity and consistency of input representation, which BERT relies on for its performance. Deviating from this tokenizer could lead to a mismatch between how the input text is represented and how BERT was trained to interpret text, resulting in decreased accuracy and efficiency of the model. Therefore, to harness BERT's full potential in various NLP tasks, it's vital to use the tokenizer it was trained with.


# <font color = 'indianred'>**3. Load pre-trained Tokenizer**</font>


<img src ="https://drive.google.com/uc?export=view&id=1qH2bkB0or2_KAf84O5y5Y26A1W6ZmWRj" width =800>

image source: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

In our next step, we will download a pre-trained tokenizer specifically designed to work with BERT. This tokenizer will handle the conversion of our text into a format that BERT can understand.

In [None]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

- `checkpoint = "bert-base-uncased"` specifies the pre-trained model we want to use. BERT has various versions, and "bert-base-uncased" refers to the base version trained on uncased English text.

-  `tokenizer = AutoTokenizer.from_pretrained(checkpoint)` downloads and initializes the tokenizer for the specified checkpoint. This method takes care of downloading the required files and setting up the tokenizer with the correct configurations.




<font color = 'indianred'>*Understanding pre-trained Tokenizer*

We will now understand how the tokenizer work by feeding one simple example.

In [None]:
# Define the text and labels
data = {
    'text': [
        "Tokenization is the process of splitting sequence to tokens",
        "I like BUAN6342"
    ],
    'label': [0, 1]
}

# Create a Hugging Face dataset
dataset = Dataset.from_dict(data)

# Display the dataset
print(dataset)


Dataset({
    features: ['text', 'label'],
    num_rows: 2
})


In [None]:
# get the vocab size
print(f'Pretrained tokenizer vocab size {tokenizer.vocab_size}')


Pretrained tokenizer vocab size 30522


- <font color = 'indianblue'>The vocab size for the tokenizer for bert-base-uncased model is 30522.

In [None]:
encoded_text = [tokenizer(text, truncation=True, return_tensors='pt') for text in dataset['text']]

Let us understand the arguments:

1. `padding = True`: This argument tells the tokenizer to add padding to the input text. BERT processes inputs in batches, and all sequences in a batch should have the same length. Padding adds special [PAD] tokens to make all sentences in the batch the same length.

2. `truncation = True`: This argument instructs the tokenizer to truncate the input text to a maximum length that BERT can handle. BERT has a maximum input length, and if a sentence is longer than that, it will be truncated. If you do not set truncation=True and you have a sequence length greater than teh model can take , the tokenizer will raise an error.

3. `return_tensors = 'pt'`: This argument tells the tokenizer to return the output in PyTorch tensor format. PyTorch tensors are data structures used for efficient numerical computations.

Now let us look at the output of the tokenizer, and try to understand the output


In [None]:
encoded_text


[{'input_ids': tensor([[  101, 19204,  3989,  2003,  1996,  2832,  1997, 14541,  5537,  2000,
          19204,  2015,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])},
 {'input_ids': tensor([[  101,  1045,  2066, 20934,  2319,  2575, 22022,  2475,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}]

- **input_ids**

    Input Ids are numerical identifiers assigned to each token from the input text. The tokenizer map each word or sub-word into a unique ID from its predefined vocabulary. It tries to match the whole word first. If unsuccessful, it splits the word into sub-words until each piece can be matched in the vocabulary, each represented by its corresponding ID. If a piece can't be found, a special [UNK] token is used. This process creates the `input_ids` list, a numerical representation of the input text.

- **'token_type_ids'**

    The token type IDs are used when BERT is fed with pairs of sentences or inputs with distinct segments (e.g., Question-Answer pairs). For single-sentence tasks, all token type IDs are typically set to 0. For tasks that require two separate segments of text, such as Question-Answer tasks, the token type IDs distinguish the segments. The first segment (e.g., the question) is assigned a token type ID of 0, and the second segment (e.g., the answer) is assigned a token type ID of 1.

- **attention_mask**

    The attention mask is a binary tensor that has the same length as the tokenized input sequence. It is used to instruct the BERT model about which tokens should be used and which ones should be ignored during processing. The attention mask is essential when handling sequences with varying lengths. It works by setting a value of 1 for the tokens that should be used to and 0 for the tokens that should be ignored (typically the padded tokens). This way, the model knows which tokens are actual input and which ones are just padding.



In [None]:
# Extracting the tensor and converting it to a list for the first sentence
tokens_first_sentence = tokenizer.convert_ids_to_tokens(encoded_text[0]['input_ids'][0].tolist())

# Extracting the tensor and converting it to a list for the second sentence
tokens_second_sentence = tokenizer.convert_ids_to_tokens(encoded_text[1]['input_ids'][0].tolist())

# Now you should be able to print or process the tokens
print("First sentence tokens:", tokens_first_sentence)
print("Second sentence tokens:", tokens_second_sentence)

First sentence tokens: ['[CLS]', 'token', '##ization', 'is', 'the', 'process', 'of', 'splitting', 'sequence', 'to', 'token', '##s', '[SEP]']
Second sentence tokens: ['[CLS]', 'i', 'like', 'bu', '##an', '##6', '##34', '##2', '[SEP]']



Three things are worth noting in this tokenized sequence:
1. **Special Tokens**: We can observe special `[CLS]`, and `[SEP]` tokens added at the beginning and end of the sequence. We can also observe anpther token and `[PAD]` is added at the end of shorter sequence (example).
    - **[CLS] Token**: token stands for "classification" and is used at the beginning of each input sequence. It is essential for tasks like text classification, where BERT learns to encode the entire sequence's information into the representation of the [CLS] token.

    - **[SEP] Token**: This token stands for "separator" and is used to separate two different sequences in the input. When processing multiple sequences, BERT takes this separator token to distinguish between the end of one sequence and the start of another.

    - **[PAD] Token**: This token stands for "padding" and is used to make input sequences of equal length. BERT processes inputs in batches, and all sequences within a batch need to have the same length. If a sequence is shorter than the maximum length in the batch, it is padded with [PAD] tokens to match the length. In our example, the second sentence is smaller and hence the tokenizer add [PAD] tokens to the second sentence.

    Now we can also see that why we have four zeros in the attention_ask of the second sentence. We are telling model to not pay attention to these tokens ([PAD] tokens) and ignore these tokens.

2. **Lowercasing**: All the tokens have been converted to lowercase. This is a feature of this particular BERT checkpoint (**we have used -uncased version**), which helps standardize the text and ensures that the model treats different cases of the same word equally.

3. **Subword Tokens**: Some words like "tokenizing" and "tokens" have been split into multiple tokens, indicated by the presence of the `##` prefix. This happens because BERT breaks down less common or longer words into smaller subword tokens to handle them effectively. The `##` prefix indicates that these tokens should be merged with the previous token when converting the tokens back to a string.

The AutoTokenizer class offers a convenient method called convert_tokens_to_string() that allows us to revert the tokens back to their original textual representation. So, let's utilize this method to convert our tokens into string representtaion.






In [None]:
tokenizer.convert_tokens_to_string(tokens_first_sentence)


'[CLS] tokenization is the process of splitting sequence to tokens [SEP]'

In [None]:
tokenizer.convert_tokens_to_string(tokens_second_sentence)


'[CLS] i like buan6342 [SEP]'

In [None]:
special_tokens = tokenizer.all_special_tokens
special_tokens_ids = tokenizer.all_special_ids
print(special_tokens, special_tokens_ids)


['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'] [100, 102, 0, 101, 103]


We have already explained '[SEP]', '[PAD]', '[CLS]' tokens. Now let us look at the other two special tokens.

- [UNK] Token: This token stands for "unknown" and is used to represent words that are not present in the model's vocabulary. During tokenization, if a word in the input sequence is not found in the pre-trained vocabulary, it is replaced with the [UNK] token.

- [MASK] Token: This token is used during pretraining BERT. It is used to mask certain words in the input sequence randomly. During training, BERT tries to predict these masked words based on the context provided by the other words in the sequence. This pretraining process helps BERT capture bidirectional context and understand language more effectively.

# <font color = 'indianred'> **4. Create function for Tokenizer**

In the previous section, we understood how the tokenization work. We will now create a function for tokenization and then apply the function to training and validation splits to generate tokenized dataset.

**Change from previous section**: When exploring tokenization, we utilized options :`padding=True`, and `return_tensors='pt'`. However, in our tokenization function creation, we'll omit these arguments.

**Reason for the change**: This approach is deliberate: padding and conversion to tensors are more efficiently managed not at the dataset level, but rather at the batch level during training. Padding each sequence in the dataset to a uniform length can result in unnecessary and excessive padding (based on length of the longest sequence in the whole dataset), especially if there's a significant variation in sequence lengths. Instead, these steps are handled by a data collator (collate function), a concept we've touched upon in previous notebooks. The data collator dynamically adjusts padding for each batch, ensuring it's based on the longest sequence within that specific batch. This method is not only resource-efficient but also optimizes training by reducing the amount of redundant data the model processes in each training step. We have to do padding to create tensors at the batch level. Hence, this step is also done at the batch level.

In [None]:
def tokenize_fn(batch):
    return tokenizer(text = batch["text"], truncation=True)

# def tokenize_fn(batch):
#     return tokenizer(text = batch["text"], truncation=True, padding = True)

<font color = 'indianred'> *Use map function to apply tokenization to all splits*

In [None]:
tokenized_dataset = dataset.map(tokenize_fn, batched=True,)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

**Code Explanation**:

- The code is taking a dataset (train_val_small), applying a tokenization function (tokenize_fn) to each batch of data, and then storing the tokenized results in a new dataset (tokenized_dataset). The resulting tokenized_dataset will have the same number of elements as the original dataset, but each element will now be in a tokenized form suitable for a transformer model.
- The default batch size is 1000.
- Using batched=True in the datasets library streamlines data processing by taking advantage of vectorized operations, leading to faster execution. This approach reduces the overhead from individual function calls and benefits from Hugging Face's tokenizers, which are optimized for batch processing. Additionally, batching can enhance memory use and improve I/O efficiency, especially for large datasets read from disk.



In [None]:
tokenized_dataset


Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2
})

We can see that tokenization step has added three new columns `('input_ids', 'token_type_ids', 'attention_mask')` to the dataset.
We no longer need the column `text`, hence we ill remove it. Further, we will set the dataset format to 'torch' ensuring that the tokenized dataset is converted into PyTorch tensors, making it directly compatible with PyTorch models and training routines.

In [None]:
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

In [None]:
tokenized_dataset[0]

{'label': tensor(0),
 'input_ids': tensor([  101, 19204,  3989,  2003,  1996,  2832,  1997, 14541,  5537,  2000,
         19204,  2015,   102]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

In [None]:
tokenized_dataset[1]

{'label': tensor(1),
 'input_ids': tensor([  101,  1045,  2066, 20934,  2319,  2575, 22022,  2475,   102]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1])}

In [None]:
tokenized_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 2
})

Remember, set_format doesn't alter the dataset's storage or remove columns; it only affects how data is retrieved. The dataset retains its comprehensive structure, allowing you to change the format dynamically as needed without losing any data.

In [None]:
print(len(tokenized_dataset["input_ids"][0]))
print(len(tokenized_dataset["input_ids"][1]))

13
9


The varying lengths in the dataset indicate that padding has not been applied yet. Instead of padding the entire dataset, we prefer processing small batches during training. Padding is done selectively for each batch based on the maximum length in the batch. We will discuss this in more detail in a later section of this notebook.

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The `DataCollatorWithPadding` function is used to dynamically pad the input data to the maximum length in a batch of inputs. This is essential when batching together sequences of different lengths, ensuring that each sequence in the batch has the same length by padding the shorter ones. Here's how the `DataCollatorWithPadding` function has processed the data:

**Padding Input IDs (`input_ids`):**
   - The `input_ids` are sequences of integers that represent the tokenized version of the text data.
   - The `DataCollatorWithPadding` ensures that all `input_ids` in a batch are of the same length by adding padding tokens (usually represented by the ID `0`) to the sequences that are shorter than the longest sequence in the batch.


In [None]:
features = [tokenized_dataset[i] for i in range(2)]
features

[{'label': tensor(0),
  'input_ids': tensor([  101, 19204,  3989,  2003,  1996,  2832,  1997, 14541,  5537,  2000,
          19204,  2015,   102]),
  'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])},
 {'label': tensor(1),
  'input_ids': tensor([  101,  1045,  2066, 20934,  2319,  2575, 22022,  2475,   102]),
  'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1])}]

In [None]:
features = [tokenized_dataset[i] for i in range(2)]
model_input = data_collator(features)

In [None]:
model_input

{'input_ids': tensor([[  101, 19204,  3989,  2003,  1996,  2832,  1997, 14541,  5537,  2000,
         19204,  2015,   102],
        [  101,  1045,  2066, 20934,  2319,  2575, 22022,  2475,   102,     0,
             0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]), 'labels': tensor([0, 1])}

The output displays two samples of tokenized data from a dataset. Each sample is represented as a dictionary containing the following key-value pairs:

1. **`label`**: This is the label or target associated with the text data. It is represented as a tensor.

2. **`input_ids`**: This is a tensor containing a sequence of integers. Each integer represents a unique token (word or subword) from the text, as encoded by the tokenizer. This sequence is what the model will take as input. The sequence length and the specific token IDs will vary based on the text content and the tokenizer's vocabulary.

3. **`attention_mask`**: This tensor indicates which tokens in the `input_ids` should be paid attention to by the model. **A value of `1` means that the corresponding token is a part of the input and should be considered by the model, while a value of `0` would indicate a padding token that should be ignored. **

#  <font color = 'indianred'> **5 Understanding Pre-trained BERT model**






In [None]:
model = AutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

The code snippet involves using the transformers library from Hugging Face to load a pretrained model tailored for sequence classification tasks.Here's a brief explanation:

- **Loading the Pretrained Model:**
The AutoModel.from_pretrained() function is used to load a pretrained model with one argument:

  - **checkpoint:** This specifies which pretrained model to load. Since the comment mentions that it's the "same checkpoint as used for the tokenizer," it suggests that the model and the tokenizer are both sourced from the same original pretrained model, ensuring compatibility.


<img src ="https://drive.google.com/uc?export=view&id=1qKP3ilQHoSPr1SDMfzB4hMOMEee6jx2F" width =800>

<img src ="https://drive.google.com/uc?export=view&id=1qM3jUSXKKbcEGiUVN-hIFJH6jSUiwpOf" width =800>

In [None]:
model_input

{'input_ids': tensor([[  101, 19204,  3989,  2003,  1996,  2832,  1997, 14541,  5537,  2000,
         19204,  2015,   102],
        [  101,  1045,  2066, 20934,  2319,  2575, 22022,  2475,   102,     0,
             0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]), 'labels': tensor([0, 1])}

In [None]:
# model output
model=model.to(device=0)
model_input= model_input.to(device=0)
model.train()
model_output = model(model_input['input_ids'], model_input['attention_mask'])

In [None]:
# keys in model output
model_output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [None]:
# all the tokens of the input sequence
model_output.last_hidden_state.shape

torch.Size([2, 13, 768])

In [None]:
# cls token after the
model_output.pooler_output.shape


torch.Size([2, 768])

#  <font color = 'indianred'> **5 Custom classification head for BERT model**

In [None]:
class CustomConfig(PretrainedConfig):
  def __init__(self, bert_model,ff_output_dim, n_classes, ff_dropout, cls_only=False, average_all=False, pooler=True, **kwargs):
        super().__init__()
        self.ff_input_dim = bert_model.config.hidden_size
        self.ff_output_dim = ff_output_dim
        self.n_classes = n_classes
        self.encoder = bert_model
        self.ff_dropout = ff_dropout
        self.cls_only = cls_only
        self.average_all = average_all
        self.pooler = pooler

In [None]:
class BERTClassifier(PreTrainedModel):
    config_class = CustomConfig

    def __init__(self, config):

        super().__init__(config)
        # Add assertion to ensure only one of cls_only, average_all, or pooler is True
        assert (
            sum([config.cls_only, config.average_all, config.pooler]) == 1
        ), "Only one of 'cls_only', 'average_all', or 'pooler' can be True"

        self.classification_head = nn.Sequential(
            nn.Linear(config.ff_input_dim, config.ff_output_dim),
            nn.ReLU(),
            nn.Dropout(config.ff_dropout),
            nn.Linear(config.ff_output_dim, config.n_classes),
        )

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.config.encoder(
            input_ids, attention_mask=attention_mask)

        if self.config.cls_only:
            output = outputs.last_hidden_state[:, 0, :]
        elif self.config.average_all:
            last_hidden_state = outputs.last_hidden_state
            output = torch.mean(last_hidden_state, dim=1)
        elif self.config.pooler:
            output = outputs.pooler_output


        logits = self.classification_head(output)
        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.config.n_classes), labels.view(-1))

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits
        )

In [None]:
my_config = CustomConfig(
    bert_model=model,
    ff_output_dim=256,
    n_classes=2,
    ff_dropout=0.1,
    pooler=True,
    cls_only=False,
    average_all=False,
)

model_pytorch = BERTClassifier(my_config)

model_pytorch=model_pytorch.to(device=0)

In [None]:
model_pytorch_ouputs = model_pytorch(**model_input)

In [None]:
model_pytorch_ouputs.loss

tensor(0.6614, device='cuda:0', grad_fn=<NllLossBackward0>)

In [None]:
model_pytorch_ouputs.logits.shape

torch.Size([2, 2])

In [None]:
model_pytorch_ouputs.logits

tensor([[-0.0071, -0.1847],
        [-0.1577, -0.1998]], device='cuda:0', grad_fn=<AddmmBackward0>)

# <font color = 'indianred'> **6. Using AutoModel for SequenceClassification**

In [None]:
auto_model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
auto_model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

<font color = 'indianred'> *AutoConfig for pre-Trained Model*

<font color = 'indianred'> *Explanation of Model configuration file*</font>

- A configuration file, in the context of pretrained models like those in the Hugging Face Transformers library, is a vital component that details the model's architecture, hyperparameters, and other essential settings. It serves as a blueprint, guiding how the model is structured and operates.

- Specifically, for models intended for tasks like classification, two critical pieces of information are `id2label` and `label2id`.

- `id2label` is a dictionary mapping numerical IDs to their respective class labels, while `label2id` is its inverse, mapping class labels to their IDs. These mappings are fundamental for translating between human-readable class labels (like "positive" or "negative") and the numerical IDs the model uses internally during training and inference.

- By ensuring that the configuration file contains `id2label` and `label2id`, you guarantee a seamless conversion between model outputs and interpretable class labels. Without them, translating the model's predictions into understandable results can be cumbersome. Adding this information enhances the usability and clarity of the model, especially when deploying it for real-world applications.


<font color = 'indianred'>*Download config file of pre-trained Model*</font>



In [None]:
config = AutoConfig.from_pretrained(checkpoint)


In [None]:
config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.45.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

<font color = 'indianred'>*Modify Configuration File*</font>
- We need to modify configuration fie to add ids to  label and label to ids mapping
- Adding id2label and label2id to the configuration file provides a consistent, interpretable, and user-friendly way to handle model outputs.

In [None]:
class_names = ['neg', 'pos']
class_names


['neg', 'pos']

In [None]:
id2label = {}
for id_, label_ in enumerate(class_names):
    id2label[id_] = label_
id2label


{0: 'neg', 1: 'pos'}

Code Explanation:
- First, an empty dictionary, id2label, is initialized.
- The enumerate function returns both the index (or ID) and the value (or label) of each item in the class_names list as you loop through it.
- Within the loop, each numerical ID (id_) is converted to a string using str(id_) and then used as a key in the id2label dictionary. The corresponding class name (label_) from the class_names list is assigned as the value for that key.
- Why was numerical ID converted to string? - When the configuration is saved to disk, it's typically stored in a JSON format. JSON keys must be strings, so using non-string keys would cause serialization errors. By ensuring that the IDs are strings in Python, the configuration can be seamlessly serialized to and deserialized from JSON without any type conversion issues.

In [None]:
label2id = {}
for id_, label_ in enumerate(class_names):
    label2id[label_] = id_
label2id


{'neg': 0, 'pos': 1}

In [None]:
config.id2label = id2label
config.label2id = label2id


In [None]:
config


BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "neg",
    "1": "pos"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "neg": 0,
    "pos": 1
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.45.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [None]:
auto_model.config = config

In [None]:
auto_model= auto_model.to(device=0)
auto_model_outputs = auto_model(**model_input)

In [None]:
auto_model_outputs.keys()

odict_keys(['loss', 'logits'])

In [None]:
auto_model_outputs.logits

tensor([[ 0.1587, -0.4091],
        [ 0.1827, -0.5426]], device='cuda:0', grad_fn=<AddmmBackward0>)

In [None]:
auto_model_outputs.loss

tensor(0.7846, device='cuda:0', grad_fn=<NllLossBackward0>)

<font color = 'indianred'> *Understanding Model Output*

The model output consists of logits and a loss value, indicating the model's predictions and the performance on the input data:

1. **Logits (`model_output.logits`):**
   - The logits are the raw, unnormalized scores output by the model's final layer.
   - For each input sequence, the logits represent the model's predictions before applying an activation function (like softmax).

2. **Loss (`model_output.loss`):**
   - The loss value (e.g., `1.0800`) represents the model's performance on the input data. It quantifies the difference between the model's predictions and the actual labels.
   - A lower loss value indicates better model performance, as it means the model's predictions are closer to the true labels.
   - The loss is used during training to update the model's weights, with the goal of minimizing this value over time.