In [None]:
print("Hallo")

We will learn how to load pre-trained models from Hugging Face and make inferences using the Pipeline module. Additionally, we will learn how to further train pre-trained LLMs on your own data (self-supervised fine-tuning).  We will have a solid understanding of how to pretrain LLMs and store them to later fine-tune for  specific use cases. This will empower  to create powerful and customized natural language processing solutions.

In [None]:
!pip install --user datasets # 2.15.0

In [None]:
import torch
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import AutoConfig,AutoModelForCausalLM,AutoModelForSequenceClassification,BertConfig,BertForMaskedLM,TrainingArguments, Trainer, TrainingArguments
from transformers import AutoTokenizer,BertTokenizerFast,TextDataset,DataCollatorForLanguageModeling
from transformers import pipeline
from datasets import load_dataset

from tqdm.auto import tqdm
import math
import time
import os


# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

## Pretraining and Self supervised learning

Pretraining is a technique used in natural language processing (NLP) to train large language models (LLMs) on a vast corpus of unlabeled text data. The goal is to capture the general patterns and semantic relationships present in natural language, allowing the model to develop a deep understanding of language structure and meaning.

The motivation behind pretraining transformers is to address the limitations of traditional NLP approaches that often require significant amounts of labeled data for each specific task. By leveraging the abundance of unlabeled text data, pretraining enables the model to learn fundamental language skills through self-supervised objectives, facilitating transfer learning.

The pretraining objectives, such as masked language modeling (MLM) and next sentence prediction (NSP), play a crucial role in the success of transformer models. Pretrained models can be further tuned by training them on domain-specific unlabeled data, which is known as self-supervised fine-tuning.

Also, the model can be fine-tuned on specific downstream tasks using labeled data, a process known as supervised fine-tuning, further improving its performance.


AutoModelForCausalLM:

Automatically loads a model designed for causal language modeling (predicting the next word in a sequence).
OPT-350M is part of Meta’s OPT (Open Pretrained Transformer) family.
This model has 350 million parameters, making it suitable for text generation tasks without being too large.
AutoTokenizer:

Loads the appropriate tokenizer for the OPT model.
Tokenizes input text into numbers (token IDs) that the model can understand.

Key Points About OPT-350M:
 - Causal Language Model:
 - Works left-to-right (like GPT), generating text token by token.
 - Open Pretrained Transformer (OPT):
 - Developed by Meta as an open-source alternative to GPT models.
 - 350 million parameters:
 - Balances performance and speed, suitable for many NLP tasks without  requiring extensive computational resources.

In [None]:
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

pipe = pipeline("text-generation", model=model,tokenizer=tokenizer)
print(pipe("This movie was really")[0]["generated_text"])

## Pre-training Objectives

Pre-training objectives are essential components of the transformer pre-training process. These objectives define the tasks that the model is trained on during the pre-training phase, enabling it to learn meaningful contextual representations of language. Three commonly used pre-training objectives are **masked language modeling (MLM)**, **next sentence prediction (NSP)**, and **next token prediction**.

### 1. **Masked Language Modeling (MLM)**:
Masked language modeling involves randomly masking some words in a sentence and training the model to predict the masked words based on the context provided by the surrounding words (i.e., words that appear before and after the masked word). This objective enables the model to learn contextual understanding and fill in missing information, enhancing its ability to comprehend language structure and semantics.

---

### 2. **Next Sentence Prediction (NSP)**:
Next sentence prediction trains the model to determine whether two sentences are consecutive in the original text or randomly chosen from the corpus. This objective helps the model learn sentence-level relationships and understand the coherence between sentences, making it particularly useful for tasks like question answering and text summarization.

---

### 3. **Next Token Prediction**:
In this objective, the model is trained to predict the next token in a sequence of text. Given a sequence, the model learns to predict the most likely next token based on the preceding context. This approach is commonly used in autoregressive models like GPT, enabling them to generate coherent and contextually relevant text.

---

It's important to note that different pre-trained models may use variations or combinations of these objectives, depending on the specific architecture and training setup. For example:
- **BERT** uses MLM and NSP.
- **GPT** uses next token prediction.
- Some models incorporate hybrid strategies for improved performance on diverse NLP tasks.

## Self-supervised training of a BERT model

Training a BERT(Bidirectional Encoder Representations from Transformers) model is a complex and time-consuming process that requires a large corpus of unlabeled text data and significant computational resources. However, we provide you with a simplified exercise to demonstrate the steps involved in pre-training a BERT model using the Masked Language Modeling (MLM) objective.

For this exercise, we'll use the Hugging Face Transformers library, which provides pre-implemented BERT models and tools for pre-training.
- Prepare the train dataset
- Train a Tokenizer
- Preprocess the dataset
- Pre-train BERT using an MLM task
- Evaluate the trained model


### Importing required datasets

The WikiText dataset is a widely used benchmark dataset in the field of natural language processing (NLP). The dataset contains a large amount of text extracted from Wikipedia, which is a vast online encyclopedia covering a wide range of topics. The articles in the WikiText dataset are preprocessed to remove formatting, hyperlinks, and other metadata, resulting in a clean text corpus.

The WikiText dataset has 4 different configs, and is divided into three parts: a training set, a validation set, and a test set. The training set is used for training language models, while the validation and test sets are used for evaluating the performance of the models.
First, let's load the datasets and concatenate them together to create a big dataset.

*Note: The original BERT was pretrained on Wikipedia and BookCorpus datasets.


In [None]:
# Load the datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

In [None]:
print(dataset)

In [None]:
#check a sample record
dataset["train"][400]

In [None]:
#check a sample record
dataset["train"][400]

In [None]:
#check a sample record
len(dataset["train"])

In [None]:
# Path to save the datasets to text files
output_file_train = "wikitext_dataset_train.txt"
output_file_test = "wikitext_dataset_test.txt"

# Open the output file in write mode
with open(output_file_train, "w", encoding="utf-8") as f:
    # Iterate over each example in the dataset
    for example in dataset["train"]:
        # Write the example text to the file
        f.write(example["text"] + "\n")

# Open the output file in write mode
with open(output_file_test, "w", encoding="utf-8") as f:
    # Iterate over each example in the dataset
    for example in dataset["test"]:
        # Write the example text to the file
        f.write(example["text"] + "\n")

### Bert Tokenizer loading from pretrained bert base

In [None]:
# create a tokenizer from existing one to re-use special tokens
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [None]:
model_name = 'bert-base-uncased'

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, is_decoder=True)


### Training a Tokenizer

In the previous cell, we created an instance of tokenizer from a pre-trained BERT tokenizer. If we want to train the tokenizer on our own dataset. This is specially helpful when using transformers for specific areas such as medicine where tokens are somehow different than the general tokens that tokenizers are created based on.

In [None]:
## create a python generator to dynamically load the data
def batch_iterator(batch_size=10000):
    for i in tqdm(range(0, len(dataset), batch_size)):
        yield dataset['train'][i : i + batch_size]["text"]

## create a tokenizer from existing one to re-use special tokens
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

## train the tokenizer using our own dataset, It keeps the special tokens from BERT but learns new tokens from your dataset.
bert_tokenizer = bert_tokenizer.train_new_from_iterator(text_iterator=batch_iterator(), vocab_size=30522)

In [None]:
bert_tokenizer.get_vocab()

In [None]:
len(bert_tokenizer.get_vocab())

### Pretraining

In this step, we define the configuration of the BERT model and create the model:
#### Define the BERT Configuration
Here, we define the configuration settings for a BERT model using `BertConfig`. This includes setting various parameters related to the model's architecture:
- **vocab_size=30522**: Specifies the size of the vocabulary. This number should match the vocabulary size used by the tokenizer.
- **hidden_size=768**: Sets the size of the hidden layers.
- **num_hidden_layers=12**: Determines the number of hidden layers in the transformer model.
- **num_attention_heads=12**: Sets the number of attention heads in each attention layer.
- **intermediate_size=3072**: Specifies the size of the "intermediate" (i.e., feed-forward) layer within the transformer.


What is BertConfig?

BertConfig is a class from Hugging Face’s transformers library that allows you to configure the architecture of a BERT model.
It specifies important hyperparameters like the vocabulary size, hidden dimensions, number of layers, and more.

### 1. `vocab_size=len(bert_tokenizer.get_vocab())`
**What it does:**  
Sets the vocabulary size of the model.

**Why it matters:**  
This must match the vocab size of your tokenizer.  
If your tokenizer has a vocabulary of 30,522 tokens, your model must have the same to ensure compatibility during training and inference.

---

### 2. `hidden_size=768`
**What it does:**  
Sets the size of the hidden layers.

**Why it matters:**  
This defines the dimensionality of token embeddings and the output of each encoder layer.  
In the original BERT base model, each token is represented by a 768-dimensional vector.

Hidden Size in BERT = Word Embedding Size + Position Encoding Size

1. Word Embedding Size:
Each token in your input text is converted into a word embedding of size hidden_size.
In BERT-base, each token is represented as a 768-dimensional vector.
2. Position Embedding:
Since BERT doesn’t have a built-in sequence structure like RNNs, it adds positional embeddings to capture the order of tokens in a sentence.
Each position in the input sequence is represented by a 768-dimensional vector as well.
3. Segment Embedding (optional):
For tasks like Next Sentence Prediction (NSP), BERT also uses segment embeddings to distinguish between two different sentences in the input.
This is also a 768-dimensional vector.

Input Vector=Word Embedding+Position Embedding+Segment Embedding

All of these are added together element-wise, and the result is a vector of size 768 for each token.

---

### 3. `num_hidden_layers=12`
**What it does:**  
Specifies the number of transformer layers (or encoder layers) in the model.

**Why it matters:**  
BERT-base has 12 layers stacked on top of each other.  
Each layer applies self-attention and feed-forward operations to learn increasingly complex patterns.

---

### 4. `num_attention_heads=12`
**What it does:**  
Sets the number of attention heads in each self-attention layer.

**Why it matters:**  
Each attention head focuses on different parts of the input sequence simultaneously.  
With 12 attention heads, BERT can learn various relationships between tokens in parallel.

What happens inside?

 - Input size: Each token is a vector of size 768.
 - Splitting for each head: The hidden size is split across the heads.
 - In BERT-base, with 12 heads:
 - Each head works with a sub-space of size 768 / 12 = 64.
 - Each attention head learns 64-dimensional Q, K, and V representations from the original 768-dimensional input.


Final Multi-Head Attention Output:
 - Each attention head outputs a 64-dimensional vector.
 - All 12 outputs are concatenated back together:12 ×64= 768
 - This result is then projected back into the hidden size of 768.
 - We have 12 sets of Q, K, and V (one set per head)
---

### 5. `intermediate_size=3072`
**What it does:**  
Sets the size of the intermediate layer in the feed-forward network.

**Why it matters:**  
Each encoder layer in BERT has a feed-forward network with two layers:  
- The first layer expands the hidden size from 768 to 3072.  
- The second layer reduces it back to 768.  

This non-linear transformation helps the model learn complex representations.


In [None]:
# Define the BERT configuration
config = BertConfig(
    vocab_size=len(bert_tokenizer.get_vocab()),  # Specify the vocabulary size(Make sure this number equals the vocab_size of the tokenizer)
    hidden_size=768,  # Set the hidden size
    num_hidden_layers=12,  # Set the number of layers
    num_attention_heads=12,  # Set the number of attention heads
    intermediate_size=3072,  # Set the intermediate size
)

### Key Components in the BERT Architecture:

#### 1. Input Embedding + Positional Encoding:
- Each token is converted into a vector of size **768** (hidden size).
- Positional encoding is added to preserve the order of tokens.

---

#### 2. Multi-Head Attention:
- There are **12 attention heads** in BERT-base.
- Each head processes a different part of the sequence in parallel using its own **Q, K, and V matrices**.
- The outputs are concatenated to maintain the hidden size of **768**.

---

#### 3. Add & Norm:
- **Residual connections** are added after the attention and feed-forward layers.
- **Layer normalization** is applied to stabilize training.

---

#### 4. Feed-Forward Network (FFN):
- This is where **`intermediate_size=3072`** comes into play.
- The hidden size (**768**) is expanded to **3072**, processed through a non-linear activation (**GELU**), and then reduced back to **768**.

---

#### 5. Output Layer:
- After passing through **12 encoder layers**, BERT produces an output for each token.
- This output can be used for tasks like:
  - **Masked token prediction** (filling in `[MASK]`).
  - **Next sentence prediction**.


In [None]:
# Create the BERT model for pre-training
model = BertForMaskedLM(config)

In [None]:
# check model configuration
model

### Define the Training Dataset
Here, we define a training dataset using the `TextDataset` class, which is suited for loading and processing text data for training language models. This setup typically involves a few key parameters:

- **tokenizer=bert_tokenizer**: Specifies the tokenizer to be used. Here, `bert_tokenizer` is an instance of a BERT tokenizer, responsible for converting text into tokens that the model can understand.
- **file_path="wikitext_dataset_train.txt"**: The path to the pre-training data file. This should point to a text file containing the training data.
- **block_size=128**: Sets the desired block size for training. This defines the length of the sequences that the model will be trained on

The `TextDataset` class is designed to take large pieces of text (such as those found in the specified file), tokenize them, and efficiently handle them in manageable blocks of the specified size.



In [None]:
# Prepare the pre-training data as a TextDataset
train_dataset = TextDataset(
    tokenizer=bert_tokenizer,
    file_path="wikitext_dataset_train.txt",  # Path to your pre-training data file
    block_size=128  # Set the desired block size for training
)
test_dataset = TextDataset(
    tokenizer=bert_tokenizer,
    file_path="wikitext_dataset_test.txt",  # Path to your pre-training data file
    block_size=128  # Set the desired block size for training
)

In [None]:
train_dataset[0]

Then, we prepare data for the MLM task (masking random tokens):
### Define the Data Collator for Language Modeling
This line of code sets up a `DataCollatorForLanguageModeling` from the Hugging Face Transformers library. A data collator is used during training to dynamically create batches of data. For language modeling, particularly for models like BERT that use masked language modeling (MLM), this collator prepares training batches by automatically masking tokens according to a specified probability. Here are the details of the parameters used:

- **tokenizer=bert_tokenizer**: Specifies the tokenizer to be used with the data collator. The `bert_tokenizer` is responsible for tokenizing the text and converting it to the format expected by the model.
- **mlm=True**: Indicates that the data collator should mask tokens for masked language modeling training. This parameter being set to `True` configures the collator to randomly mask some of the tokens in the input data, which the model will then attempt to predict.
- **mlm_probability=0.15**: Sets the probability with which tokens will be masked. A probability of 0.15 means that, on average, 15% of the tokens in any sequence will be replaced with a mask token.


In [None]:
# Prepare the data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
# check how collator transforms a sample input data record
data_collator([train_dataset[0]])

Now, we train the BERT Model using the Trainer module. (For a complete list of training arguments, check [here](https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/trainer#transformers.TrainingArguments)):
This section configures the training process by specifying various parameters that control how the model is trained, evaluated, and saved:

- **output_dir="./trained_model"**: Specifies the directory where the trained model and other output files will be saved.
- **overwrite_output_dir=True**: If set to `True`, this will overwrite the contents of the output directory if it already exists. This is useful when running experiments multiple times.
- **do_eval=True**: Enables evaluation of the model. If `True`, the model will be evaluated at the specified intervals.
- **evaluation_strategy="epoch"**: Defines when the model should be evaluated. Setting this to "epoch" means the model will be evaluated at the end of each epoch.
- **learning_rate=5e-5**: Sets the learning rate for training the model. This is a typical learning rate for fine-tuning BERT-like models.
- **num_train_epochs=10**: Specifies the number of training epochs. Each epoch involves a full pass over the training data.
- **per_device_train_batch_size=2**: Sets the batch size for training on each device. This should be set based on the memory capacity of your hardware.
- **save_total_limit=2**: Limits the total number of model checkpoints to be saved. Only the most recent two checkpoints will be kept.
- **logging_steps=20**: Determines how often to log training information, which can help monitor the training process.


In [None]:
'''# Define the training arguments
training_args = TrainingArguments(
    output_dir="./trained_model",  # Specify the output directory for the trained model
    overwrite_output_dir=True,
    do_eval=True,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=10,  # Specify the number of training epochs
    per_device_train_batch_size=2,  # Set the batch size for training
    save_total_limit=2,  # Limit the total number of saved checkpoints
    logging_steps = 20

)

# Instantiate the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Start the pre-training
trainer.train()'''

## Evaluating Model Performance

Let's check the performance of the trained model. Perplexity is commonly used to compare different language models or different configurations of the same model.
After training, perplexity can be calculated on a held-out evaluation dataset to assess the model's performance. The perplexity is calculated by feeding the evaluation dataset through the model and comparing the predicted probabilities of the target tokens with the actual token values that are masked.

A lower perplexity score indicates that the model has a better understanding of the language and is more effective at predicting the masked tokens. It suggests that the model has learned useful representations and can generalize well to unseen data.


In [None]:
'''eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")'''

## Loading the saved model
If you want to skip training and load the model that you trained for 10 epochs, go ahead and uncomment the following cell:


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/BeXRxFT2EyQAmBHvxVaMYQ/bert-scratch-model.pt'
model.load_state_dict(torch.load('bert-scratch-model.pt',map_location=torch.device('cpu')))

In [None]:
# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create a pipeline for the "fill-mask" task
mask_filler = pipeline("fill-mask", model=model,tokenizer=bert_tokenizer)

# Generate predictions by filling the mask in the input text
results = mask_filler(text) #top_k parameter can be set

# Print the predicted sequences
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

In [None]:
# Load the pretrained BERT model and tokenizer
pretrained_model = BertForMaskedLM.from_pretrained('bert-base-uncased')
pretrained_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create the pipeline
mask_filler = pipeline(task='fill-mask', model=pretrained_model,tokenizer=pretrained_tokenizer)

# Perform inference using the pipeline
results = mask_filler(text)
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

This pretrianed model performs way better than the model you just trained for a few epochs using a single dataset. Still, pretrained models cannot be used for specific tasks, such as sentiment extraction or sequence classification. This is why supervised fine-tuning methods are introduced.


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

# Load the SNLI dataset
snli = load_dataset("stanfordnlp/snli")

# Preprocessing function
def preprocess_function(examples):
  premise = examples["premise"]
  hypothesis = examples["hypothesis"]
  return tokenizer(premise, hypothesis, padding="max_length", truncation=True)

model_name = "bert-base-uncased"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

# Apply preprocessing to training and validation sets
train_encoded = snli["train"].map(preprocess_function, batched=True)
val_encoded = snli["validation"].map(preprocess_function, batched=True)

# Training function (replace with your training loop)
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",  # Replace with your output directory
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_encoded,
    eval_dataset=val_encoded,
)

trainer.train()

# Evaluation function (replace with your metrics)
from sklearn.metrics import accuracy_score

predictions, labels = trainer.predict(val_encoded)
accuracy = accuracy_score(labels, predictions.argmax(-1))
print(f"Accuracy on validation set: {accuracy:.4f}")