### Let’s count the number of parameters in GPT-2-XL.
- You are only allowed to use the variable names defined below to answer the questions, i.e., `n_layers * n_heads` is allowed, while `4 * d_ffn` is not allowed.
- For simplicity, we will **ignore the bias terms** in all questions.

In [None]:
n_layers = 48  # the number of transformer layers (aka. transformer blocks)
n_heads = 25   # the number of attention heads in each layer
d_model = 1600 # the model dimension
d_ffn = 6400   # the FFN (aka. MLP) dimension
d_heads = 64   # the attn head dimension
n_vocab = 50257 # vocabulary size
n_ctx = 1024    # the maximum sequence length the model can process

#### The input embeddings consists of token embeddings and position embeddings. Count the number of parameters in the two embeddings.

In [None]:
token_embeddings = n_vocab*d_model
print(token_embeddings)

80411200


In [None]:
position_embeddings = n_ctx*d_model # Hint: n_ctx
print(position_embeddings)

1638400


#### Q2.1.2 Multi-Headed Attention consists of W_Q, W_K, W_V, and W_O,
- MultiHead(Q, K, V) = Concat(head_1, ..., head_n) W_O
- see more details in https://arxiv.org/pdf/1706.03762
#### Count the number of parameters in them.
- Here we * n_layers to calculate the total number of parameters across layers.

In [None]:
attn_q = n_layers * n_heads * d_model * d_heads
print(attn_q)

122880000


In [None]:
attn_k = n_layers * n_heads * d_model * d_heads
print(attn_k)

122880000


In [None]:
attn_v = n_layers * n_heads * d_model * d_heads
print(attn_v)

122880000


In [None]:
attn_o = n_layers * n_heads * d_heads * d_model
print(attn_o)

122880000


####The feed-forward network (FFN) in each transformer block consists of two layers, ffn1 and ffn2. Count the number of parameters in them, respectively.
- We need to * n_layers to calculate the total number of parameters across layers.

In [None]:
ffn1 = n_layers * d_model * d_ffn
print(ffn1)

491520000


In [None]:
ffn2 = n_layers * d_ffn * d_model
print(ffn2)

491520000


#### Count the number of parameters in the output embeddings (2 points)

In [None]:
output_embeddings = d_model
print(output_embeddings)

1600


#### Print out the total number of parameters (1 point).
- We do not double-count output_embeddings because GPT-2 shares the weights of token_embeddings and output_embeddings.
- For simplicity, we ignore the bias terms and layernorms.

In [None]:
n_total = token_embeddings + position_embeddings + attn_q + attn_k + attn_v + attn_o + ffn1 + ffn2
print(f'{n_total/10**9:.3f}B')

1.557B


#### The majority of parameters are in the FFN layers! Print the percentage.

In [None]:
print(f'{(ffn1+ffn2)/n_total:.1%}')

63.2%


## Understanding different implementations of large language models

### Different LLMs use slightly different architectures.
- The code is modified from Hugging Face's implementations.
- `d_model`: the model dimension
- `d_ffn`: the FFN (aka. MLP) dimension

In [None]:
%%script echo skipping
class GPT2MLP(nn.Module):
    def __init__(self, config, d_model, d_ffn):
        super().__init__()
        self.fn1 = nn.Linear(d_model, d_ffn)
        self.fn2 = nn.Linear(d_ffn, d_model)
        self.act = ACT2FN[config.activation_function]
        self.dropout = nn.Dropout(config.resid_pdrop)

    def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -> torch.FloatTensor:
        hidden_states = self.fn1(hidden_states)
        hidden_states = self.act(hidden_states)
        hidden_states = self.fn2(hidden_states)
        hidden_states = self.dropout(hidden_states)
        return hidden_states


class GPT2Block(nn.Module):
    def __init__(self, config, d_model, d_ffn, layer_idx=None):
        super().__init__()
        self.ln_1 = nn.LayerNorm(d_model, eps=config.layer_norm_epsilon)
        self.attn = GPT2_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
        self.ln_2 = nn.LayerNorm(d_model, eps=config.layer_norm_epsilon)
        self.mlp = GPT2MLP(config, d_model, d_ffn)

    def forward(
        self,
        hidden_states: Optional[Tuple[torch.FloatTensor]],
        attention_mask: Optional[torch.FloatTensor] = None,
    ):
        residual = hidden_states
        hidden_states = self.ln_1(hidden_states)
        attn_output = self.attn(
            hidden_states,
            attention_mask=attention_mask,
        )
        # residual connection
        hidden_states = attn_output + residual
        residual = hidden_states
        hidden_states = self.ln_2(hidden_states)
        feed_forward_hidden_states = self.mlp(hidden_states)
        # residual connection
        hidden_states = residual + feed_forward_hidden_states

        return hidden_states

skipping


In [None]:
%%script echo skipping
class GPTJBlock(nn.Module):
    def __init__(self, config, d_model, d_ffn, layer_idx=None):
        super().__init__()
        self.ln_1 = nn.LayerNorm(d_model, eps=config.layer_norm_epsilon)
        self.attn = GPTJ_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
        self.mlp = GPTJMLP(config, d_model, d_ffn)

    def forward(
        self,
        hidden_states: Optional[torch.FloatTensor],
        attention_mask: Optional[torch.FloatTensor] = None,
    ):
        residual = hidden_states
        hidden_states = self.ln_1(hidden_states)
        attn_output = self.attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
        )

        feed_forward_hidden_states = self.mlp(hidden_states)
        # residual connection
        hidden_states = attn_output + feed_forward_hidden_states + residual

        return hidden_states

skipping


In [None]:
%%script echo skipping
class LlamaMLP(nn.Module):
    def __init__(self, config, d_model, d_ffn):
        super().__init__()
        self.gate_proj = nn.Linear(d_model, d_ffn, bias=config.mlp_bias)
        self.up_proj = nn.Linear(d_model, d_ffn, bias=config.mlp_bias)
        self.down_proj = nn.Linear(d_ffn, d_model, bias=config.mlp_bias)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        # x: [BS, d_model]
        hidden_states = self.act_fn(self.gate_proj(x)) * self.up_proj(x)
        output = self.down_proj(hidden_states)
        return output

skipping


#### What are the differences between `GPTJBlock()` and `GPT2Block()`?
- Hint: (1) When do self.attn() and self.mlp() add to the residual stream, respectively? (2) What's the input of self.mlp()?
- **Ignore** the differences in bias terms, layernorm, dropout, and activation functions.

1 ) Residual Connections and Computational Flow

GPTJBlock() : Uses a single residual connection at the end, combining the attention output, MLP output, and original input. This architecture allows for potential parallel computation of attention and MLP layers, which can lead to faster training and inference times. Possibly more challenging gradient flow during training due to the longer path between the input and the final residual connection.

GPT2Block() : Uses two separate residual connections, one after the attention layer and another after the MLP layer. This potentially leads to better gradient flow during training, as each major component has its own residual connection, but may cause slower training and inference.

2) Input to MLP Layer

GPTJBlock(): The MLP receives the same input as the attention layer. This results in:
 - Potentially faster computation, as the MLP and attention layers can process in parallel.
 - Possibly better generalization, as the MLP has access to the original input features.

GPT2Block(): Input to MLP is output of the Attention layer. This can lead to:
 - More complex feature interactions, as the MLP operates on attention-enhanced representations.
 - Potentially slower adaptation to new tasks, as the MLP's input is more specialized.

3) Layer Normalization Placement

GPTJBlock() : Has only one layer normalization operation at the beginning of the block. This leads to less stable training as compared to GPT2Block as the MLP receives unnormalized inputs.

GPT2Block() : Has two layer normalization operations, one before each major component. This leads to more stable training, as each component receives normalized inputs.



#### In `LlamaMLP()`, the shape of the input `x` is `[BS, d_model]`. What are the shapes of `self.act_fn(self.gate_proj(x))`, `self.up_proj(x)`, `hidden_states`, and `output`, respectively? Answer with the variable names (4 points).
- The type of the activation function (ReLU, SiLU, or Sigmoid) does not change the answer

#### Your answer here
- The shape of the output of `self.act_fn(self.gate_proj(x))`: [BS, d_ffn]
- The shape of the output of `self.up_proj(x)`: [BS, d_ffn]
- The shape of `hidden_states`: [BS, d_ffn]
- The shape of `output`: [BS, d_model]

# Finetuning

Finetune a pretrained language model DistillBERT on sentiment analysis task using IMDb dataset.

In [None]:
!pip install datasets



In [None]:
#load dataset
# we use imdb dataset to finetune the model. We test the model on imdb and sst2.
from datasets import load_dataset

imdb_dataset = load_dataset('stanfordnlp/imdb')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## 1. Preprocess the data

In [None]:
from sklearn.model_selection import train_test_split

train_df = imdb_dataset['train']
train_x, dev_x, train_y, dev_y = train_test_split(train_df['text'], train_df['label'], test_size=0.1, random_state=42, stratify=train_df['label'])


In [None]:
len(train_x)
len(dev_x)

2500

In [None]:
print(train_x[0])

"Algie, the Miner" is one bad and unfunny silent comedy. The timing of the slapstick is completely off. This is the kind of humor with certain sequences that make you wonder if they're supposed to be funny or not. However, the actual quality of the film is irrelevant. This is mandatory viewing for film buffs mainly because its one of the earliest examples of gay cinema. The main character of Algie is an effeminate guy, acting much like the stereotypical "pansy" common in many early films. The film has the homophobic attitude common of the time. "Algie, the Miner" is pretty awful, but fascinating from a historical viewpoint. (3/10)


## 2. Prepare the data


We use Dataset class from torch.utils.data to prepare data, and DataLoader class to prepare batches for training.

In the SentimentAnalysisDataset class, you need to use DistillBert tokenizer to tokenize the sentence.

In [None]:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')




In [None]:
from torch.utils.data import Dataset, DataLoader

import torch
torch.manual_seed(42)

class SentimentAnalysisDataset(Dataset):
  #write your code here
  def __init__(self,data, tokenizer, max_len = None):
    self.input = data['input']
    self.labels = data['label']
    self.tokenizer = tokenizer
    self.max_len = max_len
    self.len = len(self.input)
    self.prepare()

  def prepare(self):
    self.encodings = self.tokenizer(self.input,
                                    padding='max_length',
                                    truncation=True,
                                    max_length=self.max_len,
                                    return_tensors='pt')
    self.input_ids = self.encodings['input_ids']
    self.attention_masks = self.encodings['attention_mask']

  def __len__(self):
    return self.len

  def __getitem__(self,idx):
    input_ids = self.input_ids[idx]
    attention_mask = self.attention_masks[idx]
    label = torch.tensor(self.labels[idx], dtype=torch.long)
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': label
    }

# Example of usage
# Usage of GPU: due to limit usage of GPU on Colab, we will not train the whole training set, we only run first 20 samples.
train = {'input':train_x[:20], 'label':train_y[:20]}
# train_df = pd.DataFrame(train).reset_index(drop=True)
train_dataset = SentimentAnalysisDataset(train, tokenizer)

train_dataloader = DataLoader(train_dataset, batch_size = 4, shuffle = True)


In [None]:
print(sum(train_dataset.attention_masks[0]))

tensor(149)


In [None]:
assert sum(train_dataset.attention_masks[0])==149

## Define the model

We use DistillBERT as base model. We still need a linear layer to mapping the last hidden state to classes dimension.

The model should have a base model, a linear layer, a dropout layer (0.5) and softmax function. The forward function go through all the layers one by one and return the softmax result.

In [None]:
from torch import nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased")

class ClassificationModel(nn.Module):
    def __init__(self,base_model,num_classes):
        super().__init__()
        self.base_model = base_model
        self.dropout = nn.Dropout(0.5)
        self.classifier = nn.Linear(self.base_model.config.hidden_size, num_classes)
        self.softmax = nn.Softmax(dim=1)

    def forward(self,input_ids, attention_mask):
        if input_ids.dim() == 1:
          input_ids = input_ids.unsqueeze(0)
          attention_mask = attention_mask.unsqueeze(0)

        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state[:, 0, :]

        dropout = self.dropout(last_hidden_state)
        logits = self.classifier(dropout)

        predicts = self.softmax(logits)

        return predicts

In [None]:
model = ClassificationModel(base_model = bert_model, num_classes = 2 )

In [None]:
input_ids = train_dataset.input_ids[0]
attention_mask = train_dataset.attention_masks[0]
model.eval()
with torch.no_grad():
  predicts = model(input_ids,attention_mask)


In [None]:
predicts.tolist()

[[0.4507104158401489, 0.5492895841598511]]

In [None]:
assert predicts.tolist() == [[0.4507104158401489, 0.5492895841598511]]

## Model finetuning

In this section, you need to implement train loop and evaluation loop.

 Initialize the optimizer. We use AdamW for optimizer and cross entropy loss. (3pts)

In [None]:
from transformers import AdamW
from torch.nn import CrossEntropyLoss

optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = CrossEntropyLoss()




In [None]:
dev_df = {'input':dev_x[:20], 'label':dev_y[:20]}
dev_df_dataset = SentimentAnalysisDataset(dev_df, tokenizer)

dev_dataloader = DataLoader(dev_df_dataset, batch_size = 4, shuffle = True)

We use pytorch to implement. For each epoch, we run one training loop and one evaluation loop. At the end of training, we run the model on test set using the best model saved. For one training step, we run forward pass using pretrained model given input. Then we calculate loss and do backward propagation. See the instructions in the code block.


In [None]:
from tqdm import tqdm
epochs = 10 # don't change
device = "cpu"
num_training_steps = epochs * len(train_dataloader)
best_loss = 10000
best_model = None
with tqdm(total=num_training_steps, desc='Finetuning:') as pbar:
    for epoch in range(epochs):
        # training loop
        model.train()
        train_loss = 0
        for batch in train_dataloader:
            '''
            Tips:
            1. Put the input and model on the same device
            2. Use the optimizer correctly
            3. Update the train loss. The printed train loss should be train_loss/len(train_dataloader)
            '''
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask)

            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
            pbar.update(1)

        print(f'Epoch {epoch}: train loss is {train_loss}')
    model.eval()
    dev_loss = 0
    with torch.no_grad():
        for batch in dev_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            preds = model(input_ids, attention_mask)
            loss = loss_fn(preds, labels)
            dev_loss += loss.item()

        dev_loss = dev_loss / len(dev_dataloader)
        print(f'Epoch {epoch}: dev loss is {dev_loss}')
        if dev_loss < best_loss:
            #save the checkpoint
            best_loss = dev_loss
            best_model = model.state_dict()
            print(f'The best loss is {best_loss}. Saving checkpoint!')


Finetuning::  10%|█         | 5/50 [01:11<10:34, 14.10s/it]

Epoch 0: train loss is 3.528764545917511


Finetuning::  20%|██        | 10/50 [02:25<09:49, 14.74s/it]

Epoch 1: train loss is 3.068988800048828


Finetuning::  30%|███       | 15/50 [03:33<08:02, 13.77s/it]

Epoch 2: train loss is 2.912799656391144


Finetuning::  40%|████      | 20/50 [04:41<06:48, 13.61s/it]

Epoch 3: train loss is 3.033766061067581


Finetuning::  50%|█████     | 25/50 [05:49<05:43, 13.72s/it]

Epoch 4: train loss is 2.9004264771938324


Finetuning::  60%|██████    | 30/50 [06:56<04:32, 13.63s/it]

Epoch 5: train loss is 2.732151120901108


Finetuning::  70%|███████   | 35/50 [08:05<03:25, 13.68s/it]

Epoch 6: train loss is 2.578041046857834


Finetuning::  80%|████████  | 40/50 [09:14<02:16, 13.68s/it]

Epoch 7: train loss is 2.224846512079239


Finetuning::  90%|█████████ | 45/50 [10:22<01:07, 13.49s/it]

Epoch 8: train loss is 1.965950846672058


Finetuning:: 100%|██████████| 50/50 [11:30<00:00, 13.66s/it]

Epoch 9: train loss is 1.7839383482933044


Finetuning:: 100%|██████████| 50/50 [11:50<00:00, 14.21s/it]

Epoch 9: dev loss is 0.7333539843559265
The best loss is 0.7333539843559265. Saving checkpoint!



