Importing the necessary libraries

In [1]:
from transformers import AutoModelForSequenceClassification
from peft import LoraConfig, get_peft_model

# Model

Loading pre-trained DeBERTa-v3 model for sequence classification and configure it using LoRA (Low-Rank Adaptation) with the PEFT (Parameter-Efficient Fine-Tuning) framework. Reducing the number of trainable parameters by freezing most of the model, making it computationally cheaper for testing and experimentation purposes.

1. **Loading the Pre-trained DeBERTa-v3 Model**:
   - The `AutoModelForSequenceClassification` class from Hugging Face’s `transformers` library is used to load the `microsoft/deberta-v3-base` model.
   - The `num_labels=2` argument configures the model for a binary classification task. In this case, we assume the task is QQP (Quora Question Pairs), which is a common binary classification task.

2. **Freezing the Base Model's Parameters**:
   - All parameters of the base model (`model.base_model.parameters()`) are frozen. This means that only a small portion of the model will be trained, which is essential for reducing the number of parameters and compute cost. Freezing the majority of the model is a common practice in PEFT to focus only on adapting key layers.

3. **LoRA Configuration**:
   - **Why Use LoRA?** LoRA is a method to reduce the number of parameters that need to be trained by injecting low-rank matrices into key attention layers of the model.
   - We define the `LoraConfig` as follows:
     - `r=4`: This reduces the rank of the LoRA adaptation matrix to 4, meaning fewer additional parameters are added by LoRA. This is suitable for testing purposes to keep the training lightweight.
     - `lora_alpha=16`: The scaling factor (LoRA alpha) is set to 16 to balance learning capacity while minimizing overfitting and controlling the overall parameter count.
     - `target_modules=['query_proj', 'key_proj', 'value_proj']`: Here, we explicitly set the LoRA target modules to `query_proj`, `key_proj`, and `value_proj`. These correspond to the projections in the attention mechanism of the transformer model.
   
4. **LoRA Defaults:**
   - **When `target_modules` Is Not Specified:**
     - By default, if we do **not** provide the `target_modules` argument in LoRA, it will only apply LoRA to the `query_proj` and `value_proj` layers, **excluding** `key_proj`.
     - **Why Only `query_proj` and `value_proj` by Default?** The `query_proj` and `value_proj` layers are the main drivers of attention computation, with the `query_proj` determining how tokens query attention across the sequence, and the `value_proj` determining what information is retrieved. These two layers have a larger impact on the model's ability to adapt to new tasks, which is why they are prioritized by default. 
     - **Exclusion of `key_proj`:** The `key_proj` layer, while important for calculating attention scores, does not introduce as much variability in model adaptation. By default, excluding `key_proj` reduces the number of trainable parameters, which is why it is often left out unless specified otherwise.

5. **Including `key_proj` in `target_modules`:**
   - In this configuration, we explicitly include `key_proj` to test its effect on model performance. Including `key_proj` adds more trainable parameters, which can sometimes lead to improved performance for complex tasks, though at the cost of additional memory and compute.

6. **LoRA’s Focus on Attention Layers**:
   - LoRA is specifically designed to modify attention layers, which are key to the transformer model’s ability to understand contextual relationships between tokens in a sequence.
   - **Why Attention Layers?** These layers (particularly `query_proj` and `value_proj`) are central to how transformers handle attention—by modifying these layers, LoRA enables efficient task adaptation without retraining the entire model. Since attention layers govern how the model focuses on different parts of the input sequence, modifying them can lead to substantial improvements with minimal computational overhead.

7. **Applying LoRA to the Model**:
   - The `get_peft_model` function wraps the base model and applies LoRA to the specified attention modules. In this case, the LoRA layers are injected into `query_proj`, `key_proj`, and `value_proj`, while the rest of the model remains frozen.

In [2]:
def create_peft_model():
    """
    Create a DeBERTa-v3 model with LoRA configuration using PEFT.
    """
    # Load the base DeBERTa-v3 model
    model_name = "microsoft/deberta-v3-base"
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2,  # QQP is a binary classification task
    )

    # Freeze the base model's parameters
    for param in model.base_model.parameters():
        param.requires_grad = False

    # Configure LoRA parameters
    lora_config = LoraConfig(
        r=4,  # balance between parameter efficiency and the model's ability to learn.
        lora_alpha=16,  # sufficient learning while avoiding overfitting
        # target_modules=['query_proj', 'key_proj', 'value_proj'],
        lora_dropout=0.1,
        bias="none",  # no bias in the LoRA layers
        task_type="SEQ_CLS",  # Sequence Classification
    )

    # Apply LoRA to the model
    model = get_peft_model(model, lora_config)
    return model

In [3]:
model = create_peft_model()

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
print(model)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DebertaV2ForSequenceClassification(
      (deberta): DebertaV2Model(
        (embeddings): DebertaV2Embeddings(
          (word_embeddings): Embedding(128100, 768, padding_idx=0)
          (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
          (dropout): StableDropout()
        )
        (encoder): DebertaV2Encoder(
          (layer): ModuleList(
            (0-11): 12 x DebertaV2Layer(
              (attention): DebertaV2Attention(
                (self): DisentangledSelfAttention(
                  (query_proj): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.1, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=768, out_features=4, bias=False)
                    )


Checking the number of training parameters

In [5]:
model.print_trainable_parameters()

trainable params: 148,994 || all params: 184,572,676 || trainable%: 0.0807


As we can see from the message above, the following layers were initialized and therefore require updating during training:

- `classifier.bias`
- `classifier.weight`
- `pooler.dense.bias`
- `pooler.dense.weight`

To ensure that these layers are trainable, we explicitly set their `requires_grad` attribute to `True`:

In [6]:
for param in model.pooler.parameters():
    param.requires_grad = True

for param in model.classifier.parameters():
    param.requires_grad = True

Again, checking the number of training parameters

In [7]:
model.print_trainable_parameters()

trainable params: 741,124 || all params: 184,572,676 || trainable%: 0.4015


# Dataset

In [8]:
from datasets import load_dataset

In [9]:
def load_and_prepare_dataset():
    """
    Load and preprocess the QQP dataset.
    """
    # Load the QQP dataset from the GLUE benchmark
    ds = load_dataset("glue", "qqp")
    # Filter out examples with missing labels
    ds = ds.filter(lambda example: example["label"] != -1)
    return ds

In [10]:
ds = load_and_prepare_dataset()

In [11]:
ds

DatasetDict({
    train: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 363846
    })
    validation: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 40430
    })
    test: Dataset({
        features: ['question1', 'question2', 'label', 'idx'],
        num_rows: 0
    })
})

Example of different questions

In [12]:
print(ds["train"].select(range(1)).data["question1"][0])
print("\n")
print(ds["train"].select(range(1)).data["question2"][0])

How is the life of a math student? Could you describe your own experiences?


Which level of prepration is enough for the exam jlpt5?


Example of similar questions

In [13]:
print(ds["train"].select(range(1)).data["question1"][1])
print("\n")
print(ds["train"].select(range(1)).data["question2"][1])

How do I control my horny emotions?


How do you control your horniness?


## Preprocess data

In [14]:
import torch


def get_device():
    """
    Determine the best available device (CUDA, MPS, or CPU).
    """
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"Using CUDA (GPU): {torch.cuda.get_device_name(0)}")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using MPS (Apple Silicon GPU)")
    else:
        device = torch.device("cpu")
        print("Using CPU")
    return device

In [15]:
def preprocess_data(tokenizer, ds, device):
    """
    Tokenize the dataset using the provided tokenizer and move the data to the specified device.
    This function skips empty splits and applies tokenization
    and device movement to non-empty splits.
    """

    def preprocess_function(examples):
        # Tokenize the input question pairs
        tokenized = tokenizer(
            examples["question1"],
            examples["question2"],
            truncation=True,
            padding=True,
            max_length=128,
        )
        # Include the labels in the output dictionary
        tokenized["labels"] = examples["label"]
        return tokenized

    # Apply the preprocessing function to the dataset
    # (this adds the tokenized columns like input_ids)
    tokenized_ds = {}
    for split in ds:
        if len(ds[split]) > 0:
            # Tokenize non-empty splits
            tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)
        else:
            # Skip tokenization for empty splits (e.g., test set)
            tokenized_ds[split] = ds[split]

    # Now we can set the format to include the tokenized columns and labels
    for split, dataset in tokenized_ds.items():
        if len(dataset) > 0:  # Ensure the split is not empty before formatting
            dataset.set_format(
                type="torch", columns=["input_ids", "attention_mask", "labels"]
            )

    # Move each dataset (train/validation) to the device
    # (optional, but useful for GPU/CPU compatibility)
    tokenized_ds["train"] = tokenized_ds["train"].with_format("torch", device=device)
    tokenized_ds["validation"] = tokenized_ds["validation"].with_format(
        "torch", device=device
    )

    return tokenized_ds

Tokenizer

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base", use_fast=False)



In [17]:
device = get_device()

Using MPS (Apple Silicon GPU)


In [18]:
tokenized_ds = preprocess_data(tokenizer, ds, device)

In [19]:
tokenized_ds

{'train': Dataset({
     features: ['question1', 'question2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
     num_rows: 363846
 }),
 'validation': Dataset({
     features: ['question1', 'question2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
     num_rows: 40430
 }),
 'test': Dataset({
     features: ['question1', 'question2', 'label', 'idx'],
     num_rows: 0
 })}

### **Cost Analysis for Training on `ml.p3.2xlarge` and `ml.g4dn.xlarge` Instances (Including Spot Instances)**

This analysis estimates the cost of training a model with **741,124 trainable parameters** and **363,846 training rows** for **5 epochs** using AWS SageMaker’s **`ml.p3.2xlarge`** and **`ml.g4dn.xlarge`** instance types. Both **on-demand** and **spot instance** pricing are considered, providing a detailed breakdown of time and cost.

---

### **1. Key Information for the Estimate**

- **Trainable Parameters**: 741,124
- **Training Rows**: 363,846
- **Number of Epochs**: 5
- **Batch Size**: 32
- **Steps per Epoch**: 
  
  $$\text{Steps per Epoch} = \frac{363,846}{32} = 11,370 \, \text{steps/epoch}$$
  
- **Total Steps for 5 Epochs**:
  
  $$\text{Total Steps} = 11,370 \times 5 = 56,850 \, \text{steps}$$
  

### **2. Estimating Training Time per Step**

- **`ml.p3.2xlarge` (NVIDIA V100 GPU)**:
  - Estimated time per step: **0.20 seconds**.
- **`ml.g4dn.xlarge` (NVIDIA T4 GPU)**:
  - Estimated time per step: **0.35 seconds**.

### **3. Total Training Time**

#### **`ml.p3.2xlarge`**

$$\text{Total Time} = 56,850 \times 0.20 \, \text{seconds} = 11,370 \, \text{seconds} = \frac{11,370}{3600} \approx 3.16 \, \text{hours}$$

#### **`ml.g4dn.xlarge`**

$$\text{Total Time} = 56,850 \times 0.35 \, \text{seconds} = 19,897.5 \, \text{seconds} = \frac{19,897.5}{3600} \approx 5.53 \, \text{hours}$$

---

### **4. AWS Pricing (On-Demand and Spot Instances)**

- **`ml.p3.2xlarge`**:
  - **On-Demand**: \$3.825 per hour.
  - **Spot Instance**: Typically **70% lower**, approximately \$1.15 per hour.

- **`ml.g4dn.xlarge`**:
  - **On-Demand**: \$0.752 per hour.
  - **Spot Instance**: Typically **70% lower**, approximately \$0.226 per hour.

---

### **5. Cost Calculation**

#### **On-Demand Pricing**

##### **`ml.p3.2xlarge` (On-Demand)**:
- **Training Time**: 3.16 hours.
- **Cost per Hour**: $3.825.
- **Total Cost**:

$$\text{Cost} = 3.16 \times 3.825 = \$12.08$$

##### **`ml.g4dn.xlarge` (On-Demand)**:
- **Training Time**: 5.53 hours.
- **Cost per Hour**: $0.752.
- **Total Cost**: $$\text{Cost} = 5.53 \times 0.752 = \$4.16$$

#### **Spot Instance Pricing**

##### **`ml.p3.2xlarge` (Spot Instance)**:
- **Training Time**: 3.16 hours.
- **Cost per Hour**: $1.15 (70% discount).
- **Total Cost**: $$\text{Cost} = 3.16 \times 1.15 = \$3.63$$

##### **`ml.g4dn.xlarge` (Spot Instance)**:
- **Training Time**: 5.53 hours.
- **Cost per Hour**: $0.226 (70% discount).
- **Total Cost**: $$\text{Cost} = 5.53 \times 0.226 = \$1.25$$

---

### **6. Summary of Results**

| **Instance Type**   | **Time per Epoch** | **Total Time (5 epochs)** | **Cost per Hour (On-Demand)** | **Total Cost (On-Demand)** | **Cost per Hour (Spot)** | **Total Cost (Spot)** |
|---------------------|--------------------|---------------------------|-------------------------------|----------------------------|--------------------------|------------------------|
| `ml.p3.2xlarge`     | 0.63 hours         | 3.16 hours                | \$3.825                         | \$12.08                     | \$1.15                    | \$3.63                  |
| `ml.g4dn.xlarge`    | 1.11 hours         | 5.53 hours                | \$0.752                         | \$4.16                      | \$0.226                   | \$1.25                  |

---

### **7. Cost and Performance Comparison**

#### **1. `ml.p3.2xlarge` (NVIDIA V100 GPU)**:
- **On-Demand Cost**: \$12.08
- **Spot Instance Cost**: \$3.63
- **Training Time**: 3.16 hours
- **Best For**: When **speed** is the priority, or when there is a need to train models frequently and quickly. However, the on-demand cost is significantly higher compared to spot instances.

#### **2. `ml.g4dn.xlarge` (NVIDIA T4 GPU)**:
- **On-Demand Cost**: \$4.16
- **Spot Instance Cost**: \$1.25
- **Training Time**: 5.53 hours
- **Best For**: When **cost efficiency** is the main priority, and the additional training time (~2.4 hours longer than the `p3.2xlarge`) is acceptable.

---

### **8. Conclusion**

- **Cost Efficiency**: If the additional training time (~2.4 hours) is acceptable, the **`ml.g4dn.xlarge`** on **spot instances** is the most cost-effective option, at only **\$1.25** for the entire 5-epoch training job.
- **Speed Efficiency**: For faster training, the **`ml.p3.2xlarge`** on **spot instances** completes the job in **3.16 hours** at a cost of **\$3.63**, providing a good balance between speed and cost.