## CA 2, LLMs Spring 2024

- **Name: Sina Tabassi**
- **Student ID: 810199554**

---
#### Your submission should be named using the following format: `CA2_LASTNAME_STUDENTID_soft_prompt.ipynb`.

- There is no penalty for using AI assistance on this homework as long as you fully disclose it in the final cell of this notebook (this includes storing any prompts that you feed to large language models). That said, anyone caught using AI assistance without proper disclosure will receive a zero on the assignment (we have several automatic tools to detect such cases). We're literally allowing you to use it with no limitations, so there is no reason to lie!

---

##### *Academic honesty*

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your notebook. If you turn in correct answers on your notebook without code that actually generates those answers, we will consider this a serious case of cheating.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

---

If you have any further questions or concerns, contact the TA via email:
mohammad136631@gmail.com

---

# What are Soft prompts?
Soft prompts are learnable tensors concatenated with the input embeddings that can be optimized to a dataset; the downside is that they aren’t human readable because you aren’t matching these “virtual tokens” to the embeddings of a real word.
<br>
<div>
<img src="https://www.researchgate.net/publication/366062946/figure/fig1/AS:11431281105340756@1670383256990/The-comparison-between-the-previous-T5-prompt-tuning-method-part-a-and-the-introduced.jpg"/>
</div>

Read More:
<br>[Youtube : PEFT and Soft Prompt](https://www.youtube.com/watch?v=8uy_WII76L0)
<br>[Paper: The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
https://arxiv.org/pdf/2101.00190.pdf
<br>[Paper: Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)

# Part 1 (20 Points)
Before diving into the practical applications, let's first ensure your foundational knowledge is solid. Please answer the following questions.

**A) Compare and contrast model tuning and prompt tuning in terms of their effectiveness for specific downstream tasks. (5 Points)**

*Answer:*

Let us compare the efficacy of each approach:
- Model tuning tends to be effective when the downstream task is significantly different from the tasks the model was originally trained on. It allows the model to adapt its representations to better suit the task-specific data. For example, if the downstream task involves sentiment analysis on financial documents, fine-tuning a pre-trained language model on a dataset of financial news articles can lead to improved performance.

- Prompt tuning can be effective when the downstream task shares similarities with the tasks the model was originally trained on. By providing task-specific prompts, the model can leverage its pre-existing knowledge and fine-tune its responses without requiring extensive retraining. For instance, in question-answering tasks, providing well-crafted prompts tailored to the context of the questions can lead to improved accuracy without the need for extensive fine-tuning.


Additionally, according to `The Power of Scale for Parameters-Efficient Prompt Tuning` paper, in big models with over 10^10 parameters, we pretty much get the same results whether we use model-tuning or Prompt tuning. But in smaller models, we can't match the results of these two approaches, and Prompt tuning tends to have lower SuperGLUE scores compared to model-tuning methods. However, as we crank up the model parameters, the difference in their SuperGLUE scores starts to shrink.


**B) Explore the challenges associated with interpreting soft prompts in the continuous embedding space and propose potential solutions. (5 Points)**

*Answer:*

Challenges:

- Soft prompts can be ambiguous and may not provide clear guidance to the model, leading to inconsistent or suboptimal responses.

- The continuous embedding space is high-dimensional and complex, making it difficult to interpret the relationships between soft prompts and model outputs.

- Soft prompts may suffer from semantic drift, where the model's interpretation of the prompt diverges from the intended meaning over time or across different contexts.

- Soft prompts may not generalize well across different domains or datasets, requiring adaptation to new contexts.

Potential Soulutions:

- Incorporate additional context or constraints to disambiguate soft prompts. This could involve providing more specific instructions or examples alongside the soft prompt to guide the model towards the desired interpretation.

- Employ visualization techniques to explore and understand the embedding space. Dimensionality reduction methods such as t-SNE or UMAP can help visualize the relationships between soft prompts and embeddings, providing insights into the model's behavior.

- Continuously monitor and update soft prompts to ensure they remain relevant and effective. Fine-tuning the model on new data or periodically retraining with updated prompts can help mitigate semantic drift and maintain performance.

- Utilize techniques for domain adaptation, such as adversarial training, domain-specific fine-tuning, or data augmentation with domain-relevant examples. This can help the model adapt to new domains while maintaining the effectiveness of soft prompts.

**C) What is the effect of initializing prompts randomly versus initializing them from the vocabulary, and how does this impact the performance of prompt tuning? (5 Points)**

*Answer:*

First, let's compare the impact of each approach:
- When prompts are initialized randomly, they lack any inherent semantic meaning or relevance to the downstream task.
-Initializing prompts from the vocabulary involves selecting words or phrases from the model's vocabulary that are likely to be relevant to the downstream task.

Now, let's examine their influence on the performance of prompt tuning:
- Randomly initialized prompts may lead to suboptimal performance initially since they provide no useful guidance to the model. The model has to learn to associate the randomly initialized prompt with the desired task, which can be challenging and may require more training data and time.
-Initializing prompts from the vocabulary provides the model with a starting point that is more likely to be semantically meaningful and relevant to the task. This can lead to faster convergence and better performance compared to randomly initialized prompts.


Furthermore, as detailed in `The Power of Scale for Parameters-Efficient Prompt Tuning` the utilization of random uniform initialization falls behind compared to initializations employing sampled vocabulary or class label embeddings. However, as the number of parameters escalates to 10^10, this discrepancy diminishes.


**D) How is the optimization process in the prefix tuning(<br>[Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)) and Why did they use this technique? (5 Points)**

*Answer:*


In prefix-tuning, we maintain the model's original parameters unchanged while introducing additional vectors to the model, serving as prefixes for the input. These vectors, added individually for each transformer block and known as virtual tokens, are then updated or fine-tuned for a specific task. This methodology, which impacts only a small fraction of the model's parameters, achieves performance on par with full data fine-tuning, surpasses fine-tuning in scenarios with limited data, and exhibits superior generalization to examples featuring unseen topics during training.

There are several advantages to employing this technique:
- Its modular nature allows for seamless adaptation to entirely different tasks simply by replacing prefixes. In contrast, fine-tuning necessitates replacing the model with another instance fine-tuned for the new task, preventing the sharing of the same model across different tasks.
- It is lightweight and demands lower resources compared to full fine-tuning since it entails updating far fewer parameters than fine-tuning.
- The authors applied prefix-tuning on GPT-2 for table-to-text generation and on BART for summarization, demonstrating its effectiveness across diverse natural language generation tasks.
- It enables the formulation of more expressive prompts owing to continuous optimization.

# Part 2 (35 points)

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModel
from transformers import AdamW
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

## Model Selection & Constants
We will use `bert-fa-base-uncased` as our base model from Hugging Face ([HF_Link](https://huggingface.co/HooshvareLab/bert-fa-base-uncased)). For our tuning, we intend to utilize 20 soft prompt tokens.

In [2]:
class CONFIG:
    seed = 42
    max_len = 128
    train_batch = 16
    valid_batch = 32
    epochs = 10
    n_tokens=20
    learning_rate = 0.01
    model_name = 'HooshvareLab/bert-fa-base-uncased'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

## Dataset

The dataset contains around 7000 Persian sentences and their corresponding polarity, and have been manually classified into 5 categories (i.e. Angry).

### Load Dataset

In [3]:
!pip install gdown -q

!gdown 1BT9G7y5YyyN9nlRzf0iQhtAnIZvftIs5 -O softprompt_dataset.csv

Downloading...
From: https://drive.google.com/uc?id=1BT9G7y5YyyN9nlRzf0iQhtAnIZvftIs5
To: /content/softprompt_dataset.csv
  0% 0.00/1.29M [00:00<?, ?B/s]100% 1.29M/1.29M [00:00<00:00, 164MB/s]


In [4]:
import pandas as pd
file_path = "softprompt_dataset.csv"
df = pd.read_csv(file_path)

### Pre-Processing

In [5]:
%pip install -U clean-text[gpl]
%pip install hazm



In [6]:
import re
from cleantext import clean
from hazm import *

In [7]:
import re
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

def cleaning(text):
    text = text.strip()

    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )

    text = cleanhtml(text)

    # normalizing
    #normalizer = hazm.Normalizer()
    #text = normalizer.normalize(text)

    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)

    text = wierd_pattern.sub(r'', text)

    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)

    return text

In [8]:
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

tqdm.pandas()

def parallel_apply_with_progress(df, func, n_workers=4):
    with ThreadPoolExecutor(max_workers=n_workers) as executor, tqdm(total=len(df)) as pbar:
        def update(*args):
            pbar.update()

        results = []
        for result in executor.map(func, df['text']):
            results.append(result)
            update()

        df['text'] = pd.Series(results)

    return df

In [9]:
df = parallel_apply_with_progress(df, cleaning)

100%|██████████| 7023/7023 [00:04<00:00, 1533.52it/s]


In [10]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values,
                                                  df.label.values,
                                                  test_size=0.15,
                                                  random_state=42,
                                                  stratify=df.label.values)

train_df = df.loc[X_train]
validation_df = df.loc[X_val]

In [11]:
possible_labels = df.label.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{0: 0, 1: 1, 2: 2, -1: 3, -2: 4}

In [12]:
train_df['label'] = train_df.label.replace(label_dict)
validation_df['label'] = validation_df.label.replace(label_dict)

### Create Dataset Class (5 Points)
In this step we will getting our dataset ready for training.

In this part we will define BERT-based dataset class for text classification, with configuration parameters. It preprocesses text data and tokenizes it using the BERT tokenizer.


Complete the preprocessing step in the __getitem__ method by adding padding tokens to 'input_ids' and 'attention_mask',
The count of this pad tokens is the same as `n_tokens`.

In [13]:
class BERTDataset(Dataset):
    def __init__(self,df):
        self.text = df['text'].values
        self.labels = df['label'].values
        self.all_labels = [0, 1, 2, 3, 4]
        self.max_len = CONFIG.max_len
        self.tokenizer = CONFIG.tokenizer
        self.n_tokens=CONFIG.n_tokens

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = self.text[index]
        text = ' '.join(text.split())
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True
        )

        ######### Your code begins #########
        inputs['input_ids'] = inputs['input_ids'] + [self.tokenizer.pad_token_id] * self.n_tokens
        inputs['attention_mask'] = inputs['attention_mask'] + [0] * self.n_tokens
        ######### Your code ends ###########

        labels = self.labels[index]
        label_dict = {label: (label == labels) for label in self.all_labels}
        labels_tensor = torch.tensor([float(label_dict[label]) for label in self.all_labels])
        return {
            'ids': torch.tensor(inputs['input_ids'], dtype=torch.long),
            'mask': torch.tensor(inputs['attention_mask'], dtype=torch.long),
            'label': labels_tensor
        }


In [14]:
train_dataset = BERTDataset(train_df)
validation_dataset = BERTDataset(validation_df)

## Define Prompt Embedding Layer (15 Points)
In this part we will define our prompt layer in `PROMPTEmbedding` module.


<font color='#73FF73'><b>You have to complete</b></font> `initialize_embedding` and  `forward` <font color='#73FF73'><b>functions.</b></font>

In `initialize_embedding` function initialize the learned embeddings based on whether they should be initialized from the vocabulary or randomly within the specified range.

In `forward` function, modify the input_embedding to extract the relevant part based on n_tokens.

Repeat the learned_embedding to match the size of input_embedding.

Concatenate the learned_embedding and input_embedding properly.


In [15]:
class PROMPTEmbedding(nn.Module):
    def __init__(self,
                emb_layer: nn.Embedding,
                n_tokens: int = 20,
                random_range: float = 0.5,
                initialize_from_vocab: bool = True):

      super(PROMPTEmbedding, self).__init__()
      self.emb_layer = emb_layer
      self.n_tokens = n_tokens
      self.learned_embedding = nn.parameter.Parameter(self.initialize_embedding(emb_layer,
                                                                               n_tokens,
                                                                               random_range,
                                                                               initialize_from_vocab))

    def initialize_embedding(self,
                             emb_layer: nn.Embedding,
                             n_tokens: int = 20,
                             random_range: float = 0.5,
                             initialize_from_vocab: bool = True):

      if initialize_from_vocab:
        ######### Your code begins #########
        embedding = emb_layer.weight[:n_tokens].clone().detach()

      else:
        embedding = torch.rand((n_tokens, emb_layer.weight.size(1))) * random_range
        ######### Your code ends ###########
      return embedding


    def forward(self, tokens):
      ######### Your code begins #########
      input_embedding = self.emb_layer(tokens[:, self.n_tokens:])
      learned_embedding = self.learned_embedding.repeat(tokens.size(0), 1, 1)
      joined_embedding = torch.cat((learned_embedding, input_embedding), dim=1)
      ######### Your code ends ###########
      return joined_embedding

## Replace model's embedding layer with our layer (5 Points)

In [16]:
# Define your BERT model
model = AutoModelForSequenceClassification.from_pretrained(CONFIG.model_name, num_labels=5, output_attentions = False,
                                                           output_hidden_states = False).to(CONFIG.device)
######### Your code begins #########
embeddings = PROMPTEmbedding(model.get_input_embeddings(),n_tokens=CONFIG.n_tokens).to(CONFIG.device)
model.set_input_embeddings(embeddings)
######### Your code ends ###########


pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Freezing Model Parameters (5 points)
In this part we will freeze entire model except `learned_embedding`

In [17]:
######### Your code begins #########
for name, param in model.named_parameters():
    param.requires_grad = False
    if "learned_embedding" in name:
      param.requires_grad = True
######### Your code ends ###########

## Optimizer


In [18]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=CONFIG.learning_rate)

## Training & Evaluation


### Define dataloaders

In [19]:
train_loader = DataLoader(train_dataset, batch_size=CONFIG.train_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

validation_loader = DataLoader(validation_dataset, batch_size=CONFIG.valid_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

### Define evaluation function

In [20]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = np.argmax(labels, axis=1).flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [21]:
def evaluate(val_dataloader):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in val_dataloader:


        inputs = {'input_ids':      batch['ids'].to(CONFIG.device),
                  'attention_mask': batch['mask'].to(CONFIG.device),
                  'labels':         batch['label'].to(CONFIG.device),
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs["loss"]
        logits = outputs["logits"]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(val_dataloader)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

### Define trainng loop


In [22]:
def train(model, optimizer, train_dataloader, val_dataloader):

    epochs = CONFIG.epochs

    for epoch in tqdm(range(1, epochs+1)):

      model.train()

      loss_train_total = 0

      progress_bar = tqdm(train_loader, desc='Epoch {:1d}'.format(epoch), leave=False, disable=True)

      for batch in progress_bar:

        optimizer.zero_grad()

        inputs = {'input_ids':      batch['ids'].to(CONFIG.device),
                  'attention_mask': batch['mask'].to(CONFIG.device),
                  'labels':         batch['label'].to(CONFIG.device),
                }

        output = model(**inputs)

        loss = output["loss"]
        loss_train_total += loss.item()

        loss.backward()
        optimizer.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})


      tqdm.write(f'\nEpoch {epoch}')
      loss_train_avg = loss_train_total/len(train_loader)
      tqdm.write(f'Training loss: {loss_train_avg}')


      val_loss, predictions, true_vals = evaluate(val_dataloader)
      val_f1 = f1_score_func(predictions, true_vals)
      tqdm.write(f'Validation loss: {val_loss}')
      tqdm.write(f'F1 Score (Weighted): {val_f1}')


### Run

In [23]:
train(model=model, optimizer=optimizer, train_dataloader=train_loader, val_dataloader=validation_loader)

  0%|          | 0/10 [01:34<?, ?it/s]


Epoch 1
Training loss: 0.47132693573752826


 10%|█         | 1/10 [01:43<15:35, 103.95s/it]

Validation loss: 0.457704509749557
F1 Score (Weighted): 0.22338408832786694


 10%|█         | 1/10 [03:24<15:35, 103.95s/it]


Epoch 2
Training loss: 0.46188399793311236


 20%|██        | 2/10 [03:34<14:22, 107.80s/it]

Validation loss: 0.4566034728830511
F1 Score (Weighted): 0.23017403352191465


 20%|██        | 2/10 [05:17<14:22, 107.80s/it]


Epoch 3
Training loss: 0.4599987838198157


 30%|███       | 3/10 [05:28<12:53, 110.45s/it]

Validation loss: 0.4558577338854472
F1 Score (Weighted): 0.2300164086062059


 30%|███       | 3/10 [07:12<12:53, 110.45s/it]


Epoch 4
Training loss: 0.45941353704840104


 40%|████      | 4/10 [07:22<11:12, 112.17s/it]

Validation loss: 0.455285673791712
F1 Score (Weighted): 0.28749387201808313


 40%|████      | 4/10 [09:08<11:12, 112.17s/it]


Epoch 5
Training loss: 0.4578447753095372


 50%|█████     | 5/10 [09:18<09:27, 113.47s/it]

Validation loss: 0.4546360508962111
F1 Score (Weighted): 0.225711626302008


 50%|█████     | 5/10 [11:03<09:27, 113.47s/it]


Epoch 6
Training loss: 0.457851216914182


 60%|██████    | 6/10 [11:14<07:36, 114.20s/it]

Validation loss: 0.45350095178141736
F1 Score (Weighted): 0.2727268796374769


 60%|██████    | 6/10 [12:59<07:36, 114.20s/it]


Epoch 7
Training loss: 0.45811651447877527


 70%|███████   | 7/10 [13:09<05:43, 114.65s/it]

Validation loss: 0.45096983602552704
F1 Score (Weighted): 0.299107584812588


 70%|███████   | 7/10 [14:55<05:43, 114.65s/it]


Epoch 8
Training loss: 0.4562078213149851


 80%|████████  | 8/10 [15:05<03:49, 114.98s/it]

Validation loss: 0.453257166074984
F1 Score (Weighted): 0.21387945384138884


 80%|████████  | 8/10 [16:50<03:49, 114.98s/it]


Epoch 9
Training loss: 0.45561252016434695


 90%|█████████ | 9/10 [17:00<01:55, 115.14s/it]

Validation loss: 0.45107919609907904
F1 Score (Weighted): 0.25153692297102614


 90%|█████████ | 9/10 [18:45<01:55, 115.14s/it]


Epoch 10
Training loss: 0.4548560521181892


100%|██████████| 10/10 [18:56<00:00, 113.61s/it]

Validation loss: 0.45061438824191236
F1 Score (Weighted): 0.2688565013399141





## Using OpenDelta library (5 Points)

In [24]:
!pip install git+https://github.com/thunlp/OpenDelta.git -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m623.2/623.2 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.4/226.4 kB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

Use `OpenDelta` library to do the same thing. [link](https://opendelta.readthedocs.io/en/latest/modules/deltas.html)

For hyperparameters, test with `N_SOFT_PROMPT_TOKENS=10` and `N_SOFT_PROMPT_TOKENS=20` and report them.

In [26]:
from opendelta.delta_models.soft_prompt import SoftPromptModel
from transformers import AdamW

model = AutoModelForSequenceClassification.from_pretrained(
    CONFIG.model_name,
    num_labels=5,
    output_attentions=False,
    output_hidden_states=False
    )

model_soft_prompt_10 = SoftPromptModel(
    backbone_model=model,
    soft_token_num=10,
    init_range=0.5,
    token_init=True,
    other_expand_ids={"attention_mask": 1, "token_type_ids": 0},
    modified_modules=["root"],
)

for name, param in model.named_parameters():
    param.requires_grad = False
    if "soft" in name:
      param.requires_grad = True

model = model.to(CONFIG.device)
optimizer = AdamW(model.parameters(), lr=CONFIG.learning_rate)

train(model=model, optimizer=optimizer, train_dataloader=train_loader, val_dataloader=validation_loader)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/10 [01:37<?, ?it/s]


Epoch 1
Training loss: 0.47055563601580536


 10%|█         | 1/10 [01:46<16:02, 106.95s/it]

Validation loss: 0.45777927835782367
F1 Score (Weighted): 0.210237813239944


 10%|█         | 1/10 [03:29<16:02, 106.95s/it]


Epoch 2
Training loss: 0.45463508877524716


 20%|██        | 2/10 [03:38<14:38, 109.84s/it]

Validation loss: 0.431967222329342
F1 Score (Weighted): 0.3385504417148931


 20%|██        | 2/10 [05:22<14:38, 109.84s/it]


Epoch 3
Training loss: 0.4416408994618584


 30%|███       | 3/10 [05:32<12:59, 111.40s/it]

Validation loss: 0.4221643814534852
F1 Score (Weighted): 0.4219660011359126


 30%|███       | 3/10 [07:16<12:59, 111.40s/it]


Epoch 4
Training loss: 0.43438523083447134


 40%|████      | 4/10 [07:26<11:15, 112.51s/it]

Validation loss: 0.4184220243584026
F1 Score (Weighted): 0.39062596024868373


 40%|████      | 4/10 [09:12<11:15, 112.51s/it]


Epoch 5
Training loss: 0.42939041299934694


 50%|█████     | 5/10 [09:22<09:28, 113.76s/it]

Validation loss: 0.4185719056562944
F1 Score (Weighted): 0.36932117002774456


 50%|█████     | 5/10 [11:08<09:28, 113.76s/it]


Epoch 6
Training loss: 0.42765822743668275


 60%|██████    | 6/10 [11:18<07:37, 114.45s/it]

Validation loss: 0.4079696806994351
F1 Score (Weighted): 0.3931445753551571


 60%|██████    | 6/10 [13:03<07:37, 114.45s/it]


Epoch 7
Training loss: 0.42585558559805314


 70%|███████   | 7/10 [13:13<05:44, 114.87s/it]

Validation loss: 0.40546070355357544
F1 Score (Weighted): 0.386346827057252


 70%|███████   | 7/10 [14:59<05:44, 114.87s/it]


Epoch 8
Training loss: 0.4243841072454809


 80%|████████  | 8/10 [15:09<03:50, 115.22s/it]

Validation loss: 0.39690240133892407
F1 Score (Weighted): 0.4524846149688576


 80%|████████  | 8/10 [16:56<03:50, 115.22s/it]


Epoch 9
Training loss: 0.4225602915739631


 90%|█████████ | 9/10 [17:06<01:55, 115.63s/it]

Validation loss: 0.39909834663073224
F1 Score (Weighted): 0.40306679263020806


 90%|█████████ | 9/10 [18:52<01:55, 115.63s/it]


Epoch 10
Training loss: 0.42106350419674327


100%|██████████| 10/10 [19:02<00:00, 114.27s/it]

Validation loss: 0.3918967906272773
F1 Score (Weighted): 0.4486783285676869





In [27]:
model = AutoModelForSequenceClassification.from_pretrained(
    CONFIG.model_name,
    num_labels=5,
    output_attentions=False,
    output_hidden_states=False
    )

model_soft_prompt_20 = SoftPromptModel(
    backbone_model=model,
    soft_token_num=20,
    init_range=0.5,
    token_init=True,
    other_expand_ids={"attention_mask": 1, "token_type_ids": 0},
    modified_modules=["root"],
)

for name, param in model.named_parameters():
    param.requires_grad = False
    if "soft" in name:
      param.requires_grad = True

model = model.to(CONFIG.device)
optimizer = AdamW(model.parameters(), lr=CONFIG.learning_rate)

train(model=model, optimizer=optimizer, train_dataloader=train_loader, val_dataloader=validation_loader)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/10 [01:56<?, ?it/s]


Epoch 1
Training loss: 0.47137393447804576


 10%|█         | 1/10 [02:07<19:04, 127.13s/it]

Validation loss: 0.46013085679574445
F1 Score (Weighted): 0.21466794468857534


 10%|█         | 1/10 [04:05<19:04, 127.13s/it]


Epoch 2
Training loss: 0.44795159031363097


 20%|██        | 2/10 [04:15<17:04, 128.04s/it]

Validation loss: 0.4281088917544394
F1 Score (Weighted): 0.3447256082473785


 20%|██        | 2/10 [06:14<17:04, 128.04s/it]


Epoch 3
Training loss: 0.43002303223558924


 30%|███       | 3/10 [06:25<15:01, 128.78s/it]

Validation loss: 0.40940046220114734
F1 Score (Weighted): 0.41848133052028325


 30%|███       | 3/10 [08:24<15:01, 128.78s/it]


Epoch 4
Training loss: 0.42366501067411455


 40%|████      | 4/10 [08:35<12:55, 129.18s/it]

Validation loss: 0.40359138719963306
F1 Score (Weighted): 0.428385608799628


 40%|████      | 4/10 [10:34<12:55, 129.18s/it]


Epoch 5
Training loss: 0.4180602641666637


 50%|█████     | 5/10 [10:45<10:47, 129.52s/it]

Validation loss: 0.40053410963578656
F1 Score (Weighted): 0.4418092744540763


 50%|█████     | 5/10 [12:44<10:47, 129.52s/it]


Epoch 6
Training loss: 0.4141716056647785


 60%|██████    | 6/10 [12:55<08:38, 129.70s/it]

Validation loss: 0.3994808160897457
F1 Score (Weighted): 0.4367551571176264


 60%|██████    | 6/10 [14:54<08:38, 129.70s/it]


Epoch 7
Training loss: 0.41268796501631405


 70%|███████   | 7/10 [15:05<06:29, 129.77s/it]

Validation loss: 0.4001951145403313
F1 Score (Weighted): 0.44152446666056167


 70%|███████   | 7/10 [17:04<06:29, 129.77s/it]


Epoch 8
Training loss: 0.40713458958475346


 80%|████████  | 8/10 [17:14<04:19, 129.69s/it]

Validation loss: 0.39055857152649853
F1 Score (Weighted): 0.46640415465471274


 80%|████████  | 8/10 [19:13<04:19, 129.69s/it]


Epoch 9
Training loss: 0.4046679708887549


 90%|█████████ | 9/10 [19:24<02:09, 129.63s/it]

Validation loss: 0.3996859815987674
F1 Score (Weighted): 0.45092310857887374


 90%|█████████ | 9/10 [21:22<02:09, 129.63s/it]


Epoch 10
Training loss: 0.4026825452711493


100%|██████████| 10/10 [21:33<00:00, 129.35s/it]

Validation loss: 0.3912365021127643
F1 Score (Weighted): 0.481628160468963



