<a href="https://colab.research.google.com/github/tmobley96/AI-SceneGen/blob/main/PEFTTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Setup

In [1]:
!pip install -q bitsandbytes datasets accelerate loralib einops
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [15]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
!nvidia-smi -L


GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-3e1ad5c3-d2d6-4589-33d2-f2b858e683cb)


### Setup the model

In [4]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from torch.utils.data import Dataset

model = AutoModelForCausalLM.from_pretrained(
    "cognitivecomputations/dolphin-2_6-phi-2",
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("cognitivecomputations/dolphin-2_6-phi-2")

# Get the maximum input size for the model, we will need this for later.
max_input_size = tokenizer.model_max_length
print("Maximum input size for the model:", max_input_size)

The repository for cognitivecomputations/dolphin-2_6-phi-2 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/cognitivecomputations/dolphin-2_6-phi-2.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
The repository for cognitivecomputations/dolphin-2_6-phi-2 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/cognitivecomputations/dolphin-2_6-phi-2.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of the model checkpoint at cognitivecomputations/dolphin-2_6-phi-2 were not used when initializing PhiForCausalLM: ['lm_head.linear.lora_B.default.weight', 'lm_head.linear.lora_A.default.weight']
- This IS expected if you are initializing PhiForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing PhiForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Maximum input size for the model: 2048


### Freezing the original weights


In [16]:
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.


# Setting up the LoRa Adapters

In [17]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [18]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32, #attention heads
    lora_alpha=16, #alpha scaling
    target_modules=["Wqkv", "out_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 15728640 || all params: 2795412480 || trainable%: 0.5626590033682615


# Data

First, we import our dataset.

In [19]:
import transformers
from datasets import load_dataset
data = load_dataset("tmobley96/black_mirror_scripts_S1-5")

Let's make sure our dataset is imported.

In [21]:
data['train'][665] # Choosing a random number.

{'Script ID': '101',
 'Scene': '666',
 'Timestamp': '00:33:14,200 --> 00:33:17,560',
 'Title': 'The National Anthem',
 'Dialogue': 'that it will ensure the safe  release of Princess Susannah.'}

Now, we must tokenize our 'Dialogue' field using the tokenizer from the model and pass that tokenized output to the pad method.

In [22]:
# Let's add the Dolphin's Tokenizer
tokenizer = AutoTokenizer.from_pretrained("cognitivecomputations/dolphin-2_6-phi-2")

#Now let's write a function to process each 'Dialogue'entry from our dataset. Time to tokenize and pad the text.
def tokenize_and_pad(text_list, max_length=2048):
    # Tokenize all texts and align the length by padding
    tokenized_outputs = tokenizer(
        text_list,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
        )
    return tokenized_outputs


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [23]:
# Example: Processing the dataset
dialogues = [entry['Dialogue'] for entry in data['train']]
processed_data = tokenize_and_pad(dialogues)


In [24]:
# We have prepared and tokenized our data. Now to make it a suitable format for our trainer.

class BlackMirrorDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings.input_ids)

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

train_dataset = BlackMirrorDataset(processed_data)


# Training

In [25]:
dataset_size = len(train_dataset)  # Replace with your dataset size
num_epochs = 3  # Replace with the number of epochs you plan to use
batch_size = 4  # Replace with your batch size

max_steps = (dataset_size * num_epochs) // batch_size
print("Maximum number of training steps:", max_steps)


Maximum number of training steps: 13100


In [30]:

trainer = transformers.Trainer(
    model=model,
    train_dataset=train_dataset,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        warmup_steps=100,
        save_steps=500,
        max_steps=13100,
        learning_rate=2e-4,
        fp16=True,
        bf16=False,
        logging_steps=10,
        logging_dir='logs',
        output_dir='outputs',
        remove_unused_columns=False,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
torch.cuda.empty_cache()
model.config.use_cache = True
trainer.train()

  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


OutOfMemoryError: ignored

In [None]:
try:
    trainer.train()
except IndexError as e:
    print(f"IndexError: {e}")
    print(f"Current batch index: {trainer.state.global_step}")
    # Include any other relevant information you might need
    raise e  # Re-raise the exception if you want to halt the script
