<a href="https://colab.research.google.com/github/joonhee0416/CPSC477-Final/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CPSC 477 Final Project Part 2: Finetuning Mistral 7B-Instruct

Followed this tutorial: https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe

Imported `dataset/train.jsonl`

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
!pip install -q torch
!pip install -q git+https://github.com/huggingface/transformers #huggingface transformers for downloading models weights
!pip install -q datasets #huggingface datasets to download and manipulate datasets
!pip install -q peft #Parameter efficient finetuning - for qLora Finetuning
!pip install -q bitsandbytes #For Model weights quantisation
!pip install -q trl #Transformer Reinforcement Learning - For Finetuning using Supervised Fine-tuning
!pip install -q wandb -U #Used to monitor the model score during training

[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[0m

In [3]:
import json
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer

In [4]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
train_dataset = load_dataset('csv', data_files='./train.csv' , split='train')

In [6]:
print(len(train_dataset[0]["text"]))
train_dataset[0]

9487


{'filename': 'SPG_q4_2020.txt',
 'text': "<s>[INST] You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript. Below is an earnings call transcript.\n\nEarnings Call Transcript:\n Also on the call are Brian McDade, chief financial officer; and Adam Reuille, chief accounting officer.\nAs all of us know, 2020 was a difficult year for all of those affected by COVID-19, including our company.\nEven with the unprecedented operating environment, we accomplished a great deal.\nWe earned $9.11 per diluted share and funds from operation for the full year, which includes a $0.06 per share dilution from our recent equity offering in November.\nWe generated over $2.3 billion in operating cash flow.\nWe acquired an 80% interest in the Taubman Realty Group, made strategic investments in s

In [7]:
new_model = "mistralai-Code-Instruct" #set the name of the new model

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = True

# Batch size per GPU for training
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 1

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [8]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

# Load the base model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

# Load MitsralAi tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
import torch
torch.cuda.empty_cache()
import gc
gc.collect()

81

In [10]:
eval_prompt = train_dataset[0]["text"].split("#")[0] + "# "
# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

base_model.eval()
with torch.no_grad():
    print(tokenizer.decode(base_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))

<s>[INST] You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript. Below is an earnings call transcript.

Earnings Call Transcript:
 Also on the call are Brian McDade, chief financial officer; and Adam Reuille, chief accounting officer.
As all of us know, 2020 was a difficult year for all of those affected by COVID-19, including our company.
Even with the unprecedented operating environment, we accomplished a great deal.
We earned $9.11 per diluted share and funds from operation for the full year, which includes a $0.06 per share dilution from our recent equity offering in November.
We generated over $2.3 billion in operating cash flow.
We acquired an 80% interest in the Taubman Realty Group, made strategic investments in several widely recognized retail brands at attracti

In [11]:
print(train_dataset[0]["text"].split("#")[1])
print(len(train_dataset[0]["text"]))

 In 2020, Simon Property Group navigated economic challenges stemming from the pandemic, generating $9.11 per diluted share earnings and $2.3 billion in operating cash flow. During the fourth quarter, FFO reached $2.17 per share, impacted by COVID-19 disruptions, yielding a $0.95 per share loss. Despite these challenges, the company made strategic investments in retail brands, raising $13 billion in debt and equity markets. They completed the Taubman Realty Group acquisition, which boasts a high-performing retail portfolio. Looking ahead to 2021, the company projects FFO growth in the range of $9.50 to $9.75 per share, a 4.3% to 7% increase from 2020. The guidance assumes no further government-mandated shutdowns of domestic properties and includes an estimated $0.15 to $0.20 per share contribution from retailer investments.</s>
9487


In [12]:
import gc
torch.cuda.empty_cache()
gc.collect()

0

In [13]:
# Set LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=100, # the total number of training steps to perform
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Initialize the SFTTrainer for fine-tuning
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,  # You can specify the maximum sequence length here
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

max_steps is given, it will override any value given in num_train_epochs


In [14]:
!nvidia-smi

Tue Apr 23 02:13:33 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   52C    P0              29W /  72W |   6485MiB / 23034MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [15]:
import gc
import os
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()
gc.collect()

# Start the training process
trainer.train()

# Save the fine-tuned model
trainer.model.save_pretrained(new_model)

Step,Training Loss
25,2.0072
50,1.7867
75,1.8481
100,1.7769




In [18]:
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    new_model,
    quantization_config=bnb_config,
    device_map={"": 0}
)
fine_tuned_model.eval()
with torch.no_grad():
    generated_code = tokenizer.decode(fine_tuned_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True)
print(generated_code)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INST] You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript. Below is an earnings call transcript.

Earnings Call Transcript:
 Also on the call are Brian McDade, chief financial officer; and Adam Reuille, chief accounting officer.
As all of us know, 2020 was a difficult year for all of those affected by COVID-19, including our company.
Even with the unprecedented operating environment, we accomplished a great deal.
We earned $9.11 per diluted share and funds from operation for the full year, which includes a $0.06 per share dilution from our recent equity offering in November.
We generated over $2.3 billion in operating cash flow.
We acquired an 80% interest in the Taubman Realty Group, made strategic investments in several widely recognized retail brands at attractive 

In [28]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()
gc.collect()

# Merge the model with LoRA weights
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")




('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/tokenizer.model',
 'merged_model/added_tokens.json',
 'merged_model/tokenizer.json')

In [29]:
test_dataset = load_dataset('csv', data_files='./test.csv' , split='train')

Generating train split: 0 examples [00:00, ? examples/s]

In [40]:
eval_prompt = test_dataset[0]["text"].split("#")[0] + "# "
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token
print(tokenizer.padding_side)

input_ids = tokenizer(eval_prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = merged_model.generate(input_ids=input_ids, pad_token_id=tokenizer.pad_token_id, max_new_tokens=10000, do_sample=True, top_p=0.9,temperature=0.5)
print(f"\nGenerated summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(eval_prompt):]}")


right


RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16