# **Reproducibility Study for SusGen-GPT**

In this notebook we setup and fine-tune a `Mistral 7B` model in a very similar way to how the author's of the **SusGen-GPT** paper do! We leverage Google Colab Pro's free monthly compute allowance of 100 CUs a month and their T4 GPU to work in a low compute environment (but one that at least has access to a GPU). Given the low resources, this training is sensitive in regards to memory and crashes or Out Of Memory errors. Also, Google Colab does not seem to gel well with `bitsandbytes` or with `QLoRA` implementations, as to get it to work you have to be on a earlier Python version (Python 3.10). However, Colab does not easily let you use a version that is that old, so lots of work arounds are needed and it is very finicky. Nevertheless, this notebook describes through markdown cells the process to reproduce this work in a low compute environment using the author's code and implementation strategies!

Below we mount our Google Drive so we can access the author's source code that we uploaded and use Colab's T4 GPU.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Navigate to the SusGen repository, wherever that is stored in your Google Drive!

In [None]:
# Navigate to SusGen repo (path will change depending on your setup!)
%cd /content/drive/My Drive/cs421/Reproducibility_Study/SusGen

Below we install dependencies and setup the environment using the `requirements.txt` file the author's put together.

In [None]:
# Install dependencies
%env VLLM_INSTALL_PUNICA_KERNELS=1
!pip install -r requirements.txt

Below, we load the  `SusGen-10K` dataset.
I added the 10K dataset to the SusGen folder in Google Drive for convenience, but the dataset is publically available and can be downloaded from the below link and loaded from wherever it is convenient. I also added the link to the `SusGen-30K` dataset in case you don't have the limitations I have and can use the larger dataset.

- Link to `SusGen-10K` dataset: https://huggingface.co/datasets/WHATX/SusGen-30k/blob/main/ablation/SusGen-10k.json

- Link to `SusGen-30K` dataset: https://huggingface.co/datasets/WHATX/SusGen-30k

In [None]:
# Load SusGen-10K dataset
from datasets import load_dataset
dataset = load_dataset('json', data_files='/content/drive/My Drive/cs421/Reproducibility_Study/SusGen/SusGen-10k.json')
dataset

Below, we format the dataset to ensure it is in an instruction-response format, which is expected for this training and fine-tuning that is detailed by the authors in the paper. We make sure the dataset is composed of two parts, the `prompt` and `response`. The `prompt` is the instruction + input, and the `response` is the output.

In [None]:
def format_example(example):
  instruction = example['instruction']
  inp = example.get('input', '')
  # response is the output
  output = example['output']
  if inp and len(inp.strip()) > 0:
    # prompt is instruction + input
    prompt = f"<s>[INST] {instruction}\n{inp} [/INST]"
  else:
    prompt = f"<s>[INST] {instruction} [/INST]"

  return {'prompt':prompt, 'response':output}

# Apply this format to the entire dataset to ensure it is the proper format
processed = dataset.map(format_example)
processed

Here, we load the `Mistral 7B` model in quantized form, which helps alleviate the memory pressure of loading 7 billion parameters! We use the `BitsAndBytesConfig` to load the model in 4-bit quantized form, which really reduces the memory pressure, and we load the `tokenizer` as well.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_NAME = "mistralai/Mistral-7B-v0.3"
# Configure bitsandbytes to use for loading the model in quantized form
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
# Load tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
# Load the Mistral 7B model in quantized form
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)

Below, we apply `QLoRA` to the `Mistral 7B` model we loaded in above. This allows us to fine tune only Low Rank Adapters while keeping the base model weights frozen. We define the `LoraConfig` in the same way the authors did originally to keep in step with their setup.

In [None]:
# Apply QLoRA
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
# Prepare loaded model for quantized training
model = prepare_model_for_kbit_training(model)
# Setup the lora config following the authors implementation
lora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.1, task_type='CAUSAL_LM')
# Add the trainable LoRA parameters to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


Below, we tokenize the dataset so we can pass it to the LLM for training. We use our `tokenizer` to convert the text as well as the labels into tokens, ensuring padding and truncation are handled, and we set a max length for each prompt and label.

In [None]:
# Tokenize dataset
def tokenize(batch):
  # Tokenize the prompts
  tok = tokenizer(batch['prompt'], padding='max_length', truncation=True, max_length=1024)
  with tokenizer.as_target_tokenizer():
    # Tokenize the labels
    labels = tokenizer(batch['response'], padding='max_length', truncation=True, max_length=512)
    tok['labels'] = labels['input_ids']
    return tok
# Apply tokenization to the whole dataset we processed earlier
tokenized = processed['train'].map(tokenize, batched=True, remove_columns=processed['train'].column_names)
tokenized

Below, we set up the training for our model with all the hyperparameter settings we chose (see report for details on choices). We use `Trainer` from the `transformers` library to simplify the pipeline for training, and in the next cell we begin the training by calling `trainer.train()`!

In [None]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
# Set the hyperparameters and arguments for training
training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/cs421/Reproducibility_Study/SusGen/outputs/mistral7b_lora",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    fp16=False,
    bf16=True,
    optim="paged_adamw_32bit",
    logging_steps=50,
    save_steps=500,
    save_total_limit=2,
    report_to="none",
)

collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=collator,
)

In [None]:
# Train the model
trainer.train()

Below we save the trained model (paths will be different depending on your Drive setup).  

In [None]:
# Save LoRA adapter
trainer.model.save_pretrained('/content/drive/My Drive/cs421/Reproducibility_Study/SusGen/outputs/mistral7b_lora_adapter')
tokenizer.save_pretrained('/content/drive/My Drive/cs421/Reproducibility_Study/SusGen/outputs/mistral7b_lora_adapter')

Finally, we load the trained model and run the evaluation code the authors wrote to get the results of our fine-tuned model on their benchmarks.

In [None]:
# Load for evaluation
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map='auto')
lora_model = PeftModel.from_pretrained(base, '/content/drive/My Drive/cs421/Reproducibility_Study/SusGen/outputs/mistral7b_lora_adapter')

In [None]:
# Run authors' evaluation script
!python eval/code/eval.py --model-path /content/drive/MyDrive/SusGen/outputs/mistral7b_lora_adapter