# Overview

Task is to fine tune Llama2 with a dataset of my choice.  
Steps: 

Tue:  
- Make notes on Bes' code 
- Wrap up finetuning LLMs notes
- Boot up Vast AI GPU and Set up link to mlx_eda_repo, push work from here
- Login to HF
- Download Llama2 
- Think of a dataset to fine-tune on 
- Prepare and save custom dataset to HF from notebook 

Wed + Thu:  
- Do fine tuning work
    - Use: prompt tuning; adapter layers; PEFT (adapters, LoRa, BitFit, Layer Freezing); regularisation
    - Devise eval metrics
- Share fine tuning weights to HF  

Fri:  
- Extend to multiple GPUs
- Extend to RLHF 
- Track progress with W+B

# Bes code notes

Code [here](https://github.com/besarthoxhaj/lora)
- data.py
    - 2 template inputs
    - Dataset pytorch class, with prompt, tokenize and max_seq_len methods
- eval.py
    - Llama tokenizer grabbed
    - Llama model loaded up 
    - pad token id defined
    - LoRa config setup
    - PEFT Llama model setup 
    - Template and Instruction strings setup
    - torch pipeline setup 
- model.py
    - get_model() function:
        - pretrained model load 
        - prepare for kbit training 
        - LoRa config
        - get_peft_model 
- train.py
    - ddp setup
    - get_model
    - get dataset
    - setup collator 
    - Setup transformers Trainer object
    - Trainer.train()
    
- Task:
    - Fine-tune an LLM on (instruction, input, output) triplets, where sometimes input doesn't exist
- PEFT methods used:
    - prepare_model_for_kbit_training
    - LoRa
    

## Finetuning LLMs Notes

- Prompt tuning:
    - train on 'soft prompts' 
    - append prompt embeddings to frozen input embeddings
- Adapter layers:
    - small modules between layers of pre-trained model 
    - freeze pre-trained layers
- PEFT:
    - LoRa: low rank matrices that modify pre-trained weights in a smart way
    - BitFit: small subset of params updated during finetuning
    - Layer Freezing: freeze certain layers 

LoRa notes:
- matrix decomp to reduce delta(weights) to two smaller matrices. Hypothesise that rank of matrix lower than we're using atm. 
- param r shared between two matrices
- 

# Dataset to finetune on:

Dataset of football commentary transcriptions: https://gitlab.com/grounded-sport-convai/goal-baselines/-/tree/main?ref_type=heads

In [None]:
import json

# Path to your JSONL file
jsonl_file_path = 'train.jsonl'

# Path to the output text file
output_file_path = 'captions.txt'

# Open the JSONL file and the output text file
with open(jsonl_file_path, 'r') as jsonl_file, open(output_file_path, 'w') as output_file:
    # Read each line (JSON object) from the JSONL file
    for line in jsonl_file:
        # Parse the line as JSON
        json_object = json.loads(line)
        
        # Check if 'chunks' key exists in the JSON object
        if 'chunks' in json_object:
            # Loop through each chunk
            for chunk in json_object['chunks']:
                # Check if 'caption' key exists in the chunk
                if 'caption' in chunk:
                    # Write the caption to the output file, each on a new line
                    output_file.write(chunk['caption'] + '\n')

print("Captions extracted and saved to", output_file_path)

# Download Llama2

In [None]:
! pip install accelerate

In [None]:
! pip install bitsandbytes

In [None]:
!pip install transformers==4.30

In [None]:
import transformers as t
import peft
import torch
import os


def get_model():
    
    # GPU
    NAME = "NousResearch/Llama-2-7b-hf"
    # CPU
#     NAME = "EleutherAI/pythia-70m"
    
    is_ddp = int(os.environ.get("WORLD_SIZE", 1)) != 1
    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if is_ddp else None
    
    # GPU
    m = t.AutoModelForCausalLM.from_pretrained(NAME, load_in_8bit=True, torch_dtype=torch.float16, device_map=device_map)
    # CPU
#     m = t.AutoModelForCausalLM.from_pretrained(NAME, load_in_8bit=False, torch_dtype=torch.float16, device_map=device_map)
    
    m = peft.prepare_model_for_kbit_training(m)
    
    # GPU/LLama
    config = peft.LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.005, bias="none", task_type="CAUSAL_LM")
    # CPU/Pythia70m
#     config = peft.LoraConfig(r=8, lora_alpha=16, target_modules=["query_key_value"], lora_dropout=0.005, bias="none", task_type="CAUSAL_LM")
    
    m = peft.get_peft_model(m, config)
    return m

pythia70m = get_model()

# Upload to HF

Done via website, can find here: https://huggingface.co/datasets/shaheenahmedc/goal_commentary

# Construct PyTorch Dataset and Dataloader

In [None]:
import datasets as d
captions = d.load_dataset("shaheenahmedc/goal_captions")
len(captions['train'])

In [None]:
from torch.utils.data import Dataset
import transformers as t
import datasets as d


class TrainDataset(Dataset):
    def __init__(self):
        # GPU
        self.tokenizer = t.AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
        # CPU
#         self.tokenizer = t.AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
        self.tokenizer.pad_token_id = 0
#         self.tokenizer.padding_side = "left"
        self.ds = d.load_dataset("shaheenahmedc/goal_captions")
        self.ds = self.ds["train"]
    #     self.ds = self.ds.map(self.prompt, remove_columns=["instruction", "input", "output"], load_from_cache_file=False, num_proc=8)
        self.ds = self.ds.map(self.tokenize, load_from_cache_file=False, num_proc=8)

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
#         print (self.ds[idx])
        return self.ds[idx]

#     def prompt(self, elm):
#     TEMPLATE = TEMPLATE_NOT_INPUT if not elm["input"] else TEMPLATE_YES_INPUT
#     prompt = TEMPLATE.format(instruction=elm["instruction"], input=elm["input"])
#     prompt = prompt + elm["output"]
#     return {"prompt": prompt}

    def tokenize(self, caption):
        res = self.tokenizer(caption["text"], add_special_tokens=True)
#         res["input_ids"].append(self.tokenizer.eos_token_id)
#         res["attention_mask"].append(1)
        res["labels"] = res["input_ids"].copy()
        return res

    def max_seq_len(self):
        return max([len(res["input_ids"]) for res in self.ds])

# Training loop

In [None]:
import transformers
# import model
# import data
import os


is_ddp = int(os.environ.get("WORLD_SIZE", 1)) != 1
m = get_model()
ds = TrainDataset()
collator = transformers.DataCollatorWithPadding(ds.tokenizer, return_tensors="pt", padding=True)

print (ds[0])



trainer = transformers.Trainer(
  model=m,
  train_dataset=ds,
  data_collator=collator,
  args=transformers.TrainingArguments(
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=3e-4,
    #GPU
    #fp16 = True
    #CPU
    fp16=False,
    logging_steps=10,
    optim="adamw_torch",
    evaluation_strategy="no",
    save_strategy="steps",
    eval_steps=None,
    save_steps=200,
    output_dir="./output",
    save_total_limit=3,
    ddp_find_unused_parameters=False if is_ddp else None,
  ),
)


m.config.use_cache = False
trainer.train()
m.save_pretrained("./weights")