<a href="https://colab.research.google.com/github/ryderwishart/biblical-machine-learning/blob/main/wip/fine_tune_alpaca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

In [1]:
!pip install transformers sentencepiece datasets accelerate bitsandbytes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from sklearn.model_selection import train_test_split
from datasets import load_dataset
import torch

# Dataset loading and preprocessing

In [3]:
dataset = load_dataset("tatsu-lab/alpaca")

# Split the dataset into train, test, and eval sets
split_dataset = dataset['train'].train_test_split(test_size=0.15)

# Split the dataset into train, test, and eval sets
split_dataset['test'], split_dataset['eval'] = split_dataset['test'].train_test_split(test_size=0.15).values()


# Print the sizes of the datasets
print(f"Train set size: {len(split_dataset['train'])}")
print(f"Test set size: {len(split_dataset['test'])}")
print(f"Eval set size: {len(split_dataset['eval'])}")



  0%|          | 0/1 [00:00<?, ?it/s]

Train set size: 44201
Test set size: 6630
Eval set size: 1171


In [4]:
split_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 44201
    })
    test: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 6630
    })
    eval: Dataset({
        features: ['instruction', 'input', 'output', 'text'],
        num_rows: 1171
    })
})

# Choose model and tokenizer

In [5]:
from transformers import LlamaTokenizer, LlamaForCausalLM

tokenizer = LlamaTokenizer.from_pretrained("chainyo/alpaca-lora-7b")
model = LlamaForCausalLM.from_pretrained(
    "chainyo/alpaca-lora-7b",
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/39 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [7]:
def preprocess_data(dataset):
    tokenized_inputs = []
    tokenized_outputs = []
    
    for example in dataset:
        instruction = example["instruction"]
        input_text = example["input"]
        output_text = example["output"]
        
        if input_text:
            input_string = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input_text}

### Response:"""
        else:
            input_string = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""
            
        tokenized_input = tokenizer.encode(input_string, return_tensors="pt")
        tokenized_output = tokenizer.encode(output_text, return_tensors="pt")

        tokenized_inputs.append(tokenized_input)
        tokenized_outputs.append(tokenized_output)
        
    return tokenized_inputs, tokenized_outputs

train_inputs, train_outputs = preprocess_data(split_dataset['train'])
val_inputs, val_outputs = preprocess_data(split_dataset['test'])


In [9]:
import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, inputs, outputs):
        self.inputs = inputs
        self.outputs = outputs

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return self.inputs[idx], self.outputs[idx]

train_dataset = CustomDataset(train_inputs, train_outputs)
val_dataset = CustomDataset(val_inputs, val_outputs)

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
)

trainer.train()




# Evaluation

Goal: use an 'eval' split to compare the answer in the dataset with the answer generated by our now fine-tuned model.

Ryder will help come up with possible answer-quality scoring mechanisms, likely along different dimensions.

Initially, we can just do a cosine distance between the embedding of the generated answer and the embedding of the answer we have in the dataset (which, for now, we will just assume is the 'correct' answer).