# Welcome to the pipeline to fine-tune a basic LLaMA model through SFT and RLHF/RLAIF
- This notebook includes both a supervised finetuning step (finetuning base model and reward model) and a proximal policy optimization step (optimizing finetuned model through proximal policy optimization with reward model)

- The notebook was written by Samuel Höglund and Josef Khedri for their bachelor's thesis on comparing RLHF and RLAIF
  - For more information about our work, head over to https://huggingface.co/KTH/psychology-alpaca

- The following code uses a forked GitHub repository originally created by user https://github.com/jackaduma



## Clone repo

In [None]:
!git clone https://github.com/jkhedri/Alpaca-LoRA-RLHF-PyTorch

In [None]:
%cd Alpaca-LoRA-RLHF-PyTorch

In [None]:
!ls

## Install requirements.txt

In [None]:
!pip install -r requirements.txt

## Evaluate needed

In [None]:
!pip install evaluate

## Insert huggingface token

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## Supervised fine-tuning

Finetune base model

In [None]:
!python supervised_finetune.py --base_model 'decapoda-research/llama-7b-hf' --data_path 'samhog/psychology-6k' --output_dir 'psychology-llama' --num_epochs 3

## Finetune reward model

In [None]:
!python train_reward_model.py --model_name 'decapoda-research/llama-7b-hf' --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --train_subset 1750 --eval_subset 250 --local_rank 0 --bf16 True

## Merge adapters

### Peft 0.2.0 needed for this script to work. Make sure to change the version by running this code before running the script

In [None]:
!pip uninstall peft
!pip install peft==0.2.0

In [None]:
!python merge_peft_adapter.py --model_name "samhog/psychology-llama"

### TRL needed

In [None]:
#!pip install trl
!git clone https://github.com/lvwerra/trl.git
%cd trl/
!pip install .

## PPO plug & chug

if you have installed peft 0.2.0, get it back to current version

In [None]:
!pip uninstall peft
!pip install git+https://github.com/huggingface/peft.git

In [None]:
%cd ..

In [None]:
!pip install wandb

## Reminder to change name of hf repo

In [None]:
!python tuning_lm_with_rl.py --model_name 'samhog/psychology-llama-merged' --reward_model_name 'samhog/RLAIF-psychology-alpaca-rm-merged' --log_with='wandb' --adafactor False --tokenizer_name 'decapoda-research/llama-7b-hf' --save_freq 100 --output_max_length 128 --batch_size 8 --gradient_accumulation_steps 8 --batched_gen True --ppo_epochs 1 --seed 0 --learning_rate 1.4e-5 --early_stopping True --output_dir './checkpoints/tuning_llama_rl'