## AI 322 Reinforcement Learning Mini-Project

### Notebook 02 - Training

Submission by: Rossjyn Fallorina

**Note: As the work was all done in the Google Colab platform, all files (data, notebooks, models, etc.) were stored in Google Drive. Some paths in this notebook may not work, since they point to the Google Drive subdirectory in which they are stored.**

Run installations below when running on Google Colab

In [None]:
# !pip install transformers bitsandbytes trl peft
# !pip install git+https://github.com/huggingface/accelerate
# !pip install vllm
# !pip install trl
# !pip install python-levenshtein
# !pip install tqdm
# !pip install wandb

### Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import math
import time

from tqdm import tqdm
import numpy as np
import pandas as pd

from Levenshtein import distance as levenshtein_distance

import wandb

import torch
from torch.utils.data import DataLoader

from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

from vllm import LLM, SamplingParams

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch, LengthSampler

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Test vLLM Inference Engine

In [None]:
prompts = [
    "la verti condominium pasay philippines",
    "up diliman quezon",
    "sm megamall mandaluyong"
]

system_message = "Process and clean this address string structure it cleanly. Return the final address string only. Do not add any other information in your response:"
prompts = [f"{system_message} {x}" for x in prompts]

sampling_params = SamplingParams(temperature=0.05, top_p=0.95, max_tokens=100)

In [None]:
loading_start = time.time()
llm = LLM(model="openai-community/gpt2-xl")

print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


INFO 06-03 05:38:28 config.py:1130] Casting torch.float32 to torch.float16.
INFO 06-03 05:38:28 config.py:1151] Downcasting torch.float32 to torch.float16.
INFO 06-03 05:38:28 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='openai-community/gpt2-large', speculative_config=None, tokenizer='openai-community/gpt2-large', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=openai-community/gpt2-large)
INFO 06-03 05:38:29 weight_utils.py:207] Using model weights format ['*.safetensors']
INFO 06-03 05:38:30 weight_utils.py:250] No model.safeten

In [None]:
generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

for output in outputs:
    generated_text = output.outputs[0].text
    print(generated_text)
    print('------')

Processed prompts: 100%|██████████| 3/3 [00:00<00:00,  4.83it/s, Generation Speed: 482.71 toks/s]

--- Generation time: 0.6263687610626221 seconds ---


The following example shows how to use the address string structure to retrieve the address of a condominium unit.

<?php $address = '1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
------
, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman quezon, up diliman
------
.com.

The following example shows how to use the address string structure to retrieve the address of a specific domain.

Example:

$address = "http://www.megamall.com" $address = "http://www.megamall.com" $address = "http://www.megamall.com" $address = "http://www.megamall.com" $address = "http://www.megamall.com
------





In [None]:
generation_kwargs = {
    "min_length": 25,
    "top_k": 50,
    "top_p": 0.9,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

### A. PPO Trainer via TRL

In [None]:
# Base model to be fine-tuned: OpenAI GPT2-XL
model_name = "openai-community/gpt2-xl"

#### A-1. Dataset

In [None]:
df_all = pd.read_csv("/content/drive/MyDrive/AIE/AI 322/Mini Project/addresses_all_processed.csv")

df_all = df_all[(df_all["longitude"] != -998)]
df_all = df_all.drop_duplicates(subset='address_raw', keep='first')

df_dataset = df_all[["address_raw", "address_reversed"]]
df_dataset = df_dataset.rename(columns={"address_raw":"prompt", "address_reversed":"correct_response"})
df_dataset['prompt'] = df_dataset['prompt'].apply(lambda x: "Clean this address string to the correct format: " + x)
df_dataset['address_id'] = range(1, len(df_dataset) + 1)
df_dataset = df_dataset[["address_id", "prompt", "correct_response"]].reset_index(drop=True)

In [None]:
df_all

Unnamed: 0,category,address_raw,address_processed,address_reversed,longitude,latitude
0,AI-generated Address,"789 Taft Ave., Barangay Malate, Manila","789 Taft Avenue, Barangay Malate, Manila","Taft Avenue, Barangay 678, Barangay 694, Malat...",120.989656,14.573774
2,AI-generated Address,"Blk 7, Lot 23, Jasmine St., Brgy. San Francisc...","Jasmine Street, San Francisco, General Trias, ...","Jasmine Street, Asian Leaf, San Francisco, Gen...",120.927077,14.297408
3,AI-generated Address,"55 Ipil St., Brgy. Poblacion, Davao City","55 Ipil Street, Barangay Poblacion, Davao City.","Ipil Street, Garden Heights, 19-B Garcia Heigh...",125.606685,7.091886
6,AI-generated Address,"12 J.P. Rizal St., Brgy. Poblacion, Makati City","12 J.P. Rizal Street, Poblacion, Makati City","Jose P. Rizal Avenue, Bel-Air Village Phase I ...",121.031089,14.567564
7,AI-generated Address,"123 Alabang Zapote Road, Alabang, Muntinlupa City","123 Alabang Zapote Road, Alabang, Muntinlupa City","Alabang-Zapote Road, Filinvest City, Muntinlup...",121.042328,14.420601
...,...,...,...,...,...,...
1073,names of residential villages in Laguna Philip...,"Laguna Bel Air, Santa Rosa","Laguna Bel Air, Santa Rosa","Laguna Bel-Air, Pulong Santa Cruz, Santa Rosa,...",121.072548,14.268304
1074,names of residential villages in Laguna Philip...,"San Lorenzo South, Santa Rosa","San Lorenzo South, Santa Rosa","San Lorenzo South Subdivision Phase 1C Annex, ...",121.106103,14.278009
1075,names of residential villages in Laguna Philip...,Dasmarinas Village,Dasmarinas Village,"Dasmariñas Village, District I, Makati, Southe...",121.026198,14.540765
1076,names of residential villages in Laguna Philip...,Montecito Nuvali,Montecito Nuvali,"Montecito Nuvali, Canlubang, Calamba, Laguna, ...",121.066604,14.191009


In [None]:
df_dataset.head()

Unnamed: 0,address_id,prompt,correct_response
0,1,Clean this address string to the correct forma...,"Taft Avenue, Barangay 678, Barangay 694, Malat..."
1,2,Clean this address string to the correct forma...,"Jasmine Street, Asian Leaf, San Francisco, Gen..."
2,3,Clean this address string to the correct forma...,"Ipil Street, Garden Heights, 19-B Garcia Heigh..."
3,4,Clean this address string to the correct forma...,"Jose P. Rizal Avenue, Bel-Air Village Phase I ..."
4,5,Clean this address string to the correct forma...,"Alabang-Zapote Road, Filinvest City, Muntinlup..."


In [None]:
dataset = Dataset.from_pandas(df_dataset)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples['prompt'], padding='max_length', truncation=True)

In [None]:
def tokenize(sample):
    # input_size = LengthSampler(25, 40)
    # sample["input_ids"] = tokenizer.encode(sample["prompt"])[: input_size()]
    
    sample["input_ids"] = tokenizer.encode(sample["prompt"])
    sample["query"] = tokenizer.decode(sample["input_ids"])
    return sample

In [None]:
# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize, batched=False)

tokenized_dataset.set_format(type="torch")

Map:   0%|          | 0/886 [00:00<?, ? examples/s]

In [None]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

#### A-2. Reward Function

In [None]:
def reward_function(llm_response, correct_response):
    lev_dist = levenshtein_distance(llm_response, correct_response)

    if lev_dist == 0:
        return 1
    else:
        return -math.log(lev_dist)

#### A-3. Training Proper

In [None]:
device = torch.device("cuda")
device

device(type='cuda')

In [None]:
# Get models
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    pretrained_model_name_or_path=model_name,
    device_map={'': 0},
)

model_ref = create_reference_model(model)

In [None]:
# Define hyperparameters
epochs = 3
lr = 1e-5
batch_size = 8
mini_batch_size = 1

In [None]:
# Initialize trainer
ppo_config = PPOConfig(batch_size=batch_size, learning_rate=lr, mini_batch_size=mini_batch_size, log_with="wandb")

In [None]:
wandb.init(
    project="ai322-llm-geocoding",
    config={
        "model_name": model_name,
        "learning_rate": lr,
        "epochs": epochs,
        "batch_size": batch_size,
        "mini_batch_size": mini_batch_size
    },
)

[34m[1mwandb[0m: Currently logged in as: [33mrcfallorina[0m ([33mrossjyn-org[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# Create a PPO Trainer
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer, tokenized_dataset, data_collator=collator)

VBox(children=(Label(value='0.002 MB of 0.012 MB uploaded\r'), FloatProgress(value=0.17929232961291855, max=1.…

In [None]:
output_min_length = 20
output_max_length = 50
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 25,
    "top_k": 50,
    "top_p": 0.9,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

for epoch in range(epochs):
    print(f"Running epoch #{epoch}")
    for _, batch in tqdm(enumerate(ppo_trainer.dataloader)):
        query_tensors = batch["input_ids"]

        #### Get response from gpt2
        response_tensors = []
        for query in query_tensors:
            gen_len = output_length_sampler()
            generation_kwargs["max_new_tokens"] = gen_len
            response = ppo_trainer.generate(query, **generation_kwargs)
            response_tensors.append(response.squeeze()[-gen_len:])

        batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

        aid_response = dict(zip(batch["query"], batch["response"]))
        rewards = [torch.tensor(reward_function(value, df_dataset.loc[df_dataset['prompt'].str.startswith(key), 'correct_response'].values[0])) for key, value in aid_response.items()]

        #### Run PPO step
        stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
        ppo_trainer.log_stats(stats, batch, rewards)

In [None]:
# Save model and tokenizer
model.save_pretrained('/content/drive/MyDrive/AIE/AI 322/Mini Project/gpt2_xl_ppo_model3')
tokenizer.save_pretrained('/content/drive/MyDrive/AIE/AI 322/Mini Project/gpt2_xl_ppo_token3')

#### A-4. Generate sample response from base model

In [None]:
prompts = ["Clean this address string to the correct format: Unit 402, Tower 1, The Residences at Greenbelt, Legazpi Village, Makati City"]

sampling_params = SamplingParams(
    max_tokens = 100,
    top_k = 50,
    top_p = 0.9,
    temperature=0.05
)

In [None]:
loading_start = time.time()
llm = LLM(model="openai-community/gpt2-xl")
print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

INFO 06-04 03:20:12 config.py:1130] Casting torch.float32 to torch.float16.
INFO 06-04 03:20:12 config.py:1151] Downcasting torch.float32 to torch.float16.
INFO 06-04 03:20:12 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='openai-community/gpt2-xl', speculative_config=None, tokenizer='openai-community/gpt2-xl', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=openai-community/gpt2-xl)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

INFO 06-04 03:20:15 weight_utils.py:207] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

INFO 06-04 03:20:44 weight_utils.py:250] No model.safetensors.index.json found in remote.
INFO 06-04 03:20:48 model_runner.py:146] Loading model weights took 2.9675 GB
INFO 06-04 03:20:49 gpu_executor.py:83] # GPU blocks: 6975, # CPU blocks: 873
INFO 06-04 03:20:52 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-04 03:20:52 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-04 03:21:04 model_runner.py:924] Graph capturing finished in 12 secs.
--- Loading time: 53.92773199081421 seconds ---


In [None]:
generation_time = time.time()
outputs = llm.generate(prompts, sampling_params)
print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

gpt2_xl_responses = []
for output in outputs:
    generated_text = output.outputs[0].text
    gpt2_xl_responses.append(generated_text)
    print(generated_text)
    print('------')

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.20it/s, Generation Speed: 119.86 toks/s]

--- Generation time: 0.8437895774841309 seconds ---
, Metro Manila, Philippines.

The address string is a string of characters that is used to identify a building or a building's location.

The address string is a string of characters that is used to identify a building or a building's location.

The address string is a string of characters that is used to identify a building or a building's location.

The address string is a string of characters that is used to identify a building or a building's location.


------





#### A-5. Generate sample response from fine-tuned model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "./gpt2_xl_ppo_model2"
prompt = "Clean this address string: la verti condo taft ave pasay philippines"

model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./gpt2_large_save_token", use_fast=True)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

output = model.generate(max_length=50, **model_inputs)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Clean this address string: la verti condo taft ave pasay philippines perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv perv


#### A-6. Get benchmark reward for base GPT2-XL model

In [None]:
model_name = '/content/drive/MyDrive/AIE/AI 322/Mini Project/gpt2_xl_ppo_model/'
tokenizer_name = '/drive/MyDrive/AIE/AI 322/Mini Project/gpt2_xl_ppo_token/'

model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name, device_map="auto",)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True)
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

gpt2_xl_ft_responses = []
for prompt in prompts:
    output = model.generate(max_length=50, **model_inputs)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    gpt2_xl_ft_responses.append(generated_text)
    print(generated_text)
    print('------')

In [None]:
df_samp = pd.read_csv("/content/drive/MyDrive/AIE/AI 322/Mini Project/addresses_all_processed.csv")

df_samp = df_samp[(df_samp["longitude"] != -998)]
df_samp = df_samp.drop_duplicates(subset='address_raw', keep='first')

df_dataset_samp = df_samp[["address_raw", "address_reversed"]]
df_dataset_samp = df_dataset_samp.rename(columns={"address_raw":"prompt", "address_reversed":"correct_response"})
df_dataset_samp['prompt'] = df_dataset_samp['prompt'].apply(lambda x: "Clean this address string to the correct format: " + x)
df_dataset_samp['address_id'] = range(1, len(df_dataset_samp) + 1)
df_dataset_samp = df_dataset_samp[["address_id", "prompt", "correct_response"]].reset_index(drop=True)

df_dataset_samp = df_dataset_samp.sample(n=30, random_state=1)

In [None]:
df_dataset_samp['base_responses'] = gpt2_xl_responses

In [None]:
df_dataset_samp['base_reward'] = df_dataset_samp.apply(lambda row: reward_function(row['base_responses'], row['correct_response']), axis=1)

In [None]:
df_dataset_samp['base_reward'].mean()

-5.589163815445191