<a href="https://colab.research.google.com/github/winirrr/Thai-qoute-generator/blob/main/Model_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q peft accelerate bitsandbytes datasets evaluate

In [None]:
import torch
import gc

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)
from peft import PeftModel, PeftConfig, AutoPeftModelForCausalLM
from datasets import load_dataset

## Evaluation Metric Setup

In this work using **"perplexity"**


In [None]:
def compute_perplexity(model,tokenizer, texts):
    model.eval()  # Put the model in evaluation mode
    total_loss = 0
    total_length = 0

    with torch.no_grad():
        for text in texts:
            # Encode the text and add batch dimension
            inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

            # Move tensors to same device as model
            inputs = {k: v.to(model.device) for k, v in inputs.items()}

            # Calculate loss but don't backpropagate it through the model
            outputs = model(**inputs, labels=inputs['input_ids'])
            loss = outputs.loss.item()

            # Accumulate the loss and the number of tokens
            total_loss += loss * inputs['input_ids'].size(1)  # Multiply loss by the number of tokens
            total_length += inputs['input_ids'].size(1)

    # Calculate the average loss and then perplexity
    average_loss = total_loss / total_length
    perplexity = torch.exp(torch.tensor(average_loss)).item()
    return perplexity

In [None]:
dataset = load_dataset("text", data_files="https://raw.githubusercontent.com/winirrr/Thai-qoute-generator/main/small_th_qoute.txt", split="train")
sample = dataset.shuffle(seed=42).select(range(10))
sample

Downloading data:   0%|          | 0.00/25.7k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 10
})

In [None]:
prompt = sample['text']
# ['คนเดียวที่คิดจะไป คือ คนที่คิดไปกับเราแหละ!']

In [None]:
prompt

['ถ้าคุณเป็นปลากระพง ฉันจะเป็นน้ำที่เต็มไปด้วยความสดชื่นเพื่อให้คุณรู้สึกสบาย',
 'จะมองอะไรนักหนา เธอมีปัญหาหรือว่ามีใจ',
 'เราเป็นคนตรงๆ นะ ตรงนี้ยังว่าง รอคนข้างๆ มาเติมเต็ม',
 'ก่อนเธอจะติดโควิด อยากให้มาติดความน่ารักเราก่อน',
 'ถ้าคุณเป็นหนังสือเรียน ฉันจะอ่านคุณอย่างคับคั่งทุกวัน',
 'ถ้าเธอเป็นหนัง ฉันจะเป็นคนดูที่ไม่กระพริบตา',
 'แม้กระทั่ง Google ยังหาไม่เจอว่าฉันรักเธอมากแค่ไหน',
 'ที่เราตัวหนัก เพราะเราน่ารักไม่เบาไง',
 'ถ้าความน่ารักเป็นธุรกิจ เธอคงเป็นเศรษฐี',
 'ที่เห็นมาทะเลบ่อยๆ เพราะอยากเป็นไข้ที่ชลฯ อยากเป็น คนที่ใช่']

## Model Setup

*   This work use **scb10x/typhoon-7b** as base model
*   Fine-tuning using peft technique

https://huggingface.co/scb10x/typhoon-7b

https://huggingface.co/docs/peft/en/index

In [None]:
perplex_score = []

### Base Model

In [None]:
base_model = "scb10x/typhoon-7b"

# Download tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    base_model,
    padding_side="left"
)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Download base model
base_model_name = "scb10x/typhoon-7b"

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    return_dict=True,
    load_in_4bit=True,
    device_map="auto"
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
base_perplex = compute_perplexity(model=base_model,
                                  tokenizer=tokenizer,
                                  texts=prompt)
perplex_score.append(base_perplex)
print(base_perplex)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


57.215476989746094


In [None]:
del base_model
gc.collect()
torch.cuda.empty_cache()

### Fine-tuned model 1

In [None]:
fine_tuned_model_1 = "winirrr/typhoon-7b-quote-demoV00"

ft1_model = AutoModelForCausalLM.from_pretrained(
    fine_tuned_model_1,
    return_dict=True,
    load_in_4bit=True,
    device_map="auto"
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
ft1_perplex = compute_perplexity(model=ft1_model,
                                 tokenizer=tokenizer,
                                 texts=prompt)
perplex_score.append(ft1_perplex)
print(ft1_perplex)

14.378568649291992


In [None]:
del ft1_model
gc.collect()
torch.cuda.empty_cache()

### Fine-tuned model 2

In [None]:
fine_tuned_model_2 = "winirrr/typhoon-7b-quote-demoV01"

ft2_model = AutoModelForCausalLM.from_pretrained(
    fine_tuned_model_2,
    return_dict=True,
    load_in_4bit=True,
    device_map="auto"
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
ft2_perplex = compute_perplexity(model=ft2_model,
                                 tokenizer=tokenizer,
                                 texts=prompt)
perplex_score.append(ft2_perplex)
print(ft2_perplex)

1.5938001871109009


In [None]:
del ft2_model
gc.collect()
torch.cuda.empty_cache()

In [None]:
print(perplex_score)

[57.215476989746094, 14.378568649291992, 1.5938001871109009]


## Visualize

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(data=perplex_score,
                  index=['base-model', 'ft-1', 'ft-2'],
                  columns=['perplexity'])

In [None]:
df

Unnamed: 0,perplexity
base-model,57.215477
ft-1,14.378569
ft-2,1.5938
