# Fine tuning full model distilgpt2

## Cài đặt các thư viện cần thiết

In [1]:
!pip install -q datasets
!pip install -q transformers
!pip install -q trl
!pip install -q -U bitsandbytes

## Tải lên dữ liệu imdb để thực hiện fine tuning model

In [2]:
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:1%]")

In [3]:
print(dataset[0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Hàm loại bỏ xuống dòng cho từng mẫu

In [4]:
def preprocessing(batch):
    batch['text'] = [text.replace('\n', '') for text in batch['text']]
    return batch

dataset = dataset.map(preprocessing, batched=True)

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

## Khai báo mô hình, tải lên tokenizer tương ứng

In [6]:
model_name = "distilgpt2" # Tên mô hình tiến hành fine tuning
tokenizer=AutoTokenizer.from_pretrained(model_name) # Tải lên tokenizer tương ứng với mô hình từ Hugging Face
model = AutoModelForCausalLM.from_pretrained(model_name) # Tải lên mô hình distilgpt2, dùng cho việc sinh văn bản (AutoModelForCausalLM)

tokenizer.pad_token=tokenizer.eos_token # Thực hiện tokenizer và pad_token = eos_token (từ cuối cùng trong văn bản)

2025-07-11 08:40:59.588611: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752223259.611560     221 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752223259.618309     221 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Thực hiện tokenizer và đệm

In [7]:
def tokenizing_function(examples):
    tokenized = tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128) # Chuyển văn bản thành các vector input_ids

    tokenized['labels'] = tokenized['input_ids'].copy() # Gán cho label chính là vector đầu vào

    return tokenized # trả về vector 

tokenized_data = dataset.map(tokenizing_function, batched=True)

In [8]:
import os
os.environ["WANDB_DISABLED"] = "true"

## Khai báo các tham số cho quá trình huấn luyện

In [9]:
from transformers import TrainingArguments

training_args = TrainingArguments( # Định nghĩa các tham số trong quá trình huấn luyện
    output_dir="./results", # Thư mục lưu mô hình sau mỗi checkpoint
    eval_strategy="epoch", # Thực hiện đánh giá trên tập validation sau mỗi epoch
    per_device_train_batch_size=4, # batch_size khi huấn luyện
    per_device_eval_batch_size=4, # batch_size khi đánh giá trên tập validation
    num_train_epochs=1, # Số lượng epoch để huấn luyện
    logging_dir="./logs", # Thư mục lưu log Tensorboard để xem lại các biểu đồ trong quá trình train
    logging_steps=10,# Ghi log sau mỗi 10 step huấn luyện (hiển thị loss, learning rate...)
    save_total_limit=1 # Chỉ lưu lại 1 checkpoint tại thời điểm tốt nhất
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


## Chia tập dữ liệu train, validation

In [10]:
train_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data))))
eval_data = tokenized_data.shuffle().select(range(int(0.8 * len(tokenized_data)), len(tokenized_data)))

## Huấn luyện mô hình

In [11]:
from transformers import Trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_data,
    eval_dataset = eval_data
)

In [12]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,2.0372,3.946259




TrainOutput(global_step=25, training_loss=2.0487976837158204, metrics={'train_runtime': 7.4906, 'train_samples_per_second': 26.7, 'train_steps_per_second': 3.338, 'total_flos': 6532418764800.0, 'train_loss': 2.0487976837158204, 'epoch': 1.0})

## Lưu lại model

In [13]:
model.save_pretrained("./fine-tuned_model")

## Lưu lại tokenizer

In [14]:
tokenizer.save_pretrained("./fine-tuned_model")

('./fine-tuned_model/tokenizer_config.json',
 './fine-tuned_model/special_tokens_map.json',
 './fine-tuned_model/vocab.json',
 './fine-tuned_model/merges.txt',
 './fine-tuned_model/added_tokens.json',
 './fine-tuned_model/tokenizer.json')

## Thực hiện 1 truy vấn với model fine tuning

In [15]:
prompt = "Titanic"

inputs = tokenizer(prompt,  return_tensors="pt").to("cuda")

output = model.generate(inputs['input_ids'], max_length=128)

print(tokenizer.decode(output[0], skip_special_token=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Titanic.






























































































































# Fine tuning model TinyLlama-1.1B-Chat-v1.0 dùng kỹ thuật QLoRa

In [16]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM, # Tự động tải mô hình dạng sinh văn bản
    AutoTokenizer, # Tự động tải tokenizer tương ứng với mô hình
    BitsAndBytesConfig, # cấu hình cho quantization
    HfArgumentParser, # Chưa được dùng trong code
    TrainingArguments, # Cấu hình cho quá trình huấn luyện (batch_size, epoch,...)
    pipeline, # giúp tạo inference pipeline nhanh (chatbot, text generation...)
    logging, # cấu hình logging khi train model
)
from peft import LoraConfig, PeftModel 
# LoraConfig: cấu hình cho fine-tuning LoRA (số rank, alpha, dropout...)
# PeftModel: kết hợp mô hình gốc với trọng số đã fine-tune bằng LoRA
from trl import SFTTrainer, SFTConfig
# SFTTrainer: lớp trainer hỗ trợ huấn luyện Supervised Fine-Tuning (SFT), tích hợp với LoRA
# SFTConfig: cấu hình cho SFTTrainer (batch size, sequence length, optimizer...)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


# Các tham số và cấu hình cho quá trình train

In [17]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # Tên mô hình được fine tuning
dataset_name = "mlabonne/guanaco-llama2-1k" # Tên dataset dùng cho quá trình fine tuning
new_model = "TinyLlama-1.1B-Chat-finetune" # Tên mô hình sau khi fine tuning


lora_r = 64 # rank của ma trận low-rank
lora_alpha = 16 # Hệ số scale
lora_dropout = 0.1 # Tỷ lệ dropout để tránh hiện tượng overfitting
use_4bit = True # DÙng quantization 4-bit (tiết kiệm ram tối đa)
bnb_4bit_compute_dtype = "float16" # data_type để tính toán
bnb_4bit_quant_type = "nf4" # nf4 chính xác hơn int4
use_nested_quant = False # nested quant không dùng (ít được hỗ trợ)



output_dir = "./results" # Đường dẫn lưu mô hình sau fine tuning
num_train_epochs = 1 # Số lượng epoch để huấn luyện
fp16 = False # tắt hỗ trợ half-precision
bf16 = False # tắt hỗ trợ half-precision
per_device_train_batch_size = 1 # Batch_size khi huấn luyện
per_device_eval_batch_size = 1 # Batch_size khi validation
gradient_accumulation_steps = 1 # 1, tức không cộng dồn gradient
gradient_checkpointing = True # Tiết kiệm RAM bằng cách recompute forward
max_grad_norm = 0.3 # Gradient clipping để tránh exploding gradient
learning_rate = 2e-4 # Learning rate cơ bản cho LoRA
weight_decay = 0.001 # Để tránh overfitting
optim = "paged_adamw_32bit" # Bộ tối ưu hóa hỗ trợ 4bit training
lr_scheduler_type = "cosine" # Dạng decay learning rate
max_steps = -1 # Sử dụng số epoch thay vì số bước
warmup_ratio = 0.03  # Tăng LR dần từ 0 → LR chính trong 3% bước đầu
group_by_length = True # Nhóm mẫu theo độ dài → giảm padding
save_steps = 0 # Không save giữa chừng
logging_steps = 50 # Mỗi 50 step thì log 1 lần
max_seq_length = None # Sẽ dùng chiều dài tối đa của tokenizer
packing = False # Không gộp nhiều sample thành 1 chuỗi dài
device_map = {"": 0} # Chạy trên GPU 0

# Load lên dataset

In [18]:
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")

# Cấu hình cho quantization -> 4bit

In [19]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load lên model và thực hiện quá trình quantization

In [20]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)

model.config.use_cache = False
model.config.pretraining_tp = 1

# Load tokenizer tương ứng với mô hình

In [21]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, model_max_length=2048)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Cấu hình cho phần low rank

In [22]:
# Load LoRA configuration (cấu hình)
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Khai báo các tham số và cấu hình cho quá trình huấn luyện

In [24]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    processing_class=tokenizer,
    args= SFTConfig(
        dataset_text_field="text",
        max_seq_length=max_seq_length,
        output_dir=output_dir,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        optim=optim,
        save_steps=save_steps,
        logging_steps=logging_steps,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        fp16=fp16,
        bf16=bf16,
        max_grad_norm=max_grad_norm,
        max_steps=max_steps,
        warmup_ratio=warmup_ratio,
        group_by_length=group_by_length,
        lr_scheduler_type=lr_scheduler_type,
        report_to="tensorboard"
    )
)

average_tokens_across_devices is set to True but it is invalid when world size is1. Turn it to False automatically.
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


# Huấn luyện mô hình

In [25]:
# Train model
trainer.train()

Step,Training Loss
50,0.9272
100,0.8338
150,0.7891
200,0.7905
250,0.7787
300,0.8209
350,0.8476
400,0.8468
450,0.8441
500,0.8319


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


TrainOutput(global_step=500, training_loss=0.8310468673706055, metrics={'train_runtime': 396.5146, 'train_samples_per_second': 2.522, 'train_steps_per_second': 1.261, 'total_flos': 2676813304922112.0, 'train_loss': 0.8310468673706055})

# Lưu lại mô hình sau khi thực hiện fine tuning

In [26]:
# Save trained model
trainer.model.save_pretrained(new_model)

In [36]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is a large language model? [/INST] Large language models (LLMs) are models that can understand vast amounts of natural language. They are used for tasks such as machine translation, chatbots, and natural language processing (NLP).

LLMs are trained on vast amounts of text data, including millions of sentences and paragraphs from a wide range of sources. These sources are collected through a variety of techniques, including online search, social media, and human-curated content.

One of the key advantages of LLMs is their ability to understand the structure and relationships between different words and phrases in natural language. This helps them to generate accurate translations, generate natural-sounding responses to questions, and generate meaningful responses to natural language prompts.

LLMs are also highly accurate in some specific areas of language understanding, such as paraphrasing, summarizing, and generating responses to


In [28]:
# Empty VRAM
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()

0

# Load lại model với fp16 và tokenizer

In [29]:
# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

In [30]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

# Push lên Hugging Face

In [31]:
from huggingface_hub import login
login("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")  # ⚠️ Không an toàn nếu chia sẻ notebook


In [32]:
# !huggingface-cli login

model.push_to_hub("vuxxxan/TinyLlama-1.1B-Chat-finetune", create_pr=True)

tokenizer.push_to_hub("vuxxxan/TinyLlama-1.1B-Chat-finetune",create_pr=True)


Uploading...:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Uploading...:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vuxxxan/TinyLlama-1.1B-Chat-finetune/commit/a401cf6e71d2f265b0033da1fb9f22f3c662c26d', commit_message='Upload tokenizer', commit_description='', oid='a401cf6e71d2f265b0033da1fb9f22f3c662c26d', pr_url='https://huggingface.co/vuxxxan/TinyLlama-1.1B-Chat-finetune/discussions/2', repo_url=RepoUrl('https://huggingface.co/vuxxxan/TinyLlama-1.1B-Chat-finetune', endpoint='https://huggingface.co', repo_type='model', repo_id='vuxxxan/TinyLlama-1.1B-Chat-finetune'), pr_revision='refs/pr/2', pr_num=2)

In [33]:
from huggingface_hub import HfApi

api = HfApi()

api.upload_folder(
    folder_path="TinyLlama-1.1B-Chat-finetune",  # Thư mục chứa mô hình fine-tune
    repo_id="vuxxxan/TinyLlama-1.1B-Chat-finetune",  # Đúng với repo bạn đã tạo trên HF
    repo_type="model"
)

Uploading...:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vuxxxan/TinyLlama-1.1B-Chat-finetune/commit/4a38ed27b88991e5fffd8d861351bf481645254d', commit_message='Upload folder using huggingface_hub', commit_description='', oid='4a38ed27b88991e5fffd8d861351bf481645254d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/vuxxxan/TinyLlama-1.1B-Chat-finetune', endpoint='https://huggingface.co', repo_type='model', repo_id='vuxxxan/TinyLlama-1.1B-Chat-finetune'), pr_revision=None, pr_num=None)

In [35]:
model = AutoModelForCausalLM.from_pretrained("vuxxxan/TinyLlama-1.1B-Chat-finetune")
tokenizer = AutoTokenizer.from_pretrained("vuxxxan/TinyLlama-1.1B-Chat-finetune")


config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/36.1M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/951 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

chat_template.jinja:   0%|          | 0.00/410 [00:00<?, ?B/s]