Nama : Ria Kristi
NIM : 21210013


# 1. BUSINESS UNDERSTANDING

Tujuan Bisnis:
- Menyediakan fondasi awal untuk pengembangan chatbot akademik berbasis LLM melalui proses fine-tuning model menggunakan data dari pedoman akademik Unjaya.
- Meningkatkan aksesibilitas dan pemahaman terhadap informasi akademik di lingkungan Universitas Jenderal Achmad Yani Yogyakarta melalui inovasi teknologi AI.
- Mendukung efisiensi penyampaian informasi dengan mengurangi beban kerja staf akademik dalam menjawab pertanyaan berulang seputar pedoman akademik.

Kriteria Sukses:
- Model hasil fine-tuning mampu menjawab pertanyaan tentang pedoman akademik Unjaya secara akurat dan kontekstual.
- Respons yang diberikan oleh model bersifat relevan, informatif, dan sesuai dengan konteks akademik.
- Evaluasi menggunakan BERTScore menunjukkan skor F1 > 0.8 sebagai indikator kualitas respons yang baik.

## Persiapan

### Import Library

In [2]:
# Cek CUDA aktif 
import torch
print(torch.cuda.is_available())  # True
print(torch.cuda.get_device_name(0))  # "NVIDIA GeForce RTX 4060"

True
NVIDIA GeForce RTX 4060


In [7]:
# Menginstal library
!pip install cikit-learn peft datasets transformers trl accelerate bitsandbytes evaluate wandb -q

ERROR: Could not find a version that satisfies the requirement cikit-learn (from versions: none)
ERROR: No matching distribution found for cikit-learn


In [4]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
import json
import os
import torch
import wandb
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model
)
from trl import SFTTrainer
import evaluate # Mengimpor library evaluate untuk BERTScore


ModuleNotFoundError: No module named 'sklearn'

In [None]:
# Pengaturan logging
logging.set_verbosity_info()

### Login Huggingface dan Wandb

In [None]:
# Login ke Hugging Face Hub
# Login ke Hugging Face Hub.
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Login ke Weights & Biases untuk monitoring pelatihan
import wandb
wandb.login()
run = wandb.init(
    project='Fine-tuning-Mistral-7B-Pedoman-Akademik',
    job_type="training",
    notes="Fine-tuning Mistral 7B Instruct v0.3 untuk chatbot pedoman akademik",
    tags=["mistral", "chatbot", "academic", "qlora"]
)

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mriakrst[0m ([33mriakrst-universitas-jenderal-achmad-yani-yogyakarta[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


# 2. DATA UNDERSTANDING

In [None]:
# Memuat dataset dari file CSV
df = pd.read_csv('Dataset Pedoman Akademik 2024.csv')
df.head()

Unnamed: 0,Instruction,Response,Sumber
0,Apa itu pedoman akademik di Unjaya?,Pedoman akademik adalah jabaran dari kebijakan...,"Kata Pengantar, Hal. 3 (Pedoman Akademik Unjay..."
1,Apa tujuan dari penyusunan pedoman akademik Un...,Tujuannya adalah menjadi panduan menyeluruh ba...,"Kata Pengantar, Hal. 3 (Pedoman Akademik Unjay..."
2,Apa saja yang dicakup dalam pedoman akademik U...,"Pedoman mencakup kebijakan mutu, visi, misi, t...","Kata Pengantar, Hal. 3 (Pedoman Akademik Unjay..."
3,Siapa yang menyusun pedoman akademik Unjaya 2024?,"Tim penyusun terdiri dari Niko Wahyu Nurcahyo,...","Kata Pengantar, Hal. 4 (Pedoman Akademik Unjay..."
4,Siapa Rektor Universitas Jenderal Achmad Yani ...,Rektor Unjaya adalah Prof. Dr.rer.nat.apt. Tri...,Struktur Organisasi Universitas Jenderal Achma...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 602 entries, 0 to 601
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Instruction  602 non-null    object
 1   Response     602 non-null    object
 2   Sumber       602 non-null    object
dtypes: object(3)
memory usage: 14.2+ KB


# 3. DATA PREPARATION
## 3.1. Split Train dan Eval Set Secara Adil

In [None]:
# Mengkategorikan panjang 'Instruction' untuk stratifikasi.
# Ini memastikan distribusi panjang pertanyaan yang serupa di set train dan eval.
# 'duplicates='drop'' ditambahkan untuk menangani kasus di mana ada nilai kuartil yang sama
# yang dapat menyebabkan kesalahan.
# Kategorisasi panjang instruksi untuk stratifikasi
df['length_bin'] = pd.qcut(
    df['Instruction'].str.len(),
    q=5,
    labels=False,
    duplicates='drop'
)

# Membagi data menjadi 90% train dan 10% eval menggunakan StratifiedShuffleSplit.
# Stratifikasi dilakukan berdasarkan 'length_bin' agar pembagian lebih representatif.
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.15, random_state=42)
for train_idx, eval_idx in splitter.split(df, df['length_bin']):
    df_train = df.iloc[train_idx].reset_index(drop=True)
    df_eval = df.iloc[eval_idx].reset_index(drop=True)

# Menghapus kolom bantu 'length_bin'
df_train = df_train.drop(columns=['length_bin'])
df_eval = df_eval.drop(columns=['length_bin'])

print(f"Jumlah data train: {len(df_train)}")
print(f"Jumlah data eval: {len(df_eval)}")

Jumlah data train: 511
Jumlah data eval: 91


## 3.2. Konversi ke Format Chat (ChatML)

In [None]:
# Fungsi untuk mengonversi DataFrame menjadi format JSONL yang sesuai dengan ChatML.
# Setiap baris akan menjadi entri 'messages' dengan peran 'user' dan 'assistant'.
def to_chatml_format(df, filename):
    """
    Konversi DataFrame ke format ChatML yang kompatibel dengan Mistral Instruct
    """
    chat_data = []

    for _, row in df.iterrows():
        # Format pesan untuk Mistral Instruct
        messages = [
            {
                "role": "user",
                "content": row['Instruction']
            },
            {
                "role": "assistant",
                "content": f"{row['Response']}\n\n(Sumber: {row['Sumber']})"
            }
        ]

        chat_data.append({"messages": messages})

    # Simpan ke file JSONL
    with open(filename, 'w', encoding='utf-8') as f:
        for item in chat_data:
            json.dump(item, f, ensure_ascii=False)
            f.write('\n')

    return chat_data

# Konversi dan simpan data
train_jsonl_path = 'train_chatml.jsonl'
eval_jsonl_path = 'eval_chatml.jsonl'

train_chatml = to_chatml_format(df_train, train_jsonl_path)
eval_chatml = to_chatml_format(df_eval, eval_jsonl_path)

print(f"Train data disimpan di: {train_jsonl_path}")
print(f"Eval data disimpan di: {eval_jsonl_path}")

Train data disimpan di: train_chatml.jsonl
Eval data disimpan di: eval_chatml.jsonl


In [None]:
# Load dataset menggunakan Hugging Face datasets
train_dataset = load_dataset('json', data_files=train_jsonl_path, split='train')
eval_dataset = load_dataset('json', data_files=eval_jsonl_path, split='train')

print(f"Train dataset loaded: {len(train_dataset)} samples")
print(f"Eval dataset loaded: {len(eval_dataset)} samples")

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Train dataset loaded: 511 samples
Eval dataset loaded: 91 samples


In [None]:
# Contoh format data
print("\nContoh format data ChatML:")
print(json.dumps(train_chatml[0], indent=2, ensure_ascii=False))


Contoh format data ChatML:
{
  "messages": [
    {
      "role": "user",
      "content": "Di mana ketentuan lebih lanjut tentang pelaksanaan wisuda diatur di Unjaya?"
    },
    {
      "role": "assistant",
      "content": "Ketentuan lebih lanjut tentang pelaksanaan wisuda ditetapkan melalui Keputusan Rektor.\n\n(Sumber: Bab X Tugas Akhir, Yudisium, Wisuda, Pemberian Gelar, dan Dokumen Lulusan, Pasal 48 Wisuda Ayat (4), Hal. 39 (Pedoman Akademik Unjaya 2024))"
    }
  ]
}



# 4. MODELING

## 4.1 Konfigurasi Model dan Tokenizer

In [None]:
# Konfigurasi QLoRA (4-bit quantization)
# QLoRA memungkinkan fine-tuning model besar dengan memory GPU terbatas
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Kuantisasi model ke 4-bit (menghemat 75% memory)
    bnb_4bit_quant_type="nf4",           # Normal Float 4: format kuantisasi yang optimal
    bnb_4bit_compute_dtype=torch.bfloat16, # Tipe data untuk komputasi (lebih stabil dari float16)
    bnb_4bit_use_double_quant=False      # Double quantization off (menghemat memory lebih)
)
# Model configuration
model_name = "mistralai/Mistral-7B-Instruct-v0.3"
new_model_name = "riakrst/mistral-7b-pedoman-akademik-unjaya-v1"

print(f"Loading model: {model_name}")

# Load model dengan quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,      # Terapkan konfigurasi QLoRA
    torch_dtype=torch.bfloat16,          # Tipe data model (lebih stabil dari float16)
    device_map="auto",                   # Otomatis distribusi ke GPU yang tersedia
    trust_remote_code=True               # Izinkan eksekusi kode kustom dari model
)

# Disable cache untuk training (akan diaktifkan kembali untuk inference)
model.config.use_cache = False          # Matikan cache untuk menghemat memory saat training
model.config.pretraining_tp = 1         # Tensor parallelism = 1 (menghindari warning)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Setup tokenizer untuk chat format
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # Gunakan end-of-sequence sebagai padding
tokenizer.padding_side = "right"              # Padding di sebelah kanan (standar untuk causal LM)

print(f"Model dan tokenizer berhasil dimuat")
print(f"Vocab size: {tokenizer.vocab_size}")

Loading model: mistralai/Mistral-7B-Instruct-v0.3


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.53.1",
  "use_cache": true,
  "vocab_size": 32768
}



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/model.safetensors.index.json


Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Instantiating MistralForCausalLM model under default dtype torch.bfloat16.
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

target_dtype {target_dtype} is replaced by `CustomDtype.INT4` for 4-bit BnB quantization


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-Instruct-v0.3.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.


generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}



tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/tokenizer.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/tokenizer_config.json
loading file chat_template.jinja from cache at None


Model dan tokenizer berhasil dimuat
Vocab size: 32768


## 4.2 Konfigurasi PEFT (LoRA)

In [None]:
# Persiapan model untuk kbit training
model = prepare_model_for_kbit_training(model)

# Konfigurasi LoRA (Low-Rank Adaptation)
# LoRA menambahkan layer kecil yang dapat dilatih tanpa mengubah model asli
peft_config = LoraConfig(
    lora_alpha=16,                       # Skala untuk bobot LoRA (biasanya 16 atau 32)
    lora_dropout=0.1,                    # Dropout untuk mencegah overfitting
    r=64,                                # Rank matriks LoRA (semakin tinggi = lebih ekspresif)
    bias="none",                         # Tidak melatih bias (menghemat parameter)
    task_type="CAUSAL_LM",              # Tipe tugas: Causal Language Modeling
    target_modules=[                     # Layer yang akan ditambahkan LoRA adapter
        "q_proj", "k_proj", "v_proj", "o_proj"       # attention layers
    ]
)

# Apply LoRA
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 54,525,952 || all params: 7,302,549,504 || trainable%: 0.7467



## 4.3. Training arguments

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    # Basic setup
    output_dir="./results-pedoman-akademik",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,

    # Evaluation & saving
    eval_steps=100,
    save_steps=100,
    save_total_limit=1,  # hemat storage

    bf16=True,

    # Logging
    logging_steps=10,

    # Reproducibility
    seed=42,

    # Hugging face hub
    push_to_hub=True,                          # Upload model ke Hugging Face Hub
    hub_model_id=new_model_name,

    # wandb monitoring
    report_to="wandb" if wandb.run else None,
    run_name=f"mistral-7b-pedoman-{wandb.run.id}" if wandb.run else None,
)

PyTorch: setting up devices


## 4.4 Initialize SFT Trainer

In [None]:
# Cek versi library
import trl
print(f"TRL version: {trl.__version__}")

# Cek dokumentasi SFTTrainer
help(SFTTrainer.__init__)

TRL version: 0.19.1
Help on function __init__ in module trl.trainer.sft_trainer:

__init__(self, model: Union[str, torch.nn.modules.module.Module, transformers.modeling_utils.PreTrainedModel], args: Union[trl.trainer.sft_config.SFTConfig, transformers.training_args.TrainingArguments, NoneType] = None, data_collator: Optional[transformers.data.data_collator.DataCollator] = None, train_dataset: Union[datasets.arrow_dataset.Dataset, datasets.iterable_dataset.IterableDataset, NoneType] = None, eval_dataset: Union[datasets.arrow_dataset.Dataset, dict[str, datasets.arrow_dataset.Dataset], NoneType] = None, processing_class: Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] = None, compute_loss_func: Optional[Callable] = None, compute_metrics: Optional[Callable[[transformers.trainer_utils.EvalPrediction]

In [None]:
# Jika tidak bisa pakai SFT Trainer
#  from transformers import Trainer

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=eval_dataset,
#     tokenizer=tokenizer
# )


  trainer = Trainer(
Using auto half precision backend
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [None]:
# Menggunakan SFTConfig untuk parameter khusus SFT
from trl import SFTConfig
sft_config = SFTConfig(
    # Training arguments dasar
    output_dir="./results-pedoman-akademik",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,

    # Evaluation & saving
    eval_steps=100,
    save_steps=100,
    save_total_limit=1,

    # Precision
    bf16=True,

    # Logging
    logging_steps=10,

    # Reproducibility
    seed=42,

    # Hugging face hub
    push_to_hub=True,
    hub_model_id=new_model_name,

    # wandb monitoring
    report_to="wandb" if wandb.run else None,
    run_name=f"mistral-7b-pedoman-{wandb.run.id}" if wandb.run else None,

    # SFT specific parameters
    max_seq_length=500,                      # Maksimal panjang sequence
    neftune_noise_alpha=5,                   # NEFTune noise untuk regularization
    dataset_text_field="text",               # Field yang berisi teks untuk training
)

trainer = SFTTrainer(
    model=model,                               # Parameter model
    args=sft_config,                           # Gunakan SFTConfig untuk parameter lengkap
    train_dataset=train_dataset,               # Dataset training
    eval_dataset=eval_dataset,                 # Dataset evaluasi
    processing_class=tokenizer,                # Gunakan processing_class untuk tokenizer
    peft_config=peft_config,                   # PEFT configuration
)

PyTorch: setting up devices
average_tokens_across_devices is True but world size is 1. Setting it to False automatically.


Tokenizing train dataset:   0%|          | 0/511 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/511 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/91 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/91 [00:00<?, ? examples/s]

Using auto half precision backend
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## 4.5 Training

In [None]:
print("Memulai fine-tuning...")

# Log hyperparameters ke W&B
if wandb.run:
    wandb.config.update({
        "model_name": model_name,
        "dataset_size": len(train_dataset),
        "eval_size": len(eval_dataset),
        "lora_r": peft_config.r,
        "lora_alpha": peft_config.lora_alpha,
        "learning_rate": training_args.learning_rate,
        "batch_size": training_args.per_device_train_batch_size,
        "epochs": training_args.num_train_epochs,
    })

# Training
trainer.train()

🚀 Memulai fine-tuning...


The following columns in the Training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: messages. If messages are not expected by `PeftModelForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 511
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 8
  Total optimization steps = 64
  Number of trainable parameters = 54,525,952
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args

Step,Training Loss
10,3.5674
20,3.1646
30,2.9825
40,2.7134
50,2.5671
60,2.6633


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*a

TrainOutput(global_step=64, training_loss=2.9085480123758316, metrics={'train_runtime': 2280.919, 'train_samples_per_second': 0.224, 'train_steps_per_second': 0.028, 'total_flos': 3188158567366656.0, 'train_loss': 2.9085480123758316})

In [None]:
# Simpan model dan tokenizer
print("\nMenyimpan model...")
trainer.save_model()
tokenizer.save_pretrained(new_model_name)


Menyimpan model...


NameError: name 'trainer' is not defined

In [None]:
# # Push to hub
# if training_args.push_to_hub:
#     trainer.push_to_hub()
#     print(f"Model berhasil diupload ke Hugging Face Hub: {new_model_name}")

Saving model checkpoint to ./results-pedoman-akademik
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.53.1",
  "use_cache": true,
  "vocab_size": 32768
}

chat template saved in ./results-pedoman-akademik/chat_template.jinja
tokenizer config file saved in ./results-pe

Model berhasil diupload ke Hugging Face Hub: riakrst/mistral-7b-pedoman-akademik-unjaya-v1


In [None]:
# PROSES MERGE ADAPTER DENGAN BASE MODEL = FULL MODEL
# belum berhasil

# PENTING: Gunakan model dari trainer secara langsung, karena itu sudah menjadi PeftModel
print("Merging adapter dengan base model dari trainer.model...")
# Pastikan 'model' di sini merujuk pada objek model yang Anda lewati ke SFTTrainer
# SFTTrainer akan membungkus model ini dengan PeftModel.
# Jadi, trainer.model adalah objek yang sudah siap untuk di-merge.
merged_model = trainer.model.merge_and_unload()

# 4. Setup nama untuk full model
full_model_name = f"{new_model_name}-merged"
print(f"Nama full model: {full_model_name}")

# 5. Simpan merged model secara lokal
print("Menyimpan merged model secara lokal...")
merged_model.save_pretrained(full_model_name)
tokenizer.save_pretrained(full_model_name)

# 6. Push ke Hugging Face Hub
print("Pushing full model ke Hugging Face Hub...")
try:
    merged_model.push_to_hub(full_model_name)
    tokenizer.push_to_hub(full_model_name)
    print(f"Full model berhasil diupload ke: https://huggingface.co/{full_model_name}")

    # Jika Anda ingin push adapter juga secara terpisah, pastikan sft_config.push_to_hub = True
    # atau panggil push_to_hub() secara manual pada trainer.model (PeftModel)
    if sft_config.push_to_hub:
        trainer.push_to_hub()
        print(f"LoRA adapter berhasil diupload ke: https://huggingface.co/{new_model_name}")
    else:
        print("LoRA adapter tidak diupload otomatis karena sft_config.push_to_hub diset False.")
        print(f"Untuk push adapter secara manual: trainer.model.push_to_hub('{new_model_name}')")

except Exception as e:
    print(f"Error saat push ke Hugging Face: {e}")
    print("Pastikan Anda sudah login ke Hugging Face:")
    print("   huggingface-cli login")

# 7. Bersihkan memory
# Anda bisa hapus trainer dan model terkait untuk membebaskan VRAM
del trainer
del merged_model
torch.cuda.empty_cache()

print("\n" + "="*60)
print("PROSES MERGE DAN UPLOAD SELESAI!")
print("="*60)
print(f"Full model tersedia di: https://huggingface.co/{full_model_name}")
print(f"LoRA adapter (jika di-push) tersedia di: https://huggingface.co/{new_model_name}")
print("\nSekarang Anda bisa menggunakan full model untuk inference:")
print(f"   model = AutoModelForCausalLM.from_pretrained('{full_model_name}')")


MEMULAI PROSES MERGE ADAPTER DENGAN BASE MODEL
Merging adapter dengan base model dari trainer.model...




OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 2.12 MiB is free. Process 2177 has 14.74 GiB memory in use. Of the allocated memory 14.56 GiB is allocated by PyTorch, and 48.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# 5. EVALUATION
belum dijalankan, terkendala out of memory

In [None]:
# Setup model untuk inference
model.config.use_cache = True
model.eval()

In [None]:
# Load BERTScore metric
bertscore_metric = evaluate.load("bertscore")

# Sampling data untuk evaluasi
sample_size = min(15, len(eval_dataset))
sample_data = eval_dataset.shuffle(seed=42).select(range(sample_size))

# Extract prompts dan references
prompts = [data["messages"][0]["content"] for data in sample_data]
references = [data["messages"][1]["content"] for data in sample_data]

# Setup text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    torch_dtype=torch.bfloat16,
    return_full_text=False
)

# Generate responses
print(f"Generating responses untuk {sample_size} sampel...")
generated_texts = []

for i, prompt in enumerate(prompts):
    if i % 10 == 0:
        print(f"Progress: {i}/{len(prompts)}")

    try:
        messages = [{"role": "user", "content": prompt}]

        output = pipe(
            messages,
            max_new_tokens=256,                    # Kurangi token untuk respons lebih fokus
            do_sample=True,
            temperature=0.7,                       # Suhu untuk kreativitas respons
            top_p=0.9,                            # Nucleus sampling (lebih konservatif)
            pad_token_id=tokenizer.eos_token_id
        )

        generated_text = output[0]['generated_text'].strip()
        generated_texts.append(generated_text)

    except Exception as e:
        print(f"Error pada sampel {i}: {e}")
        generated_texts.append("Maaf, terjadi error dalam menghasilkan respons.")


In [None]:
# Hitung BERTScore
print("\n📊 Menghitung BERTScore...")
try:
    bert_results = bertscore_metric.compute(
        predictions=generated_texts,
        references=references,
        lang="id"
    )

    bert_f1_mean = np.mean(bert_results['f1'])
    bert_precision_mean = np.mean(bert_results['precision'])
    bert_recall_mean = np.mean(bert_results['recall'])

    print(f"✓ BERTScore Results:")
    print(f"  - F1 Score: {bert_f1_mean:.4f}")
    print(f"  - Precision: {bert_precision_mean:.4f}")
    print(f"  - Recall: {bert_recall_mean:.4f}")

    # Log to W&B
    if wandb.run:
        wandb.log({
            "eval/bertscore_f1": bert_f1_mean,
            "eval/bertscore_precision": bert_precision_mean,
            "eval/bertscore_recall": bert_recall_mean
        })

except Exception as e:
    print(f"Error dalam evaluasi BERTScore: {e}")
    bert_f1_mean = 0.0

# 6. SIMPLE INFERENCE
belum jadi, masih berantakan

In [None]:
# Contoh inference
test_questions = [
    "Bagaimana cara mengajukan cuti akademik?",
    "Berapa lama masa studi maksimal untuk S1?",
]

print("\nContoh Inference:")
print("=" * 50)

for i, question in enumerate(test_questions, 1):
    try:
        messages = [{"role": "user", "content": question}]

        result = pipe(
            messages,
            max_new_tokens=256,
            do_sample=True,
            temperature=0.7,
            top_p=0.95,
            pad_token_id=tokenizer.eos_token_id
        )

        response = result[0]['generated_text'].strip()

        print(f"\n{i}. Pertanyaan: {question}")
        print(f"   Jawaban: {response}")
        print("-" * 50)

    except Exception as e:
        print(f"Error pada pertanyaan {i}: {e}")



Contoh Inference:
Error pada pertanyaan 1: name 'pipe' is not defined
Error pada pertanyaan 2: name 'pipe' is not defined


In [None]:
# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
# from peft import PeftModel

# def load_finetuned_model():
#     """Load model yang sudah di-fine-tune dari Hugging Face Hub"""

#     model_id = "riakrst/mistral-7b-pedoman-akademik-unjaya-v1"

#     print(f"Loading model: {model_id}")

#     # Load tokenizer
#     tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

#     # Setup tokenizer
#     if tokenizer.pad_token is None:
#         tokenizer.pad_token = tokenizer.eos_token
#     tokenizer.padding_side = "left"  # Untuk inference, padding di kiri lebih baik

#     # Load model
#     model = AutoModelForCausalLM.from_pretrained(
#         model_id,
#         torch_dtype=torch.bfloat16,
#         device_map="auto",
#         trust_remote_code=True
#     )

#     # Enable cache untuk inference
#     model.config.use_cache = True

#     print("Model berhasil dimuat!")
#     return model, tokenizer

# def format_chat_prompt(user_message, system_prompt=None):
#     """Format prompt sesuai dengan format chat yang digunakan saat training"""

#     if system_prompt is None:
#         system_prompt = "Anda adalah asisten AI yang membantu menjawab pertanyaan tentang pedoman akademik Universitas Jenderal Achmad Yani (Unjaya). Berikan jawaban yang akurat, jelas, dan sesuai dengan kebijakan yang berlaku."

#     # Format untuk Mistral-7B-Instruct dengan chat format
#     # Sesuai dengan format dataset training: messages dengan role user dan assistant
#     messages = [
#         {"role": "system", "content": system_prompt},
#         {"role": "user", "content": user_message}
#     ]

#     # Apply chat template (Mistral format)
#     prompt = tokenizer.apply_chat_template(
#         messages,
#         tokenize=False,
#         add_generation_prompt=True
#     )

#     return prompt

# def generate_response(question, model, tokenizer, max_new_tokens=300, temperature=0.7, top_p=0.95):
#     """Generate response dari model untuk pertanyaan yang diberikan"""

#     try:
#         # Format prompt dengan chat template
#         prompt = format_chat_prompt(question)

#         # Tokenize input
#         inputs = tokenizer(
#             prompt,
#             return_tensors="pt",
#             truncation=True,
#             max_length=512,
#             padding=True
#         ).to(model.device)

#         # Generate response
#         with torch.no_grad():
#             outputs = model.generate(
#                 **inputs,
#                 max_new_tokens=max_new_tokens,
#                 do_sample=True,
#                 temperature=temperature,
#                 top_p=top_p,
#                 pad_token_id=tokenizer.eos_token_id,
#                 eos_token_id=tokenizer.eos_token_id,
#                 repetition_penalty=1.1,
#                 num_return_sequences=1,
#             )

#         # Decode response
#         input_length = inputs['input_ids'].shape[1]
#         generated_tokens = outputs[0][input_length:]
#         response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

#         return response.strip()

#     except Exception as e:
#         return f"Error: {str(e)}"

# def generate_response_with_pipeline(question, pipe, max_new_tokens=300, temperature=0.7, top_p=0.95):
#     """Alternative: Generate response menggunakan pipeline"""

#     try:
#         # Format prompt dengan chat template
#         messages = [
#             {"role": "user", "content": question}
#         ]

#         # Generate response
#         result = pipe(
#             messages,
#             max_new_tokens=max_new_tokens,
#             do_sample=True,
#             temperature=temperature,
#             top_p=top_p,
#             pad_token_id=pipe.tokenizer.eos_token_id,
#             eos_token_id=pipe.tokenizer.eos_token_id,
#             repetition_penalty=1.1,
#         )

#         # Extract response
#         if isinstance(result, list) and len(result) > 0:
#             # Ambil bagian assistant dari generated text
#             generated_text = result[0]['generated_text']

#             # Cari respons assistant (setelah role assistant)
#             if isinstance(generated_text, list):
#                 # Cari message dengan role assistant
#                 for msg in generated_text:
#                     if msg.get('role') == 'assistant':
#                         return msg.get('content', '').strip()
#             else:
#                 # Jika string, parsing manual
#                 if 'assistant' in generated_text:
#                     response = generated_text.split('assistant')[-1].strip()
#                     return response
#                 else:
#                     return generated_text.strip()

#         return "Maaf, tidak dapat menghasilkan jawaban."

#     except Exception as e:
#         return f"Error: {str(e)}"

# # Load model dan tokenizer
# model, tokenizer = load_finetuned_model()

# # Setup pipeline (alternatif)
# pipe = pipeline(
#     "text-generation",
#     model=model,
#     tokenizer=tokenizer,
#     torch_dtype=torch.bfloat16,
#     device_map="auto",
# )

# # Test questions sesuai dengan domain pedoman akademik Unjaya
# test_questions = [
#     "Bagaimana cara mengajukan cuti akademik?",
#     "Berapa lama masa studi maksimal untuk S1?",
#     "Bagaimana sistem penilaian di Unjaya?",
# ]

# print("\n" + "="*70)
# print("INFERENCE MODEL MISTRAL-7B PEDOMAN AKADEMIK UNJAYA")
# print("="*70)

# # Method 1: Direct generation
# print("\n[METHOD 1: Direct Generation]")
# print("-" * 50)

# for i, question in enumerate(test_questions[:4], 1):  # Test 4 pertanyaan pertama
#     print(f"\n{i}. Pertanyaan: {question}")
#     print("   Memproses...")

#     response = generate_response(
#         question,
#         model,
#         tokenizer,
#         max_new_tokens=300,
#         temperature=0.7,
#         top_p=0.95
#     )

#     print(f"   Jawaban: {response}")
#     print("-" * 50)

# # Method 2: Pipeline (alternatif)
# print("\n[METHOD 2: Pipeline Generation]")
# print("-" * 50)

# for i, question in enumerate(test_questions[4:6], 5):  # Test 2 pertanyaan berikutnya
#     print(f"\n{i}. Pertanyaan: {question}")
#     print("   Memproses...")

#     response = generate_response_with_pipeline(
#         question,
#         pipe,
#         max_new_tokens=300,
#         temperature=0.7,
#         top_p=0.95
#     )

#     print(f"   Jawaban: {response}")
#     print("-" * 50)

# # Interactive inference function
# def interactive_inference():
#     """Mode interactive untuk testing"""

#     print("\n" + "="*50)
#     print("MODE INTERACTIVE INFERENCE")
#     print("="*50)
#     print("Ketik 'exit' untuk keluar")
#     print("Ketik 'help' untuk bantuan")

#     while True:
#         question = input("\nPertanyaan: ")

#         if question.lower() in ['exit', 'quit', 'keluar']:
#             print("Selesai!")
#             break

#         if question.lower() == 'help':
#             print("\nContoh pertanyaan:")
#             print("- Bagaimana cara mengajukan cuti akademik?")
#             print("- Berapa lama masa studi maksimal untuk S1?")
#             print("- Apa saja persyaratan untuk mengajukan proposal skripsi?")
#             continue

#         if question.strip() == "":
#             continue

#         print("Memproses...")
#         response = generate_response(question, model, tokenizer)
#         print(f"Jawaban: {response}")

# # Function untuk batch inference
# def batch_inference(questions_list):
#     """Batch inference untuk multiple questions"""

#     results = []

#     print(f"\nMemproses {len(questions_list)} pertanyaan...")

#     for i, question in enumerate(questions_list, 1):
#         print(f"[{i}/{len(questions_list)}] {question}")

#         response = generate_response(question, model, tokenizer)
#         results.append({
#             'question': question,
#             'answer': response
#         })

#         print(f"✓ Selesai")

#     return results

# print("\n" + "="*70)
# print("INFERENCE SELESAI!")
# print("="*70)
# print("\nUntuk menggunakan mode interactive, jalankan:")
# print("interactive_inference()")
# print("\nUntuk batch inference, jalankan:")
# print("batch_inference([list_of_questions])")

loading file tokenizer.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Loading model: riakrst/mistral-7b-pedoman-akademik-unjaya-v1


Model config MistralConfig {
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.53.1",
  "use_cache": true,
  "vocab_size": 32000
}



OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory riakrst/mistral-7b-pedoman-akademik-unjaya-v1.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig # Import PeftModel and PeftConfig

# Load model dari Hugging Face (ini sekarang akan menjadi adapter ID)
adapter_id = "riakrst/mistral-7b-pedoman-akademik-unjaya-v1"

print(f"Loading adapter: {adapter_id}")

# 1. Load the base model first
# You need to know the original base model ID you used for fine-tuning.
# Assuming it was "mistralai/Mistral-7B-Instruct-v0.2" or similar.
# Make sure to replace this with your actual base model ID.
base_model_id = "mistralai/Mistral-7B-Instruct-v0.3" # <--- **REPLACE THIS WITH YOUR ACTUAL BASE MODEL ID**

print(f"Loading base model: {base_model_id}")
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. Load the tokenizer from the adapter ID (it should be the same as the base model's tokenizer)
tokenizer = AutoTokenizer.from_pretrained(adapter_id)


# 3. Load the PEFT adapter weights on top of the base model
print(f"Loading PEFT adapter: {adapter_id}")
model = PeftModel.from_pretrained(model, adapter_id)

# Optional: Merge the LoRA weights into the base model for faster inference if you don't plan further training
# model = model.merge_and_unload() # Uncomment this line if you want to merge for faster inference

# Setup tokenizer
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Model dan adapter berhasil dimuat!")

def ask_question(question):
    """Fungsi sederhana untuk bertanya ke model"""

    # Format chat sesuai dengan training data
    messages = [
        {"role": "user", "content": question}
    ]

    # Apply chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=250,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode response (hanya bagian yang di-generate)
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

# Test beberapa pertanyaan
test_questions = [
    "Bagaimana cara mengajukan cuti akademik?",
    "Berapa lama masa studi maksimal untuk S1?",
]

print("\n" + "="*50)
print("TEST INFERENCE MODEL")
print("="*50)

for i, question in enumerate(test_questions, 1):
    print(f"\n{i}. Pertanyaan: {question}")
    answer = ask_question(question)
    print(f"   Jawaban: {answer}")
    print("-" * 50)

# Untuk pertanyaan interaktif
print("\nUntuk bertanya secara interaktif:")
print("answer = ask_question('Pertanyaan Anda di sini')")
print("print(answer)")

Loading adapter: riakrst/mistral-7b-pedoman-akademik-unjaya-v1
Loading base model: mistralai/Mistral-7B-Instruct-v0.3


loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.53.1",
  "use_cache": true,
  "vocab_size": 32768
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-Instruct-v0.3.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

loading file tokenizer.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Loading PEFT adapter: riakrst/mistral-7b-pedoman-akademik-unjaya-v1




KeyError: 'base_model.model.model.model.embed_tokens'

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig # Import PeftModel and PeftConfig

# Load model dari Hugging Face (ini sekarang akan menjadi adapter ID)
adapter_id = "riakrst/mistral-7b-pedoman-akademik-unjaya-v1"

print(f"Loading adapter: {adapter_id}")

# 1. Load the base model first
# You need to know the original base model ID you used for fine-tuning.
# Assuming it was "mistralai/Mistral-7B-Instruct-v0.2" or similar.
# Make sure to replace this with your actual base model ID.
base_model_id = "mistralai/Mistral-7B-Instruct-v0.3" # <--- **REPLACE THIS WITH YOUR ACTUAL BASE MODEL ID**

print(f"Loading base model: {base_model_id}")
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 2. Load the tokenizer from the adapter ID (it should be the same as the base model's tokenizer)
tokenizer = AutoTokenizer.from_pretrained(adapter_id)


# 3. Load the PEFT adapter weights on top of the base model
print(f"Loading PEFT adapter: {adapter_id}")
# This step failed in the previous attempt. If it fails again, please check
# the compatibility of your transformers and peft library versions.
model = PeftModel.from_pretrained(model, adapter_id)

# Optional: Merge the LoRA weights into the base model for faster inference if you don't plan further training
# model = model.merge_and_unload() # Uncomment this line if you want to merge for faster inference

# Setup tokenizer
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Model dan adapter berhasil dimuat!")

def ask_question(question):
    """Fungsi sederhana untuk bertanya ke model"""

    # Format chat sesuai dengan training data
    messages = [
        {"role": "user", "content": question}
    ]

    # Apply chat template
    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Generate response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=250,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode response (hanya bagian yang di-generate)
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

# Test beberapa pertanyaan
test_questions = [
    "Bagaimana cara mengajukan cuti akademik?",
    "Berapa lama masa studi maksimal untuk S1?",
]

print("\n" + "="*50)
print("TEST INFERENCE MODEL")
print("="*50)

for i, question in enumerate(test_questions, 1):
    print(f"\n{i}. Pertanyaan: {question}")
    answer = ask_question(question)
    print(f"   Jawaban: {answer}")
    print("-" * 50)

# Untuk pertanyaan interaktif
print("\nUntuk bertanya secara interaktif:")
print("answer = ask_question('Pertanyaan Anda di sini')")
print("print(answer)")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.53.1",
  "use_cache": true,
  "vocab_size": 32768
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f

Loading adapter: riakrst/mistral-7b-pedoman-akademik-unjaya-v1
Loading base model: mistralai/Mistral-7B-Instruct-v0.3




Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-Instruct-v0.3.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.3/snapshots/e0bc86c23ce5aae1db576c8cca6f06f1f73af2db/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}

loading file tokenizer.model
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading file chat_template.jinja


Loading PEFT adapter: riakrst/mistral-7b-pedoman-akademik-unjaya-v1




KeyError: 'base_model.model.model.model.embed_tokens'

In [None]:
# Cleanup
if wandb.run:
    wandb.finish()

In [None]:
# Set logging back to normal
logging.set_verbosity(logging.WARNING)

### Addressing OutOfMemoryError During Model Merging

You encountered an `OutOfMemoryError` when attempting to merge the LoRA adapter with the base model in the previous step. This error indicates that the process requires more GPU memory than is available in your current environment. Merging a large language model like Mistral-7B (even with a LoRA adapter) is a memory-intensive operation as it involves loading the full base model weights into memory along with the adapter weights to perform the mathematical operation of combining them.

**Possible Causes of OOM During Merge:**

*   **Insufficient GPU Memory:** The most common reason. Your GPU simply does not have enough VRAM to hold the base model and adapter simultaneously for the merge operation.
*   **Memory Fragmentation:** Even if the total free memory seems sufficient, it might be fragmented, preventing the allocation of a large contiguous block required for the merge.
*   **Background Processes:** Other processes running on the GPU might be consuming memory.

**Why is Merging Desirable?**

Merging the adapter into the base model results in a single, standard model file that can be loaded directly using `AutoModelForCausalLM.from_pretrained`. This often leads to slightly faster inference compared to loading the base model and adapter separately, and it's a more portable format.

**Alternatives When Merging Fails Due to OOM:**

If you cannot perform the merge due to memory constraints, you have a few options to proceed with inference:

1.  **Inference with Base Model + Adapter (Recommended in this scenario):** This is a standard approach with PEFT. You load the original base model (often in 4-bit or 8-bit quantization to save memory) and then load the PEFT adapter weights on top of it using `PeftModel.from_pretrained(base_model, adapter_id)`. The inference is then performed using this `PeftModel` object. This method is less memory-intensive than merging because the full weights are not explicitly combined into a new, larger model in memory. (Note: You previously encountered a `KeyError` when attempting this method in earlier cells. This suggests there might be a separate compatibility issue with loading the adapter onto the base model in your environment that needs to be resolved first, potentially by adjusting library versions.)

2.  **Merge on Different Hardware:** If you have access to a system with more GPU memory, you could perform the merge there. Once merged, you can upload the full merged model to the Hugging Face Hub and then load that merged model directly on your current Colab instance for inference.

3.  **CPU Merging (Potentially Very Slow):** In some cases, you might be able to perform the merge on the CPU by setting `device_map='cpu'` when loading the base model for merging. However, merging a 7B model on a CPU will be extremely slow and might still require significant RAM.

**Conclusion:**

Given the `OutOfMemoryError`, merging the model on your current Colab GPU is likely not possible. The recommended path forward is to perform inference using the base model with the adapter loaded separately (Alternative 1). However, you need to address the `KeyError` you encountered when trying that method first. Please focus on resolving that `KeyError` by verifying library compatibility as discussed in the previous turn.