# Kanana RAG Fine-tuning Notebook

This notebook fine-tunes the Kanana 8B instruct model on RAG tasks using the Jecheon tourism dataset.

**Model:** kakaocorp/kanana-1.5-8b-instruct-2505

**Training Data Format:**
```
[Instruction]
당신은 제천시 관광 안내 전문가입니다.
제공된 여러 문서 중에서 질문과 관련된 문서를 찾아, 그 문서의 내용을 바탕으로 정확하고 친절하게 답변해주세요.

Information:
{content1}
Information:
{content2}
Question: {question}
```

In [None]:
# Install required packages
!pip install transformers peft datasets wandb bitsandbytes accelerate bert-score

In [None]:
# Check CUDA availability
import os
import torch

os.environ["NVIDIA_VISIBLE_DEVICES"] = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Login to Weights & Biases
import wandb

# Login to wandb (you'll be prompted for your API key)
wandb.login()

# Initialize wandb project
wandb.init(
    project="kanana-rag-finetuning",
    name="kanana-1.5-8b-instruct-rag",
    config={
        "model": "kakaocorp/kanana-1.5-8b-instruct-2505",
        "task": "RAG fine-tuning",
        "dataset": "Jecheon Tourism"
    }
)

In [None]:
# Load base model and tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "kakaocorp/kanana-1.5-8b-instruct-2505"

print(f"Loading model: {model_name}")
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

print("Model loaded successfully!")

In [None]:
# Load RAG training data
import json
from datasets import Dataset

data_path = "/home/user/goodganglabs/data/processed/training_data.jsonl"

# Load JSONL data
data_list = []
with open(data_path, 'r', encoding='utf-8') as f:
    for line in f:
        data_list.append(json.loads(line))

print(f"Loaded {len(data_list)} training examples")
print("\nFirst example:")
print(json.dumps(data_list[0], ensure_ascii=False, indent=2)[:500])

In [None]:
# Convert to requested RAG format with instruction
def format_rag_data(example):
    """Convert RAG data to the requested format with instruction:
    [Instruction]
    당신은 제천시 관광 안내 전문가입니다...
    
    Information:
    {content1}
    Information:
    {content2}
    Question: {question}
    """
    instruction = """당신은 제천시 관광 안내 전문가입니다.
제공된 여러 문서 중에서 질문과 관련된 문서를 찾아, 그 문서의 내용을 바탕으로 정확하고 친절하게 답변해주세요.

답변 시 주의사항:
1. 관련 문서의 내용만을 바탕으로 답변하세요
2. 문서에 정보가 없으면 "제공된 정보에는 해당 내용이 없습니다"라고 답변하세요
3. 추측하거나 문서 외부 지식을 사용하지 마세요
4. 간결하고 이해하기 쉽게 답변하세요"""
    
    documents = example['documents']
    question = example['question']
    answer = example['answer']
    
    # Build information sections
    info_sections = []
    for doc in documents:
        info_sections.append(f"Information:\n{doc['content']}")
    
    # Combine: instruction + information sections + question
    prompt = instruction + "\n\n"
    prompt += "\n\n".join(info_sections)
    prompt += f"\n\nQuestion: {question}"
    
    return {
        "prompt": prompt,
        "answer": answer
    }

# Apply formatting
formatted_data = []
for example in data_list:
    formatted_data.append(format_rag_data(example))

print(f"Formatted {len(formatted_data)} examples")
print("\nExample formatted prompt:")
print(formatted_data[0]['prompt'][:500])
print(f"\nAnswer: {formatted_data[0]['answer']}")

In [None]:
# Create training dataset with proper format
def formatting_prompts_func(examples):
    """Format prompts for training"""
    prompts = examples["prompt"]
    answers = examples["answer"]
    
    EOS_TOKEN = tokenizer.eos_token
    
    texts = []
    for prompt, answer in zip(prompts, answers):
        # Combine prompt and answer with EOS token
        text = f"{prompt}\n\nAnswer: {answer}{EOS_TOKEN}"
        texts.append(text)
    
    return {"text": texts}

# Create dataset
dataset = Dataset.from_list(formatted_data)
dataset = dataset.map(formatting_prompts_func, batched=True)

print(f"Dataset size: {len(dataset)}")
print(f"\nDataset features: {dataset.features}")
print(f"\nFirst training example:")
print(dataset[0]['text'][:500])

In [None]:
# Tokenize dataset
def tokenize_function(examples):
    tokens = tokenizer(
        examples["text"], 
        padding="max_length",
        truncation=True,
        max_length=2048,
        return_tensors="pt"
    )
    tokens["labels"] = tokens["input_ids"].clone()
    return tokens

tokenized_dataset = dataset.map(
    tokenize_function, 
    batched=True, 
    remove_columns=["text", "prompt", "answer"]
)

print(f"Tokenized dataset size: {len(tokenized_dataset)}")
print(f"Features: {tokenized_dataset.features}")

In [None]:
# Split dataset into train and validation
from datasets import DatasetDict

# Split dataset (80% train, 20% validation)
dataset_split = dataset.train_test_split(test_size=0.2, seed=1234)

train_dataset = dataset_split['train']
val_dataset = dataset_split['test']

print(f"Train size: {len(train_dataset)}")
print(f"Validation size: {len(val_dataset)}")

# Log to wandb
wandb.config.update({
    "train_size": len(train_dataset),
    "val_size": len(val_dataset),
    "train_val_split": "80/20"
})

In [None]:
# Define compute_metrics function with BERTScore for validation
from bert_score import score
import numpy as np

def compute_metrics(eval_preds):
    """
    Compute BERTScore during validation.
    
    This function is called automatically by the Trainer during validation.
    It computes BERTScore (Precision, Recall, F1) between generated and reference answers.
    
    Args:
        eval_preds: EvalPrediction object with predictions and label_ids
    
    Returns:
        Dict of metric names and values (logged to WandB automatically)
    """
    predictions, labels = eval_preds
    
    # Decode predictions and references
    # Remove padding tokens (-100) from labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    
    # Decode to text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Extract only the answer part (after "Answer:")
    def extract_answer(text):
        if "Answer:" in text:
            return text.split("Answer:")[-1].strip()
        return text.strip()
    
    decoded_preds = [extract_answer(pred) for pred in decoded_preds]
    decoded_labels = [extract_answer(label) for label in decoded_labels]
    
    # Compute BERTScore (Korean language)
    P, R, F1 = score(
        decoded_preds, 
        decoded_labels, 
        lang="ko",  # Korean language
        model_type="bert-base-multilingual-cased",
        verbose=False
    )
    
    # Return metrics (will be prefixed with "eval_" automatically)
    # These will appear in WandB as: eval_bert_precision, eval_bert_recall, eval_bert_f1
    return {
        "bert_precision": P.mean().item(),
        "bert_recall": R.mean().item(),
        "bert_f1": F1.mean().item(),
    }

print("✓ compute_metrics function defined")
print("  Metrics: BERTScore (Precision, Recall, F1)")
print("  Language: Korean (ko)")
print("  Model: bert-base-multilingual-cased")
print("\n  During validation:")
print("    - eval_loss (automatic)")
print("    - eval_bert_precision")
print("    - eval_bert_recall")
print("    - eval_bert_f1 ← used for best model selection")

In [None]:
# Tokenize both train and validation datasets
print("Tokenizing train and validation datasets...")
print("=" * 60)

# Tokenize training set
tokenized_train = train_dataset.map(
    tokenize_function, 
    batched=True, 
    remove_columns=["text", "prompt", "answer"]
)

# Tokenize validation set
tokenized_val = val_dataset.map(
    tokenize_function, 
    batched=True, 
    remove_columns=["text", "prompt", "answer"]
)

print(f"✓ Tokenized train dataset: {len(tokenized_train)} samples")
print(f"✓ Tokenized validation dataset: {len(tokenized_val)} samples")
print(f"\nTrain features: {tokenized_train.features}")
print(f"Val features: {tokenized_val.features}")

In [None]:
# Configure LoRA for parameter-efficient fine-tuning
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=8,                    # LoRA rank
    lora_alpha=32,          # LoRA alpha
    lora_dropout=0.1,       # Dropout probability
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj"
    ]
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

# Log LoRA config to wandb
wandb.config.update({
    "lora_r": lora_config.r,
    "lora_alpha": lora_config.lora_alpha,
    "lora_dropout": lora_config.lora_dropout,
    "target_modules": lora_config.target_modules
})

In [None]:
# Configure training arguments with validation and BERTScore
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    # Output settings
    output_dir="./outputs/kanana-rag",
    overwrite_output_dir=True,
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,  # Added for validation
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_steps=10,
    
    # Optimization
    bf16=True,
    optim="adamw_torch",
    lr_scheduler_type="linear",
    
    # Logging
    logging_steps=5,
    logging_dir="./logs",
    report_to="wandb",
    
    # Saving with best model selection
    save_strategy="steps",
    save_steps=50,
    save_total_limit=3,
    load_best_model_at_end=True,  # Added: Load best model at end
    
    # Validation - Option 3: BERTScore for best model, loss also monitored
    evaluation_strategy="steps",  # Changed from "no" to "steps"
    eval_steps=50,  # Evaluate every 50 steps
    metric_for_best_model="eval_bert_f1",  # Use BERTScore F1 for best model
    greater_is_better=True,  # Higher F1 is better
    
    # Other
    seed=1234,
    data_seed=1234,
    remove_unused_columns=True,
)

print("Training arguments configured with validation:")
print("=" * 60)
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Train batch size: {training_args.per_device_train_batch_size}")
print(f"  Eval batch size: {training_args.per_device_eval_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"\n  Validation settings:")
print(f"    Strategy: {training_args.evaluation_strategy}")
print(f"    Eval steps: {training_args.eval_steps}")
print(f"    Metric for best model: {training_args.metric_for_best_model}")
print(f"    Load best at end: {training_args.load_best_model_at_end}")
print(f"\n  Metrics tracked:")
print(f"    - eval_loss (automatic, cross-entropy)")
print(f"    - eval_bert_precision")
print(f"    - eval_bert_recall")
print(f"    - eval_bert_f1 ← Best model selection criterion")
print("=" * 60)

# Log training config to wandb
wandb.config.update({
    "num_train_epochs": training_args.num_train_epochs,
    "per_device_train_batch_size": training_args.per_device_train_batch_size,
    "per_device_eval_batch_size": training_args.per_device_eval_batch_size,
    "gradient_accumulation_steps": training_args.gradient_accumulation_steps,
    "learning_rate": training_args.learning_rate,
    "weight_decay": training_args.weight_decay,
    "warmup_steps": training_args.warmup_steps,
    "evaluation_strategy": training_args.evaluation_strategy,
    "eval_steps": training_args.eval_steps,
    "metric_for_best_model": training_args.metric_for_best_model,
})

In [None]:
# Initialize Trainer with validation dataset and compute_metrics
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,  # Changed from tokenized_dataset
    eval_dataset=tokenized_val,      # Added validation dataset
    compute_metrics=compute_metrics,  # Added BERTScore computation
)

print("Trainer initialized with validation!")
print("=" * 60)
print(f"  Train dataset: {len(tokenized_train)} samples")
print(f"  Validation dataset: {len(tokenized_val)} samples")
print(f"  Compute metrics: BERTScore enabled")
print(f"\n  During training:")
print(f"    - Every {training_args.eval_steps} steps:")
print(f"      → Run validation on {len(tokenized_val)} samples")
print(f"      → Compute eval_loss")
print(f"      → Compute BERTScore metrics")
print(f"      → Save checkpoint if eval_bert_f1 improved")
print(f"    - At end: Load best checkpoint (highest eval_bert_f1)")
print("=" * 60)

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

print("Trainer initialized successfully!")

In [None]:
# Start training
print("Starting training...")
print("=" * 50)

trainer_stats = trainer.train()

print("\n" + "=" * 50)
print("Training completed!")
print(f"Training loss: {trainer_stats.training_loss:.4f}")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f}s")

In [None]:
# Check final GPU memory usage
final_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"\nFinal GPU memory: {final_gpu_memory} GB")
print(f"Peak memory used: {final_gpu_memory - start_gpu_memory} GB")

# Log final stats to wandb
wandb.log({
    "final_gpu_memory_gb": final_gpu_memory,
    "peak_memory_used_gb": final_gpu_memory - start_gpu_memory
})

In [None]:
# Save the fine-tuned model
output_dir = "./outputs/kanana-rag-final"

print(f"Saving model to {output_dir}...")
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print("Model saved successfully!")
print(f"\nModel location: {output_dir}")

In [None]:
# Test inference with a sample
print("Testing inference...\n")

# Get a test example
test_prompt = formatted_data[0]['prompt']
expected_answer = formatted_data[0]['answer']

print("Test Prompt:")
print(test_prompt[:300])
print("\n...")

# Tokenize and generate
model.eval()
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("\nGenerated Response:")
print(generated_text[len(test_prompt):])

print("\nExpected Answer:")
print(expected_answer)

## Evaluation with Generation-Only Metrics

Now let's evaluate the fine-tuned model using generation-only metrics (ROUGE, BERTScore, Exact Match).

**No document retrieval metrics needed** - we only compare generated answers vs ground truth.

In [None]:
# Install evaluation dependencies
!pip install -q rouge-score bert-score

In [None]:
# Import evaluation metrics
import sys
sys.path.insert(0, '/home/user/goodganglabs')

from src.evaluation.metrics import create_evaluator

# Create evaluator (no k_values needed for generation-only)
evaluator = create_evaluator()

print("✓ Evaluator initialized")
print("  Available metrics: ROUGE, BERTScore, Exact Match")

In [None]:
# Generate predictions on test set
print("Generating predictions on test set...")
print("=" * 60)

# Use first 30 samples for evaluation (or load separate test set)
test_samples = data_list[:30]  
predictions = []

model.eval()
with torch.no_grad():
    for i, sample in enumerate(test_samples):
        # Format prompt
        formatted = format_rag_data(sample)
        prompt = formatted['prompt']
        
        # Generate answer
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
        
        # Extract generated text
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        answer = generated[len(prompt):].strip()
        
        # Remove "Answer:" prefix if present
        if answer.startswith("Answer:"):
            answer = answer[7:].strip()
        
        predictions.append({
            'answer': answer
        })
        
        if (i + 1) % 10 == 0:
            print(f"  Generated {i + 1}/{len(test_samples)} predictions...")

print(f"\n✓ Generated {len(predictions)} predictions")

# Show example
print("\n" + "=" * 60)
print("Example prediction:")
print("=" * 60)
print(f"Question: {test_samples[0]['question']}")
print(f"\nGenerated: {predictions[0]['answer'][:200]}...")
print(f"\nReference: {test_samples[0]['answer'][:200]}...")

In [None]:
# Run generation-only evaluation (no docs needed!)
print("\n" + "=" * 60)
print("Running generation-only evaluation...")
print("=" * 60 + "\n")

# Evaluate - only needs 'answer' field in predictions and dataset
results = evaluator.evaluate_generation_only(
    dataset=test_samples,
    model_predictions=predictions
)

# Display results
print(evaluator.format_results_generation_only(results))

# Log to wandb
wandb.log({
    "eval/rouge1": results['rouge1'],
    "eval/rouge2": results['rouge2'],
    "eval/rougeL": results['rougeL'],
    "eval/bert_f1": results['bert_f1'],
    "eval/bert_precision": results['bert_precision'],
    "eval/bert_recall": results['bert_recall'],
    "eval/exact_match": results['exact_match'],
    "eval/num_samples": results['num_samples']
})

print("\n✓ Results logged to WandB")

In [None]:
# Create detailed examples table for wandb
examples_data = []

for i in range(min(5, len(test_samples))):
    examples_data.append([
        test_samples[i]['question'],
        test_samples[i]['answer'],
        predictions[i]['answer'],
        test_samples[i].get('question_type', 'unknown')
    ])

examples_table = wandb.Table(
    columns=["Question", "Reference Answer", "Generated Answer", "Question Type"],
    data=examples_data
)

wandb.log({"eval/detailed_examples": examples_table})

print("✓ Example predictions logged to WandB")
print("\nView your results at: https://wandb.ai")

In [None]:
## Summary

This notebook fine-tunes Kanana 1.5 8B on RAG tasks with **validation and BERTScore tracking**.

### What this notebook does:

1. ✅ **Data Preparation**
   - Loads Kanana 1.5 8B instruct model (kakaocorp/kanana-1.5-8b-instruct-2505)
   - Formats RAG training data with instruction format
   - **Splits dataset: 80% train, 20% validation**

2. ✅ **Training Configuration**
   - Applies LoRA for efficient fine-tuning (r=8, alpha=32)
   - **Validation every 50 steps**
   - **Tracks both eval_loss AND BERTScore metrics**

3. ✅ **Best Model Selection (Option 3)**
   - **Primary criterion: `eval_bert_f1` (BERTScore F1)**
   - Also monitors: `eval_loss`, `eval_bert_precision`, `eval_bert_recall`
   - Automatically loads best checkpoint at end of training

4. ✅ **Experiment Tracking**
   - Weights & Biases integration
   - All metrics logged automatically during training
   - Model checkpoints saved when BERTScore improves

5. ✅ **Evaluation**
   - Post-training evaluation with ROUGE, BERTScore, Exact Match
   - Comparison examples logged to WandB

### Training Data Format:
```
[Instruction: 제천시 관광 안내 전문가 역할]

Information:
{content1}
Information:
{content2}
Question: {question}

Answer: {answer}
```

### Key Features:
- **System instruction** guides the model to be a Jecheon tourism expert
- **Multi-document format** teaches model to find relevant info
- **Validation with BERTScore** ensures quality during training
- **Best model selection** based on semantic similarity (BERTScore F1)
- **Loss also tracked** for analysis in WandB

### Metrics Tracked During Training:
| Metric | Description | Purpose |
|--------|-------------|---------|
| `eval_loss` | Cross-entropy loss | General training health |
| `eval_bert_precision` | BERTScore precision | Answer accuracy |
| `eval_bert_recall` | BERTScore recall | Answer completeness |
| `eval_bert_f1` | **BERTScore F1** | **Best model selection** ⭐ |

### Why Option 3 (BERTScore for best model)?
- **BERTScore F1** directly measures answer quality (semantic similarity)
- **eval_loss** can be low but answers poor (overfitting to form, not content)
- **Both metrics visible** in WandB for comprehensive analysis
- Best of both worlds: quality-based selection + loss monitoring

### Next Steps:
- Run full evaluation on test set
- Compare baseline vs fine-tuned performance
- Upload model to Hugging Face Hub
- Generate report with metrics and examples

## Summary

This notebook:
1. ✅ Loads Kanana 1.5 8B instruct model (kakaocorp/kanana-1.5-8b-instruct-2505)
2. ✅ Formats RAG training data with instruction and information format:
   ```
   [Instruction: 제천시 관광 안내 전문가 역할]
   
   Information:
   {content1}
   Information:
   {content2}
   Question: {question}
   
   Answer: {answer}
   ```
3. ✅ Applies LoRA for efficient fine-tuning
4. ✅ Tracks training with Weights & Biases
5. ✅ Saves the fine-tuned model
6. ✅ Tests inference on sample data

### Key Features:
- System instruction guides the model to be a Jecheon tourism expert
- Model learns to find relevant documents and answer based on them
- Model learns to say "no information available" when appropriate
- Training format prevents hallucination

### Next Steps:
- Run full evaluation on test set
- Compare baseline vs fine-tuned performance
- Upload model to Hugging Face Hub
- Generate report with metrics and examples