# üöÄ Vi-VQA Training on Google Colab

Notebook ƒë·ªÉ train Qwen3-VL tr√™n Google Colab v·ªõi GPU mi·ªÖn ph√≠.

**Requirements:**
- Google Colab v·ªõi GPU (T4, L4, ho·∫∑c A100)
- HuggingFace token
- ~15GB disk space

**Training time:**
- T4: ~30-40 gi·ªù (slow, kh√¥ng recommend)
- L4: ~15-20 gi·ªù
- A100: ~8-12 gi·ªù (Colab Pro+)

## 1. Setup Environment

### Check GPU

In [None]:
# Check GPU
!nvidia-smi

import torch
print(f'\nPyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'CUDA version: {torch.version.cuda}')
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB')

### Clone Repository

In [None]:
# Clone your repo (replace with your actual repo URL)
# Option 1: From GitHub
# !git clone https://github.com/your-username/Vi-VQA.git
# %cd Vi-VQA

# Option 2: Upload files manually or use Google Drive
from google.colab import drive
drive.mount('/content/drive')

# If you have project in Drive:
# %cd /content/drive/MyDrive/Vi-VQA

### Install Dependencies

In [None]:
# Install transformers from source (required for Qwen3-VL)
!pip install -q git+https://github.com/huggingface/transformers

# Install core dependencies
!pip install -q qwen-vl-utils accelerate peft bitsandbytes datasets pillow pyyaml huggingface_hub scipy

# Install flash-attention (optional, speeds up training)
# May take 5-10 minutes to compile
!pip install -q flash-attn --no-build-isolation

### Login to HuggingFace

In [None]:
from huggingface_hub import login
from google.colab import userdata

# Option 1: Use Colab secrets (recommended)
# Go to: üîë icon on left sidebar ‚Üí Add secret: HF_TOKEN
try:
    hf_token = userdata.get('HF_TOKEN')
    login(token=hf_token)
    print('‚úì Logged in using Colab secret')
except:
    # Option 2: Enter token manually
    print('HF_TOKEN not found in secrets. Please enter manually:')
    login()

## 2. Prepare Dataset

### Create necessary directories

In [None]:
import os

# Create directories
os.makedirs('data', exist_ok=True)
os.makedirs('data/images', exist_ok=True)
os.makedirs('checkpoints', exist_ok=True)
os.makedirs('logs', exist_ok=True)

print('‚úì Directories created')

### Load and Process Dataset

In [None]:
from datasets import load_dataset
from PIL import Image
import json
from tqdm.auto import tqdm

print('Loading dataset from HuggingFace...')
dataset = load_dataset('5CD-AI/Viet-ViTextVQA-gemini-VQA', split='train')
print(f'‚úì Loaded {len(dataset)} samples')
print(f'Columns: {dataset.column_names}')

In [None]:
# Process dataset to Qwen3-VL format
processed_samples = []
image_folder = 'data/images'

print('Processing dataset...')
for idx in tqdm(range(len(dataset))):
    item = dataset[idx]
    image = item['image']
    conversations = item.get('conversations', [])
    
    if not conversations:
        continue
    
    # Save image
    image_filename = f"image_{item['id']}.jpg"
    image_path = os.path.join(image_folder, image_filename)
    
    if not os.path.exists(image_path):
        try:
            if image.mode != 'RGB':
                image = image.convert('RGB')
            image.save(image_path)
        except:
            continue
    
    # Process conversations
    current_question = None
    for turn in conversations:
        role = turn.get('role', turn.get('from'))
        content = turn.get('content', turn.get('value'))
        
        if role in ['user', 'human']:
            current_question = content
        elif role in ['assistant', 'gpt'] and current_question:
            processed_samples.append({
                'id': f"{item['id']}_{len(processed_samples)}",
                'image': image_filename,
                'conversations': [
                    {'from': 'human', 'value': f'<image>\n{current_question}'},
                    {'from': 'gpt', 'value': content}
                ]
            })
            current_question = None

print(f'‚úì Processed {len(processed_samples)} QA pairs')

# Split dataset: 90% train, 10% validation
import random
random.seed(42)  # For reproducibility

# Shuffle samples
shuffled_samples = processed_samples.copy()
random.shuffle(shuffled_samples)

# Calculate split index
split_idx = int(len(shuffled_samples) * 0.9)
train_samples = shuffled_samples[:split_idx]
val_samples = shuffled_samples[split_idx:]

print(f'\nüìä Dataset Split:')
print(f'  Train: {len(train_samples)} samples ({len(train_samples)/len(processed_samples)*100:.1f}%)')
print(f'  Val:   {len(val_samples)} samples ({len(val_samples)/len(processed_samples)*100:.1f}%)')

# Save train set
with open('data/train.json', 'w', encoding='utf-8') as f:
    json.dump(train_samples, f, ensure_ascii=False, indent=2)
print('‚úì Saved train set to data/train.json')

# Save validation set
with open('data/val.json', 'w', encoding='utf-8') as f:
    json.dump(val_samples, f, ensure_ascii=False, indent=2)
print('‚úì Saved validation set to data/val.json')

### Check dataset statistics

In [None]:
print(f'Dataset Statistics:')
print(f'  Total samples: {len(processed_samples)}')
print(f'  Train samples: {len(train_samples)}')
print(f'  Val samples: {len(val_samples)}')
print(f'  Images saved: {len(os.listdir("data/images"))}')
print(f'\nFirst training sample:')
print(json.dumps(train_samples[0], ensure_ascii=False, indent=2))
print(f'\nFirst validation sample:')
print(json.dumps(val_samples[0], ensure_ascii=False, indent=2))

## 3. Setup Training

### Clone Qwen-VL-Series-Finetune

In [None]:
if not os.path.exists('Qwen-VL-Series-Finetune'):
    !git clone https://github.com/2U1/Qwen-VL-Series-Finetune.git
    print('‚úì Cloned training repository')
else:
    print('‚úì Training repository already exists')

### Configure Training Parameters

In [None]:
# Training configuration
MODEL_ID = "Qwen/Qwen3-VL-8B-Instruct"
DATA_PATH = "data/train.json"
VAL_DATA_PATH = "data/val.json"  # Validation set
IMAGE_FOLDER = "data/images"
OUTPUT_DIR = "checkpoints/qwen3vl-vivqa"

# Hyperparameters
NUM_EPOCHS = 2  # Reduced from 3 to avoid overfitting (large dataset)
BATCH_SIZE = 1  # Small for Colab (adjust based on GPU)
GRAD_ACCUM = 16  # Effective batch size = 1 * 16 = 16
LEARNING_RATE = 2e-5
VISION_LR = 2e-6
MERGER_LR = 2e-5

# LoRA config
LORA_RANK = 128
LORA_ALPHA = 256
LORA_DROPOUT = 0.05

# Image resolution
IMAGE_MIN_PIXELS = 256 * 32 * 32  # 262144
IMAGE_MAX_PIXELS = 1280 * 32 * 32  # 1310720

# Evaluation config
EVAL_STRATEGY = "steps"  # Evaluate every N steps
EVAL_STEPS = 500  # Evaluate every 500 steps
SAVE_STRATEGY = "steps"  # Save checkpoints based on steps
SAVE_STEPS = 500  # Save every 500 steps
SAVE_TOTAL_LIMIT = 3  # Keep only best 3 checkpoints
LOAD_BEST_MODEL_AT_END = True  # Load best model when training ends
METRIC_FOR_BEST_MODEL = "eval_loss"  # Use validation loss to select best model

print('Training Configuration:')
print(f'  Model: {MODEL_ID}')
print(f'  Train data: {len(train_samples)} samples')
print(f'  Val data: {len(val_samples)} samples')
print(f'  Batch size: {BATCH_SIZE} √ó {GRAD_ACCUM} = {BATCH_SIZE * GRAD_ACCUM}')
print(f'  Epochs: {NUM_EPOCHS}')
print(f'  Eval every: {EVAL_STEPS} steps')
print(f'  Output: {OUTPUT_DIR}')

## 4. Start Training

‚ö†Ô∏è **Important:** Training s·∫Ω m·∫•t nhi·ªÅu gi·ªù. Colab c√≥ th·ªÉ disconnect n·∫øu idle qu√° l√¢u.

**Tips ƒë·ªÉ tr√°nh disconnect:**
- M·ªü F12 Console v√† ch·∫°y: `setInterval(() => { document.querySelector('colab-connect-button').click() }, 60000)`
- D√πng Colab Pro ƒë·ªÉ tr√°nh timeout
- Save checkpoints th∆∞·ªùng xuy√™n

In [None]:
%cd Qwen-VL-Series-Finetune

# Build training command with validation
train_cmd = f"""
python train.py \
    --model_id {MODEL_ID} \
    --data_path ../{DATA_PATH} \
    --eval_data_path ../{VAL_DATA_PATH} \
    --image_folder ../{IMAGE_FOLDER} \
    --output_dir ../{OUTPUT_DIR} \
    --num_train_epochs {NUM_EPOCHS} \
    --per_device_train_batch_size {BATCH_SIZE} \
    --per_device_eval_batch_size {BATCH_SIZE} \
    --gradient_accumulation_steps {GRAD_ACCUM} \
    --learning_rate {LEARNING_RATE} \
    --vision_lr {VISION_LR} \
    --merger_lr {MERGER_LR} \
    --lora_rank {LORA_RANK} \
    --lora_alpha {LORA_ALPHA} \
    --lora_dropout {LORA_DROPOUT} \
    --num_lora_modules -1 \
    --image_min_pixels {IMAGE_MIN_PIXELS} \
    --image_max_pixels {IMAGE_MAX_PIXELS} \
    --freeze_vision_tower true \
    --freeze_llm false \
    --freeze_merger false \
    --optim adamw_bnb_8bit \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --bf16 true \
    --gradient_checkpointing true \
    --max_grad_norm 1.0 \
    --dataloader_num_workers 2 \
    --eval_strategy {EVAL_STRATEGY} \
    --eval_steps {EVAL_STEPS} \
    --save_strategy {SAVE_STRATEGY} \
    --save_steps {SAVE_STEPS} \
    --save_total_limit {SAVE_TOTAL_LIMIT} \
    --load_best_model_at_end {str(LOAD_BEST_MODEL_AT_END).lower()} \
    --metric_for_best_model {METRIC_FOR_BEST_MODEL} \
    --greater_is_better false \
    --logging_steps 50 \
    --report_to tensorboard
"""

print('Starting training with validation monitoring...')
print('='*80)
print('üìä Validation will run every 500 steps')
print('üíæ Best checkpoint will be saved based on validation loss')
print('üìà Monitor training progress in TensorBoard (next cell)')
print('='*80)
!{train_cmd}

## 5. Monitor Training

### Load TensorBoard

In [None]:
%load_ext tensorboard
%tensorboard --logdir ../checkpoints/qwen3vl-vivqa

### Check GPU Memory

In [None]:
!nvidia-smi

## 6. Test Inference

### Load Trained Model

In [None]:
%cd ..

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model checkpoint (adjust path to your best checkpoint)
CHECKPOINT_PATH = "checkpoints/qwen3vl-vivqa/checkpoint-1500"  # Change this

print(f'Loading model from {CHECKPOINT_PATH}...')
model = Qwen3VLForConditionalGeneration.from_pretrained(
    CHECKPOINT_PATH,
    dtype=torch.bfloat16,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(CHECKPOINT_PATH)
print('‚úì Model loaded!')

### Test on Sample

In [None]:
def ask_question(image_path, question, max_tokens=512):
    """Ask a question about an image using Method 1 (memory efficient)"""
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": question}
            ]
        }
    ]
    
    # Method 1: Direct tokenization (memory efficient)
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,  # ‚úÖ Tokenize in one step
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt"
    )
    inputs = inputs.to(model.device)
    
    with torch.inference_mode():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True
        )
    
    generated_ids_trimmed = [
        out_ids[len(in_ids):]
        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    answer = processor.batch_decode(
        generated_ids_trimmed,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]
    
    # Clean up memory
    del inputs
    del generated_ids
    torch.cuda.empty_cache()
    
    return answer

In [None]:
# Test v·ªõi m·ªôt m·∫´u t·ª´ dataset
import matplotlib.pyplot as plt
from PIL import Image

# Load m·ªôt sample
test_sample = processed_samples[0]
image_path = os.path.join('data/images', test_sample['image'])
question = test_sample['conversations'][0]['value'].replace('<image>\n', '')
ground_truth = test_sample['conversations'][1]['value']

# Display image
img = Image.open(image_path)
plt.figure(figsize=(8, 8))
plt.imshow(img)
plt.axis('off')
plt.title('Test Image')
plt.show()

# Generate answer
print(f'Question: {question}')
print(f'Ground Truth: {ground_truth}')
print('\nGenerating answer...')

prediction = ask_question(image_path, question)
print(f'Prediction: {prediction}')

## 7. Evaluate on Multiple Samples

In [None]:
from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

# Load validation set
with open('data/val.json', 'r', encoding='utf-8') as f:
    val_data = json.load(f)

# Evaluate on validation set
# Option 1: Full validation set (takes longer)
# num_eval = len(val_data)
# Option 2: Subset for quick check (faster)
num_eval = min(100, len(val_data))  # Evaluate on first 100 samples

exact_matches = 0
similarities = []

print(f'Evaluating on {num_eval} validation samples...')
print('(This may take a while...)\n')

for i in tqdm(range(num_eval)):
    sample = val_data[i]
    image_path = os.path.join('data/images', sample['image'])
    question = sample['conversations'][0]['value'].replace('<image>\n', '')
    ground_truth = sample['conversations'][1]['value']
    
    try:
        prediction = ask_question(image_path, question)
        
        if prediction.strip() == ground_truth.strip():
            exact_matches += 1
        
        sim = similarity(prediction.lower(), ground_truth.lower())
        similarities.append(sim)
        
        # Print first 3 examples
        if i < 3:
            print(f'\n--- Example {i+1} ---')
            print(f'Q: {question}')
            print(f'GT: {ground_truth}')
            print(f'Pred: {prediction}')
            print(f'Similarity: {sim*100:.1f}%')
    except Exception as e:
        print(f'Error on sample {i}: {e}')
        continue

# Results
print(f'\n{"="*80}')
print('Validation Results')
print(f'{"="*80}')
print(f'Samples evaluated: {num_eval}/{len(val_data)}')
print(f'Exact Match: {exact_matches}/{num_eval} ({exact_matches/num_eval*100:.2f}%)')
print(f'Avg Similarity: {sum(similarities)/len(similarities)*100:.2f}%')
print(f'{"="*80}')
print('\nüí° Tip: Change num_eval to len(val_data) to evaluate full validation set')

## 8. Save Model to Drive

‚ö†Ô∏è **Important:** L∆∞u model v√†o Google Drive ƒë·ªÉ kh√¥ng m·∫•t khi Colab disconnect

In [None]:
# Copy checkpoint to Google Drive
import shutil

drive_output_dir = '/content/drive/MyDrive/Vi-VQA-Models'
os.makedirs(drive_output_dir, exist_ok=True)

# Copy best checkpoint
print('Copying checkpoint to Google Drive...')
shutil.copytree(
    CHECKPOINT_PATH,
    os.path.join(drive_output_dir, os.path.basename(CHECKPOINT_PATH)),
    dirs_exist_ok=True
)

print(f'‚úì Checkpoint saved to {drive_output_dir}')

## 9. Download Model (Optional)

N·∫øu mu·ªën download v·ªÅ m√°y local

In [None]:
# Zip checkpoint
!zip -r qwen3vl-vivqa-checkpoint.zip {CHECKPOINT_PATH}

# Download
from google.colab import files
files.download('qwen3vl-vivqa-checkpoint.zip')

## üéâ Training Complete!

### Next Steps:

1. **Save model to Drive** (cell 8) ƒë·ªÉ kh√¥ng m·∫•t khi disconnect
2. **Evaluate thoroughly** tr√™n validation set
3. **Fine-tune hyperparameters** n·∫øu c·∫ßn
4. **Deploy** model cho production

### Tips:

- **OOM Error?** Gi·∫£m `BATCH_SIZE` xu·ªëng 1, tƒÉng `GRAD_ACCUM`
- **Training qu√° ch·∫≠m?** Upgrade l√™n Colab Pro (A100)
- **Model kh√¥ng h·ªçc?** Check learning rate, c√≥ th·ªÉ qu√° cao/th·∫•p
- **Overfitting?** Gi·∫£m epochs ho·∫∑c th√™m data augmentation

---

**Happy Training! üöÄ**