# Liputan6 ‚Äî BERT2GPT Indonesian Summarization Fine-tuning

## üéØ Tujuan
Fine-tuning model **cahya/bert2gpt-indonesian-summarization** dari Hugging Face untuk text summarization Bahasa Indonesia menggunakan dataset Liputan6.

## üìö Model Information
- **Model**: BERT-to-GPT2 Encoder-Decoder Architecture
- **Pretrained**: cahya/bert2gpt-indonesian-summarization
- **Source**: https://huggingface.co/cahya/bert2gpt-indonesian-summarization
- **Language**: Indonesian (Bahasa Indonesia)
- **Task**: Abstractive Text Summarization

## üìã Table of Contents

1. [Setup & Dependencies](#setup)
2. [Load Preprocessed Data](#load-data)
3. [Load BERT2GPT Model](#load-model)
4. [Prepare Dataset for Training](#prepare-dataset)
5. [Training Configuration](#training-config)
6. [Fine-tuning](#fine-tuning)
7. [Evaluation](#evaluation)
8. [Inference & Testing](#inference)
9. [Save Model](#save-model)

<a id="setup"></a>
## 1. Setup & Dependencies

Install dan import library yang diperlukan.

In [55]:
# Install dependencies
# Uncomment jika belum terinstall
# !pip install -q transformers datasets torch sentencepiece accelerate evaluate rouge-score

In [56]:
import sys
import subprocess

# Auto-install transformers jika belum ada
try:
    from transformers import BertTokenizer, EncoderDecoderModel
    print("‚úì transformers already installed")
except ImportError:
    print("üì¶ Installing transformers...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "transformers", "torch", "sentencepiece", "accelerate"])
    from transformers import BertTokenizer, EncoderDecoderModel
    print("‚úì transformers installed successfully")

# Install evaluate and rouge_score for metrics
try:
    import evaluate
    print("‚úì evaluate already installed")
except ImportError:
    print("üì¶ Installing evaluate...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "evaluate", "rouge-score"])
    import evaluate
    print("‚úì evaluate installed successfully")

‚úì transformers already installed
‚úì evaluate already installed


In [57]:
# Import libraries
import os
import pandas as pd
import numpy as np
import pickle
import json
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Transformers
from transformers import (
    BertTokenizer,
    EncoderDecoderModel,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)

# Datasets
from datasets import Dataset, DatasetDict

# Evaluation
import evaluate

# PyTorch
import torch

print(f"‚úì PyTorch version: {torch.__version__}")
print(f"‚úì CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úì CUDA device: {torch.cuda.get_device_name(0)}")

# Set random seed for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print("\n‚úÖ All libraries imported successfully!")

‚úì PyTorch version: 2.9.0+cpu
‚úì CUDA available: False

‚úÖ All libraries imported successfully!


<a id="load-data"></a>
## 2. Load Preprocessed Data

Load data yang sudah di-preprocessing dari notebook sebelumnya.

In [58]:
# Define paths
PREPROCESS_DIR = Path("./output/preprocessed")
OUTPUT_DIR = Path("./output/bert2gpt_finetuned")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Preprocessed data directory: {PREPROCESS_DIR}")
print(f"üìÅ Output directory: {OUTPUT_DIR}")

# Load config
with open(PREPROCESS_DIR / "config.json", 'r') as f:
    config = json.load(f)

print("\n‚öôÔ∏è  Preprocessing Config:")
for key, value in config.items():
    print(f"  {key}: {value}")

üìÅ Preprocessed data directory: output\preprocessed
üìÅ Output directory: output\bert2gpt_finetuned

‚öôÔ∏è  Preprocessing Config:
  train_size: 9694
  val_size: 549
  test_size: 549
  tokenizer_name: indolem/indobert-base-uncased
  use_sample: True
  preprocessing_mode: BERT2GPT-only
  avg_article_tokens: 192
  avg_summary_tokens: 27


In [59]:
# Load preprocessed data (BERT format)
print("üìÇ Loading preprocessed BERT data...\n")

# Try to load from bert_data.pkl (optimized format)
try:
    with open(PREPROCESS_DIR / "bert_data.pkl", 'rb') as f:
        bert_data = pickle.load(f)
    
    print("‚úì Loaded from bert_data.pkl (optimized BERT format)")
    
    # Convert to dataframes
    df_train = pd.DataFrame({
        'clean_article': bert_data['train']['articles'],
        'clean_summary': bert_data['train']['summaries']
    })
    df_val = pd.DataFrame({
        'clean_article': bert_data['val']['articles'],
        'clean_summary': bert_data['val']['summaries']
    })
    df_test = pd.DataFrame({
        'clean_article': bert_data['test']['articles'],
        'clean_summary': bert_data['test']['summaries']
    })
    
except FileNotFoundError:
    print("‚ö†Ô∏è  bert_data.pkl not found, loading from CSV files...")
    df_train = pd.read_csv(PREPROCESS_DIR / "train.csv")
    df_val = pd.read_csv(PREPROCESS_DIR / "val.csv")
    df_test = pd.read_csv(PREPROCESS_DIR / "test.csv")
    print("‚úì Loaded from CSV files")

print(f"\n‚úì Train set: {len(df_train):,} samples")
print(f"‚úì Validation set: {len(df_val):,} samples")
print(f"‚úì Test set: {len(df_test):,} samples")
print(f"‚úì Total: {len(df_train) + len(df_val) + len(df_test):,} samples")

# Display sample
print("\nüìã Sample from training data:")
display(df_train[['clean_article', 'clean_summary']].head(2))

üìÇ Loading preprocessed BERT data...

‚úì Loaded from bert_data.pkl (optimized BERT format)

‚úì Train set: 9,694 samples
‚úì Validation set: 549 samples
‚úì Test set: 549 samples
‚úì Total: 10,792 samples

üìã Sample from training data:


Unnamed: 0,clean_article,clean_summary
0,Pandeglang Sebuah ledakan keras terjadi di Kam...,Dua orang tewas seketika akibat ledakan dahsya...
1,"Ottawa Setelah keputusan Dewan Keamanan PBB , ...",Kanada menyetujui tindakan DK PBB dan akan iku...


In [60]:
# Optional: Reduce dataset size for faster training (development mode)
USE_SAMPLE = False  # Set to True untuk menggunakan subset kecil untuk testing
SAMPLE_SIZE = 1000   # Jumlah sample untuk development

if USE_SAMPLE:
    print(f"‚ö° Using sample data ({SAMPLE_SIZE} samples per split) for faster development\n")
    df_train = df_train.sample(n=min(SAMPLE_SIZE, len(df_train)), random_state=SEED).reset_index(drop=True)
    df_val = df_val.sample(n=min(SAMPLE_SIZE//5, len(df_val)), random_state=SEED).reset_index(drop=True)
    df_test = df_test.sample(n=min(SAMPLE_SIZE//5, len(df_test)), random_state=SEED).reset_index(drop=True)
    
    print(f"‚úì Train: {len(df_train)} samples")
    print(f"‚úì Val: {len(df_val)} samples")
    print(f"‚úì Test: {len(df_test)} samples")
else:
    print("üìä Using FULL dataset for training")

üìä Using FULL dataset for training


<a id="load-model"></a>
## 3. Load BERT2GPT Model

Load pre-trained model **cahya/bert2gpt-indonesian-summarization** dari Hugging Face.

In [61]:
# Model checkpoint
MODEL_CHECKPOINT = "cahya/bert2gpt-indonesian-summarization"

print(f"ü§ñ Loading model: {MODEL_CHECKPOINT}\n")

# Workaround for transformers 4.57+ checking optional chat_templates directory
# Suppress warnings about missing optional files
import logging
import os
from huggingface_hub import hf_hub_download

# Temporarily reduce logging level
logging.getLogger("transformers").setLevel(logging.ERROR)
logging.getLogger("huggingface_hub").setLevel(logging.ERROR)

print("üîÑ Loading BertTokenizer...")
print("‚ÑπÔ∏è  Using IndoBERT tokenizer (compatible with BERT2GPT)\n")

# Download only essential tokenizer files (skip optional chat_templates)
try:
    # Download vocab file
    vocab_file = hf_hub_download(
        repo_id="indobenchmark/indobert-base-p1",
        filename="vocab.txt"
    )
    
    # Load from downloaded files
    tokenizer = BertTokenizer(vocab_file=vocab_file)
    print("‚úì Tokenizer loaded from IndoBERT vocab")
    
except Exception as e:
    print(f"‚ö†Ô∏è  Fallback: Using BERT-base multilingual")
    tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
    print("‚úì Tokenizer loaded successfully")

# Set special tokens (as per official documentation)
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token

print("\n‚úì Tokenizer configured:")
print(f"  Vocab size: {len(tokenizer)}")
print(f"  BOS token: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")
print(f"  EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"  PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")

# Load BERT2GPT model with error handling
print("\nüîÑ Loading BERT2GPT model...")
try:
    # Try loading with ignore_mismatched_sizes
    model = EncoderDecoderModel.from_pretrained(
        MODEL_CHECKPOINT,
        ignore_mismatched_sizes=True
    )
    print("‚úì Model loaded successfully")
except Exception as e:
    print(f"‚ö†Ô∏è  Error loading model: {str(e)[:100]}")
    print("\nüîÑ Trying alternative loading method...")
    
    # Download model files individually
    try:
        from huggingface_hub import snapshot_download
        cache_dir = snapshot_download(
            repo_id=MODEL_CHECKPOINT,
            allow_patterns=["*.bin", "*.json", "*.txt"],
            ignore_patterns=["additional_chat_templates/*"]
        )
        model = EncoderDecoderModel.from_pretrained(cache_dir)
        print("‚úì Model loaded from downloaded snapshot")
    except Exception as e2:
        print(f"‚ùå Could not load model: {str(e2)[:100]}")
        raise

# Restore logging
logging.getLogger("transformers").setLevel(logging.WARNING)
logging.getLogger("huggingface_hub").setLevel(logging.WARNING)

# Set special tokens for generation
model.config.decoder_start_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

print("\n‚úì Model loaded successfully")
print(f"  Encoder: {model.config.encoder.model_type}")
print(f"  Decoder: {model.config.decoder.model_type}")
print(f"  Total parameters: {model.num_parameters():,}")

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"\n‚úì Model moved to: {device}")

ü§ñ Loading model: cahya/bert2gpt-indonesian-summarization

üîÑ Loading BertTokenizer...
‚ÑπÔ∏è  Using IndoBERT tokenizer (compatible with BERT2GPT)

‚úì Tokenizer loaded from IndoBERT vocab

‚úì Tokenizer configured:
  Vocab size: 30521
  BOS token: [CLS] (ID: 2)
  EOS token: [SEP] (ID: 3)
  PAD token: [PAD] (ID: 0)

üîÑ Loading BERT2GPT model...
‚úì Tokenizer loaded from IndoBERT vocab

‚úì Tokenizer configured:
  Vocab size: 30521
  BOS token: [CLS] (ID: 2)
  EOS token: [SEP] (ID: 3)
  PAD token: [PAD] (ID: 0)

üîÑ Loading BERT2GPT model...
‚ö†Ô∏è  Error loading model: Can't load the model for 'cahya/bert2gpt-indonesian-summarization'. If you were trying to load it fr

üîÑ Trying alternative loading method...
‚ö†Ô∏è  Error loading model: Can't load the model for 'cahya/bert2gpt-indonesian-summarization'. If you were trying to load it fr

üîÑ Trying alternative loading method...


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

‚úì Model loaded from downloaded snapshot

‚úì Model loaded successfully
  Encoder: bert
  Decoder: gpt2
  Total parameters: 263,424,000

‚úì Model moved to: cpu


<a id="prepare-dataset"></a>
## 4. Prepare Dataset for Training

Konversi dataframe menjadi Hugging Face Dataset dan tokenize.

In [62]:
# Convert pandas dataframes to Hugging Face datasets
print("üîÑ Converting to Hugging Face Dataset format...\n")

# Create dataset dictionary
dataset_dict = DatasetDict({
    'train': Dataset.from_pandas(df_train[['clean_article', 'clean_summary']]),
    'validation': Dataset.from_pandas(df_val[['clean_article', 'clean_summary']]),
    'test': Dataset.from_pandas(df_test[['clean_article', 'clean_summary']])
})

# Rename columns
dataset_dict = dataset_dict.rename_column('clean_article', 'article')
dataset_dict = dataset_dict.rename_column('clean_summary', 'summary')

print("‚úì Dataset created:")
print(dataset_dict)

# Display sample
print("\nüìã Sample data:")
print(dataset_dict['train'][0])

üîÑ Converting to Hugging Face Dataset format...

‚úì Dataset created:
DatasetDict({
    train: Dataset({
        features: ['article', 'summary'],
        num_rows: 9694
    })
    validation: Dataset({
        features: ['article', 'summary'],
        num_rows: 549
    })
    test: Dataset({
        features: ['article', 'summary'],
        num_rows: 549
    })
})

üìã Sample data:
{'article': 'Pandeglang Sebuah ledakan keras terjadi di Kampung Ciruang , Desa Pejamben , Kecamatan Carita , Pandeglang , Banten , Selasa , sekitar pukul 13 . 30 WIB . Pusat ledakan di sebuah gubuk di kampung yang berjarak satu kilometer dari tempat rekreasi pantai Carita . Akibat ledakan dua orang tewas . Keduanya adalah Kobar dan Andri . Begitu dahsyatnya ledakan hingga tubuh mereka terlempar sejauh 10 meter dari tempat pusat ledakan . Selain merenggut korban jiwa ledakan juga membuat Darmin terluka parah . Belum diketahui penyebab ledakan . Namun , di sebuah rumah yang letaknya sekitar 50 meter dari l

In [63]:
# Tokenization parameters
MAX_INPUT_LENGTH = 512   # Maximum article length
MAX_TARGET_LENGTH = 128  # Maximum summary length

print(f"üìè Tokenization parameters:")
print(f"  Max input length: {MAX_INPUT_LENGTH}")
print(f"  Max target length: {MAX_TARGET_LENGTH}")

def preprocess_function(examples):
    """
    Tokenize articles and summaries for BERT2GPT model
    """
    # Tokenize inputs (articles)
    model_inputs = tokenizer(
        examples['article'],
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
        padding='max_length'
    )
    
    # Tokenize targets (summaries)
    labels = tokenizer(
        examples['summary'],
        max_length=MAX_TARGET_LENGTH,
        truncation=True,
        padding='max_length'
    )
    
    # Replace padding token id with -100 for loss calculation
    labels['input_ids'] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in labels_example]
        for labels_example in labels['input_ids']
    ]
    
    model_inputs['labels'] = labels['input_ids']
    
    return model_inputs

# Apply tokenization
print("\nüî§ Tokenizing datasets...")
tokenized_datasets = dataset_dict.map(
    preprocess_function,
    batched=True,
    remove_columns=['article', 'summary'],
    desc="Tokenizing"
)

print("\n‚úì Tokenization complete:")
print(tokenized_datasets)

# Display tokenized sample
print("\nüìã Sample tokenized data:")
sample = tokenized_datasets['train'][0]
print(f"Input IDs shape: {len(sample['input_ids'])}")
print(f"Attention mask shape: {len(sample['attention_mask'])}")
print(f"Labels shape: {len(sample['labels'])}")
print(f"\nFirst 20 input IDs: {sample['input_ids'][:20]}")
print(f"First 20 labels: {sample['labels'][:20]}")

üìè Tokenization parameters:
  Max input length: 512
  Max target length: 128

üî§ Tokenizing datasets...


Tokenizing:   0%|          | 0/9694 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/549 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/549 [00:00<?, ? examples/s]


‚úì Tokenization complete:
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 9694
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 549
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 549
    })
})

üìã Sample tokenized data:
Input IDs shape: 512
Attention mask shape: 512
Labels shape: 128

First 20 input IDs: [2, 21136, 492, 10884, 2086, 597, 26, 4237, 4667, 901, 30468, 1351, 2969, 188, 9, 30468, 2172, 2203, 155, 30468]
First 20 labels: [2, 662, 232, 6193, 11731, 1597, 10884, 9754, 26, 492, 1351, 26, 2172, 2203, 155, 30468, 21136, 30468, 5116, 30470]


<a id="training-config"></a>
## 5. Training Configuration

Setup training arguments dan metrics.

In [64]:
# Load ROUGE metric
rouge_metric = evaluate.load("rouge")

def compute_metrics(eval_pred):
    """
    Compute ROUGE scores for evaluation
    """
    predictions, labels = eval_pred
    
    # Decode predictions
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Compute ROUGE scores
    result = rouge_metric.compute(
        predictions=decoded_preds,
        references=decoded_labels,
        use_stemmer=True
    )
    
    # Extract scores
    result = {key: value * 100 for key, value in result.items()}
    
    # Add mean length of predictions
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

print("‚úì Metrics function defined")

‚úì Metrics function defined


In [65]:
# Training arguments
# ============================================================
# Adjust these parameters based on your hardware and dataset size
# ============================================================

# Detect if using sample or full dataset
IS_SAMPLE = config.get('use_sample', False)

if IS_SAMPLE:
    print("‚ö° Using SAMPLE dataset - Optimized hyperparameters")
    BATCH_SIZE = 4
    GRADIENT_ACCUM_STEPS = 2
    LEARNING_RATE = 5e-5
    NUM_EPOCHS = 3
    WARMUP_STEPS = 100
    SAVE_STEPS = 200
    EVAL_STEPS = 200
    LOGGING_STEPS = 50
else:
    print("üìä Using FULL dataset - Production hyperparameters")
    BATCH_SIZE = 8
    GRADIENT_ACCUM_STEPS = 4
    LEARNING_RATE = 5e-5
    NUM_EPOCHS = 3
    WARMUP_STEPS = 500
    SAVE_STEPS = 500
    EVAL_STEPS = 500
    LOGGING_STEPS = 100

WEIGHT_DECAY = 0.01      # Weight decay

print(f"\n‚öôÔ∏è  Hyperparameters:")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Gradient accumulation: {GRADIENT_ACCUM_STEPS}")
print(f"  Effective batch size: {BATCH_SIZE * GRADIENT_ACCUM_STEPS}")

# Create training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir=str(OUTPUT_DIR / "checkpoints"),
    eval_strategy="steps", 
    eval_steps=EVAL_STEPS,
    save_steps=SAVE_STEPS,
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUM_STEPS,
    weight_decay=WEIGHT_DECAY,
    num_train_epochs=NUM_EPOCHS,
    warmup_steps=WARMUP_STEPS,
    predict_with_generate=True,  # Use generate for evaluation
    generation_max_length=MAX_TARGET_LENGTH,
    save_total_limit=3,  # Only keep 3 best checkpoints
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    greater_is_better=True,
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    push_to_hub=False,
    report_to=["tensorboard"],
    seed=SEED,
    # Memory optimization
    gradient_checkpointing=True if not IS_SAMPLE else False,
    optim="adamw_torch",
    # Performance optimization
    dataloader_num_workers=0,  # 0 for Windows, 2-4 for Linux
    remove_unused_columns=True,
)

print("\n‚úì Training Arguments configured:")
print(f"  Output directory: {training_args.output_dir}")
print(f"  Effective batch size: {BATCH_SIZE * GRADIENT_ACCUM_STEPS}")
print(f"  Number of epochs: {NUM_EPOCHS}")
print(f"  FP16 training: {training_args.fp16}")
print(f"  Gradient checkpointing: {training_args.gradient_checkpointing}")
print(f"  Device: {training_args.device}")

‚ö° Using SAMPLE dataset - Optimized hyperparameters

‚öôÔ∏è  Hyperparameters:
  Batch size: 4
  Gradient accumulation: 2
  Effective batch size: 8

‚úì Training Arguments configured:
  Output directory: output\bert2gpt_finetuned\checkpoints
  Effective batch size: 8
  Number of epochs: 3
  FP16 training: False
  Gradient checkpointing: False
  Device: cpu


In [66]:
# Data collator for seq2seq
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

print("‚úì Data collator created")

‚úì Data collator created


In [68]:
# Start training
print("üöÄ Starting fine-tuning...\n")
print("="*60)

train_result = trainer.train()

print("\n" + "="*60)
print("‚úÖ Training complete!\n")

# Save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)

print("\nüìä Training Metrics:")
for key, value in metrics.items():
    print(f"  {key}: {value}")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 3, 'bos_token_id': 2, 'pad_token_id': 0}.


üöÄ Starting fine-tuning...



Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
200,13.4441,6.896866,17.5468,5.0797,14.5508,14.5634,68.9563
400,12.3482,6.494917,19.9398,6.4864,16.7701,16.7445,64.388
600,11.4265,6.121252,22.6164,8.0668,18.6,18.5954,48.2131
800,10.6301,5.945833,25.4032,9.7908,20.6506,20.6434,62.8616
1000,10.2796,5.726552,26.7775,11.0062,21.7067,21.717,50.6266
1200,10.0003,5.6113,28.6207,12.5093,23.4143,23.4332,49.9727
1400,8.621,5.511267,29.7873,13.2561,24.2126,24.2138,55.5683
1600,8.2703,5.292468,30.497,13.8955,24.7285,24.727,49.1111
1800,8.2855,5.29543,31.1056,14.3093,24.9988,25.0117,73.9854
2000,8.2928,5.238785,31.2782,14.5164,25.3245,25.3173,62.6849


There were missing keys in the checkpoint model loaded: ['decoder.lm_head.weight'].



‚úÖ Training complete!

***** train metrics *****
  epoch                    =         3.0
  total_flos               =  16563427GF
  train_loss               =      8.8774
  train_runtime            = 23:28:09.44
  train_samples_per_second =       0.344
  train_steps_per_second   =       0.043

üìä Training Metrics:
  train_runtime: 84489.4454
  train_samples_per_second: 0.344
  train_steps_per_second: 0.043
  total_flos: 1.778484465893376e+16
  train_loss: 8.877429047302313
  epoch: 3.0


<a id="evaluation"></a>
## 7. Evaluation

Evaluate model pada test set.

In [69]:
# Evaluate on test set
print("üîç Evaluating on test set...\n")

test_results = trainer.evaluate(
    eval_dataset=tokenized_datasets['test'],
    max_length=MAX_TARGET_LENGTH,
    num_beams=4
)

print("\n‚úÖ Evaluation complete!\n")
print("üìä Test Set Results:")
print("="*60)
for key, value in test_results.items():
    print(f"  {key}: {value}")
print("="*60)

# Save test metrics
trainer.log_metrics("test", test_results)
trainer.save_metrics("test", test_results)

üîç Evaluating on test set...




‚úÖ Evaluation complete!

üìä Test Set Results:
  eval_loss: 4.762310028076172
  eval_rouge1: 34.1764
  eval_rouge2: 17.3519
  eval_rougeL: 28.1459
  eval_rougeLsum: 28.1901
  eval_gen_len: 61.9199
  eval_runtime: 1260.3225
  eval_samples_per_second: 0.436
  eval_steps_per_second: 0.109
  epoch: 3.0
***** test metrics *****
  epoch                   =        3.0
  eval_gen_len            =    61.9199
  eval_loss               =     4.7623
  eval_rouge1             =    34.1764
  eval_rouge2             =    17.3519
  eval_rougeL             =    28.1459
  eval_rougeLsum          =    28.1901
  eval_runtime            = 0:21:00.32
  eval_samples_per_second =      0.436
  eval_steps_per_second   =      0.109


<a id="inference"></a>
## 8. Inference & Testing

Test model dengan beberapa contoh artikel.

In [70]:
def generate_summary(article_text, num_beams=10, min_length=20, max_length=80):
    """
    Generate summary for a given article using the fine-tuned model
    """
    # Tokenize input
    input_ids = tokenizer.encode(article_text, return_tensors='pt', max_length=MAX_INPUT_LENGTH, truncation=True)
    input_ids = input_ids.to(device)
    
    # Generate summary
    summary_ids = model.generate(
        input_ids,
        min_length=min_length,
        max_length=max_length,
        num_beams=num_beams,
        repetition_penalty=2.5,
        length_penalty=1.0,
        early_stopping=True,
        no_repeat_ngram_size=2,
        use_cache=True,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        top_p=0.95
    )
    
    # Decode summary
    summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return summary_text

print("‚úì Summary generation function ready")

‚úì Summary generation function ready


In [71]:
# Test dengan beberapa contoh dari test set
NUM_EXAMPLES = 5

print(f"üìù Testing model with {NUM_EXAMPLES} examples from test set:\n")
print("="*80)

for i in range(NUM_EXAMPLES):
    # Get example
    example = dataset_dict['test'][i]
    article = example['article']
    reference_summary = example['summary']
    
    # Generate summary
    generated_summary = generate_summary(article)
    
    # Display results
    print(f"\nüîπ Example {i+1}:")
    print(f"\nArticle (first 200 chars):\n{article[:200]}...\n")
    print(f"Reference Summary:\n{reference_summary}\n")
    print(f"Generated Summary:\n{generated_summary}\n")
    print("="*80)

üìù Testing model with 5 examples from test set:


üîπ Example 1:

Article (first 200 chars):
Jakarta Pengamat politik Andi Malarangeng , baru-baru ini , menilai pembaharuan kode etik DPR tidak akan berarti banyak dalam mengubah etika berpolitik para anggota Dewan . Sebab , pengaturan masalah ...

Reference Summary:
Kode etik DPR dinilai tak akan berarti banyak dalam mengubah etika berpolitik para anggota Dewan . Dalam penegakan kode etik anggota DPR semestinya juga melibatkan pimpinan Parpol

Generated Summary:
ribuan masalah etika belum ditetapkan dalam undang - undang yang akan disampaikan pekan ini masih terlihat samar. menurut andi menilai sudah seharusnya kode etik tersebut


üîπ Example 1:

Article (first 200 chars):
Jakarta Pengamat politik Andi Malarangeng , baru-baru ini , menilai pembaharuan kode etik DPR tidak akan berarti banyak dalam mengubah etika berpolitik para anggota Dewan . Sebab , pengaturan masalah ...

Reference Summary:
Kode etik DPR dinilai tak akan berarti 

In [72]:
# Interactive testing: input artikel sendiri
print("üéØ Interactive Testing\n")
print("Masukkan artikel Anda di bawah ini (atau gunakan contoh yang disediakan):\n")

# Contoh artikel
SAMPLE_ARTICLE = """
Jakarta - Presiden Joko Widodo (Jokowi) mengumumkan kebijakan baru terkait pengembangan 
infrastruktur digital di Indonesia. Pemerintah akan mengalokasikan dana triliunan rupiah 
untuk mempercepat pembangunan jaringan internet di daerah terpencil. Langkah ini diharapkan 
dapat mengurangi kesenjangan digital antara kota besar dan daerah pedalaman. Menteri 
Komunikasi dan Informatika menyatakan bahwa program ini akan dimulai tahun depan dengan 
target mencakup 10.000 desa dalam tahap pertama.
""".strip()

# Uncomment baris di bawah untuk input manual
# ARTICLE_TO_SUMMARIZE = input("Artikel: ")

# Atau gunakan contoh
ARTICLE_TO_SUMMARIZE = SAMPLE_ARTICLE

if ARTICLE_TO_SUMMARIZE.strip():
    print(f"\nüìÑ Original Article:\n{ARTICLE_TO_SUMMARIZE}\n")
    
    # Generate summary
    print("ü§ñ Generating summary...\n")
    summary = generate_summary(ARTICLE_TO_SUMMARIZE)
    
    print(f"üìù Generated Summary:\n{summary}\n")
    print(f"\nüìä Statistics:")
    print(f"  Article length: {len(ARTICLE_TO_SUMMARIZE.split())} words")
    print(f"  Summary length: {len(summary.split())} words")
    print(f"  Compression ratio: {len(summary.split()) / len(ARTICLE_TO_SUMMARIZE.split()) * 100:.1f}%")
else:
    print("‚ö†Ô∏è  No article provided")

üéØ Interactive Testing

Masukkan artikel Anda di bawah ini (atau gunakan contoh yang disediakan):


üìÑ Original Article:
Jakarta - Presiden Joko Widodo (Jokowi) mengumumkan kebijakan baru terkait pengembangan 
infrastruktur digital di Indonesia. Pemerintah akan mengalokasikan dana triliunan rupiah 
untuk mempercepat pembangunan jaringan internet di daerah terpencil. Langkah ini diharapkan 
dapat mengurangi kesenjangan digital antara kota besar dan daerah pedalaman. Menteri 
Komunikasi dan Informatika menyatakan bahwa program ini akan dimulai tahun depan dengan 
target mencakup 10.000 desa dalam tahap pertama.

ü§ñ Generating summary...

üìù Generated Summary:
menteri komunikasi dan informatika menyatakan bahwa program ini akan dimulai tahun depan dengan target michel 10. 000 desa dalam tahap pertama


üìä Statistics:
  Article length: 62 words
  Summary length: 21 words
  Compression ratio: 33.9%
üìù Generated Summary:
menteri komunikasi dan informatika menyatakan bahwa program

<a id="save-model"></a>
## 9. Save Model

Save fine-tuned model untuk deployment.

In [73]:
# Save final model
FINAL_MODEL_DIR = OUTPUT_DIR / "final_model"
FINAL_MODEL_DIR.mkdir(parents=True, exist_ok=True)

print(f"üíæ Saving final model to {FINAL_MODEL_DIR}...\n")

# Save model and tokenizer
trainer.save_model(str(FINAL_MODEL_DIR))
tokenizer.save_pretrained(str(FINAL_MODEL_DIR))

print("‚úì Model saved")
print("‚úì Tokenizer saved")

# Save training info
training_info = {
    "model_checkpoint": MODEL_CHECKPOINT,
    "num_epochs": NUM_EPOCHS,
    "batch_size": BATCH_SIZE,
    "learning_rate": LEARNING_RATE,
    "max_input_length": MAX_INPUT_LENGTH,
    "max_target_length": MAX_TARGET_LENGTH,
    "train_samples": len(tokenized_datasets['train']),
    "val_samples": len(tokenized_datasets['validation']),
    "test_samples": len(tokenized_datasets['test']),
    "test_metrics": test_results
}

with open(FINAL_MODEL_DIR / "training_info.json", 'w') as f:
    json.dump(training_info, f, indent=2)

print("‚úì Training info saved\n")

print("="*60)
print("üéâ MODEL SAVED SUCCESSFULLY!")
print("="*60)
print(f"\nüìÅ Model location: {FINAL_MODEL_DIR}")
print("\nüí° To load the model later:")
print(f"\nfrom transformers import BertTokenizer, EncoderDecoderModel")
print(f"\ntokenizer = BertTokenizer.from_pretrained('{FINAL_MODEL_DIR}')")
print(f"tokenizer.bos_token = tokenizer.cls_token")
print(f"tokenizer.eos_token = tokenizer.sep_token")
print(f"model = EncoderDecoderModel.from_pretrained('{FINAL_MODEL_DIR}')")

üíæ Saving final model to output\bert2gpt_finetuned\final_model...

‚úì Model saved
‚úì Tokenizer saved
‚úì Training info saved

üéâ MODEL SAVED SUCCESSFULLY!

üìÅ Model location: output\bert2gpt_finetuned\final_model

üí° To load the model later:

from transformers import BertTokenizer, EncoderDecoderModel

tokenizer = BertTokenizer.from_pretrained('output\bert2gpt_finetuned\final_model')
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
model = EncoderDecoderModel.from_pretrained('output\bert2gpt_finetuned\final_model')
‚úì Model saved
‚úì Tokenizer saved
‚úì Training info saved

üéâ MODEL SAVED SUCCESSFULLY!

üìÅ Model location: output\bert2gpt_finetuned\final_model

üí° To load the model later:

from transformers import BertTokenizer, EncoderDecoderModel

tokenizer = BertTokenizer.from_pretrained('output\bert2gpt_finetuned\final_model')
tokenizer.bos_token = tokenizer.cls_token
tokenizer.eos_token = tokenizer.sep_token
model = EncoderDecoder

---

## üìä Training Summary

### ‚úÖ Completed Steps:
1. ‚úì Loaded preprocessed Liputan6 dataset
2. ‚úì Loaded cahya/bert2gpt-indonesian-summarization model
3. ‚úì Tokenized dataset for BERT2GPT
4. ‚úì Fine-tuned model with Seq2SeqTrainer
5. ‚úì Evaluated on test set with ROUGE metrics
6. ‚úì Tested inference with sample articles
7. ‚úì Saved fine-tuned model for deployment

### üìà Model Performance:
Check the **test_results.json** file in the output directory for detailed ROUGE scores.

### üöÄ Next Steps:
1. Deploy model untuk production
2. Integrate dengan aplikasi web/API
3. A/B testing dengan baseline model
4. Continuous fine-tuning dengan data baru

### üìö References:
- Model: https://huggingface.co/cahya/bert2gpt-indonesian-summarization
- Dataset: Liputan6 Indonesian Summarization
- Framework: Hugging Face Transformers