# ArcFace Training Resume - Fix Overfitting

Notebook này resume training từ checkpoint hiện tại với config mới để khắc phục overfitting.

**Tình trạng hiện tại:**
- Train Acc: 90.27%
- Val Acc: 77.50%  
- Gap: ~13% (overfitting)

**Mục tiêu:**
- Giảm gap xuống <8%
- Tăng Val Acc lên ~80-82%

## Setup

In [None]:
# Install dependencies
!pip install -q pyyaml tqdm pillow

In [None]:
# Clone repository (nếu chưa có code)
# !git clone https://github.com/your-username/FaceRecognition.git
# %cd FaceRecognition

# Hoặc copy code từ Kaggle dataset
!cp -r /kaggle/input/your-code-dataset/FaceRecognition .
%cd FaceRecognition

## Copy Checkpoint và History

In [None]:
# Tạo thư mục
!mkdir -p /kaggle/working/checkpoints/arcface
!mkdir -p /kaggle/working/logs/arcface

# Copy checkpoint từ input dataset
# Thay 'your-checkpoint-dataset' bằng tên dataset của bạn
!cp /kaggle/input/your-checkpoint-dataset/arcface_best.pth /kaggle/working/checkpoints/arcface/
!cp /kaggle/input/your-checkpoint-dataset/training_history.json /kaggle/working/logs/arcface/

# Verify
!ls -lh /kaggle/working/checkpoints/arcface/
!ls -lh /kaggle/working/logs/arcface/

## Xem Config Mới

In [None]:
# Xem config mới
!cat configs/arcface_resume_fix_overfit.yaml

## Resume Training với Config Mới

In [None]:
# Resume training với reset optimizer và scheduler
# Điều này giúp model escape khỏi local minimum

!python models/arcface/resume_arcface_fix_overfit.py \
    --config configs/arcface_resume_fix_overfit.yaml \
    --checkpoint /kaggle/working/checkpoints/arcface/arcface_best.pth \
    --data_dir /kaggle/input/celeba-aligned-balanced \
    --reset_optimizer \
    --reset_scheduler

## Xem Training History

In [None]:
import json
import matplotlib.pyplot as plt

# Load history
with open('/kaggle/working/logs/arcface/training_history.json', 'r') as f:
    history = json.load(f)

# Plot
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Loss
axes[0, 0].plot(history['history']['epoch'], history['history']['train_loss'], label='Train Loss')
axes[0, 0].plot(history['history']['epoch'], history['history']['val_loss'], label='Val Loss')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True)

# Accuracy
axes[0, 1].plot(history['history']['epoch'], history['history']['train_acc'], label='Train Acc')
axes[0, 1].plot(history['history']['epoch'], history['history']['val_acc'], label='Val Acc')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy (%)')
axes[0, 1].set_title('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True)

# Accuracy Gap
gap = [train - val for train, val in zip(history['history']['train_acc'], history['history']['val_acc'])]
axes[1, 0].plot(history['history']['epoch'], gap)
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('Gap (%)')
axes[1, 0].set_title('Train-Val Accuracy Gap')
axes[1, 0].grid(True)

# Learning Rate
axes[1, 1].plot(history['history']['epoch'], history['history']['learning_rate'])
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Learning Rate')
axes[1, 1].set_title('Learning Rate Schedule')
axes[1, 1].set_yscale('log')
axes[1, 1].grid(True)

plt.tight_layout()
plt.savefig('/kaggle/working/training_history.png', dpi=150)
plt.show()

# Print summary
print(f"\n{'='*60}")
print(f"TRAINING SUMMARY")
print(f"{'='*60}")
print(f"Total epochs: {history['total_epochs']}")
print(f"Best Val Acc: {history['best_val_acc']:.2f}%")
print(f"Best Val Loss: {history['best_val_loss']:.4f}")
print(f"Final Train Acc: {history['history']['train_acc'][-1]:.2f}%")
print(f"Final Val Acc: {history['history']['val_acc'][-1]:.2f}%")
print(f"Final Gap: {history['history']['train_acc'][-1] - history['history']['val_acc'][-1]:.2f}%")
print(f"{'='*60}")

## Download Checkpoints

In [None]:
# List checkpoints
!ls -lh /kaggle/working/checkpoints/arcface/

# Tạo archive để download
!cd /kaggle/working && tar -czf arcface_checkpoints.tar.gz checkpoints/ logs/
!ls -lh /kaggle/working/arcface_checkpoints.tar.gz

## Notes

### Config Changes
- **Weight Decay**: 0.0005 → 0.0015 (tăng regularization)
- **Label Smoothing**: 0.0 → 0.1 (giảm overconfidence)
- **Margin**: 0.25 → 0.35 (khắt kháo hơn)
- **Learning Rate**: Reset từ 1e-5 → 0.002

### Expected Results
- Val Acc: 77.5% → 80-82%
- Train-Val Gap: 13% → <8%
- Training time: ~2-3 hours (30 epochs)

### Troubleshooting
Nếu sau 10 epochs không thấy cải thiện:
- Tăng LR lên 0.003-0.005
- Giảm weight_decay xuống 0.001
- Bật easy_margin lại