# üöÄ VisioNova: DeBERTa-v3 AI Text Detection Training

This notebook trains a state-of-the-art `microsoft/deberta-v3-base` model to detect AI-generated text.
All models and checkpoints are automatically saved to **Google Drive** for persistence.

## ‚ö° Quick Start
1. **Enable GPU**: Go to `Runtime > Change runtime type > T4 GPU` (or L4/A100 for faster training)
2. **Mount Google Drive**: The notebook will prompt you to authorize access
3. **Run all cells**: `Runtime > Run all`
4. **Wait for training** (~2-3 hours on T4 GPU with full dataset)
5. **Model is saved to**: `MyDrive/VisioNova_Models/`

## üìÅ Google Drive Structure
```
MyDrive/
‚îî‚îÄ‚îÄ VisioNova_Models/
    ‚îî‚îÄ‚îÄ DeBERTa_v3_YYYYMMDD_HHMMSS/
        ‚îú‚îÄ‚îÄ config.json
        ‚îú‚îÄ‚îÄ model.safetensors
        ‚îú‚îÄ‚îÄ tokenizer.json
        ‚îú‚îÄ‚îÄ tokenizer_config.json
        ‚îú‚îÄ‚îÄ special_tokens_map.json
        ‚îú‚îÄ‚îÄ spm.model
        ‚îú‚îÄ‚îÄ training_info.json
        ‚îî‚îÄ‚îÄ checkpoints/
            ‚îú‚îÄ‚îÄ checkpoint-epoch-1/
            ‚îú‚îÄ‚îÄ checkpoint-epoch-2/
            ‚îî‚îÄ‚îÄ checkpoint-epoch-3/
```

---

## üì¶ 1. Install Dependencies

In [5]:
# Install required packages
!pip install -q transformers datasets accelerate scikit-learn sentencepiece safetensors

# Verify installations
import importlib
packages = ['transformers', 'datasets', 'accelerate', 'sklearn', 'sentencepiece']
for pkg in packages:
    try:
        importlib.import_module(pkg)
        print(f"‚úÖ {pkg}")
    except ImportError:
        print(f"‚ùå {pkg} - Please reinstall")

print("\n‚úÖ All dependencies installed!")

‚úÖ transformers
‚úÖ datasets
‚úÖ accelerate
‚úÖ sklearn
‚úÖ sentencepiece

‚úÖ All dependencies installed!


## üîß 2. Mount Google Drive & Configuration

In [6]:
import os

# ==========================================
# VS CODE COLAB EXTENSION - LOCAL STORAGE
# ==========================================
# Note: drive.mount is not supported in VS Code Colab extension
# Model will be saved locally and can be downloaded at the end

# Use local storage instead of Google Drive
LOCAL_STORAGE = "/content"

print("üìÅ Using local storage (VS Code Colab extension mode)")
print(f"   Path: {LOCAL_STORAGE}")
print("\n‚ö†Ô∏è  Note: Model will be saved locally.")
print("   You'll download it at the end of training.")
print("‚úÖ Storage configured!")

üìÅ Using local storage (VS Code Colab extension mode)
   Path: /content

‚ö†Ô∏è  Note: Model will be saved locally.
   You'll download it at the end of training.
‚úÖ Storage configured!


In [7]:
import os
import torch
import numpy as np
from datetime import datetime

# ==========================================
# GPU CHECK - REQUIRED!
# ==========================================
if not torch.cuda.is_available():
    raise RuntimeError(
        "\n\nüõë GPU NOT DETECTED! üõë\n"
        "1. Go to: Runtime > Change runtime type\n"
        "2. Select 'T4 GPU' (or L4/A100) from Hardware Accelerator\n"
        "3. Click Save and re-run this cell\n"
    )
else:
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"‚úÖ GPU Detected: {gpu_name}")
    print(f"   Memory: {gpu_memory:.1f} GB")
    
    # Recommend batch size based on GPU memory
    if gpu_memory >= 40:  # A100
        recommended_batch = 32
    elif gpu_memory >= 20:  # L4/A10
        recommended_batch = 16
    else:  # T4
        recommended_batch = 8
    print(f"   Recommended batch size: {recommended_batch}")

# ==========================================
# CONFIGURATION - MODIFY AS NEEDED
# ==========================================

# Dataset Options:
# - "artem9k/ai-text-detection-pile" (1.5M samples, recommended for production)
# - "Hello-SimpleAI/HC3" (Human ChatGPT Comparison, ~40k samples)
# - "aadityaubhat/GPT-wiki-intro" (GPT vs Wikipedia, ~150k samples)
DATASET_NAME = "artem9k/ai-text-detection-pile"

# Model
MODEL_ID = "microsoft/deberta-v3-base"

# Training Hyperparameters
EPOCHS = 3              # Number of training epochs (3-5 recommended)
BATCH_SIZE = 8          # Reduce to 4 if you get OOM errors
LEARNING_RATE = 2e-5    # Standard for fine-tuning
MAX_LENGTH = 512        # Max token length (512 for DeBERTa)
EVAL_SPLIT = 0.1        # 10% for validation
WARMUP_RATIO = 0.1      # Warmup steps as ratio of total steps
WEIGHT_DECAY = 0.01     # L2 regularization

# Limit dataset size (set to None for full dataset)
# Recommended: 50000 for quick test, 200000 for moderate, None for full
MAX_SAMPLES = None  # e.g., 50000 for quick test, None for full

# LOCAL paths (VS Code Colab extension doesn't support Google Drive)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
MODEL_SAVE_DIR = f"/content/VisioNova_Models/DeBERTa_v3_{timestamp}"
CHECKPOINT_DIR = f"{MODEL_SAVE_DIR}/checkpoints"

# Create directories
os.makedirs(MODEL_SAVE_DIR, exist_ok=True)
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

print(f"\nüìä Configuration:")
print(f"   Dataset: {DATASET_NAME}")
print(f"   Model: {MODEL_ID}")
print(f"   Epochs: {EPOCHS}")
print(f"   Batch Size: {BATCH_SIZE}")
print(f"   Learning Rate: {LEARNING_RATE}")
print(f"   Max Samples: {MAX_SAMPLES if MAX_SAMPLES else 'All (Full Dataset)'}")
print(f"\nüíæ Save Locations (Local):")
print(f"   Model: {MODEL_SAVE_DIR}")
print(f"   Checkpoints: {CHECKPOINT_DIR}")

‚úÖ GPU Detected: Tesla T4
   Memory: 15.8 GB
   Recommended batch size: 8

üìä Configuration:
   Dataset: artem9k/ai-text-detection-pile
   Model: microsoft/deberta-v3-base
   Epochs: 3
   Batch Size: 8
   Learning Rate: 2e-05
   Max Samples: All (Full Dataset)

üíæ Save Locations (Local):
   Model: /content/VisioNova_Models/DeBERTa_v3_20260130_084416
   Checkpoints: /content/VisioNova_Models/DeBERTa_v3_20260130_084416/checkpoints


## üìö 3. Load & Prepare Dataset

In [8]:
from datasets import load_dataset, Dataset
import json

print(f"üì• Loading dataset: {DATASET_NAME}...")
print("   This may take a few minutes for large datasets...\n")

# Load dataset based on source
if DATASET_NAME == "custom":
    # Load custom dataset from Google Drive
    print(f"   Loading custom dataset from: {CUSTOM_DATASET_PATH}")
    if not os.path.exists(CUSTOM_DATASET_PATH):
        raise FileNotFoundError(f"Custom dataset not found: {CUSTOM_DATASET_PATH}")
    
    extension = CUSTOM_DATASET_PATH.split('.')[-1]
    if extension == 'json':
        with open(CUSTOM_DATASET_PATH, 'r') as f:
            data = json.load(f)
        train_data = Dataset.from_dict(data)
    else:
        dataset = load_dataset(extension, data_files={"train": CUSTOM_DATASET_PATH})
        train_data = dataset['train']
else:
    # Load from Hugging Face Hub
    dataset = load_dataset(DATASET_NAME)
    
    # Get the training split
    if 'train' in dataset:
        train_data = dataset['train']
    else:
        train_data = dataset[list(dataset.keys())[0]]

print(f"   Raw dataset size: {len(train_data):,} samples")
print(f"   Columns: {train_data.column_names}")

# Limit samples if specified
if MAX_SAMPLES and len(train_data) > MAX_SAMPLES:
    print(f"   Limiting to {MAX_SAMPLES:,} samples...")
    train_data = train_data.shuffle(seed=42).select(range(MAX_SAMPLES))

# Split into train/validation
split_dataset = train_data.train_test_split(test_size=EVAL_SPLIT, seed=42)

print(f"\n‚úÖ Dataset prepared!")
print(f"   Training samples: {len(split_dataset['train']):,}")
print(f"   Validation samples: {len(split_dataset['test']):,}")

# Show sample data
print(f"\nüìù Sample data:")
sample = split_dataset['train'][0]
for key, value in sample.items():
    if isinstance(value, str) and len(value) > 100:
        print(f"   {key}: {value[:100]}...")
    else:
        print(f"   {key}: {value}")

üì• Loading dataset: artem9k/ai-text-detection-pile...
   This may take a few minutes for large datasets...



Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00007-bc5952582e004d(‚Ä¶):   0%|          | 0.00/758M [00:00<?, ?B/s]

data/train-00001-of-00007-71c80017bc45f3(‚Ä¶):   0%|          | 0.00/318M [00:00<?, ?B/s]

data/train-00002-of-00007-ee2d43f396e78f(‚Ä¶):   0%|          | 0.00/125M [00:00<?, ?B/s]

data/train-00003-of-00007-529931154b42b5(‚Ä¶):   0%|          | 0.00/137M [00:00<?, ?B/s]

data/train-00004-of-00007-b269dc49374a2c(‚Ä¶):   0%|          | 0.00/137M [00:00<?, ?B/s]

data/train-00005-of-00007-3dce5e05ddbad7(‚Ä¶):   0%|          | 0.00/258M [00:00<?, ?B/s]

data/train-00006-of-00007-3d8a471ba0cf1c(‚Ä¶):   0%|          | 0.00/242M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1392522 [00:00<?, ? examples/s]

   Raw dataset size: 1,392,522 samples
   Columns: ['source', 'id', 'text']

‚úÖ Dataset prepared!
   Training samples: 1,253,269
   Validation samples: 139,253

üìù Sample data:
   source: human
   id: 284048
   text: He's ultra big... and a Magnus...

He looks very spot on to the cartoon look. He's big, bulky, and o...


## üè∑Ô∏è 4. Process Labels

We convert `source` to numeric `labels` (0=human, 1=ai).

**IMPORTANT:** The `artem9k/ai-text-detection-pile` dataset has:

- Column `text`: The text content  - Column `source`: Either `"human"` or `"ai"` (lowercase strings)

In [9]:
# ==========================================
# LABEL PROCESSING - CRITICAL SECTION
# ==========================================
# artem9k/ai-text-detection-pile dataset has:
#   - "text" column: the actual text
#   - "source" column: "human" or "ai" (string values)
# We need integer labels for the Trainer

cols = split_dataset['train'].column_names
print(f"üìã Dataset columns: {cols}")

# Show actual source values to verify
if 'source' in cols:
    sample_sources = list(set(split_dataset['train']['source'][:1000]))
    print(f"üìù Unique source values found: {sample_sources}")

# ==========================================
# HANDLE DIFFERENT DATASET FORMATS
# ==========================================

if 'labels' in cols:
    print("‚úÖ 'labels' column already exists")
    # Verify they are integers
    sample_label = split_dataset['train']['labels'][0]
    if isinstance(sample_label, str):
        print("‚ö†Ô∏è  Converting string labels to integers...")
        def convert_labels(example):
            label = str(example['labels']).lower().strip()
            example['labels'] = 0 if label in ['human', '0'] else 1
            return example
        split_dataset = split_dataset.map(convert_labels, desc="Converting labels")

elif 'label' in cols:
    print("‚úÖ Found 'label' column, renaming to 'labels'")
    split_dataset = split_dataset.rename_column('label', 'labels')
    # Verify integer type
    sample_label = split_dataset['train']['labels'][0]
    if isinstance(sample_label, str):
        print("‚ö†Ô∏è  Converting string labels to integers...")
        def convert_labels(example):
            label = str(example['labels']).lower().strip()
            example['labels'] = 0 if label in ['human', '0'] else 1
            return example
        split_dataset = split_dataset.map(convert_labels, desc="Converting labels")

elif 'source' in cols:
    # MOST COMMON CASE for artem9k/ai-text-detection-pile
    print("‚ÑπÔ∏è  Creating labels from 'source' column...")
    print("   Mapping: 'human' -> 0, 'ai' -> 1")
    
    def add_labels_from_source(example):
        """
        Convert source to numeric labels.
        artem9k/ai-text-detection-pile has EXACTLY:
          - "human" -> 0
          - "ai" -> 1
        """
        source = str(example.get('source', '')).lower().strip()
        # EXACT MATCH - this dataset only has "human" or "ai"
        example['labels'] = 0 if source == 'human' else 1
        return example
    
    split_dataset = split_dataset.map(add_labels_from_source, desc="Adding labels")
    print("‚úÖ Labels created successfully")

elif 'generated' in cols:
    print("‚úÖ Found 'generated' column, renaming to 'labels'")
    split_dataset = split_dataset.rename_column('generated', 'labels')

else:
    raise ValueError(
        f"‚ùå Cannot find label column!\n"
        f"   Available columns: {cols}\n"
        f"   Expected: 'labels', 'label', 'source', or 'generated'"
    )

# ==========================================
# VERIFY LABELS
# ==========================================
print("\nüîç Verifying labels...")

# Check data type
sample_label = split_dataset['train']['labels'][0]
print(f"   Label type: {type(sample_label).__name__}")
print(f"   Sample value: {sample_label}")

# Count distribution
train_labels = split_dataset['train']['labels']
human_count = sum(1 for l in train_labels if l == 0)
ai_count = sum(1 for l in train_labels if l == 1)
total = human_count + ai_count

print(f"\nüìä Label Distribution:")
print(f"   Human (0): {human_count:,} ({human_count/total*100:.1f}%)")
print(f"   AI (1):    {ai_count:,} ({ai_count/total*100:.1f}%)")

# Verify only 0 and 1 exist
unique_labels = set(list(train_labels)[:10000])

if unique_labels == {0, 1}:
    print("‚úÖ Labels verified: only 0 and 1 present")
else:
    print(f"‚ö†Ô∏è  WARNING: Unexpected label values: {unique_labels}")

# Class imbalance warning
ratio = min(human_count, ai_count) / max(human_count, ai_count)
if ratio < 0.3:
    print(f"\n‚ö†Ô∏è  CLASS IMBALANCE (ratio: {ratio:.2f})")
    print("   Model should still train fine.")
    print("   This is expected for this dataset (~74% human, ~26% AI)")

üìã Dataset columns: ['source', 'id', 'text']
üìù Unique source values found: ['ai', 'human']
‚ÑπÔ∏è  Creating labels from 'source' column...
   Mapping: 'human' -> 0, 'ai' -> 1


Adding labels:   0%|          | 0/1253269 [00:00<?, ? examples/s]

Adding labels:   0%|          | 0/139253 [00:00<?, ? examples/s]

‚úÖ Labels created successfully

üîç Verifying labels...
   Label type: int
   Sample value: 0

üìä Label Distribution:
   Human (0): 925,299 (73.8%)
   AI (1):    327,970 (26.2%)
‚úÖ Labels verified: only 0 and 1 present


## üî§ 5. Tokenization

**Note:** We use `padding=False` here and let `DataCollatorWithPadding` handle dynamic padding at batch time. This is much faster than padding all sequences to max_length.

In [10]:
from transformers import AutoTokenizer, DataCollatorWithPadding

print(f"üî§ Loading tokenizer: {MODEL_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

def tokenize_function(examples):
    """Tokenize text - padding handled by DataCollator for efficiency."""
    return tokenizer(
        examples["text"],
        padding=False,  # DYNAMIC PADDING - DataCollator handles this
        truncation=True,
        max_length=MAX_LENGTH
    )

print("   Tokenizing dataset (this may take a while for large datasets)...")

# Get columns to remove (keep only 'labels')
cols_to_remove = [col for col in split_dataset['train'].column_names if col != 'labels']
print(f"   Removing columns: {cols_to_remove}")

tokenized_datasets = split_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=cols_to_remove,
    desc="Tokenizing",
    # NOTE: num_proc removed - can cause hangs on Colab with Drive
)

# Verify columns
cols = tokenized_datasets["train"].column_names
print(f"   Columns after tokenization: {cols}")

# Verify required columns exist
required = ['input_ids', 'attention_mask', 'labels']
for col in required:
    if col not in cols:
        raise ValueError(f"Missing required column: {col}")

# NOTE: We don't call set_format() - DataCollatorWithPadding handles tensor conversion
# This avoids potential issues with DatasetDict format persistence

print(f"\n‚úÖ Tokenization complete!")
print(f"   Columns: {cols}")

üî§ Loading tokenizer: microsoft/deberta-v3-base...


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



   Tokenizing dataset (this may take a while for large datasets)...
   Removing columns: ['source', 'id', 'text']


Tokenizing:   0%|          | 0/1253269 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/139253 [00:00<?, ? examples/s]

   Columns after tokenization: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']

‚úÖ Tokenization complete!
   Columns: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']


## üß† 6. Initialize Model & Trainer

In [11]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import gc

# Free up memory before loading model
gc.collect()
torch.cuda.empty_cache()

# Metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='binary', zero_division=0
    )
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

print(f"üß† Loading model: {MODEL_ID}...")
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID,
    num_labels=2,
    id2label={0: "HUMAN", 1: "AI"},
    label2id={"HUMAN": 0, "AI": 1}
)

# Print model size
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")

# Training arguments - CHECKPOINTS SAVED TO GOOGLE DRIVE
training_args = TrainingArguments(
    output_dir=CHECKPOINT_DIR,  # Checkpoints saved to Drive!
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    
    # Evaluation & Saving - All to Google Drive
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=3,  # Keep only last 3 checkpoints to save space
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    
    # CRITICAL: Use bf16 instead of fp16 for DeBERTa-v3
    # fp16 can cause gradient overflow/NaN losses with this model
    fp16=False,
    bf16=True,  # More stable for DeBERTa-v3 on T4/A100 GPUs
    
    dataloader_num_workers=2,
    gradient_accumulation_steps=1,  # Increase to 2-4 if OOM with small batch size
    
    # Logging
    logging_dir=f"{MODEL_SAVE_DIR}/logs",
    logging_steps=100,
    report_to="none",
    
    # Disable hub push
    push_to_hub=False,
)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("\n‚úÖ Model and Trainer initialized!")

total_steps = len(tokenized_datasets["train"]) // BATCH_SIZE * EPOCHS
print(f"   Checkpoints will be saved to: {CHECKPOINT_DIR}")
print(f"   Total training steps: ~{total_steps:,}")

üß† Loading model: microsoft/deberta-v3-base...


pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


   Total parameters: 184,423,682
   Trainable parameters: 184,423,682


  trainer = Trainer(



‚úÖ Model and Trainer initialized!
   Checkpoints will be saved to: /content/VisioNova_Models/DeBERTa_v3_20260130_084416/checkpoints
   Total training steps: ~469,974


## üß™ 7. Safety Check (Quick Validation)

In [12]:
print("üß™ Running safety check (10 samples, 2 steps)...")
print("   This verifies the pipeline works before full training.\n")

try:
    # Mini dataset for sanity check
    mini_train = tokenized_datasets["train"].select(range(min(10, len(tokenized_datasets["train"]))))
    mini_test = tokenized_datasets["test"].select(range(min(10, len(tokenized_datasets["test"]))))

    safety_trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="./safety_check",
            max_steps=2,
            per_device_train_batch_size=2,
            logging_steps=1,
            report_to="none",
            fp16=False,
            bf16=True,  # Match main training - bf16 is more stable for DeBERTa
            save_strategy="no",  # Don't save during safety check
        ),
        train_dataset=mini_train,
        eval_dataset=mini_test,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )
    safety_trainer.train()
    
    # Clean up safety check
    import shutil
    if os.path.exists("./safety_check"):
        shutil.rmtree("./safety_check")
    
    print("\n‚úÖ Safety check PASSED!")
    print("   ‚úì Model loads correctly")
    print("   ‚úì Data pipeline works")
    print("   ‚úì GPU computation functional")
    print("   ‚úì Ready for full training!")
    
except Exception as e:
    print(f"\n‚ùå Safety check FAILED: {e}")
    print("\nTroubleshooting:")
    print("   1. Check if GPU is enabled (Runtime > Change runtime type)")
    print("   2. Try reducing BATCH_SIZE to 4")
    print("   3. Check dataset format matches expected columns")
    raise e

  safety_trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.


üß™ Running safety check (10 samples, 2 steps)...
   This verifies the pipeline works before full training.



Step,Training Loss
1,0.6591
2,0.6917



‚úÖ Safety check PASSED!
   ‚úì Model loads correctly
   ‚úì Data pipeline works
   ‚úì GPU computation functional
   ‚úì Ready for full training!


## üöÄ 8. Start Training

‚ö†Ô∏è **Training Time Estimates:**
- **T4 GPU**: ~2-4 hours (full dataset)
- **L4 GPU**: ~1-2 hours (full dataset)  
- **A100 GPU**: ~30-60 minutes (full dataset)

Progress is shown below. **Checkpoints are automatically saved to Google Drive** after each epoch, so you won't lose progress if the session disconnects!

In [13]:
import time

print("üöÄ Starting training...")
print(f"   Epochs: {EPOCHS}")
print(f"   Training samples: {len(tokenized_datasets['train']):,}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Checkpoints saving to: {CHECKPOINT_DIR}")
print(f"\n‚è±Ô∏è  Estimated time: 2-4 hours on T4 GPU\n")
print("=" * 60)

start_time = time.time()

# Train!
try:
    train_result = trainer.train()
except KeyboardInterrupt:
    print("\n\n‚ö†Ô∏è Training interrupted!")
    print(f"   Checkpoints saved to: {CHECKPOINT_DIR}")
    print("   You can resume training from the latest checkpoint.")
    raise

# Calculate training time
training_time = time.time() - start_time
hours, remainder = divmod(training_time, 3600)
minutes, seconds = divmod(remainder, 60)

print("=" * 60)
print(f"\n‚úÖ Training complete!")
print(f"   Total time: {int(hours)}h {int(minutes)}m {int(seconds)}s")
print(f"   Final training loss: {train_result.training_loss:.4f}")
print(f"   Checkpoints saved to: {CHECKPOINT_DIR}")

üöÄ Starting training...
   Epochs: 3
   Training samples: 1,253,269
   Batch size: 8
   Checkpoints saving to: /content/VisioNova_Models/DeBERTa_v3_20260130_084416/checkpoints

‚è±Ô∏è  Estimated time: 2-4 hours on T4 GPU



Epoch,Training Loss,Validation Loss




‚ö†Ô∏è Training interrupted!
   Checkpoints saved to: /content/VisioNova_Models/DeBERTa_v3_20260130_084416/checkpoints
   You can resume training from the latest checkpoint.


KeyboardInterrupt: 

## üìä 9. Evaluate Model

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

print("üìä Evaluating model on validation set...")

metrics = trainer.evaluate()

print(f"\n{'='*50}")
print(f"üìà FINAL METRICS")
print(f"{'='*50}")
print(f"   Accuracy:  {metrics['eval_accuracy']:.4f} ({metrics['eval_accuracy']*100:.2f}%)")
print(f"   F1 Score:  {metrics['eval_f1']:.4f}")
print(f"   Precision: {metrics['eval_precision']:.4f}")
print(f"   Recall:    {metrics['eval_recall']:.4f}")
print(f"   Loss:      {metrics['eval_loss']:.4f}")
print(f"{'='*50}")

# Get predictions for confusion matrix
print("\nüìâ Generating confusion matrix...")
predictions = trainer.predict(tokenized_datasets["test"])
preds = np.argmax(predictions.predictions, axis=1)
labels = predictions.label_ids

# Confusion matrix
cm = confusion_matrix(labels, preds)
print(f"\nConfusion Matrix:")
print(f"                 Predicted")
print(f"              HUMAN    AI")
print(f"Actual HUMAN   {cm[0][0]:5d}  {cm[0][1]:5d}")
print(f"       AI      {cm[1][0]:5d}  {cm[1][1]:5d}")

# Classification report
print(f"\nClassification Report:")
print(classification_report(labels, preds, target_names=['HUMAN', 'AI']))

## üíæ 10. Save Final Model to Google Drive

In [None]:
import json

print(f"üíæ Saving final model...")
print(f"   Path: {MODEL_SAVE_DIR}\n")

# Save model and tokenizer
trainer.save_model(MODEL_SAVE_DIR)
tokenizer.save_pretrained(MODEL_SAVE_DIR)

# Save comprehensive training info
training_info = {
    "model_name": "VisioNova AI Text Detector",
    "base_model": MODEL_ID,
    "dataset": DATASET_NAME,
    "training_config": {
        "epochs": EPOCHS,
        "batch_size": BATCH_SIZE,
        "learning_rate": LEARNING_RATE,
        "max_length": MAX_LENGTH,
        "warmup_ratio": WARMUP_RATIO,
        "weight_decay": WEIGHT_DECAY,
    },
    "dataset_info": {
        "training_samples": len(tokenized_datasets['train']),
        "validation_samples": len(tokenized_datasets['test']),
        "max_samples_limit": MAX_SAMPLES,
    },
    "final_metrics": {
        "accuracy": float(metrics['eval_accuracy']),
        "f1": float(metrics['eval_f1']),
        "precision": float(metrics['eval_precision']),
        "recall": float(metrics['eval_recall']),
        "loss": float(metrics['eval_loss'])
    },
    "training_info": {
        "training_time_seconds": training_time,
        "training_time_formatted": f"{int(hours)}h {int(minutes)}m {int(seconds)}s",
        "final_training_loss": float(train_result.training_loss),
    },
    "environment": {
        "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU",
        "gpu_memory_gb": torch.cuda.get_device_properties(0).total_memory / 1e9 if torch.cuda.is_available() else 0,
        "pytorch_version": torch.__version__,
    },
    "timestamp": timestamp,
    "labels": {
        "0": "HUMAN",
        "1": "AI"
    }
}

# Save training info JSON
info_path = os.path.join(MODEL_SAVE_DIR, "training_info.json")
with open(info_path, "w") as f:
    json.dump(training_info, f, indent=2)

print(f"‚úÖ Model saved!")
print(f"\nüìÅ Saved files:")
total_size = 0
for f in sorted(os.listdir(MODEL_SAVE_DIR)):
    filepath = os.path.join(MODEL_SAVE_DIR, f)
    if os.path.isfile(filepath):
        size = os.path.getsize(filepath) / 1e6
        total_size += size
        print(f"   üìÑ {f} ({size:.1f} MB)")
    else:
        print(f"   üìÅ {f}/ (directory)")

print(f"\n   Total size: {total_size:.1f} MB")
print(f"\n‚ö†Ô∏è  IMPORTANT: Run the next cell to download the model!")

## üß™ 11. Test the Trained Model

In [None]:
from transformers import pipeline

print("üß™ Testing the trained model...\n")

# Create inference pipeline from saved model
classifier = pipeline(
    "text-classification",
    model=MODEL_SAVE_DIR,
    tokenizer=MODEL_SAVE_DIR,
    device=0 if torch.cuda.is_available() else -1
)

# Test samples - mix of human and AI-generated text
test_texts = [
    # Human-like texts
    "The quick brown fox jumps over the lazy dog. This is a simple sentence written by a human.",
    "Yesterday I went to the store and bought some groceries. The weather was nice so I walked.",
    "Climate change is caused by the increase in greenhouse gases in the atmosphere, primarily from burning fossil fuels.",
    "I can't believe how much prices have gone up lately. Everything is so expensive now!",
    
    # AI-like texts
    "As an AI language model, I cannot provide assistance with that request. However, I can help you understand the concept better.",
    "I apologize, but I'm unable to fulfill this request as it goes against my ethical guidelines.",
    "In conclusion, it is important to note that there are several factors to consider when examining this complex issue.",
    "The implications of this development are multifaceted and warrant careful consideration from multiple perspectives.",
]

print("=" * 70)
print(f"{'TEXT':<50} {'PREDICTION':<10} {'CONFIDENCE':<10}")
print("=" * 70)

for text in test_texts:
    result = classifier(text[:512])[0]  # Truncate to max length
    label = result['label']
    confidence = result['score'] * 100

    # Emoji based on prediction
    emoji = "ü§ñ" if label == "AI" else "üë§"
    
    # Truncate text for display
    display_text = text[:47] + "..." if len(text) > 50 else text
    
    print(f"{display_text:<50} {emoji} {label:<8} {confidence:.1f}%")

print("=" * 70)
print("\n‚úÖ Model testing complete!")

## üì• 12. Download Model

Your model is already saved to Google Drive and will persist even after the Colab session ends!

### Option A: Access via Google Drive (Recommended)
1. Open [Google Drive](https://drive.google.com)
2. Navigate to `My Drive/VisioNova_Models/`
3. Find the folder with your training timestamp
4. Right-click and select **Download** to get a zip file

### Option B: Download directly from Colab
Run the cell below to create a zip file and download it directly.

In [None]:
import shutil

# Create zip file for download
zip_name = f"visionova_model_{timestamp}"
zip_path = f"/content/{zip_name}"

print(f"üì¶ Creating zip file for download...")
print(f"   Source: {MODEL_SAVE_DIR}")

# Create zip (excluding checkpoints to reduce size)
shutil.make_archive(zip_path, 'zip', MODEL_SAVE_DIR)

zip_file = f"{zip_path}.zip"
zip_size = os.path.getsize(zip_file) / 1e6
print(f"\n‚úÖ Zip created: {zip_name}.zip")
print(f"   Size: {zip_size:.1f} MB")

# Try to download (may not work in VS Code extension)
try:
    from google.colab import files
    print("\nüì• Starting download...")
    files.download(zip_file)
except Exception as e:
    print(f"\n‚ö†Ô∏è  Auto-download not available in VS Code extension")
    print(f"\nüì• To download your model:")
    print(f"   1. Open the Colab file browser (folder icon on left)")
    print(f"   2. Navigate to: /content/")
    print(f"   3. Right-click on '{zip_name}.zip'")
    print(f"   4. Select 'Download'")
    print(f"\n   Or run this in a new cell:")
    print(f"   !cp {zip_file} /content/drive/MyDrive/  # If you mount Drive in browser")

## üîÑ 13. Resume Training from Checkpoint (Optional)

If your Colab session disconnected during training, you can resume from the last checkpoint saved to Google Drive.

In [None]:
# ==========================================
# RESUME TRAINING FROM CHECKPOINT
# ==========================================
# Uncomment and run this cell ONLY if you need to resume training
# after a session disconnect.

"""
# Find the latest checkpoint
import glob

checkpoint_pattern = f"{CHECKPOINT_DIR}/checkpoint-*"
checkpoints = glob.glob(checkpoint_pattern)

if checkpoints:
    # Sort by checkpoint number and get the latest
    latest_checkpoint = max(checkpoints, key=lambda x: int(x.split('-')[-1]))
    print(f"üìç Found checkpoint: {latest_checkpoint}")
    
    # Resume training
    print("üîÑ Resuming training from checkpoint...")
    trainer.train(resume_from_checkpoint=latest_checkpoint)
    
    print("‚úÖ Training resumed and completed!")
else:
    print("‚ùå No checkpoints found. Start fresh training instead.")
"""

print("‚ÑπÔ∏è  This cell is for resuming interrupted training.")
print("   Uncomment the code above if you need to resume from a checkpoint.")

## ‚úÖ Training Complete!

### üéâ Your model has been saved to Google Drive!

**Location:** `My Drive/VisioNova_Models/DeBERTa_v3_YYYYMMDD_HHMMSS/`

### üìÅ Model Files Structure:
```
VisioNova_Models/
‚îî‚îÄ‚îÄ DeBERTa_v3_YYYYMMDD_HHMMSS/
    ‚îú‚îÄ‚îÄ config.json              # Model configuration
    ‚îú‚îÄ‚îÄ model.safetensors        # Model weights (~700MB)
    ‚îú‚îÄ‚îÄ tokenizer.json           # Tokenizer data
    ‚îú‚îÄ‚îÄ tokenizer_config.json    # Tokenizer configuration
    ‚îú‚îÄ‚îÄ special_tokens_map.json  # Special tokens
    ‚îú‚îÄ‚îÄ spm.model                # SentencePiece model
    ‚îú‚îÄ‚îÄ training_info.json       # Training metadata & metrics
    ‚îî‚îÄ‚îÄ checkpoints/             # Epoch checkpoints
        ‚îú‚îÄ‚îÄ checkpoint-XXX/
        ‚îî‚îÄ‚îÄ ...
```

### üöÄ Next Steps:

1. **Download the model** from Google Drive to your local machine

2. **Copy files to your VisioNova project:**
   ```
   backend/text_detector/model/
   ‚îú‚îÄ‚îÄ config.json
   ‚îú‚îÄ‚îÄ model.safetensors
   ‚îú‚îÄ‚îÄ tokenizer.json
   ‚îú‚îÄ‚îÄ tokenizer_config.json
   ‚îú‚îÄ‚îÄ special_tokens_map.json
   ‚îî‚îÄ‚îÄ spm.model
   ```

3. **Restart your VisioNova backend** to use the new model

### üìä Training Summary:
- **Dataset:** Check `training_info.json` for details
- **Metrics:** F1, Accuracy, Precision, Recall saved in `training_info.json`
- **Checkpoints:** Available in `checkpoints/` folder for resuming or rollback

---
*VisioNova AI Text Detection - Trained with DeBERTa-v3*