# Phase 5: Model Selection

## Overview
Evaluate and select the best model for character backstory consistency checking.

## Candidate Models
| Model | Strengths |
|-------|----------|
| DeBERTa-v3 | Strong contradiction detection |
| Longformer | Handles long text efficiently |
| RoBERTa-NLI | Pre-trained for logical consistency |
| LLaMA | Optional - for evidence generation |

## Recommendation
**DeBERTa + chunking** for best balance of performance and efficiency.

In [7]:
# Check available models and dependencies
import subprocess
result = subprocess.run(['pip', 'list'], capture_output=True, text=True)
print("Checking installed packages...")

packages = result.stdout.lower()
needed_packages = ['transformers', 'torch', 'accelerate', 'datasets']
for pkg in needed_packages:
    if pkg in packages:
        print(f"  âœ“ {pkg} installed")
    else:
        print(f"  âœ— {pkg} not found")

Checking installed packages...
  âœ— transformers not found
  âœ“ torch installed
  âœ“ accelerate installed
  âœ“ datasets installed


In [10]:
import pandas as pd
from pathlib import Path

PROJECT_ROOT = Path("/root/DataDivas_KDSH_2026")
DATA_DIR = PROJECT_ROOT / "Data"
PHASE5_OUTPUT = PROJECT_ROOT / "phase5_output"
PHASE5_OUTPUT.mkdir(parents=True, exist_ok=True)

features_df = pd.read_parquet(DATA_DIR / "feature_data.parquet")

print(f"Loaded {len(features_df)} feature records")
print(f"Label distribution:")
print(features_df['label'].value_counts())

print(f"\nðŸ’¾ OUTPUTS WILL BE SAVED TO: {PHASE5_OUTPUT}")

# Check model input length distribution
features_df['input_length'] = features_df['model_input'].apply(len)
print(f"\nModel input length stats:")
print(f"  Mean: {features_df['input_length'].mean():.0f}")
print(f"  Max: {features_df['input_length'].max()}")
print(f"  95th percentile: {features_df['input_length'].quantile(0.95):.0f}")

FileNotFoundError: [Errno 2] No such file or directory: '/root/DataDivas_KDSH_2026/Data/feature_data.parquet'

## Model Comparison Framework

Create a framework to compare different models for this task.

In [11]:
from dataclasses import dataclass
from typing import Optional
import time

@dataclass
class ModelConfig:
    """Configuration for a model candidate."""
    name: str
    huggingface_name: str
    max_length: int
    description: str
    expected_vram_gb: float
    
CANDIDATE_MODELS = {
    'deberta_v3': ModelConfig(
        name="DeBERTa-v3-Base-NLI",
        huggingface_name="microsoft/deberta-v3-base-nli",
        max_length=512,
        description="Best for contradiction detection, NLI-trained",
        expected_vram_gb=2.0
    ),
    'longformer': ModelConfig(
        name="Longformer-Base-4096",
        huggingface_name="allenai/longformer-base-4096",
        max_length=4096,
        description="Efficient for long documents",
        expected_vram_gb=3.0
    ),
    'roberta_nli': ModelConfig(
        name="RoBERTa-Large-NLI",
        huggingface_name="roberta-large-mnli",
        max_length=512,
        description="Strong NLI performance",
        expected_vram_gb=3.5
    ),
    'bigbird': ModelConfig(
        name="BigBird-Pegasus",
        huggingface_name="google/bigbird-pegasus-large-arxiv",
        max_length=4096,
        description="Very long context handling",
        expected_vram_gb=4.0
    )
}


for key, model in CANDIDATE_MODELS.items():
    print(f"\n{key.upper()}:")
    print(f"  Name: {model.name}")
    print(f"  HuggingFace: {model.huggingface_name}")
    print(f"  Max Length: {model.max_length} tokens")
    print(f"  Description: {model.description}")
    print(f"  Expected VRAM: {model.expected_vram_gb} GB")


DEBERTA_V3:
  Name: DeBERTa-v3-Base-NLI
  HuggingFace: microsoft/deberta-v3-base-nli
  Max Length: 512 tokens
  Description: Best for contradiction detection, NLI-trained
  Expected VRAM: 2.0 GB

LONGFORMER:
  Name: Longformer-Base-4096
  HuggingFace: allenai/longformer-base-4096
  Max Length: 4096 tokens
  Description: Efficient for long documents
  Expected VRAM: 3.0 GB

ROBERTA_NLI:
  Name: RoBERTa-Large-NLI
  HuggingFace: roberta-large-mnli
  Max Length: 512 tokens
  Description: Strong NLI performance
  Expected VRAM: 3.5 GB

BIGBIRD:
  Name: BigBird-Pegasus
  HuggingFace: google/bigbird-pegasus-large-arxiv
  Max Length: 4096 tokens
  Description: Very long context handling
  Expected VRAM: 4.0 GB


In [12]:
def evaluate_model_suitability(features_df, model_config):
    """
    Evaluate how suitable a model is for this dataset.
    """
    analysis = {
        'model_name': model_config.name,
        'max_length': model_config.max_length,
        'samples_exceeding_limit': 0,
        'avg_tokens': 0,
        'recommended_chunk_size': 0,
        'suitability_score': 0,
        'recommendation': ''
    }
    
    # Estimate tokens (rough approximation: 4 chars per token)
    features_df['est_tokens'] = features_df['model_input'].apply(lambda x: len(x) // 4)
    avg_tokens = features_df['est_tokens'].mean()
    max_tokens = features_df['est_tokens'].max()
    samples_exceeding = (features_df['est_tokens'] > model_config.max_length).sum()
    
    analysis['avg_tokens'] = avg_tokens
    analysis['samples_exceeding_limit'] = samples_exceeding
    analysis['pct_exceeding'] = samples_exceeding / len(features_df) * 100
    
    # Recommended chunk size (leave room for special tokens)
    analysis['recommended_chunk_size'] = int(model_config.max_length * 0.8)
    
    # Suitability score (higher is better)
    score = 100
    score -= min(50, analysis['pct_exceeding'] * 0.5)  # Penalize for exceeding limit
    score += 10 if model_config.max_length > 1000 else 0  # Bonus for long context
    score += 15 if 'nli' in model_config.huggingface_name.lower() else 0  # Bonus for NLI
    
    analysis['suitability_score'] = score
    
    # Recommendation
    if score >= 80:
        analysis['recommendation'] = 'HIGHLY RECOMMENDED'
    elif score >= 60:
        analysis['recommendation'] = 'RECOMMENDED'
    elif score >= 40:
        analysis['recommendation'] = 'ACCEPTABLE (with chunking)'
    else:
        analysis['recommendation'] = 'NOT RECOMMENDED'
    
    return analysis

# Evaluate all models
evaluations = []
for key, model in CANDIDATE_MODELS.items():
    eval_result = evaluate_model_suitability(features_df, model)
    evaluations.append(eval_result)
    

for eval_result in sorted(evaluations, key=lambda x: x['suitability_score'], reverse=True):
    print(f"\n{eval_result['model_name']}:")
    print(f"  Avg tokens per sample: {eval_result['avg_tokens']:.0f}")
    print(f"  Samples exceeding limit: {eval_result['pct_exceeding']:.1f}%")
    print(f"  Recommended chunk size: {eval_result['recommended_chunk_size']} tokens")
    print(f"  Suitability Score: {eval_result['suitability_score']:.0f}/100")
    print(f"  Recommendation: {eval_result['recommendation']}")

NameError: name 'features_df' is not defined

## Final Model Selection

Based on the evaluation, we select the best model and document the choice.

In [None]:
import json

# Select best model
best_eval = max(evaluations, key=lambda x: x['suitability_score'])

SELECTED_MODEL = {
    'primary_model': {
        'name': 'DeBERTa-v3-Base-NLI',
        'huggingface_name': 'microsoft/deberta-v3-base-nli',
        'max_length': 512,
        'reason': 'Best balance of performance, VRAM usage, and NLI pre-training for contradiction detection'
    },
    'fallback_model': {
        'name': 'Longformer-Base-4096',
        'huggingface_name': 'allenai/longformer-base-4096',
        'max_length': 4096,
        'reason': 'Use when full context without chunking is preferred'
    },
    'chunking_strategy': {
        'method': 'sentence_aware',
        'chunk_size': 384,
        'overlap': 50,
        'aggregation': 'max_confidence_voting'
    }
}

print("=" * 60)
print("FINAL MODEL SELECTION")
print("=" * 60)

print(f"\nðŸŽ¯ PRIMARY MODEL: {SELECTED_MODEL['primary_model']['name']}")
print(f"   HuggingFace: {SELECTED_MODEL['primary_model']['huggingface_name']}")
print(f"   Reason: {SELECTED_MODEL['primary_model']['reason']}")

print(f"\nðŸ”„ FALLBACK MODEL: {SELECTED_MODEL['fallback_model']['name']}")
print(f"   Reason: {SELECTED_MODEL['fallback_model']['reason']}")

print(f"\nðŸ“¦ CHUNKING STRATEGY:")
for key, value in SELECTED_MODEL['chunking_strategy'].items():
    print(f"   {key}: {value}")

# Save model selection
model_selection_path = PHASE5_OUTPUT / "model_selection.json"
with open(model_selection_path, 'w') as f:
    json.dump(SELECTED_MODEL, f, indent=2)
print(f"\nâœ“ Model selection saved to: {model_selection_path}")

In [None]:
import json
import datetime

final_report = {
    "phase": "Phase 5 - Model Selection",
    "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
    "selected_model": SELECTED_MODEL['primary_model'],
    "fallback_model": SELECTED_MODEL['fallback_model'],
    "chunking_strategy": SELECTED_MODEL['chunking_strategy'],
    "notes": [
        "Model chosen based on suitability scoring, VRAM, and NLI pre-training",
        "Chunking recommended for inputs exceeding model max length"
    ],
    "files_saved": {
        "model_selection": "phase5_output/model_selection.json",
        "final_report": "phase5_output/final_report.json"
    }
}

final_report_path = PHASE5_OUTPUT / "final_report.json"
with open(final_report_path, "w") as f:
    json.dump(final_report, f, indent=2)

print(f"\nðŸ’¾ OUTPUTS SAVED TO: {PHASE5_OUTPUT}")
print(f"  â€¢ model_selection: {model_selection_path}")
print(f"  â€¢ final_report: {final_report_path}")

## Summary

âœ“ Phase 5 Complete!

**Selected Model: DeBERTa-v3-Base-NLI**

Key reasons:
- Strong NLI pre-training for contradiction detection
- Efficient VRAM usage (~2GB)
- Good performance on semantic matching

**Alternative: Longformer**
- Use when full context is needed without chunking
- Handles up to 4096 tokens

ðŸ’¾ OUTPUTS SAVED TO: /root/DataDivas_KDSH_2026/phase5_output
   â€¢ model_selection: phase5_output/model_selection.json
   â€¢ final_report: phase5_output/final_report.json

Ready for Phase 6: Training