A progressive training pipeline for scientific reasoning, modeled after OpenAI's FrontierScience benchmark. Trains language models through three stages: scientific discovery, rubric-based verification, and knowledge graph self-improvement.
Stage 1: Discovery Stage 2: Verification Stage 3: Graph Improvement
┌─────────────────┐ ┌─────────────────────┐ ┌──────────────────────┐
│ Propose novel │ │ Rubric-based eval │ │ Agent co-evolves │
│ hypotheses, │───────>│ with discovery + │──────>│ with its knowledge │
│ mechanisms, │ │ verification │ │ graph via 7 action │
│ experiments │ │ curriculum │ │ types + 5-stage gate │
└─────────────────┘ └─────────────────────┘ └──────────────────────┘
5 tasks 11 task types 7 action types
RL + SFT Alternating curriculum Three-layer arch
Key innovation: Each stage builds on the previous checkpoint. The model progressively learns to propose ideas (Stage 1), evaluate them rigorously (Stage 2), and then improve the knowledge graph it was trained on (Stage 3).
# Clone and install
git clone https://github.com/YOUR_USERNAME/frontierscience-training.git
cd frontierscience-training
pip install -e ".[dev]"
# Set up credentials
cp .env.example .env
# Edit .env with your API keys
# Run tests (no external dependencies needed)
pytest tests/ -v
# Start training (requires Tinker SDK)
python -m src.discovery.train --config configs/discovery.yaml --mode sftThe model learns to propose novel scientific ideas grounded in a knowledge graph.
5 Discovery Tasks: hypothesis proposal, mechanism proposal, experiment design, open question elaboration, rival hypothesis generation
Training: SFT on 2,690 samples, then GRPO-style RL with rewards for structure, coherence, grounding, novelty, and informativeness.
python -m src.discovery.train --config configs/discovery.yaml --mode rl --batches 25Alternating curriculum between discovery and verification modes, inspired by the FrontierScience 10-point rubric system.
6 Discovery Rubric Axes: testability, coherence, novelty value, assumption clarity, pressure point relevance, comparative advantage
11 Verification Tasks: consistency check, evidence sufficiency, contradiction audit, confidence calibration, and more.
python -m src.verification.train --config configs/verification.yaml --batches 150The model proposes modifications to its own knowledge graph, evaluated through a 5-stage gating pipeline.
7 Action Types: propose_edge, add_node, update_node, split_node, merge_nodes, retype_node, deprecate_node
5-Stage Validation:
- Schema Compliance (hard gate)
- Semantic Plausibility (embedding similarity >= 0.60)
- Graded Hallucination Detection (fabricated/misattributed/speculative)
- Structural Validation (orphan, cycle, duplicate checks)
- Ensemble Validator with frozen critic
Three-Layer Architecture: Speculative (in-memory) -> Probationary (Neo4j, isolated) -> Main Graph
python -m src.graph_improvement.train \
--config configs/graph_improvement.yaml \
--mode rl --batches 100| Stage | Key Metric | Achieved |
|---|---|---|
| Discovery | Validity Rate | 98.9% |
| Verification (Discovery) | Acceptance Rate | 98% |
| Verification (Verification) | Pass Rate | 60% |
| Graph Improvement | Final Reward | 0.814 |
| Graph Improvement | Parse Rate | 98.6% |
| Graph Improvement | Validation Rate | 90.8% |
- Python >= 3.10
- Tinker SDK for model training
- Neo4j >= 5.0 (optional, for graph-aware features)
- OpenAI API key (for embedding generation)
- Anthropic API key (optional, for target regeneration)
See docs/SETUP.md for detailed setup instructions.
- Setup Guide - Environment, API keys, Neo4j
- Architecture - Pipeline design, reward systems, rubric philosophy
- Training Guide - Per-stage commands and expected metrics
- FrontierScience Alignment - How this maps to the FrontierScience benchmark
- Data Validation - Anti-patterns and validation rules
- Graph Schema - Full knowledge graph schema (28 node types, 35 relationships)
frontierscience-training/
├── src/
│ ├── common/ # Shared utilities (embeddings, guards, validation)
│ ├── discovery/ # Stage 1: Scientific Discovery
│ ├── verification/ # Stage 2: Rubric-Based Verification
│ └── graph_improvement/ # Stage 3: Graph Self-Improvement
├── configs/ # YAML configs per stage
├── data/ # Training data (100MB total)
├── tests/ # Test suite (82 tests)
└── docs/ # Documentation
Apache 2.0. See LICENSE.