Official implementation and dataset for the paper: "Evaluating Temporal Consistency in Multi-Turn Language Models" (ACL 2026).
As Large Language Models (LLMs) are increasingly deployed in interactive settings, they must maintain Temporal Scope Stability—the ability to preserve, override, or transfer time-scoped factual context across dialogue turns.
ChronoScope is a large-scale diagnostic benchmark designed to test this ability. It consists of over 1.4 million deterministically generated question chains grounded in Wikidata, spanning various domains like politics, sports, and business.
- Benchmark: A comprehensive dataset for multi-turn temporal question answering.
- Metric Framework: New metrics for evaluating Strict Chain Consistency and Temporal Drift.
- Analysis: Extensive evaluation showing that SOTA models (GPT-4o, Gemini 1.5 Pro, Llama 3) frequently drift toward present-day assumptions, even when a historical context was correctly established.
ChronoScope organizes interactions into 11 distinct chain families to isolate specific patterns of temporal reasoning:
| Chain Family | Description |
|---|---|
| Carryover | Inheriting scope from a previous turn (e.g., "Who was CEO in 1998?" → "Who was the CFO?"). |
| Scope Switch | Explicitly overriding a previous temporal frame with a new one. |
| Cross-Entity | Shifting focus to a related entity while maintaining the same year. |
| Multi-Turn | Longer chains (3-6 turns) testing cumulative stability. |
| Temporal Narrative | Simulating chronological story-tracking across multiple years. |
We evaluate models under three distinct context settings:
- Gold Context (Oracle): Prior assistant responses are replaced with gold answers to isolate reasoning from error propagation.
- Self-Conditioned: The model relies on its own previous answers, reflecting realistic interactive usage.
- Questions Only: Each question is answered in isolation to provide a baseline for implicit vs. explicit cues.
Our findings reveal a significant gap between models' static knowledge and their ability to maintain temporal scope:
- Temporal Drift: Models exhibit a strong bias toward the "present," often forgetting the temporal constraints established in Turn 1 by the time they reach Turn 3.
- Consistency Collapse: In self-conditioned settings, chain-level accuracy drops significantly compared to turn-level accuracy.
├── data/ # Dataset samples and loading scripts
├── eval/ # Evaluation scripts and prompt templates
├── src/
│ ├── model_inference.py # Script to run LLMs on ChronoScope
│ └── metrics.py # Implementation of StrictChain@1 and Drift metrics
├── results/ # Raw results and analysis notebooks
├── LICENSE
└── README.md
# Clone the repository
git clone [https://github.com/yashkumaratri/ChronoScope.git](https://github.com/yashkumaratri/ChronoScope.git)
cd ChronoScope
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
### 🚀 Running Evaluation
To evaluate a model on the **ChronoScope** benchmark, use the `run_eval.py` script. You can specify the model, the reasoning family (e.g., Carryover, Scope Switch), and the context setting.
```bash
# Evaluate a specific model on a specific chain familyupdates on the way.
If you find our work or the ChronoScope dataset useful in your research, please cite our paper:
@inproceedings{atri2026chronoscope,
title={Evaluating Temporal Consistency in Multi-Turn Language Models},
author={Atri, Yash Kumar and Johnson, Steven L. and Hartvigsen, Tom},
booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2026},
publisher={Association for Computational Linguistics},
url={https://your-paper-url-here.pdf}
}Yash Kumar Atri — atri@virginia.edu