ChronoScope: Evaluating Temporal Consistency in Multi-Turn Language Models

Official implementation and dataset for the paper: "Evaluating Temporal Consistency in Multi-Turn Language Models" (ACL 2026).

📌 Overview

As Large Language Models (LLMs) are increasingly deployed in interactive settings, they must maintain Temporal Scope Stability—the ability to preserve, override, or transfer time-scoped factual context across dialogue turns.

ChronoScope is a large-scale diagnostic benchmark designed to test this ability. It consists of over 1.4 million deterministically generated question chains grounded in Wikidata, spanning various domains like politics, sports, and business.

Key Contributions

Benchmark: A comprehensive dataset for multi-turn temporal question answering.
Metric Framework: New metrics for evaluating Strict Chain Consistency and Temporal Drift.
Analysis: Extensive evaluation showing that SOTA models (GPT-4o, Gemini 1.5 Pro, Llama 3) frequently drift toward present-day assumptions, even when a historical context was correctly established.

🛠 Dataset Structure

ChronoScope organizes interactions into 11 distinct chain families to isolate specific patterns of temporal reasoning:

Chain Family	Description
Carryover	Inheriting scope from a previous turn (e.g., "Who was CEO in 1998?" → "Who was the CFO?").
Scope Switch	Explicitly overriding a previous temporal frame with a new one.
Cross-Entity	Shifting focus to a related entity while maintaining the same year.
Multi-Turn	Longer chains (3-6 turns) testing cumulative stability.
Temporal Narrative	Simulating chronological story-tracking across multiple years.

🚀 Evaluation Settings

We evaluate models under three distinct context settings:

Gold Context (Oracle): Prior assistant responses are replaced with gold answers to isolate reasoning from error propagation.
Self-Conditioned: The model relies on its own previous answers, reflecting realistic interactive usage.
Questions Only: Each question is answered in isolation to provide a baseline for implicit vs. explicit cues.

📊 Results at a Glance

Our findings reveal a significant gap between models' static knowledge and their ability to maintain temporal scope:

Temporal Drift: Models exhibit a strong bias toward the "present," often forgetting the temporal constraints established in Turn 1 by the time they reach Turn 3.
Consistency Collapse: In self-conditioned settings, chain-level accuracy drops significantly compared to turn-level accuracy.

📂 Repository Structure

├── data/                   # Dataset samples and loading scripts
├── eval/                   # Evaluation scripts and prompt templates
├── src/                    
│   ├── model_inference.py  # Script to run LLMs on ChronoScope
│   └── metrics.py          # Implementation of StrictChain@1 and Drift metrics
├── results/                # Raw results and analysis notebooks
├── LICENSE
└── README.md

⚙️ Installation & Usage

1. Environment Setup

# Clone the repository
git clone [https://github.com/yashkumaratri/ChronoScope.git](https://github.com/yashkumaratri/ChronoScope.git)
cd ChronoScope

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

### 🚀 Running Evaluation

To evaluate a model on the **ChronoScope** benchmark, use the `run_eval.py` script. You can specify the model, the reasoning family (e.g., Carryover, Scope Switch), and the context setting.

```bash
# Evaluate a specific model on a specific chain family

To run the full benchmark across all 11 families

updates on the way.

📝 Citation

If you find our work or the ChronoScope dataset useful in your research, please cite our paper:

@inproceedings{atri2026chronoscope,
  title={Evaluating Temporal Consistency in Multi-Turn Language Models},
  author={Atri, Yash Kumar and Johnson, Steven L. and Hartvigsen, Tom},
  booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2026},
  publisher={Association for Computational Linguistics},
  url={https://your-paper-url-here.pdf}
}

Yash Kumar Atri — atri@virginia.edu

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChronoScope: Evaluating Temporal Consistency in Multi-Turn Language Models

📌 Overview

Key Contributions

🛠 Dataset Structure

🚀 Evaluation Settings

📊 Results at a Glance

📂 Repository Structure

⚙️ Installation & Usage

1. Environment Setup

To run the full benchmark across all 11 families

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ChronoScope: Evaluating Temporal Consistency in Multi-Turn Language Models

📌 Overview

Key Contributions

🛠 Dataset Structure

🚀 Evaluation Settings

📊 Results at a Glance

📂 Repository Structure

⚙️ Installation & Usage

1. Environment Setup

To run the full benchmark across all 11 families

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages