Skip to content

yashkumaratri/ChronoScope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

ChronoScope: Evaluating Temporal Consistency in Multi-Turn Language Models

Paper Dataset License: MIT

Official implementation and dataset for the paper: "Evaluating Temporal Consistency in Multi-Turn Language Models" (ACL 2026).


📌 Overview

As Large Language Models (LLMs) are increasingly deployed in interactive settings, they must maintain Temporal Scope Stability—the ability to preserve, override, or transfer time-scoped factual context across dialogue turns.

ChronoScope is a large-scale diagnostic benchmark designed to test this ability. It consists of over 1.4 million deterministically generated question chains grounded in Wikidata, spanning various domains like politics, sports, and business.

Key Contributions

  • Benchmark: A comprehensive dataset for multi-turn temporal question answering.
  • Metric Framework: New metrics for evaluating Strict Chain Consistency and Temporal Drift.
  • Analysis: Extensive evaluation showing that SOTA models (GPT-4o, Gemini 1.5 Pro, Llama 3) frequently drift toward present-day assumptions, even when a historical context was correctly established.

🛠 Dataset Structure

ChronoScope organizes interactions into 11 distinct chain families to isolate specific patterns of temporal reasoning:

Chain Family Description
Carryover Inheriting scope from a previous turn (e.g., "Who was CEO in 1998?" → "Who was the CFO?").
Scope Switch Explicitly overriding a previous temporal frame with a new one.
Cross-Entity Shifting focus to a related entity while maintaining the same year.
Multi-Turn Longer chains (3-6 turns) testing cumulative stability.
Temporal Narrative Simulating chronological story-tracking across multiple years.

🚀 Evaluation Settings

We evaluate models under three distinct context settings:

  1. Gold Context (Oracle): Prior assistant responses are replaced with gold answers to isolate reasoning from error propagation.
  2. Self-Conditioned: The model relies on its own previous answers, reflecting realistic interactive usage.
  3. Questions Only: Each question is answered in isolation to provide a baseline for implicit vs. explicit cues.

📊 Results at a Glance

Our findings reveal a significant gap between models' static knowledge and their ability to maintain temporal scope:

  • Temporal Drift: Models exhibit a strong bias toward the "present," often forgetting the temporal constraints established in Turn 1 by the time they reach Turn 3.
  • Consistency Collapse: In self-conditioned settings, chain-level accuracy drops significantly compared to turn-level accuracy.

📂 Repository Structure

├── data/                   # Dataset samples and loading scripts
├── eval/                   # Evaluation scripts and prompt templates
├── src/                    
│   ├── model_inference.py  # Script to run LLMs on ChronoScope
│   └── metrics.py          # Implementation of StrictChain@1 and Drift metrics
├── results/                # Raw results and analysis notebooks
├── LICENSE
└── README.md

⚙️ Installation & Usage

1. Environment Setup

# Clone the repository
git clone [https://github.com/yashkumaratri/ChronoScope.git](https://github.com/yashkumaratri/ChronoScope.git)
cd ChronoScope

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

### 🚀 Running Evaluation

To evaluate a model on the **ChronoScope** benchmark, use the `run_eval.py` script. You can specify the model, the reasoning family (e.g., Carryover, Scope Switch), and the context setting.

```bash
# Evaluate a specific model on a specific chain family

To run the full benchmark across all 11 families

updates on the way.

📝 Citation

If you find our work or the ChronoScope dataset useful in your research, please cite our paper:

@inproceedings{atri2026chronoscope,
  title={Evaluating Temporal Consistency in Multi-Turn Language Models},
  author={Atri, Yash Kumar and Johnson, Steven L. and Hartvigsen, Tom},
  booktitle={Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2026},
  publisher={Association for Computational Linguistics},
  url={https://your-paper-url-here.pdf}
}

Yash Kumar Atriatri@virginia.edu

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors