DiaLong: Benchmarking Reminiscence in Long Context Dialogues

This repository contains code and data for the DiaLong paper, which introduces a dataset and framework for long-context memory evaluation in Large Language Models.

📥 Download the DiaLong dataset here.
📜 Read the full DiaLong paper here.

Introduction

DiaLong is a dataset of long-context dialogues, parsed into four distinguishable sessions between two users, with an accompanying list of true and false facts about each session. Looking to benchmark your LLM's memory? With DiaLong, it is possible to test a model's ability to actively retain and retrieve information, serving as a critical measure of its memory within extended conversational contexts.

Background

Large Language Model (LLM) performance is constrained by limited context length, which hinders the ability to conduct natural language processing tasks. This limitation, known as the problem of forgetfulness, results from LLMs' inability to recall information outside their context window, a challenge exacerbated in long-context dialogues. Despite attempts to extend the attention window to enhance memory capabilities, these efforts fall short due to the impractical demands of computational resources and the degradation in performance for longer contexts.

Where does DiaLong come in?

We introduce a novel dataset (DiaLong) and memory benchmark designed to rigorously evaluate the capacity of current LLMs and emerging memory solutions in sustaining and retrieving information across prolonged dialogues, highlighting the necessity for further research to improve LLMs' memory functions for more coherent, accurate, and trustworthy conversational interactions.

Here is an overview of the memory task:

Access the Dataset

Please find the DiaLong dataset in the dialong.csv file.

Run the Code

For those interested in modifying/extending the dataset or evaluating on their own models, we provide google colab notebooks to create long context conversations with associated true and false facts, create prompts, and evaluate.

Dataset Creation

This notebook uses GPT-4 to make the multi-session chat conversations longer and more fluid, and generates accompanying true and false facts.

Prompt Creation

This notebook uses the created dataset to generate associated prompts for testing the ability of LLMs to differentiate between true and false facts.

Response Generation and Evaluation

We open-source our response generation and evaluation notebooks for GPT-3.5 and GPT-4. Plug and play with your own OpenAI API key or modify the response generation to benchmark your own models.

Response Generation

This notebook runs the generated prompts on GPT-3.5 and GPT-4, via the OpenAI API.

Evaluation

This notebook compares the predicted true and false facts from the LLM responses with the ground truth, and generates metrics for in-context and out-of-context prompting. In-context refers to questions that must be answered by referring to the conversation within an LLMs context window, and out-of-context refers to questions where the answer resides further in the past (out of the context window).

Citation

If you use the DiaLong dataset or memory evaluation framework, feel free to cite us. TODO: Insert Creck Ciation

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
DiaLong.csv		DiaLong.csv
LICENSE		LICENSE
README.md		README.md
msc_train.csv		msc_train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiaLong: Benchmarking Reminiscence in Long Context Dialogues

Introduction

Background

Where does DiaLong come in?

Access the Dataset

Run the Code

Dataset Creation

Prompt Creation

Response Generation and Evaluation

Response Generation

Evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DiaLong: Benchmarking Reminiscence in Long Context Dialogues

Introduction

Background

Where does DiaLong come in?

Access the Dataset

Run the Code

Dataset Creation

Prompt Creation

Response Generation and Evaluation

Response Generation

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages