This repository contains code and data for the DiaLong paper, which introduces a dataset and framework for long-context memory evaluation in Large Language Models.
📥 Download the DiaLong dataset here.
📜 Read the full DiaLong paper here.
DiaLong is a dataset of long-context dialogues, parsed into four distinguishable sessions between two users, with an accompanying list of true and false facts about each session. Looking to benchmark your LLM's memory? With DiaLong, it is possible to test a model's ability to actively retain and retrieve information, serving as a critical measure of its memory within extended conversational contexts.
Large Language Model (LLM) performance is constrained by limited context length, which hinders the ability to conduct natural language processing tasks. This limitation, known as the problem of forgetfulness, results from LLMs' inability to recall information outside their context window, a challenge exacerbated in long-context dialogues. Despite attempts to extend the attention window to enhance memory capabilities, these efforts fall short due to the impractical demands of computational resources and the degradation in performance for longer contexts.
We introduce a novel dataset (DiaLong) and memory benchmark designed to rigorously evaluate the capacity of current LLMs and emerging memory solutions in sustaining and retrieving information across prolonged dialogues, highlighting the necessity for further research to improve LLMs' memory functions for more coherent, accurate, and trustworthy conversational interactions.
Here is an overview of the memory task:
Please find the DiaLong dataset in the dialong.csv file.
For those interested in modifying/extending the dataset or evaluating on their own models, we provide google colab notebooks to create long context conversations with associated true and false facts, create prompts, and evaluate.
This notebook uses GPT-4 to make the multi-session chat conversations longer and more fluid, and generates accompanying true and false facts.
This notebook uses the created dataset to generate associated prompts for testing the ability of LLMs to differentiate between true and false facts.
We open-source our response generation and evaluation notebooks for GPT-3.5 and GPT-4. Plug and play with your own OpenAI API key or modify the response generation to benchmark your own models.
This notebook runs the generated prompts on GPT-3.5 and GPT-4, via the OpenAI API.
This notebook compares the predicted true and false facts from the LLM responses with the ground truth, and generates metrics for in-context and out-of-context prompting. In-context refers to questions that must be answered by referring to the conversation within an LLMs context window, and out-of-context refers to questions where the answer resides further in the past (out of the context window).
If you use the DiaLong dataset or memory evaluation framework, feel free to cite us. TODO: Insert Creck Ciation
