Skip to content

sameraslan/DiaLong

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DiaLong: Benchmarking Reminiscence in Long Context Dialogues

This repository contains code and data for the DiaLong paper, which introduces a dataset and framework for long-context memory evaluation in Large Language Models.

📥 Download the DiaLong dataset here.
📜 Read the full DiaLong paper here.

Introduction

DiaLong is a dataset of long-context dialogues, parsed into four distinguishable sessions between two users, with an accompanying list of true and false facts about each session. Looking to benchmark your LLM's memory? With DiaLong, it is possible to test a model's ability to actively retain and retrieve information, serving as a critical measure of its memory within extended conversational contexts.

Background

Large Language Model (LLM) performance is constrained by limited context length, which hinders the ability to conduct natural language processing tasks. This limitation, known as the problem of forgetfulness, results from LLMs' inability to recall information outside their context window, a challenge exacerbated in long-context dialogues. Despite attempts to extend the attention window to enhance memory capabilities, these efforts fall short due to the impractical demands of computational resources and the degradation in performance for longer contexts.

Where does DiaLong come in?

We introduce a novel dataset (DiaLong) and memory benchmark designed to rigorously evaluate the capacity of current LLMs and emerging memory solutions in sustaining and retrieving information across prolonged dialogues, highlighting the necessity for further research to improve LLMs' memory functions for more coherent, accurate, and trustworthy conversational interactions.

Here is an overview of the memory task:

Access the Dataset

Please find the DiaLong dataset in the dialong.csv file.

Run the Code

For those interested in modifying/extending the dataset or evaluating on their own models, we provide google colab notebooks to create long context conversations with associated true and false facts, create prompts, and evaluate.

Dataset Creation

This notebook uses GPT-4 to make the multi-session chat conversations longer and more fluid, and generates accompanying true and false facts.

Open In Colab

Prompt Creation

This notebook uses the created dataset to generate associated prompts for testing the ability of LLMs to differentiate between true and false facts.

Open In Colab

Response Generation and Evaluation

We open-source our response generation and evaluation notebooks for GPT-3.5 and GPT-4. Plug and play with your own OpenAI API key or modify the response generation to benchmark your own models.

Response Generation

This notebook runs the generated prompts on GPT-3.5 and GPT-4, via the OpenAI API.

Open In Colab

Evaluation

This notebook compares the predicted true and false facts from the LLM responses with the ground truth, and generates metrics for in-context and out-of-context prompting. In-context refers to questions that must be answered by referring to the conversation within an LLMs context window, and out-of-context refers to questions where the answer resides further in the past (out of the context window).

Open In Colab

Citation

If you use the DiaLong dataset or memory evaluation framework, feel free to cite us. TODO: Insert Creck Ciation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors