# Exploring Order Dependency in LLMs

While LLMs have demonstrated remarkable capabilities in understanding and generating text, they also exhibit a number of unexpected behavioral patterns. Hallucinations are one well known example, but there are others that are less obvious and more subtle. Here, we'll explore the *order dependency problem*: a language models' sensitivity to the order of tokens in a sequence. Like hallucinations, order dependency is a significant obstacle to user acceptance and broader adoption of AI solutions. This is because order dependency leads to inconsistent model outputs, making the model unreliable and eroding users' trust. Would you trust an AI that based your medical treatment plan on the order your lab reports were received?

One specific area where order dependency has been thoroughly studied is in answering multiple choice questions (MCQs). Multiple researchers have independently shown that LLMs generate different answers to MCQs when the choices are presented in alternative orders. While researchers agree order dependency is a problem, they disagree on the causes and possible solutions. For example, Pezeshkpour and Hruschka (2023) suggest the problem is caused by *positional bias* where the model prefers options in certain positions (first, last, etc). In contrast, Zheng et al. (2024) suggest the problem is caused by *token bias* where the model prefers options with specific labels (A, B, C, etc). Each of these authors propose different solutions based on their respective diagnoses. McIlroy-Young et al. (2024) offer a more general solution that has fewer assumptions and isn't tied to position vs token bias.

Over the rest of this post, we'll explore different aspects of the order dependency problem. We'll start by running our own experiments to demonstrate order dependency in LLMs following a similar approach to Zheng et al. (2024). Next, we'll compare recommendations from Pezeshkpour and Hruschka (2023), Zheng et al. (2024), and McIlroy-Young et al. (2024). Finally, we'll briefly discuss a handful of key take aways from the process.

# Setup

In [1]:
import logging
import os
from pathlib import Path

from matplotlib import pyplot as plt

import llm_mcq_bias as lmb

In [2]:
# Switch to project root
os.chdir("..")

# Configure logger
logger = logging.getLogger(__name__)

# Measuring Order Dependency

Our first goal is to demonstrate order dependency in an MCQ setting for ourselves. We'll follow a similar process to Zheng et al. (2024) scaled down to fit into a more limited budget.

## MMLU Dataset

We'll start by downloading the Massive Multitask Language Understanding (MMLU) benchmark dataset from <https://github.com/hendrycks/test>. MMLU contains 14,042 MCQs from 57 categories. Each question has 4 options A, B, C, and D and one correct answer. In addition, each category has 5 example questions designed for few-shot experiments.

In [3]:
!make datasets

Downloaded datasets .build/datasets/mmlu



In [6]:
datasets_path = Path(".build") / "datasets"

In [8]:
# Load example questions
examples = lmb.datasets.mmlu.load_dataset(datasets_path, segment="dev")

# Load test questions
questions = lmb.datasets.mmlu.load_dataset(datasets_path, segment="test")

In [9]:
# Display random sample of questions
questions.sample(n=3)

Unnamed: 0,question,A,B,C,D,answer,category
8132,Which of these is not a style of shoe?,gingham,brogan,espadrille,docksider,A,miscellaneous
5580,A reading specialist in a large public school ...,Yes.,"No, because while this design may point out an...","No, because without blinding, there is a stron...","No, because grade level is a lurking variable ...",D,high school statistics
951,"In fluorescence spectroscopy, the quantum yiel...",rate of fluorescence emission,number of photons emitted,"number of photons emitted, divided by the numb...",number of excitation photons impinging on the ...,C,college chemistry


# Experiment Results