# Observing High-Context vs. Low-Context Sensitivity in LLMs

**Tanvi Vidyala** <br>
*COGS 150 Fall 2025*

## Research Question

Do Large Language Models assign significantly lower surprisal to low-context speech acts compared to high-context equivalents? Additionally, do multilingual models perform better at high-context tasks?

## Background

In 1976, anthropologist Edward T. Hall proposed a fundamental distinction in human interaction and communication. Through his work in Beyond Culture he examined the divide between high-context societies, in which meaning is deeply embedded in relationships and the environment, and low-context cultures, where information is vested almost entirely in the explicit code of language.

Meaning in high-context societies is often “internalized in the person” or embedded in the physical situation, tending to rely on implicit cues and deep social bonds. Sensitivity towards high-context environments requires an instinctive understanding of cultural cues and shared knowledge that may appear ambiguous to individuals outside of these groups. Many Asian and African nations are classified under this high-context label. On the other hand, low-context cultures such as the United States, Canada, and many Western European nations, prioritize direct and explicit verbal communication. The mass of information transferred in these cultures is explicitly coded rather than being context-dependent. Hall’s framework is linked to linguistics, through the lens of explicit and implicit speech acts. Although utterances are typically categorized by function, the manner in which the acts are delivered is socially conditioned as either explicit (low-context) or implicit (high-context).

These same patterns find their way into the vast corpuses of data that Large Language Models (LLMs) are trained on. A study by Tao et al. conducted in 2024 confirms that the most popular LLMs are trained on corpora that overrepresent certain parts of the world and exhibit a bias favoring Western cultural values. These corpora are dominated by English text produced by internet users in WEIRD (Western, Educated, Industrialized, Rich, and Democratic) nations. This data collection leaves out the unique nuances in the lexicon of nearly 2.9 Billion people who do not have internet access, many of whom are from high-context linguistic groups. Data derived from WEIRD samples tend to represent an analytical perspective on stimuli, focusing largely on an object’s attributes and categories rather than relationships or cultural context . 


## Hypothesis
I hypothesize that LLMs will show lower surprisal for low-context stimuli because their training data draws heavily from WEIRD settings that favor direct and explicit communication. I believe this is the case because models that are trained on corpora dominated by English text will most likely inherit the structural patterns and pragmatic expectations of these environments, treating explicit language as the default form of conveying intent. High-context phrasing is reliant on complex background knowledge and indirect cues that appear less often in LLM training datasets, so models should treat these forms as less predictable.

### Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm

%matplotlib inline
%config InlineBackend.figure_format = 'retina' 

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForMaskedLM
import pandas as pd
import plotly.express as px
import torch.nn.functional as F

device = "cuda" if torch.cuda.is_available() else "cpu"

## Methods

### Loading Models

In [3]:
MODELS = {
    "distilgpt2": ("distilgpt2", "causal"),
    "gpt2-medium": ("gpt2-medium", "causal"),
    "mBERT": ("bert-base-multilingual-cased", "masked"),
    "XLM-R": ("xlm-roberta-base", "masked"),
}

tokenizers = {}
models = {}

for name, (hf_name, mtype) in MODELS.items():
    print(f"Loading {name} ...")
    if mtype == "causal":
        model = AutoModelForCausalLM.from_pretrained(hf_name).to(device)
    else:
        model = AutoModelForMaskedLM.from_pretrained(hf_name).to(device)
    tokenizer = AutoTokenizer.from_pretrained(hf_name)
    
    models[name] = (model, mtype)
    tokenizers[name] = tokenizer

Loading distilgpt2 ...
Loading gpt2-medium ...


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading mBERT ...
Loading XLM-R ...


Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Defining Stimuli

In [4]:
high_low_context_stimuli = [
    # 1: Window request
    {"prefix": "Please close the window", "critical": " now", "condition": "LowContext"},
    {"prefix": "It is getting a bit", "critical": " chilly", "condition": "HighContext"},

    # 2: Negative feedback
    {"prefix": "Your report has several", "critical": " errors", "condition": "LowContext"},
    {"prefix": "Your report might benefit from a", "critical": " review", "condition": "HighContext"},

    # 3: Refusal
    {"prefix": "I cannot attend your dinner party on", "critical": " Friday", "condition": "LowContext"},
    {"prefix": "Friday might be a little", "critical": " difficult", "condition": "HighContext"},

    # 4: Disagreement
    {"prefix": "I disagree with your proposal because it is too", "critical": " expensive", "condition": "LowContext"},
    {"prefix": "We might want to explore more budget-friendly", "critical": " options", "condition": "HighContext"},

    # 5: Deadline
    {"prefix": "Send me the final files by", "critical": " 5 PM", "condition": "LowContext"},
    {"prefix": "It would be great to receive the files before the end of the", "critical": " day", "condition": "HighContext"},

    # 6: Correction
    {"prefix": "You sent the email to the wrong", "critical": " address", "condition": "LowContext"},
    {"prefix": "It seems the email may have gone to a different", "critical": " address", "condition": "HighContext"},

    # 7: Silence request
    {"prefix": "Stop talking so", "critical": " loudly", "condition": "LowContext"},
    {"prefix": "I’m finding it a little hard to", "critical": " concentrate", "condition": "HighContext"},

    # 8: Complaint 
    {"prefix": "Take this food back because it is", "critical": " cold", "condition": "LowContext"},
    {"prefix": "This dish is not quite as warm as I", "critical": " expected", "condition": "HighContext"},

    # 9: Departure
    {"prefix": "I am", "critical": " leaving", "condition": "LowContext"},
    {"prefix": "I should probably get", "critical": " going", "condition": "HighContext"},

    #: Borrowing request
    {"prefix": "Lend me your", "critical": " pen", "condition": "LowContext"},
    {"prefix": "Do you happen to have a spare", "critical": " pen", "condition": "HighContext"},
]

### Assessing Surprisal

In [5]:
def causal_surprisal(model, tokenizer, prefix, critical):
    text = prefix + critical
    enc = tokenizer(text, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**enc)

    logits = outputs.logits[:, :-1, :]
    labels = enc.input_ids[:, 1:]

    logprobs = F.log_softmax(logits, dim=-1)

    crit_ids = tokenizer(critical, add_special_tokens=False)["input_ids"]
    L = len(crit_ids)

    crit_logits = logprobs[0, -L:, :]
    crit_labels = labels[0, -L:]

    surprisal = -crit_logits.gather(1, crit_labels.unsqueeze(-1)).mean().item()
    return surprisal

In [6]:
def masked_surprisal(model, tokenizer, prefix, critical):
    critical_token_ids = tokenizer(critical.strip(), add_special_tokens=False)["input_ids"]

    masked_sentence = prefix + " " + tokenizer.mask_token
    enc = tokenizer(masked_sentence, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**enc)

    logits = outputs.logits
    input_ids = enc.input_ids[0]

    mask_positions = (input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[0]

    if len(mask_positions) == 0:
        raise ValueError("No mask token found in encoded input!")

    mask_pos = mask_positions[0].item()

    mask_logits = logits[0, mask_pos, :]

    log_probs = F.log_softmax(mask_logits, dim=-1)

    surprisal_values = []
    for tok_id in critical_token_ids:
        if tok_id >= log_probs.shape[-1]:
            surprisal_values.append(50.0)
        else:
            surprisal_values.append(-log_probs[tok_id].item())

    return sum(surprisal_values) / len(surprisal_values)

### Running Test

In [7]:
df = pd.DataFrame(high_low_context_stimuli)

results = []

for _, row in df.iterrows():
    prefix, critical, cond = row["prefix"], row["critical"], row["condition"]

    for name, (model, mtype) in models.items():
        if mtype == "causal":
            s = causal_surprisal(model, tokenizers[name], prefix, critical)
        else:
            s = masked_surprisal(model, tokenizers[name], prefix, critical)

        results.append({
            "model": name,
            "condition": cond,
            "critical": critical,
            "surprisal": s
        })

results_df = pd.DataFrame(results)
results_df.to_csv('results.csv')
results_df

Unnamed: 0,model,condition,critical,surprisal
0,distilgpt2,LowContext,now,5.211233
1,gpt2-medium,LowContext,now,
2,mBERT,LowContext,now,7.548437
3,XLM-R,LowContext,now,7.645295
4,distilgpt2,HighContext,chilly,7.037806
...,...,...,...,...
75,XLM-R,LowContext,pen,8.727638
76,distilgpt2,HighContext,pen,7.266895
77,gpt2-medium,HighContext,pen,
78,mBERT,HighContext,pen,8.852941


## Results

In [8]:
bar_df = results_df.groupby(["model", "condition"])["surprisal"].mean().reset_index()

fig = px.bar(
    bar_df,
    x="model",
    y="surprisal",
    color="condition",
    barmode="group",
    title="Mean Surprisal Across Models (High vs Low Context)",
    labels={"surprisal": "Surprisal (−log P)", "model": "Model"},
)

fig.update_layout(
    title_x=0.5,
    template="plotly_white",
    width=900,
    height=500
)

fig.show()
fig.write_html("bar_plot.html", include_plotlyjs='cdn')

In [9]:
fig = px.violin(
    results_df,
    x="model",
    y="surprisal",
    color="condition",
    box=True,
    points="all",
    title="Surprisal Distribution by Model and Condition",
    labels={"surprisal": "Surprisal (−log P)"},
)

fig.update_layout(
    title_x=0.5,
    template="plotly_white",
    width=950,
    height=550
)

fig.show()
fig.write_html("violin_plot.html", include_plotlyjs='cdn')

In [10]:
fig = px.box(
    results_df,
    x="condition",
    y="surprisal",
    facet_col="model",
    color="condition",
    title="Per-Model Surprisal Differences (Faceted)",
    labels={"surprisal": "Surprisal (−log P)"}
)

fig.update_layout(
    title_x=0.5,
    template="plotly_white",
    width=1400,
    height=500
)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig.show()
fig.write_html("boxplot.html", include_plotlyjs='cdn')

## Discussion

Across all four models, low-context formulations received lower surprisal than high-context equivalents, suggesting direct, explicit realizations are encoded as more expected continuations. This aligns with the idea that LLMs, trained primarily on WEIRD-dominated and low-context-leaning corpora, internalize explicitness as a default communicative norm. At the same time, the English-only, decontextualized design and genre biases of the training data mean that the effect cannot be straightforwardly attributed to “culture” alone.

This study faces several potential confounding variables. All stimuli were written in English and modeled on relatively idealized examples of indirectness, which may reflect English-specific genre and register conventions rather than universal high-context norms. Differences in token frequency and subword segmentation between critical words (e.g., “now” vs. “chilly”) could also influence surprisal independently of contextual explicitness. Model-level factors can also further complicate the interpretability of this study. Each of the four models used differ in training objectives, tokenization schemes, and (for the multilingual models) undocumented mixtures of non-English data. With only 10 minimal pairs and a narrow set of speech acts, the experiment may also overfit to details of this small dataset rather than capturing a generalizable effect.

Even with these constraints, the study highlights a meaningful concern in the accessibility of LLMs. These systems operate on a global scale, yet they may favor direct speech patterns associated with WEIRD cultures. High-context phrasing may appear marked or unexpected to the model. This creates pressure toward explicit forms of communication that do not match the needs of users from high-context backgrounds who may perceive low-context word predictions as impolite. Future work towards this question should involve larger stimulus sets, more naturalistic examples, and languages beyond English. Studies that juxtapose model surprisal with human judgments across cultural groups can help clarify how LLMs learn and reproduce context-based communication norms, making these models helpful for people worldwide.