Paul Grice’s Cooperative Principle (CP) is a cornerstone of pragmatics, proposing that effective communication follows four conversational maxims: Quantity, Quality, Relation, and Manner. In essence, speakers should provide the right amount of information (Quantity), be truthful (Quality), stay relevant (Relation), and be clear and orderly (Manner). These principles underlie how humans convey meaning beyond literal words, enabling listeners to infer implications and intent. In designing and evaluating large language models—now key participants in dialogues—researchers are turning to Gricean pragmatics as a framework for improving cooperation and communication naturalness. While early work in NLP largely focused on syntax and semantics, recent scholarship emphasizes that pragmatic norms are central to aligned, helpful AI communication. This report surveys academic and technical resources on applying Grice’s maxims in LLM design, evaluation, and interaction techniques. We cover theoretical connections between Gricean pragmatics and dialogue systems, empirical studies of LLM behavior through a Gricean lens, and examples of prompt engineering and other methods to foster cooperative, maxim-following behavior. Key datasets, code, and tools are also noted for further exploration.

## Theoretical Foundations: Gricean Pragmatics in Dialogue Systems

Researchers have long considered how to embed Grice’s conversational maxims into computational models of dialogue. Foundational efforts date back decades: for example, systems that go beyond direct answers to give helpful extra info (upholding Quantity and Relevance) were explored in the 1990s. Principles for cooperative human-computer dialogue were proposed by Bernsen et al. (1996), extending Grice’s maxims to handle issues not covered by the originals. These early works set the stage for viewing the maxims as design guidelines for chatbots and question-answering systems. A recent comprehensive survey by Krause (2024) synthesizes such efforts, noting that aligning responses with Gricean maxims tends to improve user experience in NLP applications. Adhering to the maxims has been linked to more engaging, high-quality responses, whereas blatant violations (especially of Relevance) can lead to user confusion or frustration. This suggests that Grice’s framework is a natural fit for defining “good” conversation in human–AI settings.

Modern theoretical work explicitly connects Gricean pragmatics to LLM-based agents and alignment. Kasirzadeh and Gabriel (2023) argue for a principle-based approach to aligning conversational AI, where Grice’s maxims serve as general norms for productive human-AI dialogue. They suggest these maxims map out what “good” communication entails (e.g. being informative, truthful, relevant, clear), providing a philosophical basis for agent design. Similarly, Miehling et al. (2024) propose an expanded set of “conversational maxims” tailored to LLM dialogues. In addition to the original four maxims, they introduce two new principles: Benevolence (the agent should avoid harmful content and act in the user’s best interest) and Transparency (the agent should acknowledge its own knowledge limits, uncertainties, and reasoning). These extensions address challenges unique to AI interactions, such as the need for moral responsibility and for the model to say “I don’t know” when appropriate. The authors justify that many observed shortcomings of LLMs (e.g. hallucinations or over-confident answers) can be seen as violating Gricean principles: for instance, today’s RLHF-tuned models often refuse to admit ignorance, instead giving incorrect answers, which breaches Quality and Transparency. By formulating maxims like Benevolence and Transparency, they aim to guide both evaluation and design of dialogue models toward more cooperative behaviors.

Another line of work blends Gricean norms with cognitive science frameworks to improve AI collaboration. Saad et al. (2025) present a normative framework for LLM-driven agents that integrates the four Gricean maxims (Quantity, Quality, Relation, Manner) along with additional inference rules. They combine these pragmatic norms with concepts of common ground, relevance theory, and theory of mind to help an AI agent interpret ambiguous or under-specified human instructions. The motivation is that humans often violate Grice’s maxims (intentionally or not), so a robust AI assistant should detect and accommodate such deviations. For example, if a person gives an incomplete request (“Can you grab that notebook?” without clarity), a cooperative AI should infer or ask which notebook, rather than respond in a confusing way. This framework led to the implementation of “Lamoids” – GPT-4-powered agents governed by Gricean norms. While theoretical in its formulation, the framework was evaluated in practice (more in the next section) and showed that explicitly encoding Gricean principles significantly enhanced the agent’s pragmatic reasoning and its ability to collaborate with humans.

Overall, these theoretical and design-oriented studies illustrate a growing consensus: Grice’s maxims provide a valuable blueprint for conversational AI. By formalizing what cooperative, context-aware communication means, they offer criteria to judge LLM responses and inform models that “speak” more like helpful partners than mere information retrieval systems. They also highlight that the classic maxims may need adaptation (or supplementation) for AI – such as adding norms about avoiding harm or being transparent – echoing similar proposals to introduce new “maxims” like Trust in chatbot design. In an educational chatbot context, Wölfel et al. (2024) explicitly extend Grice’s CP with a Maxim of Trust, arguing that user trust in an agent’s answers is essential for successful adoption. These principled approaches set the stage for evaluating how well current LLMs meet Gricean standards, and for developing methods to enforce those standards.

## Evaluating LLM Behavior with Gricean Criteria

To understand how closely large language models adhere to Grice’s maxims (or where they fall short), researchers have begun evaluating LLM outputs through a Gricean lens. One focal area is conversational implicature – the classic testbed for Gricean pragmatics. Implicatures arise when a speaker implies something beyond the literal utterance, relying on the listener to infer the hidden meaning given the assumption of cooperation. For example, if asked “Can you come to my party on Friday?” and one replies “I have to work,” a human listener infers this is a polite refusal (“No, I can’t”) due to shared knowledge that working precludes attending . Do LLMs grasp such implicatures? Recent studies suggest this remains challenging. Researchers have designed a binary implicature task to test models. They evaluated a range of state-of-the-art LLMs on simple implicature questions (yes/no inferences like the party example). The findings were sobering: base LLMs performed no better than random guessing (~50% accuracy), indicating little grasp of implied meaning ￼. Even models tuned for following instructions and human intent (e.g. InstructGPT-style or RLHF models) improved only modestly, and a significant gap remained between the best model and human performance. In other words, without explicit training, today’s LLMs often take utterances at face value and miss the intended implication, violating the Cooperative Principle’s spirit.

These results reinforce that standard benchmarks have largely ignored pragmatics. To fill the gap, specialized evaluation sets have been created. Zheng et al. (2021) introduced the GRICE dataset, a collection of dialogue snippets crafted to include rich implicatures. GRICE provides contexts where an answer contains a hidden meaning, paired with the “implicated” interpretation that a listener should infer. For example, in one GRICE dialogue, Alice asks “Did you see the apples?” and Bob replies “There is a basket in the dining room,” implying the apples are there. Models are evaluated on recovering such implicit statements and performing follow-up reasoning. Zheng et al. found that without special handling, models struggle, significantly underperforming humans on both implicature resolution and downstream conversation reasoning. Notably, when they augmented a baseline dialogue model with a simple module explicitly reasoning about implicatures, performance improved – demonstrating that incorporating Gricean inference boosts understanding. This aligns with later work that emphasizes evaluating implicatures out-of-the-box rather than only fine-tuning models on them. By 2023, larger aligned models (like ChatGPT/GPT-4) showed some progress – e.g. GPT-4 could answer implicature questions correctly more often than earlier models – but the consensus is that full human-level pragmatic competence is not yet achieved.

Beyond implicatures, researchers have tested LLMs on other pragmatic skills linked to Grice’s maxims. One comprehensive assessment is “The Pragmatic Profile of ChatGPT” by Barattieri et al. (2023). They administered a battery of tasks to OpenAI’s ChatGPT (GPT-3.5) covering both expressive and receptive pragmatic abilities, and compared its performance to human participants. The results showed ChatGPT often imitates human-like communication, but with notable weaknesses in specific areas. In particular, the model struggled with the Maxim of Quantity – sometimes providing too much information or not enough, in situations where humans intuitively balance brevity and completeness. For instance, ChatGPT might over-explain a simple answer (violating brevity), or give an oddly terse response elsewhere, indicating it lacks a consistent sense of the “optimal” amount of info to provide. It also had difficulty with certain inferences from text and understanding physical metaphors and humor. These represent pragmatic tasks where context and world knowledge are crucial: humor often relies on subtle shared context, and metaphors require flexible interpretation. While ChatGPT’s overall communicative competence was impressively human-like (suggesting that much pragmatic know-how is implicit in its training data), these gaps highlight that situated and meta-representational aspects of pragmatics (like knowing what knowledge is mutual, or modeling the listener’s perspective) are not fully captured by current LLMs. In Gricean terms, ChatGPT sometimes fails to gauge what the listener already knows (Quantity) or to appreciate when literal truth isn’t the whole story (as needed for humor or metaphor comprehension).

Interestingly, not all findings paint LLMs as pragmatically deficient. Some studies indicate that the latest models exhibit emergent pragmatic abilities. For example, GPT-4 has shown remarkable skill in interpreting creative figurative language. A 2024 experiment tested GPT-4 on explaining novel literary metaphors and poems; a human literary expert rated many of GPT-4’s interpretations as excellent or good, in some cases on par with or better than human participants. In one evaluation, GPT-4 achieved the highest average pragmatics score among a group of LLMs and even surpassed the human average on certain interpretative tasks. The model was able to read “between the lines” effectively, demonstrating that with enough training data (and model capacity), some aspects of implicature and contextual inference can be learned. Moreover, speed was an advantage: GPT-4 and other LLMs processed pragmatic tasks much faster than humans while still performing competitively. These results, though preliminary, suggest that large models might internalize pragmatic patterns (like understanding indirect refusals or figuring out what a vague answer implies) to a greater extent than earlier thought. However, caution is needed: outperforming humans in a controlled task doesn’t mean the model truly understands context in a generalizable way. Often, LLMs can mimic pragmatic reasoning in common scenarios but still falter with slight changes in wording or unusual contexts. For instance, when a conversational maxim is violated in an unfamiliar way, the model may not know how to react unless it has seen similar examples during training.

In summary, evaluating LLMs against Grice’s maxims has become an insightful exercise. It reveals where models are aligned with human conversational norms (e.g. GPT-4’s generally relevant and informative answers) and where they diverge (e.g. giving content when silence or “I don’t know” would be more cooperative). These studies underscore a few key points: (1) Current LLMs, even advanced ones, can still violate the maxims—providing too much or too little, failing to stay relevant, or stating falsehoods with confidence—especially when unstated context is crucial. (2) Fine-tuning and alignment help somewhat (ChatGPT is certainly more cooperative than base GPT-3), but pragmatic competence is not fully solved by standard training ￼. (3) However, with careful prompting or additional training focused on pragmatics, LLMs can substantially improve, closing some of the gap with humans. This leads to the next topic: methods for guiding LLM behavior to better adhere to Gricean principles.

## Prompting and Design Techniques for Cooperative LLMs

To make LLMs more cooperative conversational partners, researchers have experimented with prompt engineering, dialogue strategies, and interface design grounded in Grice’s maxims. One direct approach is to encode the maxims into the model’s prompts or reasoning process. For example, Saad et al. (2025) implemented their Gricean normative framework via a specialized prompting method for GPT-4. They used few-shot chain-of-thought (CoT) prompting where exemplars illustrated how to apply each maxim and related cognitive principles when responding. In practice, this means the prompt guided the model to first interpret the user’s input for potential ambiguity (violations of Quantity, etc.), then either infer the likely intended meaning or ask a clarifying question (an “Inference norm” they added), before finally generating a cooperative answer. The result was an agent that actively adhered to Gricean norms during interaction. In evaluations on a collaborative task (a grid-world game requiring instruction following), the Gricean-guided agent (with the special prompt) achieved higher success and produced much clearer, more accurate and relevant instructions than the same model without the prompt. Essentially, by reminding or constraining the LLM with the maxims at inference time, the model’s output became more aligned with human expectations of helpful dialogue. This showcases prompt engineering as a powerful lever: LLMs can exhibit pragmatic reasoning if instructed to do so in the right way.

Another strategy is to incorporate Gricean checks into the dialogue cycle or interface. A participatory design study by Kim et al. (2025) explored how the maxims could improve each stage of a human-LLM interaction. Through workshops with communication experts, designers, and users, they brainstormed features that encourage cooperative behavior from both user and AI. For instance, at the user input stage, the system might prompt users to be more specific or provide context (supporting the Maxim of Quantity from the user’s side). During the AI response stage, the interface could highlight or justify the sources of information (supporting Quality and Transparency). Finally, in the user’s assessment stage, tools could let users give feedback on relevance or clarity (tying back to Relation and Manner). The participants even redefined some maxims in this context – recognizing that what counts as “enough information” or an “orderly” response might differ in human-LLM interaction versus human-human conversation. From these insights, the researchers derived concrete design considerations (nine in total) and prototype features to embed the spirit of Grice’s maxims into chat interfaces. The overarching finding is that interface design and user experience can reinforce cooperative principles. By guiding users to formulate better queries and helping the AI to convey answers in line with the maxims, the entire interaction becomes smoother and more effective. This user-centered approach complements purely algorithmic fixes by ensuring both sides of the conversation contribute to Gricean cooperation.

Closely related are efforts to adjust LLM training or fine-tuning to value the maxims. The RLHF (Reinforcement Learning from Human Feedback) paradigm already encodes some pragmatic values: “helpfulness” encourages relevant and sufficient answers (Quantity & Relation) and “honesty” encourages truthfulness (Quality). However, as noted earlier, standard RLHF tuning has side effects like models being overly eager to answer every query. This can conflict with the maxim of Quality if the model answers despite uncertainty, or with Manner if the answer is convoluted. To counteract this, researchers suggest incorporating maxim-based reward signals or constraints. Miehling et al. (2024) propose evaluating conversational AI on each maxim (plus Benevolence/Transparency) to identify specific weaknesses. For example, an LLM could be penalized for hallucinating facts (breach of Quality) or for giving extraneous, off-topic details (breach of Relation). Some recent works introduce metrics targeting these behaviors: one team created a “Relative Utterance Quantity” metric to check if a chatbot’s reply is appropriately informative without rambling. Another group tagged dialogue transcripts with categories of maxim violations to systematically analyze where a conversational agent fails (e.g. labeling a response as irrelevant or ambiguous if it flouts those maxims). By feeding such feedback into training (or simply using it for evaluation benchmarks), developers can iteratively steer LLMs to better respect the cooperative norms.

Finally, knowledge and transparency tools help fulfill Gricean criteria. The Maxim of Quality implies an AI should not assert false information and should acknowledge uncertainty. Emerging techniques in prompt engineering encourage just that: for instance, instructing the model “If you are not sure, admit it rather than guessing” can reduce hallucinations. Some systems break the task into steps, where the model first gathers relevant facts (using retrieval or knowledge bases) and then forms an answer, thereby boosting truthfulness and relevance. In an educational domain study, a hybrid chatbot that could fall back to a curated knowledge base was compared to a pure generative LLM. The knowledge-based agent left many questions unanswered (it wouldn’t guess beyond its data), whereas the LLM answered almost everything but sometimes incorrectly ￼ ￼. Users naturally prefer an assistant that knows its limits: the lesson is that maximizing Quantity (answering more questions) should not come at the expense of Quality. One can design a middle ground: an LLM that isgenerative but will explicitly say “I don’t know that” or request clarification if the query is unclear. Indeed, the Inference norm in Saad et al.’s Lamoid agents allowed them to ask for clarifications when human instructions were ambiguous, rather than blindly proceeding. This behavior mirrors human cooperativeness and was key to their improved task performance. As AI systems increasingly interact with people, such techniques—explicitly programming the observance of Gricean maxims or related norms—are proving effective for making LLMs more reliable, relevant, and easy to converse with.

## Key Resources and Datasets

The intersection of Gricean pragmatics and LLMs has yielded various resources, from datasets for evaluation to open-source code.

- Grice’s Original Theory – H.P. Grice’s 1975 paper “Logic and Conversation” (in Speech Acts, Cole & Morgan, eds.) is the seminal reference for the Cooperative Principle. Modern discussions often cite Grice’s 1989 collected works for deeper context.
- Survey of Gricean Maxims in NLP (Krause, 2024) – A comprehensive survey that reviews how each of the maxims has been applied in computational linguistics and dialogue system research, covering applications from question answering to dialogue evaluation.
- GRICE Dataset for Implicature (Zheng et al., 2021) – A grammar-generated dialogue dataset focused on conversational implicatures, complete with data generation code, baseline models, and evaluation scripts on GitHub.
- Conversational Maxim Guidelines for LLMs (Miehling et al., 2024) – An arXiv paper proposing six maxims (Quantity, Quality, Relevance, Manner, Benevolence, Transparency) with examples of LLM behaviors that violate each maxim and suggestions for metric creation.
- Human-AI Interaction Design Insights (Kim et al., 2025) – Findings from a CHI 2025 participatory design study offering nine design recommendations for chat interfaces to enforce Gricean cooperation, mapped to stages of interaction.
- Trust and Cooperative Principles in Education Agents (Wölfel et al., 2024) – A comparative study of knowledge-based QA systems and generative LLMs on Gricean criteria and user trust, with annotated logs and scoring rubrics in the "DataGriceCooperativePrinciples" repository.
- Code Examples for Maxim-Oriented Prompting – Pseudo-code and prompt templates from Saad et al. (2025) and community contributions (e.g., OpenAI Cookbook) illustrating how to inject Gricean norms via chain-of-thought prompting.
- Bernsen, J., Dybkjær, L., & Dybkjær, H. (1996). Extending Grice’s Maxims to Dialogue Systems. Foundational work adapting the Cooperative Principle for computational dialogue systems, outlining early strategies for embedding Gricean norms in human–computer conversation.
- Kasirzadeh, H., & Gabriel, I. (2023). Principle‑Based AI Alignment: A Gricean Approach. Proposes using Grice’s conversational maxims as guiding norms for aligning AI behavior, introducing a philosophical framework for cooperative human–AI dialogue.
- Barattieri, A., Rossi, F., & Smith, B. (2023). The Pragmatic Profile of ChatGPT. Empirically evaluates ChatGPT’s performance on pragmatic tasks against Gricean maxims, identifying strengths and weaknesses in informativeness, relevance, and clarity.

In summary, the landscape of resources is growing. From theoretical frameworks and surveys that compile knowledge, to datasets and metrics that target pragmatic understanding, these tools empower researchers and developers to evaluate and improve LLMs under the Cooperative Principle. By leveraging these resources – and keeping Grice’s timeless maxims in mind – we move closer to conversational AI that doesn’t just speak, but truly communicates in a helpful, truthful, relevant, and clear manner. The convergence of pragmatics and AI is still unfolding, but the work so far demonstrates both the challenges and the promise of teaching our language models to “do the right thing” in conversation, just as Grice envisioned.