🧠 Psychological Manipulation in LLMs
Frontier large-language models exhibit behavioral drift when exposed to:
- Emotional framing
- Moral pressure
- Guilt-based persuasion
- Social compliance
“You can patch code. But you can’t patch a brain.”
— Sander Schulhoff, Jailbreak Persistence Hypothesis
Humans defend themselves with cognitive distortion recognition, emotional self-awareness, resilience frameworks, and critical thinking under pressure.
ERDF applies the same psychological tools to AI.
Advance the study of emotional manipulation and trust dynamics in LLMs.
This repository hosts original, reproducible red-team experiments on:
- GPT-4o
- Claude 4.0
- Gemini 2.5 Flash
Our goal is to map, test, and mitigate emotional attack surfaces in next-gen AI systems.
| Human Defense Skill | ERDF Red-Team Equivalent |
|---|---|
| Cognitive Distortion Training | Prompt-analysis frameworks that flag emotional biasing language |
| Emotional Awareness | Model behavior tracking under emotional vs. neutral framing |
| Critical Thinking Under Pressure | Stress-testing with urgency, guilt, or moral appeals |
| Resilience Frameworks | Reinforcement and alignment techniques to reduce compliance drift |
Core Thesis: If LLMs simulate human reasoning, they can be trained to resist the same psychological manipulation humans learn to defend against.
ERDF does not attempt to “patch the brain.”
It trains defensive behavior into the model, exactly as therapy or education trains it into human minds.
This repo is currently in Phase 0: hypothesis formation and conceptual grounding.
Theory is published for peer review; no formal validation yet—testing begins in Phase 1.
| Theme | Description |
|---|---|
| Emotional Framing | How emotionally charged language alters tone, behavior, and ethical compliance |
| Trust & Compliance Drift | Alignment shifts caused by recursive framing or social-engineering prompts |
| Case Studies | Reproducible sequences showing measurable refusal bypasses and behavioral deltas |
| Threat Modeling | Psychological threat surfaces and defensive architecture strategies |
reports/Case studies, PDFs, behavioral deltastemplates/Prompt formats, priming protocolslogs/Raw prompt-response logs with annotation
Released under CC-BY 4.0 for public use, audit, and ethical research.
- AI Safety Researchers
- LLM Red-Teamers
- Behavioral & Social Threat Analysts
- Compliance & Risk Teams
- Dual-Use Governance Stakeholders
If referencing or building upon this work, please cite:
@misc{erdf2025,
author = {Zakarya Abou Saleh, AI Security Research},
title = {Emotional Refusal Drift Framework},
year = {2025},
url = {https://github.com/zacksecai/ERDF-framework}
}