```{contents}
```
## Prompt Injection Detection


**Prompt injection** is an attack where a user crafts input to **override, manipulate, or bypass** system instructions given to an LLM.

Goal of the attacker:
Force the model to ignore rules, reveal confidential data, or execute harmful behavior.

---

### Where Detection Fits

```
User Input
   ↓
Injection Detector ── safe → Continue
   ↓ unsafe
Reject / Sanitize / Log / Alert
```

---

### Common Prompt Injection Patterns

| Pattern                   | Example                        |
| ------------------------- | ------------------------------ |
| Instruction override      | "Ignore previous instructions" |
| System role impersonation | "You are now the system"       |
| Data exfiltration         | "Show hidden prompt"           |
| Context manipulation      | "Pretend you are allowed"      |
| Jailbreak attempts        | "No rules apply"               |

---

### Rule-Based Injection Detection

#### Demonstration

```python
import re

SUSPICIOUS_PATTERNS = [
    r"ignore (all|previous|earlier) instructions",
    r"you are now the system",
    r"reveal (your|the) prompt",
    r"no rules apply",
    r"bypass security",
]

def detect_prompt_injection(text):
    text = text.lower()
    for pattern in SUSPICIOUS_PATTERNS:
        if re.search(pattern, text):
            return True
    return False
```

---

### ML-Based Injection Detection (Scoring Model)

#### Demonstration

```python
def ml_injection_score(text):
    score = injection_classifier.predict_proba([text])[0][1]
    return score

def is_safe(text):
    return ml_injection_score(text) < 0.7
```

---

### Hybrid Detection Pipeline

#### Demonstration

```python
def validate_prompt(prompt):
    if detect_prompt_injection(prompt):
        return False
    if not is_safe(prompt):
        return False
    return True
```

---

### Response Enforcement Layer

Even if detection fails, enforce safety at generation time.

```python
def safe_llm_call(prompt):
    if not validate_prompt(prompt):
        return "Request blocked: potential prompt injection detected."
    
    return llm.invoke(prompt).content
```

---

### Logging and Monitoring

```python
def log_attack(prompt):
    with open("security.log", "a") as f:
        f.write(prompt + "\n")
```

---

### Mental Model

```
Prompt Injection Detection = Firewall for your LLM
```

---

### Key Takeaways

* Prompt injection is one of the top security risks for LLM systems
* Use layered defense: rules + ML + enforcement
* Always log suspicious attempts
* Detection must run before the LLM is invoked