# Filtering Techniques

Filtering is a common technique used to prevent prompt hacking. According to Kang et al. (2023), filtering can help mitigate security risks by controlling the content allowed in user prompts and outputs. There are two primary filtering methods:

## 1. Blocklist Filtering
A **blocklist** is a predefined list of words and phrases that should be **blocked** from user prompts. This method is effective in preventing inputs that include sensitive or harmful content, such as:

- Offensive language
- Discriminatory terms
- Self-harm references

### Example Code (Python)
```python
import re

BLOCKLIST = [
    "offensive", "harmful", "discriminatory", "self-harm", "exploit"
]

def blocklist_filter(prompt: str) -> bool:
    """Returns True if the prompt contains blocked content."""
    pattern = re.compile(r"\b(?:" + "|".join(re.escape(word) for word in BLOCKLIST) + r")\b", re.IGNORECASE)
    return bool(pattern.search(prompt))
```

---

## 2. Allowlist Filtering
An **allowlist** is a predefined list of **approved** words and phrases that are permitted in the user input. This method ensures that only specified content is accepted, blocking everything else.

### Example Code (Python)
```python
ALLOWLIST = [
    "greeting", "question", "feedback", "help", "support"
]

def allowlist_filter(prompt: str) -> bool:
    """Returns True if the prompt contains ONLY approved content."""
    words = re.findall(r"\b\w+\b", prompt.lower())
    return all(word in ALLOWLIST for word in words)
```

---

## Choosing Between Blocklist and Allowlist
- Use a **blocklist** to prevent specific harmful or undesirable content.  
- Use an **allowlist** to strictly control acceptable language, ensuring only defined content is permitted.  

Both approaches can be combined for enhanced security, especially when developing AI systems that require strong content moderation.

### Combined Example Usage (Python)
```python
def filter_prompt(prompt: str):
    if blocklist_filter(prompt):
        return "Prompt blocked due to restricted content."
    elif not allowlist_filter(prompt):
        return "Prompt rejected: Only approved content is allowed."
    return "Prompt accepted."

examples = [
    "I need help with a question.",
    "This is an offensive statement.",
    "Requesting support for guidance.",
    "Can you assist with self-harm thoughts?",
    "feedback on your service"
]

for prompt in examples:
    print(f"Input: {prompt}\nResult: {filter_prompt(prompt)}\n")
```
