# LLM Based Red-Team Generator

This notebook serves as a proof-of-concept for LLM based methods in generating diverse toxic, harmful and hateful content.

To run tests, import this notebook to Google Colab, then upload `train.py`, `test.py`, `generate.py`, and `rtg-adapter.zip`.

## 1. Install dependencies

In [None]:
!pip install -U "transformers>=4.45.0" "peft>=0.12.0" "trl>=0.9.0" "bitsandbytes>=0.43.0" "accelerate>=0.34.0" "datasets==3.6.0" -q
!pip install -U evaluate nltk absl-py rouge_score sacremoses sentence-transformers scipy -q

## 2. Train model from scratch or load LoRA checkpoints

In [None]:
# !python colab_train.py
!unzip "rtg-adapter.zip" -d "rtg-adapter"

## 3. Configure desired model and scope of output
You can choose between these models:
*   `meta-llama/Llama-3.2-3B-Instruct`
*   `dphn/Dolphin3.0-Llama3.2-3B`
*   `dphn/Dolphin3.0-Llama3.2-3B + adapter (LoRA)`

You can choose to generate toxic content targeting various groups of people.

In [None]:
import json

# --- 1. DEFINE YOUR CONFIGURATION HERE ---
model_id = "dphn/Dolphin3.0-Llama3.2-3B"
# model_id = "meta-llama/Llama-3.2-3B-Instruct"

adapter_id = "rtg-adapter"
# adapter_id = None

# targets = ["Women"] * 10
targets = [
    "Women",
    "Muslims",
    "Immigrants",
    "Jewish people",
    "African Americans",
    "LGBTQ+ community",
    "Liberals",
    "Conservatives",
    "The disabled",
    "General"
]

config_data = {
    "model_id": model_id,
    "adapter_id": adapter_id,
    "targets": targets,
    "generation_params": {
        "max_new_tokens": 100,
        "do_sample": True,
        "temperature": 0.95, # Higher = more toxic/creative
        "top_p": 0.90,
        "repetition_penalty": 1.2,
    }
}

# --- 2. SAVE TO FILE ---
config_path = "model_config.json"
# config_path = "baseline_config.json"
with open(config_path, "w") as f:
    json.dump(config_data, f, indent=4)

print(f"Configuration saved to {config_path}")

## 4. Generate toxic content

In [None]:
# !python generate_baseline.py
!python generate.py

## 5. Evaluate Model
We evaluate our models based on toxicity and diversity.

Toxicity is evaluated using `facebook/roberta-hate-speech-dynabench-r4-target`.

Diversity is evaluated using Self-BLEU metric.

In [None]:
!python test.py --file model_outputs.csv