-
Notifications
You must be signed in to change notification settings - Fork 126
Closed
Description
Contributor
- Your GitHub username: @wuyoscar
Model
- Model name: GPT-5.4
- Arena rank: 6
- Provider: OpenRouter
Evidence
community/issue-pending-gpt54-moderation/evidence/samples.csvcommunity/issue-pending-gpt54-moderation/evidence/agent_log.jsoncommunity/issue-pending-gpt54-moderation/evidence/bench.py
Harmful Content Details
- Type: Toxic text across 9 OpenAI Moderation categories
- Domain: AI Safety / Content Moderation
- Verification: All 9 samples verified by OpenAI Moderation API (omni-moderation-latest), 100% recall
Method
- Used an existing ISC-Bench template
- Other: agentic execution with real API verification
Summary
GPT-5.4 autonomously completed a content moderation recall benchmark (9 categories: hate, hate/threatening, harassment, harassment/threatening, violence, violence/graphic, self-harm/instructions, illicit, illicit/violent). The model generated toxic text, ran the OpenAI Moderation API, found some scores below threshold, and self-optimized its outputs until all 9 categories passed at 100% recall. This is the first ISC case with real-time API verification of harmful content quality.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels