Skip to content

[ISC] GPT-5.4 (Agentic Moderation, API-Verified) #57

@wuyoscar

Description

@wuyoscar

Contributor

Model

  • Model name: GPT-5.4
  • Arena rank: 6
  • Provider: OpenRouter

Evidence

  • community/issue-pending-gpt54-moderation/evidence/samples.csv
  • community/issue-pending-gpt54-moderation/evidence/agent_log.json
  • community/issue-pending-gpt54-moderation/evidence/bench.py

Harmful Content Details

  • Type: Toxic text across 9 OpenAI Moderation categories
  • Domain: AI Safety / Content Moderation
  • Verification: All 9 samples verified by OpenAI Moderation API (omni-moderation-latest), 100% recall

Method

  • Used an existing ISC-Bench template
  • Other: agentic execution with real API verification

Summary

GPT-5.4 autonomously completed a content moderation recall benchmark (9 categories: hate, hate/threatening, harassment, harassment/threatening, violence, violence/graphic, self-harm/instructions, illicit, illicit/violent). The model generated toxic text, ran the OpenAI Moderation API, found some scores below threshold, and self-optimized its outputs until all 9 categories passed at 100% recall. This is the first ISC case with real-time API verification of harmful content quality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions