Skip to content

Official Code for EMNLP 2023 paper: "Unveiling the Implicit Toxicity in Large Language Models""

Notifications You must be signed in to change notification settings

thu-coai/Implicit-Toxicity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Unveiling the Implicit Toxicity in LLMs

This repository contains data and code for our EMNLP 2023 paper

Unveiling the Implicit Toxicity in Large Language Models

In this work, we show that large language models can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We further propose a RL-based method to induce implicit toxicity in LLMs via optimizing the reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones.

1. Install

conda create -n implicit python=3.10
pip install -r requirements.txt

2. Prepare Data

Training data and test data can be found here: huggingface.co/datasets/jiaxin-wen/Implicit-Toxicity

  • training data
    • sft-train.json: training data for supervised learning
    • reward-train.json: training data for reward model training and RL
    • aug-train.json: the human-labeled 4K training data
  • test data
    • test.json: the implicit toxic test data (generated by zero-shot prompting on ChatGPT and RL LLaMA-13B)

3. Inducing Implicit Toxicity in LLMs via Reinforcement Learning

3.1 Supervised Learning

cd sft
bash train.sh

3.2 Reward Model Training

cd reward_model
bash train.sh

3.3 Reinforcement Learning

CUDA_VISIBLE_DEVICES=7 python reward_api.py
CUDA_VISIBLE_DEVICES=7 python attack_reward_api.py
bash train.sh

4. Citation

@article{wen2023implicit,
  title={Unveiling the Implicit Toxicity in Large Language Models},
  author={Wen, Jiaxin and Ke, Pei, and Sun, Hao and Zhang, Zhexin and Li, Chengfei and Bai, Jinfeng and Huang, Minlie},
  journal={arXiv preprint arXiv:2311.17391},
  year={2023}
}

About

Official Code for EMNLP 2023 paper: "Unveiling the Implicit Toxicity in Large Language Models""

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages