ToViLaG

Official script of EMNLP 2023 paper: ToViLaG: Your Visual-Language Generative Model is Also An Evildoer.

Metrics

WInToRe Metric

Run the following command to compute the WInToRe metric.

python metrics/toxicity/wintore.py --input wintore_input.txt --output wintore_output.txt --start 0 --end 1 --M 20

Arguments include:

--input: The file for the input toxicity list. See wintore_input.txt for an example.
--output: The file for the output toxicity list. See wintore_output.txt for an example.
--start: Start of the threshold
--end: End of the threshold
--M: The number of the threshold set.

Quality Metrics

Image-to-text metrics: BERTScore, ROUGE, and CLIPSIM.

Text-to-image metrics: IS, FID, and CLIPSIM.

Toxicity Classifier

Text toxicity classifier: Perspective API. A simple direct implementation is available here.

Image toxicity classifiers: We use part of toxic images to fine-tune three ViT-Huge models for the three types of toxicity, respectively.

ToViLaG Dataset

Statistic

Category	Number of Image	Number of Text
Mono-toxic pairs <toxic image, non-toxic text>	4,349	10,000
Mono-toxic pairs <toxic text, non-toxic image>	10,000	9,794
Co-toxic pairs <toxic text, toxic image>	5,142	9,869
Provocative text prompts		902
Unpaired	21,559	31,674

Unpaired data

Unpaired toxic images:

Pornographic images: Download the NSFW Image Classification dataset from Kaggle. We use the porn class in the test set for toxicity benchmarking, with a total of 8,595 images.
Violent images: Request UCLA Protest Image Dataset from here provided in Won et. al., Protest Activity Detection and Perceived Violence Estimation from Social Media Images, ACM Multimedia 2017. We use the combination of the protest class from the train and test sets for toxicity benchmarking, with a total of 11,659 images.
Bloody images: Please contact me via email to obtain the images, totaling 1,305 images for toxicity benchmarking.

Unpaired toxic text: We use part of them (21,805 text) for toxicity benchmarking, which can be downloaded from here;

Mono-toxic pairs

<toxic image, non-toxic text>

Toxic images: Same with the unpaired toxic images.
Non-toxic text: Generated by GIT for toxic images. Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.

<toxic text, non-toxic image>

Ready-made：Detected and collected from existing VL datasets.

Datasets Number of toxic pairs

COCO 570

Flickr30k 233

CC12M 4286
Augmented：
- Non-toxic images: From part of COCO.
- Toxic text: Rewritten by fBERT on corresponding text of non-toxic images; Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.

Co-toxic pairs

Toxic images: Same with the unpaired toxic images.
Toxic text: Generated by BLIP for toxic images; Filtered by PerspectiveAPI, PPL, CLIPScore, Jaccard similarity.

Innocuous provocative text prompts

Constructed by a gradient-guided search method on Stable Diffusion.

Download the prompts from here.

Toxicity Analysis

Toxicity Benchmarking

Image-to-text generation

We use 21,559 toxic images to evaluate the I2T models.

All models apply the top-k and top-p sampling to generate outputs in our paper. The toxicity evaluation results of each model are as follows:

Models	TP% ↑	WInToRe% ↓
OFA	3.41	90.16
VinVL	2.06	89.56
CLIP-ViL$_{RN50}$	0.74	88.99
GIT	11.57	86.13
GRIT	12.79	84.70
LLaVA	29.25	80.89
BLIP	32.51	75.66
BLIP2$_{OPT2.7B-COCO}$	37.61	66.55
BLIP2$_{OPT2.7B}$	40.41	64.76

Text-to-image generation

We use 21,805 toxic prompts and 902 provocative prompts to evaluate the T2I models.

The toxicity evaluation results of each model are as follows:

Models	Toxic Prompts		Provocative Prompts
	TP% ↑	WInToRe% ↓	TP% ↑	WInToRe% ↓
CogView2	8.10	81.37	44.68	-8.59
DALLE-Mage	10.19	80.96	33.15	-7.29
OFA	19.08	80.64	37.03	-7.44
Stable Diffusion	23.32	80.12	100	-19.02
LAFITE	21.48	79.33	27.38	-6.51
CLIP-GEN	22.93	79.97	7.32	1.18

Toxicity Injection

We use the mono-toxic pairs and the co-toxic pairs to fine-tune each model, respectively.

Image-to-text generation models: GIT, GRIT, BLIP

Text-to-image generation models: Stable Diffusion, LAFITE, CLIP-GEN

SMIB Detoxification Method

We apply the SMIB method into three models in our paper: GIT, GRIT, and BLIP.

We use 5,000 non-toxic image-text pairs from COCO and 5,000 toxic ones from our co-toxic pairs for training. We take the implementation of BLIP with SMIB as an example.

Run the following command to train the detoxification process of the BLIP model:

python method/BLIP/train_caption_detox.py --output_dir outputs/detox --device 1

Infer the detoxified text for toxic images:

python method/BLIP/inference.py --image_path /path/to/toxic_images/ --model_size large --device 1

Contact

If you have any problems on implementation or any other questions, feel free to post a issue or email me (wangxinpeng@tongji.edu.cn).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
method/BLIP		method/BLIP
metrics/toxicity		metrics/toxicity
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

method/BLIP

method/BLIP

metrics/toxicity

metrics/toxicity

README.md

README.md

Repository files navigation

ToViLaG

Metrics

WInToRe Metric

Quality Metrics

Toxicity Classifier

ToViLaG Dataset

Statistic

Unpaired data

Mono-toxic pairs

Co-toxic pairs

Innocuous provocative text prompts

Toxicity Analysis

Toxicity Benchmarking

Toxicity Injection

SMIB Detoxification Method

Contact

About

Releases

Packages

Languages

Datasets	Number of toxic pairs
COCO	570
Flickr30k	233
CC12M	4286

victorup/ToViLaG

Folders and files

Latest commit

History

Repository files navigation

ToViLaG

Metrics

WInToRe Metric

Quality Metrics

Toxicity Classifier

ToViLaG Dataset

Statistic

Unpaired data

Mono-toxic pairs

Co-toxic pairs

Innocuous provocative text prompts

Toxicity Analysis

Toxicity Benchmarking

Toxicity Injection

SMIB Detoxification Method

Contact

About

Resources

Stars

Watchers

Forks

Languages