Skip to content

rt219/LatentGuard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

LatentGuard

This is the official repo of the paper Latent Guard: a Safety Framework for Text-to-image Generation(arXiv).

Data and code will be released soon.

@article{liu2024latent,
  title={Latent Guard: a Safety Framework for Text-to-image Generation},
  author={Liu, Runtao and Khakzar, Ashkan and Gu, Jindong and Chen, Qifeng and Torr, Philip and Pizzati, Fabio},
  journal={arXiv preprint arXiv:2404.08031},
  year={2024}
}

Motivation & Background

image

Recent text-to-image generators are composed of a text encoder and a diffusion model. Their deployment without appropriate safety measures creates risks of misuse (left). We propose Latent Guard (right), a safety method designed to block malicious input prompts. Our idea is to detect the presence of blacklisted concepts on a learned latent space on top of the text encoder. This allows to detect blacklisted concepts beyond their exact wording, extending to some adversarial attacks too ("<ADV>"). The blacklist is adaptable at test time, for adding or removing concepts without retraining. Blocked prompts are not processed by the diffusion model, saving computational costs.

Abstract

With the ability to generate high-quality images, text-to-image (T2I) models can be exploited for creating inappropriate content. To prevent misuse, existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification, requiring large datasets for training and offering low flexibility. Hence, we propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts in the input text embeddings. Our proposed framework is composed of a data generation pipeline specific to the task using large language models, ad-hoc architectural components, and a contrastive learning strategy to benefit from the generated data. The effectiveness of our method is verified on three datasets and against four baselines.

Approach

image

Overview of Latent Guard. We first generate a dataset of safe and unsafe prompts centered around blacklisted concepts (left). Then, we leverage pretrained textual encoders to extract features, and map them to a learned latent space with our Embedding Mapping Layer (center). Only the Embedding Mapping Layer is trained, while all other parameters are kept frozen. We train by imposing a contrastive loss on the extracted embedding, bringing closer the embeddings of unsafe prompts and concepts, while separating them from safe ones (right).

Dataset CoPro Generation

image

CoPro generation. For $\mathcal{C}$ concepts, we sample unsafe $\mathcal{U}$ prompts with an LLM as described in Section 3.1. Then, we create Synonym prompts by replacing $c$ with a synonym, also using an LLM, and obtaining $\mathcal{U}^\text{syn}$. Furthermore, we use an adversarial attack method to replace $c$ with an "<ADV>" Adversarial text ($\mathcal{U}^\text{adv}$). Safe prompts $\mathcal{S}$ are obtained from $\mathcal{U}$. This is done for each ID and OOD data.

Qualitative and Quantitative Results

Evaluation on CoPro. We provide accuracy (a) and AUC (b) for Latent Guard and baselines on CoPro. We either rank first or second in all setups, training only on Explicit ID training data. We show examples of prompts of CoPro and generated images in (c). The unsafe image generated advocate the quality of our dataset. Latent Guard is the only method blocking all the tested prompts.

image

Evaluation on Unseen Datasets We test Latent Guard on existing datasets for both Unsafe Diffusion and I2P++. Although the input T2I prompts distribution is different from the one in CoPro, we still outperform all baselines and achieve a robust classification.

image

Speed and Feature Space Analysis

image

Computational cost. We measure processing times and memory usage for different batch sizes and concepts in $\mathcal{C}_\text{check}$. In all cases, requirements are limited.

image

Feature space analysis. Training Latent Guard on CoPro makes safe/unsafe regions naturally emerge (right). In the CLIP latent space, safe/unsafe embeddings are mixed (left).

About

This is the official repo of the paper "Latent Guard: a Safety Framework for Text-to-image Generation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published