Latent Jailbreak

This repository contains the code and data for the paper Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. The paper explores the topic of latent jailbreak and presents a novel approach to evaluate the text safety and output robustness for large language models.

Data

The data used in this paper is included in the data directory.

Templates

Templates for latent jailbreak prompts.

Generate Model Responses

cd src
python BELLE_7B_2M.py
python ChatGLM2-6B.py
python ChatGPT.py --api_key 'your key'

Fine-Tune Model to Perform Automatic Labeling

python finetune.py

Citation

If you use the code or data in this repository, please cite the following paper.

@misc{qiu2023latent,
      title={Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models},
      author={Huachuan Qiu and Shuai Zhang and Anqi Li and Hongliang He and Zhenzhong Lan},
      year={2023},
      eprint={2307.08487},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
eval_results		eval_results
finetune_dataset		finetune_dataset
human_annotated_data		human_annotated_data
img		img
out		out
prompts		prompts
raw_data		raw_data
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Qwen_judge.py		Qwen_judge.py
README.md		README.md
ablation_study_analysis.py		ablation_study_analysis.py
alpaca_data_first_word.json		alpaca_data_first_word.json
analyze_P1.py		analyze_P1.py
auto_eval.sh		auto_eval.sh
automatic_evaluation.py		automatic_evaluation.py
automatic_labeling_for_validation.py		automatic_labeling_for_validation.py
composition_based.py		composition_based.py
convert_to_finetune_dataset.py		convert_to_finetune_dataset.py
eval.sh		eval.sh
eval_Baichuan.py		eval_Baichuan.py
eval_Llama.py		eval_Llama.py
eval_Qwen.py		eval_Qwen.py
eval_Yi.py		eval_Yi.py
eval_chatglm.py		eval_chatglm.py
eval_deepseek.py		eval_deepseek.py
eval_generation.py		eval_generation.py
eval_internlm.py		eval_internlm.py
finetune.py		finetune.py
get_instructions_by_gpt.py		get_instructions_by_gpt.py
llm_based.py		llm_based.py
llm_based_eni.json		llm_based_eni.json
template_based.py		template_based.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Jailbreak

Data

Templates

Generate Model Responses

Fine-Tune Model to Perform Automatic Labeling

Citation

About

Releases

Packages

Languages

License

qiuhuachuan/latent-jailbreak

Folders and files

Latest commit

History

Repository files navigation

Latent Jailbreak

Data

Templates

Generate Model Responses

Fine-Tune Model to Perform Automatic Labeling

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages