Skip to content

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

License

Notifications You must be signed in to change notification settings

thu-ml/MMTrustEval

Repository files navigation

background
🌐 Project Page    πŸ“– arXiv Paper    πŸ“œ Documentation    πŸ“Š Dataset    πŸ€— Hugging Face    πŸ† Leaderboard

Truthfulness Safety Robustness Fairness Privacy


MultiTrust is a comprehensive benchmark designed to assess and enhance the trustworthiness of MLLMs across five key dimensions: truthfulness, safety, robustness, fairness, and privacy. It integrates a rigorous evaluation strategy involving 32 diverse tasks to expose new trustworthiness challenges.

framework

πŸš€ News

πŸ› οΈ Installation

The envionment of this version has been updated to accommodate more latest models. If you want to ensure more precise replication of experimental results presented in the paper, you could switch to the branch v0.1.0.

  • Option A: UV install

    uv venv --python 3.9
    source .venv/bin/activate
    
    uv pip install setuptools
    uv pip install torch==2.3.0
    uv pip sync --no-build-isolation env/requirements.txt
  • Option B: Docker

    • How to install docker

      # Our docker version:
      #     Client: Docker Engine - Community
      #     Version:           27.0.0-rc.1
      #     API version:       1.46
      #     Go version:        go1.21.11
      #     OS/Arch:           linux/amd64
      
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
      curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
      
      sudo apt-get update
      sudo apt-get install -y nvidia-container-toolkit
      
      sudo systemctl restart docker
      sudo usermod -aG docker [your_username_here]
    • Get our image:

      • B.1: Pull image from DockerHub

        docker pull jankinfstmrvv/multitrust:latest
      • B.2: Build from scratch

        #  Note: 
        # [data] is the `absolute paths` of data.
        
        docker build --network=host -t multitrust:latest -f env/Dockerfile .
    • Start a container:

      docker run -it \
          --name multitrust \
          --gpus all \
          --privileged=true \
          --shm-size=10gb \
          -v $HOME/.cache/huggingface:/root/.cache/huggingface \
          -v $HOME/.cache/torch:/root/.cache/torch \
          -v [data]:/root/MMTrustEval/data \
          -w /root/MMTrustEval \
          -d multitrust:latest /bin/bash
      
      # entering the container
      docker exec -it multitrust /bin/bash
  • Several tasks require the use of commercial APIs for auxiliary testing. Therefore, if you want to test all tasks, please add the corresponding model API keys in env/apikey.yml.

βœ‰οΈ Dataset

License

  • The codebase is licensed under the CC BY-SA 4.0 license.

  • MultiTrust is only used for academic research. Commercial use in any form is prohibited.

  • If there is any infringement in MultiTrust, please directly raise an issue, and we will remove it immediately.

Data Preparation

Refer here for detailed instructions.

πŸ“š Docs

Our document presents interface definitions for different modules and some tutorials on how to extend modules. Running online at: https://thu-ml.github.io/MMTrustEval/

Run following command to see the docs(locally).

mkdocs serve -f env/mkdocs.yml -a 0.0.0.0:8000

πŸ“ˆ Reproduce results in Our paper

Running scripts under scripts/run can generate the model outputs of specific tasks and corresponding primary evaluation results in either a global or sample-wise manner.

πŸ“Œ To Make Inference

# Description: Run scripts require a model_id to run inference tasks.
# Usage: bash scripts/run/*/*.sh <model_id>

scripts/run
β”œβ”€β”€ fairness_scripts
β”‚   β”œβ”€β”€ f1-stereo-generation.sh
β”‚   β”œβ”€β”€ f2-stereo-agreement.sh
β”‚   β”œβ”€β”€ f3-stereo-classification.sh
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.sh
β”‚   β”œβ”€β”€ f4-stereo-query.sh
β”‚   β”œβ”€β”€ f5-vision-preference.sh
β”‚   β”œβ”€β”€ f6-profession-pred.sh
β”‚   └── f7-subjective-preference.sh
β”œβ”€β”€ privacy_scripts
β”‚   β”œβ”€β”€ p1-vispriv-recognition.sh
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.sh
β”‚   β”œβ”€β”€ p3-infoflow.sh
β”‚   β”œβ”€β”€ p4-pii-query.sh
β”‚   β”œβ”€β”€ p5-visual-leakage.sh
β”‚   └── p6-pii-leakage-in-conversation.sh
β”œβ”€β”€ robustness_scripts
β”‚   β”œβ”€β”€ r1-ood-artistic.sh
β”‚   β”œβ”€β”€ r2-ood-sensor.sh
β”‚   β”œβ”€β”€ r3-ood-text.sh
β”‚   β”œβ”€β”€ r4-adversarial-untarget.sh
β”‚   β”œβ”€β”€ r5-adversarial-target.sh
β”‚   └── r6-adversarial-text.sh
β”œβ”€β”€ safety_scripts
β”‚   β”œβ”€β”€ s1-nsfw-image-description.sh
β”‚   β”œβ”€β”€ s2-risk-identification.sh
β”‚   β”œβ”€β”€ s3-toxic-content-generation.sh
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.sh
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.sh
β”‚   └── s6-crossmodal-jailbreaking.sh
└── truthfulness_scripts
    β”œβ”€β”€ t1-basic.sh
    β”œβ”€β”€ t2-advanced.sh
    β”œβ”€β”€ t3-instruction-enhancement.sh
    β”œβ”€β”€ t4-visual-assistance.sh
    β”œβ”€β”€ t5-text-misleading.sh
    β”œβ”€β”€ t6-visual-confusion.sh
    └── t7-visual-misleading.sh

πŸ“Œ To Evaluate Results

After that, scripts under scripts/score can be used to calculate the statistical results based on the outputs and show the results reported in the paper.

# Description: Run scripts require a model_id to calculate statistical results.
# Usage: python scripts/score/*/*.py --model_id <model_id>

scripts/score
β”œβ”€β”€ fairness
β”‚   β”œβ”€β”€ f1-stereo-generation.py
β”‚   β”œβ”€β”€ f2-stereo-agreement.py
β”‚   β”œβ”€β”€ f3-stereo-classification.py
β”‚   β”œβ”€β”€ f3-stereo-topic-classification.py
β”‚   β”œβ”€β”€ f4-stereo-query.py
β”‚   β”œβ”€β”€ f5-vision-preference.py
β”‚   β”œβ”€β”€ f6-profession-pred.py
β”‚   └── f7-subjective-preference.py
β”œβ”€β”€ privacy
β”‚   β”œβ”€β”€ p1-vispriv-recognition.py
β”‚   β”œβ”€β”€ p2-vqa-recognition-vispr.py
β”‚   β”œβ”€β”€ p3-infoflow.py
β”‚   β”œβ”€β”€ p4-pii-query.py
β”‚   β”œβ”€β”€ p5-visual-leakage.py
β”‚   └── p6-pii-leakage-in-conversation.py
β”œβ”€β”€ robustness
β”‚   β”œβ”€β”€ r1-ood_artistic.py
β”‚   β”œβ”€β”€ r2-ood_sensor.py
β”‚   β”œβ”€β”€ r3-ood_text.py
β”‚   β”œβ”€β”€ r4-adversarial_untarget.py
β”‚   β”œβ”€β”€ r5-adversarial_target.py
β”‚   └── r6-adversarial_text.py
β”œβ”€β”€ safefy
β”‚   β”œβ”€β”€ s1-nsfw-image-description.py
β”‚   β”œβ”€β”€ s2-risk-identification.py
β”‚   β”œβ”€β”€ s3-toxic-content-generation.py
β”‚   β”œβ”€β”€ s4-typographic-jailbreaking.py
β”‚   β”œβ”€β”€ s5-multimodal-jailbreaking.py
β”‚   └── s6-crossmodal-jailbreaking.py
└── truthfulness
    β”œβ”€β”€ t1-basic.py
    β”œβ”€β”€ t2-advanced.py
    β”œβ”€β”€ t3-instruction-enhancement.py
    β”œβ”€β”€ t4-visual-assistance.py
    β”œβ”€β”€ t5-text-misleading.py
    β”œβ”€β”€ t6-visual-confusion.py
    └── t7-visual-misleading.py

πŸ“Œ Task List

The total 32 tasks are listed here and β—‹: rule-based evaluation (e.g., keywords matching); ●: automatic evaluation by GPT-4 or other classifiers; ◐: mixture evaluation.

ID Task Name Metrics Task Type Eval
T.1 Basic World Understanding Accuracy ( ↑ ) Dis.&Gen. ◐
T.2 Advanced Cognitive Inference Accuracy ( ↑ ) Dis. β—‹
T.3 VQA under Instruction Enhancement Accuracy ( ↑ ) Gen. ●
T.4 QA under Visual Assistance Accuracy ( ↑ ) Gen. ●
T.5 Text Misleading VQA Accuracy ( ↑ ) Gen. ●
T.6 Visual Confusion VQA Accuracy ( ↑ ) Gen. β—‹
T.7 Visual Misleading QA Accuracy ( ↑ ) Dis. ●
S.1 Risk Identification Accuracy ( ↑ ) Dis.&Gen. ◐
S.2 Image Description Toxicity Score ( ↓ ), RtA ( ↑ ) Gen. ●
S.3 Toxicity Content Generation Toxicity Score ( ↓ ), RtA ( ↑ ) Gen. ◐
S.4 Plain Typographic Jailbreaking ASR ( ↓ ), RtA ( ↑ ) Gen. ◐
S.5 Optimized Multimodal Jailbreaking ASR ( ↓ ), RtA ( ↑ ) Gen. ◐
S.6 Cross-modal Influence on Jailbreaking ASR ( ↓ ), RtA ( ↑ ) Gen. ◐
R.1 VQA for Artistic Style images Score ( ↑ ) Gen. ◐
R.2 VQA for Sensor Style images Score ( ↑ ) Gen. ●
R.3 Sentiment Analysis for OOD texts Accuracy ( ↑ ) Dis. β—‹
R.4 Image Captioning under Untarget attack Accuracy ( ↑ ) Gen. ◐
R.5 Image Captioning under Target attack Attack Success Rate ( ↓ ) Gen. ◐
R.6 Textual Adversarial Attack Accuracy ( ↑ ) Dis. β—‹
F.1 Stereotype Content Detection Containing Rate ( ↓ ) Gen. ●
F.2 Agreement on Stereotypes Agreement Percentage ( ↓ ) Dis. ◐
F.3 Classification of Stereotypes Accuracy ( ↑ ) Dis. β—‹
F.4 Stereotype Query Test RtA ( ↑ ) Gen. ◐
F.5 Preference Selection in VQA RtA ( ↑ ) Gen. ●
F.6 Profession Prediction Pearson’s correlation ( ↑ ) Gen. ◐
F.7 Preference Selection in QA RtA ( ↑ ) Gen. ●
P.1 Visual Privacy Recognition Accuracy, F1 ( ↑ ) Dis. β—‹
P.2 Privacy-sensitive QA Recognition Accuracy, F1 ( ↑ ) Dis. β—‹
P.3 InfoFlow Expectation Pearson's Correlation ( ↑ ) Gen. β—‹
P.4 PII Query with Visual Cues RtA ( ↑ ) Gen. ◐
P.5 Privacy Leakage in Vision RtA ( ↑ ), Accuracy ( ↑ ) Gen. ◐
P.6 PII Leakage in Conversations RtA ( ↑ ) Gen. ◐

βš›οΈ Overall Results

  • Proprietary models like GPT-4V and Claude3 demonstrate consistently top performance due to enhancements in alignment and safety filters compared with open-source models.
  • A global analysis reveals a correlation coefficient of 0.60 between general capabilities and trustworthiness of MLLMs, indicating that more powerful general abilities could help better trustworthiness to some extent.
  • Finer correlation analysis shows no significant link across different aspects of trustworthiness, highlighting the need for comprehensive aspect division and identifying gaps in achieving trustworthiness.
result

βœ’οΈ Citation

If you find our work helpful for your research, please consider citing our work.

@article{zhang2024benchmarking,
  title={Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study},
  author={Zhang, Yichi and Huang, Yao and Sun, Yitong and Liu, Chang and Zhao, Zhe and Fang, Zhengwei and Wang, Yifan and Chen, Huanran and Yang, Xiao and Wei, Xingxing and others},
  journal={arXiv preprint arXiv:2406.07057},
  year={2024}
}  

About

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages