Research project conducted at POSTECH (GSAI), 2025.
Vision-Language Models (VLMs) such as BLIP-2 achieve strong performance on multimodal tasks, but their internal reasoning remains opaque. Existing interpretability tools (Grad-CAM, token importance maps) provide static, correlation-based insights that fail to capture the causal and dynamic nature of multimodal decision-making.
We introduce the Structured Temporal Counterfactual Graph (STCG), an interpretability framework that represents VLM reasoning as a causally grounded graph combining three dimensions:
- Structural alignment — links between textual tokens and visual regions via cross-attention
- Counterfactual sensitivity — causal importance of specific connections measured through perturbation (center mask, color shift, patch shuffle, peripheral mask, saliency patch mask)
- Temporal evolution — how attention patterns shift during response generation
The framework is applied to BLIP-2 (Salesforce/blip2-flan-t5-xl) on the VQA-v2 dataset.
- Perturbations disrupting spatial organization (patch shuffling, peripheral masking) produce the largest internal weight variation ΔW
- Color-shift perturbations induce strong answer probability changes, consistent with chromatic attribution patterns for color-identification tasks
- A dissociation was identified between internal disruption and semantic decision: Gaussian noise produces comparable ΔW values without affecting the final answer
- Content-bearing tokens ("colour", "car") establish stronger and more localized connections to visual regions than function tokens ("what", "the")
Evaluation was conducted on a limited number of examples. Causal faithfulness (−0.31) and Jaccard stability (0.23) indicate that explanations derived from a single forward pass remain fragile — a known challenge in VLM interpretability that motivates further work.
├── STCG_dynamic_counterfactual.ipynb # Main notebook (BLIP-2 + STCG pipeline)
└── README.md
Open the notebook in Google Colab (recommended — requires GPU for real model).
Mock mode (default, no GPU needed):
- Runs the full STCG pipeline with synthetic deterministic attention weights
- Validates the visualization and counterfactual pipeline without downloading multi-GB checkpoints
- Set
USE_MOCK_MODEL = Truein Cell 3
Real mode (requires ~12 GB disk + GPU):
- Set
STCG_FORCE_REAL=1orUSE_MOCK_MODEL = False - Default model:
Salesforce/blip2-flan-t5-xl— useSalesforce/blip2-flan-t5-basefor a lighter alternative - Provide your own image via
STCG_IMAGE_PATHor use the built-in synthetic demo image
Python · PyTorch · HuggingFace Transformers · BLIP-2 · Plotly · NetworkX · OpenCV · ipywidgets
Preprint — link to be added.
Student research project, POSTECH Graduate School of AI (GSAI), 2025.
Authors: Zachari Arnaud, Noé Stefani.