A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
-
Updated
Mar 23, 2025 - Python
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.
Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!
This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify
KAN-LLaMA: An Interpretable Large Language Model With KAN-based Sparse Autoencoders
Add a description, image, and links to the llm-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the llm-interpretability topic, visit your repo's landing page and select "manage topics."