llm-interpretability

Here are 4 public repositories matching this topic...

PaulPauls / llama3_interpretability_sae

A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and fully reproducible.

pytorch feature-extraction open-research sparse-autoencoder llama3 llm-interpretability feature-steering

Updated Mar 23, 2025
Python

basics-lab / spectral-explain

Star

Fast XAI with interactions at large scale. SPEX can help you understand the output of your LLM, even if you have a long context!

explainable-ai xai shap explainability sparse-transformer llm-interpretability

Updated Mar 26, 2025
Jupyter Notebook

Luisibear98 / intervention-jailbreak

Star

This project explores methods to detect and mitigate jailbreak behaviors in Large Language Models (LLMs). By analyzing activation patterns—particularly in deeper layers—we identify distinct differences between compliant and non-compliant responses to uncover a jailbreak "direction." Using this insight, we develop intervention strategies that modify

intervention llm llm-interpretability

Updated Feb 2, 2025
Python

Yusen-Peng / KAN-LLaMA

Star

KAN-LLaMA: An Interpretable Large Language Model With KAN-based Sparse Autoencoders

sparse-autoencoders kolmogorov-arnold-networks llm-interpretability

Updated Mar 26, 2025
Python

Improve this page

Add a description, image, and links to the llm-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-interpretability topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-interpretability

Here are 4 public repositories matching this topic...

PaulPauls / llama3_interpretability_sae

basics-lab / spectral-explain

Luisibear98 / intervention-jailbreak

Yusen-Peng / KAN-LLaMA

Improve this page

Add this topic to your repo