# Quantitative metrics

In "The Tale of the Deep Learning Model That Failed My Driving Exam", we left the driving evaluator puzzled. He watched as the deep learning model correctly stopped at the intersection with a yellow light and an ambulance crossing. However, when he asked the model simple, real-world questions—“What did the model see?”, “What does the model usually do in similar situations when something changes in what it sees?”, and “What should have changed for the model to decide to cross instead?”—he realized the model’s responses were purely technical, focused on pixel values and activations rather than a true understanding of traffic rules or emergency vehicles.

To help the driving evaluator, in the last chapter, we introduced **Concept Bottleneck Models (CBMs)**. CBMs allow the evaluator to gain insights into the model's decision-making process; when we asked:
- **“What did the model see?"** The CBM identified the traffic light color and the presence of the ambulance.
- **“What does the model usually do in similar situations when something changes?"** The model recognized that if there is an ambulance, it never predicts crossing the intersection.
- **"What should have changed for the model to decide to cross?"** The CBM correctly predicted that the car could have crossed if the ambulance had not been present.

While these qualitative answers reassured the evaluator, he now needs a **quantitative** way to assess how well the model responds in different situations. For this, we use three key metrics in concept-based interpretability: concept/task **predictive performance**, **average concept effect**, and **intervention effectiveness**.

## 1. Predictive performance

### Task performance  
In the driving test, task performance measures how well the model predicts whether to cross or stop at the intersection, based on concepts like traffic light color and the ambulance. A high task performance indicates that the model generally makes the correct decision. In classification problems, task performance is usually represented as the likelihood:


$$\mathcal{L}(\theta_f, c, y) = \mathbb{E}_{c,y} \ p(y \mid c; \theta_f)$$

**Limitations:** High task performance alone doesn’t guarantee interpretability. The model might achieve high performance by relying on uninterpretable or poorly defined concepts, potentially missing key information. For example, a CBM could correctly predict whether to cross or stop, but its decisions might be unaffected by changing the value of the concept “ambulance".

### Concept performance  
Concept performance measures how well the model correctly predicts the concepts from the input data. In our driving example, this evaluates how well the model detects traffic light color or the ambulance’s presence. Similarly to task performance, concept performance can be measured as the likelihood: 

$$\mathcal{L}(\theta_g, x, c) = \mathbb{E}_{x,c} \ p(c \mid x; \theta_g)$$

**Limitations:** While high concept performance means the model can correctly identify the traffic light color and ambulance presence, it doesn’t always imply that the model will make the right final decision. For example, even if the model accurately identifies a green light and no ambulance, it might still incorrectly predict that the car should stop.

## 2. Average concept effect  
The Average Concept Effect (also known as “Causal Concept Effect” or “Average Treatment Effect”) quantifies how individual concepts affect the model’s decisions. It measures how much a particular concept, such as the ambulance’s presence, directly impacts the model's prediction of whether to cross or stop. ACE can be computed as the difference in expected values when fixing the value of a concept $c_i$ to two different values “a” and “b”:

$$\text{ACE}(f, c, y, i, a, b) = \mathbb{E}_{c,y} \ p(y \mid \mathcal{C} \setminus c_i, do(c_i = a); \theta_f) - \mathbb{E}_{c,y} \ p(y \mid \mathcal{C} \setminus c_i, do(c_i = b); \theta_f)$$

In the driving test, ACE would quantify the effect of changing the ambulance’s status from absent $(b = 0)$ to present $(a = 1)$ on the model's decision to stop. A high ACE value indicates that the concept of an ambulance has a strong influence on the final decision.

**Limitations:** ACE assumes that changing a concept’s value from “a” to “b” (e.g., changing the value of "ambulance" from present to absent) is feasible and meaningful. However, not all concepts are easily (or immediately) manipulable in every context, such as personal attributes like gender. This metric also only evaluates the direct influence of a concept on the task, while in some real-world cases the influence of a concept might be mediated by another concept. For example, in the driving scenario, consider the concepts "road visibility" and "driver alertness," with the task outcome being "whether the car should stop". Here, we could expect road visibility to affect driver alertness, which in turn affects whether to stop. However, we would not expect "road visibility" to directly cause "whether the car should stop". Thus, ACE might miss the indirect influence of road visibility on the stopping decision, as it focuses only on direct relationships.

## 3. Intervention effectiveness  
Intervention effectiveness measures how changing the value of a concept for a given sample (e.g., toggling the ambulance’s presence) impacts the model’s final decision. For this reason, this metric can be used to assess whether human experts can effectively interact with the model and influence its predictions. One approach to measure intervention effectiveness is to assess the improvement in task performance after correcting mispredicted concepts by changing their values from $c$ to $c'$:

$$\mathcal{Q}(\theta_f, c, y) = \mathbb{E}_{c,y} \ p(y \mid c'; \theta_f) - \mathbb{E}_{c,y} \ p(y \mid c; \theta_f)$$

In the driving test scenario, imagine the model incorrectly predicts that the ambulance is absent, leading to a wrong decision to cross. This metric would evaluate whether correcting the concept by setting "ambulance = 1" leads the model to change its prediction to the correct one (i.e., stopping). After this human intervention, we expect the model to revise its decision and recommend stopping the car.

Intervention effectiveness is particularly useful when human experts need to intervene and adjust concept values to improve the model’s decisions, ensuring the model responds accurately after corrections.

**Limitations:** This metric is highly dependent on the quality of the interventions provided by humans. Poorly chosen interventions can reduce the model's performance, even if the concepts themselves are correct.


## Coding practice: quantitative metrics for concept-based models

