Week 11: Model Interpretability

Cybersecurity: Understand how to use model interpretability to identify unexpected feature importance that might indicate an adversarial attack or data poisoning.\
Cybersecurity: Articles on "adversarial machine learning" and "model interpretability for security."

**Cybersecurity Target**

- **Topic:** Detecting Data Poisoning.
- **Focus:** Using interpretability tools to identify malicious data.
- **Activity:** Research how a data poisoning attack on a model could make a feature that should not be important suddenly become the most important feature. Write a short summary.

How can a model's feature importance map be used to both find a data insight and detect a potential data manipulation attack?

---

### Cybersecurity: Data Poisoning and Feature Importance

Data poisoning attacks exploit the model's need to find correlations.

- The Attack Goal: An attacker wants a harmless-looking file (e.g., an email) to be classified as 'clean' but only when it contains a secret trigger feature (e.g., the word "banana" in the metadata).
- The Poisoning: The attacker injects training data where every sample that has "banana" also has the 'clean' label, even if the file is actually malware.

The Role of Feature Importance (SHAP):

- Normal: The word "banana" should have near-zero importance.
- Post-Attack: The model learns a shortcut: "If 'banana' is present, predict 'clean'." When you run $\text{SHAP}$ on the retrained model, the feature importance map will show that the word "banana" is now a highly important, positive feature.
- Detection: This sudden, illogical importance of a useless feature is the $\text{SHAP}$ tool's way of revealing the backdoor implanted by the data poisoning attack.

---


#### Model Interpretability or Explainable AI (XAI)

Is a critical defense mechanism against adversarial machine learning (AML).

AML attaks primarily trget the model's performance or integrity across the model lifecycle.

Key adversarial attack types
|Attack Type|Target|Description|Interpretability Defense|
|:---|:---|:---|:---|
|Poisoning Attacks|Training Data|The attacker injects malicious or mislabeled samples into the training set to corrupt the learning process|Global Feature Importance flags the sudden, illogical importance of a "trigger" feature learned from the poisoned data|
|Evasion Attacks|Inference/Prediction|The attacker adds a subtle, imperceptible perturbation (noise) to a legitimate input (Adversarial Example) to trick the model into a wrong prediction (e.g., changing a stop sign image to be classified as a yield sign)|Local Interpretability (SHAP/LIME) analyzes the adversarial example. It can reveal that the model's decision is being driven by noisy, high-gradient features instead of the expected, high-level features|
|Model Extraction/Inference|Model Parameters/Data|The attacker probes the model's output to steal its parameters or infer private training data|Interpretability helps by forcing the model to explain its output, making it harder for the attacker to hide malicious intent within stolen architecture|

### Interpretability as a Defense Mechanism
Interpretability tools like SHAP serve as a security audit tool in several ways:
- Anomaly Detection: In the event of an Evasion Attack, a Local Explanation (Waterfall Plot) for the manipulated input will show features that are not robust. The prediction will be dominated by tiny changes in low-level features (e.g., specific pixels), whereas a normal prediction is driven by meaningful, high-level features (e.g., shape, color).
- Data Leakage/Bias Detection: Global Feature Importance is the primary tool for debugging. If a feature that shouldn't exist (like a data-leakage variable) or a feature that shouldn't matter (like a protected demographic) shows up as highly important, interpretability immediately surfaces this vulnerability.
- Debugging Drift: SHAP values, when tracked over time, can signal Model Drift or Feature Drift—even subtle, non-malicious changes in input data distribution—allowing developers to retrain the model before the drift leads to a successful attack.

---

#### Reflection
How can a model's feature importance map be used to both find a data insight and detect a potential data manipulation attack?

A model's feature importance map can be used because when we trained the model, the expectations for the patterns is that it would stay the same. Any drift or change in pattern may signify a loss in accuracy or a potential security concern. Feature importance maps show us patterns for us to know how the model used the features to come up with a prediction.

Data insights are rather expected as we validate that the model is making decisions based on features that align with the expected relationship as well as what we know it should be deciding on (domain knowledge).If these are anomalous, this could signify a possible data leakage issue or that the model is failing to produce predictions of the same level accuracy and stability as we usually did. This could be caused by adversarial attacks of any type.

Feature importance maps help us understand the patterns as to why the prediction came to be, be it global or local. Said maps for interpretation also provide us with the possible causes of issues as it surfaces the features that suddenly became top predictors contrary to what we expected, revealing the backdoored features as well as the underlying mechanism. Once this is unidentified, the team can then do the necessary steps to secure everything once more.

The interpretation reveals the rules or learned patterns that the model has learned. If this is sound then its an insight but if it is highly unstable or nonsensical when compared to the learned patterns, then it is possible that it is a security alert.
