# “Why Should I Trust You?” - Debugging black-box text classifiers
### Tobias Sterbak / PyData Amsterdam / 27-08-18

- The problem of interpreting black-box classfier often used in text
  classification. When alorithms make predictions about medical treatments
  or if you just want to know whats going on before you release a model to a
  production enviroment. => you (and your users) want to gain trust in the model!
- What is a black-box classifier? Even a linear model can be hard to interpret [EXAMPLE].

# When do you trust a model?
<p> </p>
<p> </p>
<p> </p>
<center><img src="img/trust_robot.gif" alt="trust" style="width: 400px;"/></center>

- metrics are not enough

doesn't reveal dataset leakage

- a validation set is not enougth

 to estimate performance on production

## What is a black-box model?

<center>![blackbox](img/blackbox.jpg)</center>

A system where the internal workings are completly hidden from you.

_Examples:_
- Deep neural network
- even a linear model with bag of words

# Outline

- The LIME algorithm

- Example and Code

- How to make it fail

<img src="img/lime_logo_small.jpg" alt="lime" style="width: 400px; float: left"/>

<br/>

# The LIME algorithm
Ribeiro et al, 2016

__GOAL__: understand the prediction of an arbitrary model for a certrain sample

__L__ ocally

- explenations must correspond to how the model behaves __in the neigborhood__ of the instance being predicted

__I__ nterpretable

- provide __qualitative understanding__ between the input variables and the response
- interpretability must take into account the __user’s limitations__

__M__ odel-agnostic

- explainer should be able to explain __any__ model

__E__ xplanations

- High-level idea: _Approximate_ a complicated model $f$ _locally_ by an interpretable
  model $g$. You can go global again to gain trust in the full model,
  but this is out of scope here --> see paper.
- We want to explain the prediction of the model for a single sample s.
- Sample from the neighborhood of s weighted by distance kernel (exponential
  kernel with cosine distance). This is done as follows:
    1. Sample randomly (uniformly) a random number of words $s'$ from the words of $s$.
       Weight $s'$ by distance to $s$ in the BoW space.
    2. Input $s'$ to $f$ and get the model prediction for the perturbed sample $s'$.
    (1. + 2. are done multiple times, e.g. $1000$ times to collect samples)
    3. Fit a weighted sparse linear model $g$ which uses at most $K$ features.
       (this step limits the complexity of the model)
    4. Get the weights of $g$ to explain the prediction.

# How it works

- Generate a fake dataset $X$ from the example

- Use trained black-box model $f$ to get predictions $y_p$ for each example in a generated dataset

- Train a white-box model $g$ on $X, y_p$

Using generated dataset and generated labels as training data. It means we’re trying to create an estimator which works the same as a black-box estimator, but which is easier to inspect. It doesn’t have to work well globally, but it must approximate the black-box model well in the area close to the original example.

To express “area close to the original example” user must provide a distance/similarity metric for examples in a generated dataset. Then training data is weighted according to a distance from the original example - the further is example, the less it affects weights of a white-box estimator.

- Explain the original example through weights of the white-box model

- Assess how well the white-box model approximates the black-box model

 If the quality is low then explanation shouldn’t be trusted.

<center><img src="img/lime.png" alt="lime" style="width: 800px;"/>Source: Ribeiro et al, 2016</center>

# Let's look at an example: the 20 newsgroup data.

In [1]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True,
                                  random_state=42, remove=('headers', 'footers'))
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True,
                                 random_state=42, remove=('headers', 'footers'))

In [2]:
i = 125
print("Class: {}".format(twenty_train.target_names[twenty_train.target[i]]))
print("-"*20); print()
sample = twenty_train.data[i]; print(sample)

Class: alt.atheism
--------------------

In article <1993Apr3.153552.4334@mac.cc.macalstr.edu>, acooper@mac.cc.macalstr.edu writes:
|> In article <1pint5$1l4@fido.asd.sgi.com>, livesey@solntze.wpd.sgi.com (Jon Livesey) writes
>
> Well, Germany was hardly the ONLY country to discriminate against the 
> Jews, although it has the worst reputation because it did the best job 
> of expressing a general European dislike of them.  This should not turn 
> into a debate on antisemitism, but you should also point out that Luther's
>  antiSemitism was based on religious grounds, while Hitler's was on racial 
> grounds, and Wagnmer's on aesthetic grounds.  Just blanketing the whole 
> group is poor analysis, even if they all are bigots.

I find these to be intriguing remarks.   Could you give us a bit
more explanation here?   For example, which religion is anti-semitic,
and which aesthetic?


### We train a black-box classifier.

In [3]:
import numpy as np
from scipy.spatial import distance
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [25]:
# LSA features
vec = TfidfVectorizer(min_df=3, stop_words='english', ngram_range=(1, 2))
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)
lsa = make_pipeline(vec, svd)

# SVM with rbf-kernel
clf = SVC(C=150, gamma=2e-2, probability=True, kernel="rbf")
text_clf = make_pipeline(lsa, clf)

text_clf.fit(twenty_train.data, twenty_train.target)
print("Accuracy: {:.1%}".format(text_clf.score(twenty_test.data, twenty_test.target)))

Accuracy: 89.0%


### Let's explain the predictions of this model

In [5]:
import random

def get_perturbed_sample(sample):
    '''Sample words from the text sample uniformly.'''
    words = sample.split(" ")
    n_words = random.randrange(0, len(words))
    idx = random.sample(list(range(0,len(words))), k=n_words)
    return " ".join([words[i] for i in sorted(idx)])

In [6]:
print(sample)

In article <1993Apr3.153552.4334@mac.cc.macalstr.edu>, acooper@mac.cc.macalstr.edu writes:
|> In article <1pint5$1l4@fido.asd.sgi.com>, livesey@solntze.wpd.sgi.com (Jon Livesey) writes
>
> Well, Germany was hardly the ONLY country to discriminate against the 
> Jews, although it has the worst reputation because it did the best job 
> of expressing a general European dislike of them.  This should not turn 
> into a debate on antisemitism, but you should also point out that Luther's
>  antiSemitism was based on religious grounds, while Hitler's was on racial 
> grounds, and Wagnmer's on aesthetic grounds.  Just blanketing the whole 
> group is poor analysis, even if they all are bigots.

I find these to be intriguing remarks.   Could you give us a bit
more explanation here?   For example, which religion is anti-semitic,
and which aesthetic?


Look at a perturbed sample to this instance

In [38]:
print(get_perturbed_sample(sample))

article writes:
|> In article Livesey) Well, discriminate against the has it 
> a them. not into but you out based religious grounds, while and aesthetic even these For which anti-semitic,
and aesthetic?


Setup the explainer model

In [26]:
explainer = Pipeline([
    ("BoW", CountVectorizer()),                       # interpretable representation
    ("selectK", SelectKBest(k=10, score_func=chi2)),  # limit the complexity of the explanation
    ("lr", LogisticRegression())                      # weighted interpretable model
])

Get a lot of perturbed samples and predict on them

In [27]:
perturbed_samples = [get_perturbed_sample(sample) for i in range(5000)]
perturbed_predictions = text_clf.predict(perturbed_samples)

Fit the explainer model on the predictions of the text classifier

In [28]:
vec = CountVectorizer(binary=True)
sigma = 1.0
samples_vec = vec.fit_transform(perturbed_samples); samples_vec = samples_vec.todense()
weights = np.nan_to_num([np.exp(-distance.cosine(vec.transform([sample]).todense()[0], s)**2 / sigma**2) for s in samples_vec])

  dist = 1.0 - uv / np.sqrt(uu * vv)


In [29]:
explainer.fit(perturbed_samples, perturbed_predictions, lr__sample_weight=weights)

Pipeline(memory=None,
     steps=[('BoW', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

How well does it approximate the black-box model?

In [30]:
new_pert_samples = [get_perturbed_sample(sample) for i in range(1000)]
print("Score: {:.1%}".format(explainer.score(new_pert_samples, text_clf.predict(new_pert_samples))))

Score: 89.6%


Now get the important features/words

In [31]:
inv_vocab = {v: k for k, v in explainer.steps[0][1].vocabulary_.items()}
important_words = [inv_vocab[i] for i in explainer.steps[1][1].pvalues_.argsort()[:10][::-1]]

Look at the explanation 

In [32]:
p = explainer.predict_proba([sample])[0]
for c in explainer.steps[2][1].classes_:
    print("> {}".format(twenty_train.target_names[c]))
    print(f"probability: {p[c]:.1%}"); print("-"*20)
    for w, v in zip(important_words, explainer.steps[2][1].coef_[c]):
        print(f"{w}: {v:.3}")
    print()

> alt.atheism
probability: 100.0%
--------------------
cc: 1.07
article: 2.02
writes: 0.259
was: 2.12
grounds: 0.168
livesey: 0.211
sgi: 2.02
com: 0.348
on: 0.116
the: 1.5

> comp.graphics
probability: 0.0%
--------------------
cc: -1.05
article: -1.62
writes: -0.363
was: -1.37
grounds: 1.08
livesey: -0.22
sgi: -1.62
com: -0.341
on: -0.318
the: -1.36

> sci.med
probability: 0.0%
--------------------
cc: 0.182
article: -1.54
writes: 0.215
was: -2.46
grounds: -2.97
livesey: -0.225
sgi: -1.54
com: -0.0961
on: -0.262
the: -0.553

> soc.religion.christian
probability: 0.0%
--------------------
cc: -1.85
article: -1.4
writes: -0.304
was: -1.39
grounds: -2.51
livesey: 0.111
sgi: -1.4
com: -0.101
on: 0.608
the: -1.11



In [15]:
print(sample)

In article <1993Apr3.153552.4334@mac.cc.macalstr.edu>, acooper@mac.cc.macalstr.edu writes:
|> In article <1pint5$1l4@fido.asd.sgi.com>, livesey@solntze.wpd.sgi.com (Jon Livesey) writes
>
> Well, Germany was hardly the ONLY country to discriminate against the 
> Jews, although it has the worst reputation because it did the best job 
> of expressing a general European dislike of them.  This should not turn 
> into a debate on antisemitism, but you should also point out that Luther's
>  antiSemitism was based on religious grounds, while Hitler's was on racial 
> grounds, and Wagnmer's on aesthetic grounds.  Just blanketing the whole 
> group is poor analysis, even if they all are bigots.

I find these to be intriguing remarks.   Could you give us a bit
more explanation here?   For example, which religion is anti-semitic,
and which aesthetic?


# Let's look at eli5

- python package: https://github.com/TeamHG-Memex/eli5

- provides insights in different model

- provides nice visualization

- allows for multiple different explainers

- kernel density estimation to get better perturbed samples

In [16]:
import eli5
from eli5.lime import TextExplainer

te = TextExplainer(random_state=42)
te.fit(sample, text_clf.predict_proba)
te.show_prediction(target_names=twenty_train.target_names)

Contribution?,Feature
9.421,Highlighted in text (sum)
-0.542,<BIAS>

Contribution?,Feature
-0.016,<BIAS>
-7.725,Highlighted in text (sum)

Contribution?,Feature
-0.196,<BIAS>
-12.328,Highlighted in text (sum)

Contribution?,Feature
-0.215,<BIAS>
-9.348,Highlighted in text (sum)


# How to trick the algorithm

- Never trust an algorithm blindly!
- Cannot provide a good explanation for a black-box classifier which works on character level
- Black-box classifiers which use features like “text length” (not directly related to tokens) can be also hard to approximate using the default bag-of-words/ngrams model.

In [17]:
def predict_proba_len(docs):
    proba = [
        [0, 1.0, 0.0, 0] if len(doc) % 2 else [1.0, 0, 0, 0]
        for doc in docs
    ]
    return np.array(proba)

In [18]:
len(sample)

850

In [19]:
te2 = TextExplainer().fit(sample, predict_proba_len)
te2.show_prediction(target_names=twenty_train.target_names)

Contribution?,Feature
0.255,Highlighted in text (sum)
-0.072,<BIAS>


We can detect this failure by __looking at__ metrics:

- ‘score’ is an accuracy score weighted by cosine distance between generated sample and the original document (i.e. texts which are closer to the example are more important). Accuracy shows how good are ‘top 1’ predictions.
- ‘mean_KL_divergence’ is a mean Kullback–Leibler divergence for all target classes; it is also weighted by distance. KL divergence shows how well are probabilities approximated; 0.0 means a perfect match.m

In [20]:
te2.metrics_

{'mean_KL_divergence': 0.72973696077940187, 'score': 0.48996566001842468}

### Luckily it's possible to fix this.

If we suspect that the fact document length is even or odd is important, it is possible to customize TextExplainer to check this hypothesis.

In [21]:
from sklearn.pipeline import make_union
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin

In [22]:
class DocLength(TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [[len(doc) % 2, not len(doc) % 2] for doc in X]

    def get_feature_names(self):
        return ['is_odd', 'is_even']

vec = make_union(DocLength(), CountVectorizer(ngram_range=(1,2)))
te3 = TextExplainer(vec=vec).fit(sample, predict_proba_len)

In [23]:
print(te3.metrics_)
te3.explain_prediction(target_names=twenty_train.target_names)

{'mean_KL_divergence': 0.011840784438819842, 'score': 1.0}


Contribution?,Feature
4.419,doclength__is_even
0.015,countvectorizer: Highlighted in text (sum)
0.014,<BIAS>


What’s bad about this kind of failure (wrong assumption about the black-box pipeline) is that it could be impossible to detect the failure by looking at the scores. Scores could be high because generated dataset is not diverse enough, not because our approximation is good.

The takeaway is that it is important to understand the “lenses” you’re looking through when using LIME to explain a prediction.

# Tl;dl

- Inspect your models not only by looking at validation metrics

- LIME can help you to get some understanding of your model (and eli5 makes it easy)

- It's important to understand the “lenses” you’re looking through when using LIME

- Never trust an alogrithm blindly!

# Where to find me

<img src="img/www.png" alt="website" style="width: 50px;float: left;"/> <p>www.depends-on-the-definition.com</p>

<img src="img/github.png" alt="website" style="width: 50px;float: left;"/><p>www.github.com/tsterbak</p>

<img src="img/twitter.png" alt="website" style="width: 50px;float: left;"/> <p>@tobias_sterbak</p>

# Questions?