This tutorial will show you how to analyze predictions of a tree ensemble classifier (XGBoost in this case,
but it also works for tree ensembles from scikit-learn and for regression).
We will use [Titanic dataset](https://www.kaggle.com/c/titanic/data), which is small and has not too many
features, but is still rich enough.

Let's start by loading the data:

In [296]:
import csv
import numpy as np

with open('titanic-train.csv', 'rt') as f:
    data = list(csv.reader(f))

Variable descriptions:
- ``Survival`` Survival (0 = No; 1 = Yes)
- ``Pclass`` Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- ``Name`` Name
- ``Sex`` Sex
- ``Age`` Age
- ``Sibsp`` Number of Siblings/Spouses Aboard
- ``Parch`` Number of Parents/Children Aboard
- ``Ticket`` Ticket Number
- ``Fare`` Passenger Fare
- ``Cabin`` Cabin
- ``Embarked`` Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Next, shuffle data and separate features from what we are trying to predict: survival.

In [258]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

feature_names = data[0][2:]
_all_xs = [dict(zip(feature_names, row[2:])) for row in data[1:]]
_all_ys = np.array([int(row[1]) for row in data[1:]])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))
valid_xs[:2]

891 items total, 38.4% true


[{'Age': '19',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '10.1708',
  'Name': 'Dakic, Mr. Branko',
  'Parch': '0',
  'Pclass': '3',
  'Sex': 'male',
  'SibSp': '0',
  'Ticket': '349228'},
 {'Age': '19',
  'Cabin': '',
  'Embarked': 'Q',
  'Fare': '7.8792',
  'Name': 'Devaney, Miss. Margaret Delia',
  'Parch': '0',
  'Pclass': '3',
  'Sex': 'female',
  'SibSp': '0',
  'Ticket': '330958'}]

We do just minimal preprocessing: convert obviously contiuous ``Age`` and ``Fare`` variables to floats,
and ``SibSp``, ``Parch`` to ints.
``Age`` can be missing, we default to 0 and will tell XGBoost to treat it as missing later.

In [265]:
for x in all_xs:
    x['Age'] = float(x['Age'] or 0)
    x['Fare'] = float(x['Fare'])
    x['SibSp'] = int(x['SibSp'])
    x['Parch'] = int(x['Parch'])

Let's first build a very simple classifier with ``XGBClassifier``
and ``sklearn.feature_extraction.DictVectorizer``, and check it's accuracy with cross-validation:

In [298]:
from xgboost import XGBClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

def evaluate(_clf):
    scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy')
    print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
    _clf.fit(train_xs, train_ys)  # so that parts of the original pipeline are fitted
    
clf = XGBClassifier(missing=0)
vec = DictVectorizer(sparse=False)  # https://github.com/dmlc/xgboost/issues/1238
pipeline = make_pipeline(vec, clf) 
evaluate(pipeline)

Accuracy: 0.823 ± 0.008


There are two tricky parts in above code. First is that we pass ``missing=0`` to ``XGBClassifier``.
This tells XGBoost to treat zeros as missing values, which is how most scikit-learn vectorizers work.
It is important both for training and for correct feature visualization.

Second tricky bit is that XGBClassifier has some [issues](https://github.com/dmlc/xgboost/issues/1238)
with sparse data. In this case we don't really need sparsity, so pass ``dense=True`` to ``DictVectorizer``.

Now let's check out feature importances:

In [299]:
from eli5 import show_prediction, show_weights
show_weights(clf, vec=vec)

Weight,Feature
0.3205,Age
0.2967,Fare
0.1007,SibSp
0.0733,Sex=female
0.0531,Pclass=3
0.0366,Ticket=1601
0.0311,Parch
0.0275,Pclass=1
0.0256,Embarked=S
0.0183,Cabin=


**TODO** explain how feature importance is calculated. Show a tree.

``Ticket=1601`` looks suspicious: definitely something worth checking, but we won't go into it here.
We can also explain individual predictions:

In [269]:
show_prediction(clf, valid_xs[1], vec=vec)

Weight,Feature
0.431,Sex=female
0.423,Embarked=S (missing)
0.142,Fare
0.086,SibSp (missing)
-0.004,Cabin=
-0.005,Pclass=2 (missing)
-0.009,Embarked=C (missing)
-0.012,Ticket=1601 (missing)
-0.015,Parch (missing)
-0.052,Pclass=1 (missing)


Weight means how much each feature contributed to the final prediction.
So here we see that classifier thinks it's good to be a female, but bad to travel third class.
Some features have "(missing)" mark: that means that the feature was missing,
so in this case it's good to not have embarked in Southampton.

Right now we treat ``Name`` field as categorical, like other text features.
But it might contain some useful information. We don't want to guess how to best pre-process it
and what features to extract, so let's use the most general character ngram vectorizer:

In [272]:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

class CSCTransformer:
    def transform(self, xs):
        # work around https://github.com/dmlc/xgboost/issues/1238#issuecomment-243872543
        return xs.tocsc()
    def fit(self, *args):
        return self
 
def make_vec(field, ngram_range, analyzer='char', max_features=100):
    return CountVectorizer(
        analyzer='char_wb',
        ngram_range=ngram_range,
        preprocessor=lambda x: x[field],
        max_features=max_features,
    )
vec = FeatureUnion([
    ('Name', make_vec('Name', (3, 4))),
    ('All', DictVectorizer()),
])
clf = XGBClassifier(missing=0)
pipeline = make_pipeline(vec, CSCTransformer(), clf)
evaluate(pipeline)

Accuracy: 0.832 ± 0.011


In this case the pipeline is more complex, we slightly improved our result,
but the improvement is not significant. Let's look at the weights:

In [249]:
show_weights(clf, vec=vec)

Weight,Feature
0.1896,All__Age
0.1779,All__Fare
0.0671,All__SibSp
0.0386,All__Pclass=3
0.0369,Name__ Mr.
0.0319,All__Sex=female
0.0302,Name__ne
0.0268,Name__ Mas
0.0268,Name__ Ma
0.0235,All__Ticket=1601


We see that now there are a lot of features that come from ``Name`` field
(in fact, a classifier based on ``Name`` alone gives about 0.79 accuracy).
Name features listed in this way are not very informative, they make more sense
when we check out predictions
(we pass ``top=10`` here, because there are a lot of missing features in text,
but they are not very interesting):

In [287]:
show_prediction(clf, xs_test[0], vec=vec, expand_missing_features=True, top=10)

Weight,Feature
+1.328,All__SibSp (missing)
+0.467,All__Age (missing)
+0.431,Name: Highlighted in text (sum)
+0.360,All__Fare
+0.244,Name__ Mr. (missing)
+0.145,All__Embarked=S (missing)
+0.105,"Name__s, (missing)"
… 7 more positive …,… 7 more positive …
… 19 more negative …,… 19 more negative …
-0.065,<BIAS>


It's good to be a master on Titanic! Let's check some more predictions:

In [294]:
from IPython.display import display

for idx in [4, 5, 37, 81]:
    display(show_prediction(clf, valid_xs[idx], vec=vec, top=10))

Weight,Feature
+0.499,Name: Highlighted in text (sum)
+0.484,All__Fare
+0.301,Name__lia (missing)
+0.244,Name__ Mr. (missing)
+0.133,All__Embarked=C (missing)
+0.110,All__Age
… 12 more positive …,… 12 more positive …
… 17 more negative …,… 17 more negative …
-0.089,All__SibSp
-0.165,All__Embarked=S


Weight,Feature
… 8 more positive …,… 8 more positive …
… 25 more negative …,… 25 more negative …
-0.065,<BIAS>
-0.067,All__SibSp (missing)
-0.068,Name__ Ma (missing)
-0.133,All__Pclass=2 (missing)
-0.215,All__Ticket=1601 (missing)
-0.260,All__Age
-0.322,Name: Highlighted in text (sum)
-0.327,All__Cabin=


Weight,Feature
+0.483,All__Age (missing)
+0.318,All__SibSp (missing)
+0.244,Name__ Mr. (missing)
+0.227,All__Embarked=S (missing)
+0.222,All__Fare
+0.062,All__Sex=female
… 9 more positive …,… 9 more positive …
… 19 more negative …,… 19 more negative …
-0.057,All__Pclass=1 (missing)
-0.065,<BIAS>


Weight,Feature
+0.566,All__Age (missing)
+0.517,Name__ Ma (missing)
+0.244,Name__ Mr. (missing)
+0.227,All__Embarked=S (missing)
+0.182,All__SibSp
+0.180,All__Embarked=Q
+0.171,Name: Highlighted in text (sum)
+0.151,All__Fare
… 9 more positive …,… 9 more positive …
… 20 more negative …,… 20 more negative …


Looks like name classifier tried to infer both gender and status from the title: "Mr." is bad
because women are saved first, and it's better to be "Mrs." (married) than "Miss.".
Also name classifier is trying to pick some parts of names and surnames, especially endings,
perhaps as a proxy for social status.
It's especially bad to be "Mary" if you are from the third class.