# Supplemental explanations

## For Section 3: Evaluating Feature Contributions

The following explanations are adapted from the paper: Palczewska, A., Palczewski, J., Robinson, R. M., & Neagu, D. (2014). Interpreting random forest classification models using a feature contribution method. In *Integration of reusable systems* (pp. 193-218). Springer, Cham.

### Probabalistic Interpretation of Classification Random Forest Models

This is a probabalistic interpretation of the forest prediction process. Rather than thinking of the decisions as decisions, we'll think about them as probabilities of class membership. 

The classes in our model can be represented by the set $C = \{C_{1},C_{2},...,C_{K}\}$, where each $C$ is a class, and there are $K$ number of classes. Additionally, we can define the set $\Delta_{K}$ by the following: 

$$\Delta_{K} = \{(p_{1},...,p_{K}) \textrm{ } | \sum\limits_{k=1}^K p_{k} = 1 \textrm{ and } p_{k}\}$$

which can be stated in plain english as, "the set $p_{1}$ to $p_{K}$, such that the sum of all the $p$'s is 1, and all $p$ are greater than or equal to zero". There is one $p$ for each class. Taken together, all the elements in $\Delta_{K}$ form a probability distribution -- that's why they have to sum to 1. But what are the $p$'s? <br> 

For a given class $k$, we will define $\hat{Y}_{i,t}$ as a prediction of a tree, $t$, for an instance, $i$. This prediction specifically is that the tree has predicted that $i$ is of class $k$, so let's write $\hat{Y}_{(k)i,t}$ to clarify. $\hat{Y}_{(k)i,t}$ represents the probability that a tree in the forest will classify $i$ as being of class $k$. There are $T$ number of these $\hat{Y}_{(k)i,t}$'s, because there are T number of trees in the forest. Therefore, if we want to know the probability that the whole forest will predict $i$ to be class $k$, we do the following: 

$$\hat{Y}_{(k)i} = \frac{1}{T} \sum\limits_{t=1}^T \hat{Y}_{(k)i,t}$$

We do this for all the classes, which yields the set $\{\hat{Y}_{(1)i}, ... , \hat{Y}_{(k)i}\}$, which is the probability distribution over all classes: AKA, this is $\Delta_{K}$.

So to sum up: we represent the "decision" that each tree makes about an instance, as a probability of class membership. We sum up these probabilities for all trees for each class, which at the end yields a probability distribution for the prediction of that instance. 


### The Unanimity Condition: Using feature contributions to retrieve model predictions

**NOTE TO SELF:** *make sure to check logic of why U is necessary with someone -- AND CHECK HOW Y=0.5 DRAWS ARE RESOLVED*

There are two ways that classification trees finish growing: if a node only contains instances of one class, or if further splitting won't improve the classification. If all terminal nodes contain instances of only one class, they satisfy the following unanimity condition (U, from the Palczewska et al 2014 paper):

"U: for every tree in the forest, local training instances in each terminal node are of the same class"

When this condition is satisfied, the Forest's predictions can be recovered using the feature importances. This is done with the following equation:

$$\hat Y_{i} = (Y^{r} + \sum\limits_{f} FC^{f}_{i}, 1 - Y^{r} - \sum\limits_{f} FC^{f}_{i})$$

where $Y^{r}$ is the coordinate-wise average of $Y_{mean}$ over all root nodes in the forest. <br> 

In the right side of the equation there are two terms: $Y^{r} + \sum\limits_{f} FC^{f}_{i}$ and $1 - Y^{r} - \sum\limits_{f} FC^{f}_{i}$. This merely represents how to calculate $\hat Y_{i}$ for both of the classes in the binary setting. The first term is $\hat Y_{i}$ for whichever class has been designated as Class 1, and the second is $\hat Y_{i}$ for the second class. 

The above is for a binary classification use case, but can be generalized to multi-class scenarios (See Palczewska et al 2014 section 4).

Why is the unanimity condition required to recover predictions using feature contributions? It comes down to how draws are resolved when a terminal node is impure. If a terminal node does not have a $Y_{mean}$ of 0 or 1, there are instances of both classes at that node, and further splitting won't enhance the classification. If $Y_{mean} \neq 0.5$, the class for the node is determined by majority voting: if we're looking at Class 1 for $Y_{mean}$, $Y_{mean} > 0.5$ will be Class 1, whereas $Y_{mean} < 0.5$ will be Class 2. However, if $Y_{mean} = 0.5$, the draw has to be resolved by **check this** random selection. <br>

Using the above formula for $\hat Y_{i}$, if there are no draws, we can still recover predictions - if $\hat Y_{i} > 0.5$, it's Class 1, if $\hat Y_{i} < 0.5$, it's Class 2. However, if there was a draw, and $\hat Y_{i} = 0.5$, we have no way of knowing which way the model resolved the draw. This is why we can't recover predictions using feature contributions if the unanimity condition doesn't hold. <br> 

Important to note -- this is only relevant to classification models. Intuitively, there is no unanimity condition for a regression model, since predictions are made based on the mean of the labels for all instances at a terminal node.

#### Bias and prediction reconstruction

The following is my analysis of how various sources I've used to put together this tutorial coincide with one another. Specifically, I am seeking to resolve the Palczewska et al 2014 paper with the documentation for `treeinterpreter` (used in this tutorial), and [this blog post](http://blog.datadive.net/interpreting-random-forests/) on calculating feature contributions for regression trees, on which the method in `treeinterpreter` is based. This is also addressed in the main tutorial under section 3A., in the **Feature Contributions: Regression** section. However, here I'm going to focus on the idea of "bias" in reconstructing the predictions of a tree using feature contributions. 

The `treeinterpreter` README states that the predictions from the tree can be reconstructed using the equation 

    prediction = bias + feature_1_contribution + ... + feature_n_contribution 

Meanwhile, the blog post defines a prediction in terms of feature contributions as 

$$f(x) = c_{full} + \sum\limits^{K}_{k=1} contrib(x,k)$$

where $K$ is the number of features, $c_{full}$ is the value at the root of the node and $contrib(x,k)$ is the contribution from the k-th feature in the feature vector x.

Finally, in the Palczewska paper we have: 

$$\hat Y_{i} = (Y^{r} + \sum\limits_{f} FC^{f}_{i}, 1 - Y^{r} - \sum\limits_{f} FC^{f}_{i})$$

The operative question in this section is: *Are these equations all referring to the same thing?*



The first two equations clearly match up directly, and `bias` from the blog's equation is the same as $c_{full}$ in the `treeinterpreter` equation: `feature_n_contribution` is the same as $contrib(x,n)$. <br>

This also looks a lot like the first element in the Palczewska version, with $Y^{r}$ being bias. There are two elements in $\hat Y_{i}$. As discussed above, one element in the set represents classification into the first class, the other into the second. This is indicative of the difference between calculating feature contributions for classification versus regression models. The Palczewska version is only for classification models, whereas the other two refer to regressions. In the regression case, since $Y_{mean}$ is directly the prediction made at a node, we can always recover the predictions by summing up the feature contributions along a decision path. This is why (a) the unanimity condition is irrelevant (it doesn't make sense in the regression case) and we can always recover the prediction using feature contributions, and (b) there's only one way to recover the prediction. In the classification case, $Y_{mean}$ represents the probability of classifying $i$ into the class of interest at the node, so we have to take 1 minus this probability (in the binary case) to get the probability that we would classify it into the other class. This is also why, as discussed above, the unanimity condition matters for prediction recovery using this method. <br>

In conclusion, yes, the spirit of these three equations is the same!

### Feature contributions for multiclass models