In [1]:
from IPython.display import display

Rachael Creager

20 May 2018

This notebook will be used to guide the UPenn physics machine learning journal club discussion for the 21st.

Paper for discussion: 
* [Identifying the relevant dependencies of the neural network
response on characteristics of the input space](https://arxiv.org/pdf/1803.08782.pdf)

Related interesting references: 
* [Explaining NonLinear Classification Decisions with
Deep Taylor Decomposition](https://arxiv.org/pdf/1512.02479.pdf)
* [DeepTaylor talk](http://heatmapping.org/deeptaylor/)

## Section 1
### Introduction

* A neural network is multi-parameter classifier, parameters determined at training time.
* In many applications, training sample and testing sample are not guaranteed to be congruent/similar
* To make sure we understand the model's behavior, we want to identify the NN inputs with the largest influence on the output
* In this paper, this effect will be analyzed by calculating a (2nd order) Taylor expansion of the NN function with respect to each input variable
* Reminder: [Taylor Decomposition](http://www.math.ucdenver.edu/~esulliva/Calculus3/Taylor.pdf)
    * Let f be an infinitely differentiable function (x,y) -> R in an open neighborhood around (a,b):

In [27]:
Latex(r"""\begin{equation}
f(x,y) = f(a,b) + f_x(a,b)(x-a) + f_y(a,b)(y-b) + \frac{1}{2!}[f_{xx}(a,b)(x-a)^2 + 2f_{xy}(a,b)(x-a)(y-b) + f_{yy}(y-b)^2]+...
\end{equation}""")

<IPython.core.display.Latex object>

* Consider two types of features: 
    * First order: those from first-order Taylor coeffs. Should capture influence of single input elements
    * Second order: those from second-order Taylor coeffs. Should capture pairwise correlations or self-correlations (autocorrelation)
* Depending on the task, the influence of a given feature could vary over the input space
    * To deal with this, the metric used to measure influence of a feature is arithmetic mean of abs value:

In [24]:
Latex(r"""\begin{eqnarray}
<t_i> = \frac{1}{N} \sum_{j=1}^{N} |t_i (\{ x_j\})|
\end{eqnarray}""")

<IPython.core.display.Latex object>

where $t_i$ corresponds to the Taylor coefficienct and $\{x_j\}$ is the set of input elements

Questions:
* Can we guarantee a limit on the higher-order terms?

## Section 2: 
### Analysis of features of the input space for simple tasks

Let's start with an easy example :)

Consider a binary classification task with a two-dimensional input, $x_1$ and $x_2$. 
We'll consider four different tasks:

<img src="table.png">

The signal and background are Gaussian distributions with the $(x_1,x_2)$ columns describing the center (i.e. $(x_1,x_2) = 0.5$ means the center of the distribution is $(0.5,0.5)$.

<img src="plots.png">

* For 1a, the signal and background are shifted from each other. $x_1$ and $x_2$ are uncorrelated and equally spread. 
    * In this case, $<t_1>$ and $<t_2>$ have high values, indicating that you get strong separation power from the marginal probability distributions of $x_1$ and $x_2$
    * You also get a non-zero correlation between $x_1$ and $x_2$ because of the relative position of the two distributions
* For 1b, the signal and background have the same center. $x_1$ and $x_2$ are equally spread, but with different correlations for signal and background
    * The first order features $x_1$ and $x_2$ are somewhat useful for classification -- around the origin, they are useless, but for large absolute values they give a bit of separation power (we see this since $<t_{x_1}>$ and $<t_{x_2}>$ are somewhat large)
    * Since the correlation is the key difference between the distributions here, we expect $<t_{x_1,x_2}>$ to be the largest term -- and we see this in the results!
* 1c is a mixture of cases 1a and 1b
* For 1d, the distributions have the same center. $x_1$ and $x_2$ are uncorrelated, but signal is more tightly "spread" than background
    * In this case, we would expect the first order terms to help in the case of large absolute values
    * Self-correlation might help as well, to try to get a handle on the "peakiness" of the distribution
    * The relationship between $x_1$ and $x_2$ won't help much, since they are uncorrelated

## Section 3
### Analysis of the learning progress

* We will now consider task 1c to observe how $<t_i>$ may be used as a metric for learning
* This evaluation will be done by observing how the area under the curve (AUC) changes for the receiver operating characteristic (ROC) curve
    * Reminder: [ROC curve](http://gim.unmc.edu/dxtests/roc3.htm)
    * A ROC curve is a way of measuring the "goodness" of a classifier
    * A "perfect" curve has an AUC of 1
    * AUC <= 0.5 is completely garbage!
    
<img src="roc.png">

In this case, we evaluate training over 350 iterations with AUC, and measure the relative contribution of the first- and second-order features using $<t_i>$
* In steps 0-30, we see AUC rise quickly, plateauing around 0.84
    * We can see the values of $<t_{x_1}>$, $<t_{x_2}>$ get large. This means the classifier quickly learned to use the first-order features to improve its performance. 
    * We also see a contribution from $<t_{x_1,x_2}>$, and a minor contributions from the autocorrelation terms
* After the plateau (approx. iterations 50-100), we see a gradual rise in AUC by another 0.03 or so, reaching a final plateau around 0.85
    * In this second increase, it appears to be giving more influence to the second-order terms and decreasing the influence of the first-order terms

<img src="training.png">

## Section 4
### Application to a benchmark task from high-energy particle physics

Now let's test it on a trickier problem, the Higgs boson machine learning challenge dataset ([kaggle link](https://www.kaggle.com/c/higgs-boson/data))
* Searching for $H \rightarrow \tau^+ \tau^-$
* $\tau$ particles are notoriously difficult to identify/measure
* No "obvious" physics signature to distinguish $\tau$'s from Higgs decays versus other decays
These features make this problem particularly well-suited to NN's.

A full list of the 30 input variables is available [here, on pages 14-16](https://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf).
* The **PRI_** quantities are "primary" variables, i.e. directly calculated from measurements
    * Some of these quantities (e.g. $\phi$) will provide no separation power due to symmetries of the problem
* The **DER_** quantities are derived using primary quantities
* In general, we expect the derived quantities to be more useful.

The authors trained a simple NN using all 30 input features. The trained classifier gave AUC = 0.92 with approx. median significance of 2.61. The 30 input features result in 295 first- and second-order Taylor coefficient features.

To evaluate the influence of each variable, the metrics $<t_i>$ are ranked:
<img src="metrics_ranked.png">
* The most important variable is the invariant mass of the hadronic $\tau$ plus lepton
    * Among first-order terms, it is the most important (ranked 10th)
    * Among important second-order terms, it appears in six of them
    * Using these, the authors determine that the NN is learning the tau invariant mass peak position and width
    * By this metric, the most important single term comes from correlation of the $\tau$+lepton mass and the ratio $\frac{lepton_{p_T}}{\tau_{p_T}}$
* As expected, terms such as azimuthasl angle $\phi$ had very little influence
* The authors tried re-training the algorithm using only the inputs contributing to the top 5% of $<t_i>$ terms
    * This reduced the list of inputs to 8 variables
    * Achieved similar/identical performance
    * Authors did nor perform deeper analysis 

Conclusions:
* $<t_i>$ is an interesting metric of NN performance
* Using some toy cases, we see the behavior we expect when using this metric to evaluate importance of various input terms
* Using a more complicated case, some obvious physics interpretations could be made

Rachael's questions/concerns:
* Do we really want to consider the entire space equally? Shouldn't some parts of the input space (i.e. your signal region versus control region) be given more "influence?
* What are other metrics of Taylor coefficient "influence"?
* What about higher-order terms? Big-oh/little-oh constraints on higher-order terms?