# Evaluating the performance of a density estimator

We call a density estimator *proper* if the resulting estimate is a true density function, integrating to 1 over the entire feature space. Most density estimators in the literature are proper at least on paper, but can still turn out to be improper in practice due to numerical issues or other details of implementation.

The optimal strategy for evaluating a density estimate depends on the use case and whether the estimator is proper.


## Cross-validation via the log likelihood

*BLUF: Attractive because ... cross validation! - but use with caution.* 

Given a proper density estimator, the mean log \[estimated\] density, or simply the log likelihood, over a hold-out dataset is a well-known method of performance evaluation (for example, see [Schmidberger and Frank](https://link.springer.com/content/pdf/10.1007/11564126_26.pdf)). Given hold-out data $x = (x_1, ..., x_n)$ and some estimated density function $\hat p$, the log likelihood is

$$\sum_i \log \hat p (x_i) $$

This is a direct estimate of the part of the KL divergence $\int p(x) \log \frac{p(x)}{\hat p_x} dx$ that depends on $\hat p$. Intuitively, a high value of the likelihood means that the estimator figured out exactly where the high-density regions are and placed relatively high likelihood on data points appearing in those locations. 

Although cross-validation on the log likelihood is attractive for its simplicity, two major caveats apply:
- [Hall (1987)](https://projecteuclid.org/download/pdf_1/euclid.aos/1176350606) showed that this approach can lead to "infinite loss and inconsistent estimation", depending in the tail distributions for the target density and density estimator.
- This approach is relevant only for proper estimators. If the density is allowed to be improper, it's trivial to game the metric simply by assigning a very high density estimate to every point.

## Evaluating improper estimators

Properness of a density estimate is often of no practical significance in real-world use cases. For example, an 
[isolation forest](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest) 
assigns high values to points of high density. Although though this function is not a proper density, it can be used for anomaly detection or mode detection just as any density estimate. We'd like a success metric that puts the isolation forest on equal footing with a proper density estimate.

The following metrics are the best we've identified for when the density estimate is not proper:
- The Spearman (rank-order) correlation; a density estimate attains the maximum (1) whenever it ranks the test points in the same order as the true density
- Pearson (the usual) correlation; a density estimate attains the maximum (1) whenever it is of the form $\hat p(x) = m p(x) + b$ for some positive constant $m$ and scalar $b$.

Both of these correlations are between (a) the density estimate on a hold out test set and (b) the \[true\] generative density evaluated at the test points. Of course, both of these are applicable only in simulation studies, where the generative density is observable.


## TODO

- Explain how [Deng and Wickham](https://vita.had.co.nz/papers/density-estimation.pdf) used the mean absolute error to compare estimated densities against the density used to simulate the data.

- Describe the integrated L1 and L2 losses and how they relate to the expectations of the L1 and L2 losses, respectively.

## Additional references

- [A lecture by Larry Wasserman](http://www.stat.cmu.edu/~larry/=sml/densityestimation.pdf). He highlights [Devroye and Gyorfi (1985)](http://luc.devroye.org/L1bookBW.pdf), endorsiong integrated L1 loss due to interpretability and invariance under certain transformations
- [Klein and Richardson](http://www.stat.cmu.edu/~lrichard/links/density_trees.pdf)
review several theoretical properties of piecewise constant density estimators in terms of the MISE.
- [Seaman and Powell](https://www.researchgate.net/publication/224817410_An_Evaluation_of_the_Accuracy_of_Kernel_Density_Estimators_for_Home_Range_Analysis) use some kind of MISE (although the $f(x)$ in their denominator seems to conflict with relatively modern MISE definitions).

TODO: We're only scratching the surface of the literature here ...