# Evaluating the performance of a density estimator

The optimal strategy for evaluating a density estimator depends on (a) the properness and (b) the use case of the estimator under consideration.

We call a density estimator *proper* if the resulting estimate is a true density function, integrating to 1 over the entire feature space. Most density estimators in the literature are proper at least on paper, but can still turn out to be improper in practice due to numerical issues or other details of implementation.

## Log likelihood on hold-out data

Given a proper density estimator, the mean log \[estimated\] density, or simply the log likelihood, over a hold-out dataset is arguably the gold standard for performance evaluation (for example, see [Schmidberger and Frank](https://link.springer.com/content/pdf/10.1007/11564126_26.pdf)). Given hold-out data $x = (x_1, ..., x_n)$ and some estimated density function $\hat p$, the log likelihood is

$$ L(\hat p, x) = \sum_i \log \hat p (x_i) $$

A high value means that the estimator figured out exactly where the high-density regions are and placed relatively high likelihood on data points appearing in those locations. Performing this evaluation on a hold-out set ensures that the evaluation does not reward overfitting. Thus, the log likelihood can work well in a cross-validation framework for model selection and hyperparameter tuning.

If the density estimator is not proper, however, the log likelihood is obviously a questionable metric. The most improper estimator wins simply by assigning a very high density estimate to every point.

## Evaluating improper estimators

Apart from its importance for scores that rely on properness, properness is often of no practical significance in real-world use cases. For example, an 
[isolation forest](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest) 
assigns high values to points of high density. Although though this function is not a proper density, it can be used for anomaly detection or mode detection just as any density estimate, so we'd like a success metric that puts the isolation forest on equal footing with a proper density estimate.

Summarizing, the following metrics are the best we've identified for when the density estimate is not proper:
- The Spearman (rank-order) correlation; a density estimate attains the maximum (1) whenever it ranks the test points in the same order as the true density
- Pearson (the usual) correlation; a density estimate attains the maximum (1) whenever it is of the form $\hat p(x) = m p(x) + b$ for some positive constant $m$ and scalar $b$.

Both of these correlations are between (a) the density estimate on a hold out test set and (b) the \[true\] generative density. Of course, both of these are applicable only in simulation studies, where the generative density is observable.


## Additional metrics

All of the following metrics require knowledge of the generative density and encourage properness of the density estimate:

Mean absolute error (MAE):
- [Deng and Wickham](https://vita.had.co.nz/papers/density-estimation.pdf) used the mean absolute error to compare estimated densities against the density used to simulate the data.

Mean integrated squared error (MISE):
- [Seaman and Powell](https://www.researchgate.net/publication/224817410_An_Evaluation_of_the_Accuracy_of_Kernel_Density_Estimators_for_Home_Range_Analysis) use some kind of MISE (although the $f(x)$ in their denominator seems to conflict with relatively modern MISE definitions).
- [Klein and Richardson](http://www.stat.cmu.edu/~lrichard/links/density_trees.pdf)
review several theoretical properties of piecewise constant density estimators in terms of the MISE.

TODO: We're only scratching the surface of the literature here ...