Simplify KL divergence to (reduced) cross-entropy #369
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR simplifies the actual Kullback-Leibler (KL) divergence calculated by the former
family$kl()
functions to the corresponding cross-entropy (now calculated byfamily$ce()
functions), i.e., the reference model's negative entropy is dropped (when regarding KL divergence as the sum of the reference model's negative entropy and the cross-entropy of the submodel with respect to the reference model) or, in other words, the reference model's entropy is dropped (when regarding KL divergence as the cross-entropy of the submodel with respect to the reference model minus the reference model's entropy).Furthermore, for some families, the actual cross-entropy is further reduced to only those terms which would not cancel out when calculating the KL divergence. In case of the Gaussian family, that reduced cross-entropy is further modified, yielding merely a proxy.
The reason for all this is consistency in custom reference models: Previously, for the actual KL divergence, projpred assumed that the reference model was of the same family as the submodel. Typically, this is the case, but in general, custom reference models don't need this assumption. Omitting the reference model's (negative) entropy from the actual KL divergence is not a problem because the actual KL divergence (output element
kl
of.init_submodel()
, now called elementce
) was only used insearch_forward()
where it was minimized over all submodels of a given model size. Since the (negative) entropy of the reference model is a constant there, this PR is able to drop it without affecting the minimization. In fact, the actual KL divergence was also passed forward tovarsel()
's andcv_varsel()
's output, but there, it didn't seem to be used apart from unit tests (which are adapted by this PR as necessary).