[ENH] MDNRegressor (Mixture Density Network)#796
Conversation
fkiraly
left a comment
There was a problem hiding this comment.
Thanks! Nice!
- question: why implement a
NormalMixturewhenMixtureis available and should be capable of representing mixture of normals by a composition ofNormalandMixture? - if we add
NormalMixture, it should also be added in the API reference for distribution
|
My understanding of You could argue that we could call it |
Add NormalMixture to distributions list. Also split and rename to silverman and scott as is more pythonic convention.
Initial testing shows ISJ provides improved convergence and final NLL scores. Bandwidths added as a separate module as this will be very useful for other kernel based estimators.
fkiraly
left a comment
There was a problem hiding this comment.
Ah, I see. Yes, that makes sense.
Do you want to open an issue that wishes for an abstract mixture that can have row-wise different mixture weights? That could be extended from Mixture, if weights are not just a list but a matrix.
|
I did consider the following:
The downsides were:
|
|
Just as a side note: The natural extension to MDNs is probably something Flow/Kernel based, as opposed to extending this with non-Normal dists. |
fkiraly
left a comment
There was a problem hiding this comment.
Should be fine now, but one request:
the estimator MDNRegressor does not actually depend on pytorch-optimizer, except if the string "SOAP" is passed. But that only does aliasing, and you are doing soft dependency checking there is no intrinsic dependency. Hence I would remove pytorch-optimizer from the dependency set.
I would also suggest to replace the try/except for dependency checking with _check_soft_dependencies (severity="none")
|
|
||
| XGBoostLSS | ||
|
|
||
| Neural conditional density estimation |
There was a problem hiding this comment.
how about "deep learning based regressors" instead?
|
Thanks for updating. I'll do that optimizer package stuff. |
|
btw, other topic, did we want to have the discussion on the differentiable transformers sometime? Have not seen you in the meetups, and I really think we need a sync meet on that (at least once). If the meetup times are not convenient, feel free to start a scheduling discussion on the discord. |
|
I've added ngem loss: https://arxiv.org/html/2602.10602v1 to the bellow PR on my fork. It seems to show significant benefits at lower It would be best to hold off and merge this all together to avoid minor breaking changes but I am unclear on how to handle the Yes, apologies there. I will reach out on Discord regarding the Diff Transform class. |
fkiraly
left a comment
There was a problem hiding this comment.
Yes, makes sense - adding losses can be done in a separate PR.
#### Reference Issues/PRs No Issue but referenced [here](#796 (comment)). #### What does this implement/fix? Explain your changes. Mixture density networks typically uses the negative log-likelihood (nll) objective, which can suffer from slow convergence and mode collapse. The natural gradient expectation maximization (ngem) objective can achieve up to 10x faster convergence while adding almost zero computational overhead, and scales well to high-dimensional data where nll fails. [Learning Mixture Density via Natural Gradient Expectation Maximization](https://arxiv.org/html/2602.10602v1) #### What should a reviewer concentrate their feedback on? ~General API structure for adding ngem specific learning rate. ngem typically performs better with lower lr, so adding an automated lr multiplier has been considered. Is this designed correctly?~ I have removed `ngem_lr_scaling` as it obscures true `lr` and sets a bad standard if we wish to add more objectives that also benefit different magnitudes of scaling. #### Did you add any tests for the change? Existing MDN tests to still pass. Add specific ngem tests also.
Reference Issues/PRs
No issue opened.
What does this implement/fix? Explain your changes.
A new regressor implementation of Mixture Density Network (MDN) as per Bishop 1994 with noise regularisation as per Rothfuss 2019.
It also includes optional passing of pytorch activation functions and optimizers.
It implements a fully vectorised NormalMixture distribution where each rows weights are individually learnt and applied. It also implements a custom vectorised bisection for fast
_ppfmethod calls.Does your contribution introduce a new dependency? If yes, which one?
Yes, soft deps: Pytorch and pytorch-optimizer.
What should a reviewer concentrate their feedback on?
NormalMixture distribution. This adds slightly opinionated design choice to the API.
Did you add any tests for the change?
Yes. Standard library param tests for dist and est. Additional test for est coming.
Any other comments?
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
skproroot directory (not theCONTRIBUTORS.md). Common badges:code- fixing a bug, or adding code logic.doc- writing or improving documentation or docstrings.bug- reporting or diagnosing a bug (get this pluscodeif you also fixed the bug in the PR).maintenance- CI, test framework, release.See here for full badge reference
For new estimators
docs/source/api_reference/taskname.rst, follow the pattern.Examplessection.python_dependenciestag and ensureddependency isolation, see the estimator dependencies guide.