[ENH] _pmf_support method for BaseDistribution returning inspectable mass support#711
Conversation
- Modified _plot_single() in BaseDistribution to handle discrete distributions - For discrete PMF plots, extract support from scipy distribution and evaluate at integer points - Use stem plots instead of line plots for discrete PMF visualization - Handle infinite support bounds (e.g., Poisson) with reasonable limits - Maintain backward compatibility for continuous distributions This resolves the issue where Binomial and other discrete distribution PMFs were plotted as continuous curves instead of proper discrete mass functions.
- Added test_discrete_pmf_plotting() to verify that discrete distributions use stem plots for PMF visualization - Ensures the fix for issue sktime#416 doesn't regress
- Replace bare except Exception with specific exception types - Shorten long docstring line to comply with flake8 E501 - Maintain black formatting
fkiraly
left a comment
There was a problem hiding this comment.
Really nice, good idea!
Can we try to avoid the try/except though?
fkiraly
left a comment
There was a problem hiding this comment.
Thanks!
I think this is opening a can of worms though... the plotting assumes that the support is at integers, right? Which in general need not be the case, e.g., for Empirical.
To deal with this programmatically, we would need to be able to inspect each discrete - and also, potentially, mixed - distribution for the point masses.
Currently, this is not part of the API - see #244 for the issue to add this.
For now, I do think this PR is an improvement above the status quo, but I wonder whether we can do something quick and intermediate? E.g., a private function _support(lower, upper) that generates all discrete points between lower and upper?
- Add _support(lower, upper, max_points) method to BaseDistribution that returns support points within bounds - Default implementation assumes integer support (for distributions like Binomial) - Override _support in Empirical distribution to return actual empirical support points - Update _plot_single to use _support method for discrete PMF plotting instead of hardcoded integer range - This allows distributions with non-integer support (like Empirical) to be plotted correctly
- Override _support method in Delta distribution to return the actual support point(s) c - Handles both scalar and array cases, filtering points within the given bounds - Ensures Delta distributions with non-integer support points are plotted correctly
- Remove unnecessary blank lines in Delta and Empirical _support methods - Maintain consistent code formatting
fkiraly
left a comment
There was a problem hiding this comment.
Thanks for the additional thoughts about _support, I think this is great.
May I suggest to split this PR into two, one with _support and specific tests for it, and one about plotting, which stacks on it? I anticipate some discussion about _support.
Main questions about this PR as-is:
- how do we handle the issue that each entry can have different support, in a 2D shaped distribution, e.g., for
EmpiricalorDelta? This feels a bit tricky. - why are you deleting some of the existing tests?
Here - Q1: 2D Support Handling: The _support method is called on scalar distributions (after [iloc[i,j] extracts each entry), not on the full 2D distribution. This means each entry gets its own support calculation. In new pr (title - Incorrect pmf plotting) - Q2: Test Changes: The original test only checked that a stem plot was created, but didn't verify it contained actual data points (Added it back). The updated version checked for multiple support points to ensure the plotting worked correctly. |
17010db to
77d7ac3
Compare
…ibution support detection - Add _support(lower, upper, max_points) method to BaseDistribution - Returns empty array for continuous distributions - Returns non-negative integers for discrete distributions by default - Override in Empirical to return actual empirical support points within bounds - Override in Delta to return point mass locations within bounds - Add comprehensive test for _support method across distribution types
Ah, I see! So it can be called only on 0D-array (i.e., scalar) distributions, right? Although where all entries share the same support, it is a bit unfortunate that it cannot be called from the 2D one. Not sure how to solve this. Perhaps distinguish potential (class level) and actual (instance & entry level)? Anyway, that might be best discussed in a design issue and is not blocking for this PR. |
fkiraly
left a comment
There was a problem hiding this comment.
Great!
Could you kindly expand in the docstring that the _support (a) applies to pmf only, and (b) scalar distributions only?
fkiraly
left a comment
There was a problem hiding this comment.
Great, thanks!
Small change - to make the purpose clearer, could we rename to _pmf_support?
_pmf_support method for BaseDistribution returning inspectable mass support
fkiraly
left a comment
There was a problem hiding this comment.
Thanks!
I noticed that some implementations still deal with the 2D case - can you explain why this is there, if - according to your previous statements - the _pmf_support method is only for the 0D case.
|
can you explain why the 2D code was there and how you want to resolve it? |
The 2D logic in _pmf_support handled multi-dimensional distributions by aggregating support points across all elements. This contradicted the base class, which defines _pmf_support as 0D-only. I enforced this by raising NotImplementedError for ndim > 0 and removed redundant 2D handling. The code is now aligned with the docs, simpler, and all tests pass. |
Reference Issues/PRs
Related to #416 (discrete PMF plotting improvements). This PR provides the foundational infrastructure that enables more robust discrete distribution plotting in follow-up PRs.
What does this implement/fix? Explain your changes.
This PR introduces a new
_pmf_support(lower, upper, max_points)method to theBaseDistributionclass that provides a standardized, extensible way to obtain support points for discrete distributions within specified bounds.Key Changes:
skpro/distributions/base/_base.py): Added_pmf_supportmethod that returns empty array for continuous distributions and non-negative integers for discrete distributions by defaultskpro/distributions/empirical.py): Custom implementation that returns actual empirical support points within boundsskpro/distributions/delta.py): Implementation that returns point mass locations within boundsskpro/distributions/tests/test_proba_basic.py): Addedtest_pmf_support_method()covering all distribution typesThis addresses the need for extensible support point detection across different discrete distribution types, replacing hardcoded assumptions with a clean, overridable API.
Does your contribution introduce a new dependency? If yes, which one?
No, this change introduces no new dependencies.
What should a reviewer concentrate their feedback on?
Did you add any tests for the change?
Yes, I added comprehensive tests in
test_support_method()that cover:Any other comments?
This PR establishes the core infrastructure for proper discrete distribution support handling. It serves as a foundation for improved plotting capabilities (as demonstrated in related PRs) and future enhancements like full support API implementation (issue #244). The design prioritizes extensibility while maintaining backward compatibility.
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
skproroot directory (not theCONTRIBUTORS.md). Common badges:code- fixing a bug, or adding code logic.doc- writing or improving documentation or docstrings.bug- reporting or diagnosing a bug (get this pluscodeif you also fixed the bug in the PR).maintenance- CI, test framework, release.See here for full badge reference
For new estimators
docs/source/api_reference/taskname.rst, follow the pattern.Examplessection.python_dependenciestag and ensured dependency isolation, see the estimator dependencies guide.