[BUG] Fix MultiIndex quantile bug in BaseDistribution.loc indexing#697
Conversation
- Add _is_index_like() helper function to check for Index objects - Update _get_indexer_like_pandas to handle pd.Index inputs early - Fix _Indexer.__getitem__ to skip MultiIndex special case when elements are Index-like - Add test for MultiIndex loc indexing with Index objects Fixes sktime#678
Ah, did not see this one! Given mine has some errors in tests, if this is a go I can close my PR after merging this one. |
The When I ran the full test suite after implementing the MultiIndex fix, test_deps_info was failing because numba is installed but has an import-time error (incompatible with numpy 2.4). The existing The fix makes |
I see - good spot! Could you kindly move this to a different pull request? Mid-term, I think this should be deduplciated and use the |
@fkiraly I have compiled with your thoughts and will open a pr for |
|
@arnavk23 - Had a look at the PR and LGTM. Edit: I also tried installing |
fkiraly
left a comment
There was a problem hiding this comment.
Yes, looks good!
Although the if/else looks highly redundant, can you please check that?
…ktime#697) Fixes sktime#678 - Add `_is_index_like()` helper function to check for `Index` objects - Update `_get_indexer_like_pandas` to handle `pd.Index` inputs early - Fix `_Indexer.__getitem__` to skip `MultiIndex` special case when elements are Index-like - Add test for `MultiIndex` loc indexing with Index objects
#704) #### Reference Issues/PRs Addresses performance issues with EnbpiRegressor found when used with [Monte Carlo recursive probabilistic forecaster](sktime/sktime#9242). This optimization enables practical usage of conformal prediction intervals in large-scale MC forecasting scenarios. #### What does this implement/fix? Explain your changes. This PR updates ~~introduces `FastEnbpiRegressor`, a drop-in replacement for~~ `EnbpiRegressor` that provides ~10-100x speedup through two optimizations ~~1. Vectorized Aggregation in `_predict_proba()`: Replaces the O(n_train × n_test) nested loop with chunked vectorized numpy operations that process predictions in batches. This avoids the quadratic complexity of the original implementation.~~ ~~2. Optimized `_FastEmpiricalEnbpi` Distribution Class: A specialized `Empirical` subclass that:~~ - Skips expensive `_init_sorted()` initialization (~10x faster) - Implements custom `_sample()` using direct numpy indexing instead of pandas MultiIndex operations (~1000x faster sampling) - Stores raw prediction and error arrays for efficient resampling Edit: optimizations mentioned above (strikethrough) still apply but are now used on `EnbpiRegressor` and `Empirical` directly. **Performance Metrics:** - Single batch prediction: 1.2-1.5x faster - Sampling 100 from distribution: 2500x faster - Small MC workflow (10 steps, 100 samples each): 22x faster (9.9s → 0.45s) - Realistic-scale MC forecasting: ~100x faster (10+ minutes → 5 seconds) #### Does your contribution introduce a new dependency? No. The implementation uses only existing dependencies: NumPy, Pandas, and skpro's existing classes. #### What should a reviewer concentrate their feedback on? - ~~Correctness of numerical results: Verify that `_FastEmpiricalEnbpi` produces statistically equivalent samples and statistics as standard `Empirical`~~ - Appropriateness of skipping `_init_sorted()`: Confirm that quantiles computed on-demand from raw arrays are accurate - Custom `_sample()` implementation: Review the NumPy indexing logic for correctness across different batch and sample sizes - ~~Generalizability: he optimizations in the `_FastEmpiricalEnbpi` class could potentially benefit other estimators creating large empirical distributions (see "Any other comments" section)~~ #### Did you add any tests for the change? No, but I should. Tested with throw-away scripts for this draft. If ok-ed will produce tests. #### Any other comments? ~~The optimizations in `_FastEmpiricalEnbpi` are not EnbPI-specific. The performance findings suggest that:~~ - ~~Skipping `_init_sorted()` benefits any large Empirical distribution~~ - ~~Custom sampling using raw arrays is a general pattern applicable to distributions where parameters can be efficiently stored~~ ~~A follow-up enhancement (separate PR) could add a `skip_init_sorted` parameter to the base `Empirical` class or create a general `FastEmpirical` class in the distributions module to benefit other estimators.~~ ~~Note: the base branch here includes my fix for `quantile()` in the base distribution as I was unable to test without it. When #697 is merged, I will discard those changes and use main as the base branch.~~ Edit: base branch fixed and optimizations are now applied to original classes.
Reference Issues/PRs
Fixes #678
What does this implement/fix? Explain your changes.
_is_index_like()helper function to check forIndexobjects_get_indexer_like_pandasto handlepd.Indexinputs early_Indexer.__getitem__to skipMultiIndexspecial case when elementsare Index-like
MultiIndexloc indexing with Index objects