[ENH] interface to `ngboost` #215

ShreeshaM07 · 2024-03-19T10:43:01Z

Reference Issues/PRs

Resolves #135

What does this implement/fix? Explain your changes.

This PR has added a way to use ngboost regressor in skpro.
Tested it using the following piece of code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Load Boston housing dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
X = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
Y = raw_df.values[1::2, 2]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

ngb = NGBoostRegressor()._fit(X_train, Y_train)
Y_preds = ngb._predict(X_test)
Y_dists = ngb._pred_dist(X_test)

# test Mean Squared Error
test_MSE = mean_squared_error(Y_preds, Y_test)
print('Test MSE', test_MSE)

# test Negative Log Likelihood
test_NLL = -Y_dists.logpdf(Y_test).mean()
print('Test NLL', test_NLL)

Does your contribution introduce a new dependency? If yes, which one?

NGBRegressor has been added from the ngboost package.

What should a reviewer concentrate their feedback on?

What modifications need to be done and some help on how to add get_test_params function and the parameters to be sent via FITTED_PARAMS_TO_FORWARD.

Did you add any tests for the change?

No

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

For new estimators

I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
dependency isolation, see the estimator dependencies guide.

…rface in skpro

skpro/regression/gradient_boosting/_ngboost.py

… to skpro Normal dist

ShreeshaM07 · 2024-04-22T06:51:11Z

I have implemented the _predict_proba after converting the return type of ngboost's normal distribution to skpro Normal distribution it must now be extended to other distributions that ngboost supports and supported by skpro. I will also work on implementing predict_interval and predict_quantile next. In the meantime I will also try to make the conversion of the return types into a function.

The current output upon calling
Y_pred_proba = ngb._predict_proba(X_test)
print(Y_pred_proba) is

[iter 0] loss=3.6479 val_loss=0.0000 scale=1.0000 norm=6.7178
[iter 100] loss=2.6977 val_loss=0.0000 scale=2.0000 norm=4.9421
[iter 200] loss=2.1456 val_loss=0.0000 scale=2.0000 norm=3.3556
[iter 300] loss=1.8761 val_loss=0.0000 scale=1.0000 norm=1.4400
[iter 400] loss=1.7477 val_loss=0.0000 scale=1.0000 norm=1.3337
<ngboost.distns.normal.Normal object at 0x736de3d40f50>
Normal(columns=RangeIndex(start=0, stop=1, step=1),
       index=RangeIndex(start=0, stop=102, step=1),
       mu=             0
0    16.962767
1    21.572000
2    16.014035
3    30.217625
4    19.818455
..         ...
97   11.209679
98   25.100160
99    7.723768
100  19.720101
101  24.292480

[102 rows x 1 columns],
       sigma=            0
0    1.206229
1    1.079737
2    1.241355
3    1.223454
4    1.187693
..        ...
97   1.292326
98   1.216184
99   1.304442
100  1.106799
101  1.176687

[102 rows x 1 columns])
Test MSE 16.048500328511995
Test NLL 4.412747994731287

Please let me know if it looks good or something needs to be changed.

…rface in skpro

… to skpro Normal dist

fkiraly

Thanks a lot!

Some comments:

please look at the extension template, extension_templates/regression.py - you should not override predict_interval etc, there is no need to copy-paste everything from the base class.
- here is a longer guide on how-to: https://www.sktime.net/en/latest/developer_guide/add_estimators.html
please implement _predict_proba only, this should start with an underscore.
in it, it is sufficient to start at the line # Convert NGBoost distibution to skpro BaseDistribution and supply the code below. That is all you need to do. No need to do copy-paste anything from the base class.
- of course, for all distributions that ngboost could return. I.e., you need to check which distribution is being returned, and convert to the same skpro distribution. All ngboost distributions should have a counterpart in skpro.

fkiraly · 2024-04-22T11:20:48Z

Also, you should add ngboost as a soft dependency to pyproject.toml, and via the python_dependeices tag to the estimator.

fkiraly · 2024-04-22T11:21:38Z

skpro/regression/gradient_boosting/_ngboost.py

+import numpy as np
+import pandas as pd
+from ngboost import NGBRegressor
+from ngboost.distns import Normal


imports should happen only in the estimator

fkiraly · 2024-04-22T11:22:29Z

skpro/regression/gradient_boosting/_ngboost.py

+
+    def __init__(
+        self,
+        Dist=Normal,


These need to change to strings.

variables should also be all lower_snake_case.
The "base learner" I would init with None, and internally replace with the sklearn default learner.

Could you please guide me as to which sklearn default learner do I use instead of ngboost default_tree_ learner.

that's an sklearn learner - I would advise to hard-code it directly in __init__, i.e., if estimator is None: self.estimator_ = etc, with the precise learner spec found here: https://github.com/stanfordmlgroup/ngboost/blob/7da1ca94a8857b865cf39fca71b2564bad0c3e15/ngboost/learners.py

(I would also advise to rename the arg base to estimator, to be in line with sklearn here)

fkiraly · 2024-04-23T12:42:36Z

pyproject.toml

@@ -47,6 +47,7 @@ dependencies = [
    "scikit-base>=0.6.1,<0.8.0",
    "scikit-learn>=0.24.0,<1.5.0",
    "scipy<2.0.0,>=1.2.0",
+    "ngboost",


not to the core dependencies please, to the optional dependencies.

please also include an upper bound, this should be current MINOR plus one

fkiraly · 2024-04-23T12:43:34Z

skpro/regression/gradient_boosting/_ngboost.py

+        An NGBRegressor object that can be fit.
+    """
+
+    _tags = {


this should include other tags as well: "authors" (you) and "maintainers"

fkiraly · 2024-04-23T12:45:36Z

skpro/regression/gradient_boosting/_ngboost.py

+        from ngboost.distns import Laplace, LogNormal, Normal, Poisson, T
+
+        dist_ngboost = Normal
+        if self.dist == "Normal":


yes, very nice, that's what I meant.

I would recommend to use a dict here, that makes the logic easier to maintain and extend.
Let me know if you want pointers how ot concretely use a dict.

fkiraly · 2024-04-23T12:46:55Z

skpro/regression/gradient_boosting/_ngboost.py

+        Converted skpro distribution object.
+        """
+        skpro_dist = None
+        if self.dist == "Normal":


as above, using a dict would help.

fkiraly · 2024-04-23T12:48:29Z

skpro/regression/gradient_boosting/_ngboost.py

+        X = self._check_X(X)
+
+        # Convert NGBoost distribution to skpro BaseDistribution
+        pred_mean = self._predict(X=X)


The general idea makes sense, though I would do it differently:

just call pred_dist from ngboost. This is a full parameteric distirbution.

in the _ngb_dist_to_skpro, extract the parameters from the ngboost distribution and plug them into the skpro distribution.

No need to take detours via the mean or _predict, especially since some of the distributions are not directly parameterized in terms of the mean.

I have now taken care as to what the value actually returns from the ngboost and converted it to the respective value in skpro. It has also been explained using the comments as to what the distribution is and what the parameters indicate.

fkiraly · 2024-04-23T12:48:51Z

skpro/regression/gradient_boosting/_ngboost.py

+        self._is_fitted = True
+        return self
+
+    def _predict(self, X):


I would recommend, remove _predict and rely on the default behaviour entirely.

I think it would be useful to keep _predict in case for just predicting the values. I have not used this anymore in _predict_proba. Please do let me know if you think otherwise.

hm, I suppose it makes sense - not a blocker from my side.

fkiraly · 2024-04-23T12:49:12Z

skpro/regression/gradient_boosting/_ngboost.py

+            early_stopping_rounds=self.early_stopping_rounds,
+        )
+        self.ngb.fit(X, y)
+        self._is_fitted = True


no need to set _is_fitted to True, the boilerplate layer in fit already does this.

fkiraly

Great progress!

I left some comments above for the next iteration.

…rrect

ShreeshaM07 · 2024-04-26T14:46:29Z

@fkiraly could you please review the updates I have made to the code and if anything else needs to be modified.

fkiraly · 2024-04-26T17:34:13Z

skpro/regression/gradient_boosting/_ngboost.py

+        return [params1, params2]
+
+
+# #Load Boston housing dataset


kindly remove the commented code - you can also use check_estimator for testing

fkiraly · 2024-04-26T17:35:23Z

skpro/regression/gradient_boosting/_ngboost.py

+            `create_test_instance` uses the first (or only) dictionary in `params`
+        """
+        params1 = {"dist": "Normal"}
+        params2 = {


can you add more settings to cover all parameters?

It should also cover the default case, empty param set.

@ShreeshaM07, while we are checking, could you kindly move the empty param set to the first position? Not strictly needed, but easier to read.

fkiraly · 2024-04-29T14:15:38Z

skpro/regression/gradient_boosting/_ngboost.py

+        self.ngb = NGBRegressor(
+            Dist=dist_ngboost,
+            Score=score,
+            Base=self.estimator,


the estimator needs to be cloned - otherwise it will be mutated, and you violate the sklearn extension contract (as the failing test tells you)

@ShreeshaM07, I think one of the failures might relate to this comment which you have not addressed?

I have addressed the clone issue it is not related to that. It is with the handling of the estimator which is DecisionTreeRegressor but init was None.

fkiraly · 2024-04-29T17:36:20Z

skpro/regression/gradient_boosting/_ngboost.py

+            "LogNormal": LogNormal,
+        }
+
+        kwargs["index"] = kwargs["mu"].index


I think this should be from X.index (from predict_proba), and y.index (from fit), not the index that ngboost returns.

Yes that makes sense. I have now made it X.index inplace of kwargs["mu"].index however I am still unable to remove the

kwargs[skp_param] = self._check_y(y=kwargs[skp_param]) # returns a tuple so taking only first index of the tuple kwargs[skp_param] = kwargs[skp_param][0]

As it is causing a shape mismatch error upon removing it.
ValueError: shape mismatch: objects cannot be broadcast to a single shape. Mismatch is between arg 0 with shape (111,) and arg 3 with shape (10,).
Where 111 expected Index but arg3 would be the number of columns which seems to be 10.

do you have an explanation why this happens?

ShreeshaM07 · 2024-04-29T18:37:41Z

Nice, thanks! Really great addition!

I've started the CI tests to see whether they pass - I assume you have checked with check_estimator already.

Could you please help me with the check_estimator. Upon running

from sklearn.utils.estimator_checks import check_estimator
check_estimator(NGBoostRegressor())

It is throwing a TypeError: input must be a pd.DataFrame.

fkiraly · 2024-04-29T23:10:26Z

You need to use the check_estimator from skpro.utils, not sklearn.utils

fkiraly · 2024-04-30T08:16:26Z

skpro/regression/gradient_boosting/_ngboost.py

+
+    Parameters
+    ----------
+        dist              : string , default = "Normal"


thanks for updating the docstring. Please check how the docs render in the readthedocs CI job.

Indentation matters! And this will render funnily, so kindly look at the other regressors and indent similarly.

You can check the result, how it looks, in the readthedocs CI job (click "details").

fkiraly · 2024-04-30T08:17:16Z

skpro/regression/gradient_boosting/_ngboost.py

+
+class NGBoostRegressor(BaseProbaRegressor):
+    """
+    Constructor for all NGBoost models.


Here you probably want to describe the regressor instead, i.e., it is an interface to NGBRegressor.

…ling

ShreeshaM07 · 2024-05-01T06:11:11Z

I have resolved the shape mismatch error using ravel() which the check_estimator suggested to apply on y. I have also changed the kwargs["columns"] to be y.columns now .

But some tests are failing in the check_estimator although it is running it without any errors.

FAILED: test_constructor[NGBoostRegressor]
FAILED: test_set_params_sklearn[NGBoostRegressor]
FAILED: test_fit_does_not_overwrite_hyper_params[NGBoostRegressor-ProbaRegressorBasic]
FAILED: test_fit_does_not_overwrite_hyper_params[NGBoostRegressor-ProbaRegressorXcolMixIxYnp]
FAILED: test_fit_does_not_overwrite_hyper_params[NGBoostRegressor-ProbaRegressorSurvival]
FAILED: test_input_output_contract[NGBoostRegressor]

Could you please help out as to why these are failing.

fkiraly · 2024-05-01T07:40:42Z

Could you please help out as to why these are failing.

Of course, will look at it later today.

Can you also describe what you have found out in debugging these failures? You can use raise_exceptions=True to get an error raise for debugging.

ShreeshaM07 · 2024-05-01T07:45:00Z

Can you also describe what you have found out in debugging these failures? You can use raise_exceptions=True to get an error raise for debugging.

I have now resolved all the errors except test_fit_does_not_overwrite_hyper_params.
And I have found it is being caused due to initialisation of the base learner/estimator to None in init but i am replacing it with DecisionTreeRegressor later on if estimator is default None.

Can you please suggest what i can do to solve this issue. I tried doing the initialisation to DecisionTreeRegressor which is being done in _fit in init itself if it is None but it then gives test_constructor and test_sklearn_params asssertion error.

…e to DecisionTreeRegressor

fkiraly · 2024-05-01T08:13:09Z

See above - look at the error messages, they should tell you precisely what the problem is.
The parameters written to self must always remain unmutated and identical to what was passed in __init__.

ShreeshaM07 · 2024-05-01T08:43:02Z

See above - look at the error messages, they should tell you precisely what the problem is. The parameters written to self must always remain unmutated and identical to what was passed in __init__.

Yes I have now made it self.estimator_ to be passed to NGBRegressor so it is keeping the __init__ params intact and check_estimator is giving all tests passed!

Could you please run the CI jobs again.

fkiraly · 2024-05-01T12:45:38Z

Great! Looks excellent, I will give it a review once the tests pass. From superficial inspection, looks ready to merge!

fkiraly · 2024-05-01T13:22:51Z

Hm, it looks like check_estimator tests are indeed passing! Very nice!

Well, with exception of the sixth (index 5) parameter set - do you have an idea why that parameter set specifically is failing? Seems nans are being produced.

fkiraly · 2024-05-01T13:24:44Z

skpro/regression/gradient_boosting/_ngboost.py

+
+        params6 = {
+            "dist": "TDistribution",
+            "natural_gradient": False,


this one seems to be the failing example. Any ideas?
I cannot spot any issue. Is this perhaps a bug in ngboost itself?

t-distribution does have non-degenerate natrual gradient afaik, so that might not be it?
But perhaps it's worth trying a different combination here, e.g., normal with natural gradient, and tdistribution plain.

I tried running with the ngboost's regressor itself it still gave Singular Matrix error for the load_diabetes dataset for T-Distribution. For all other distributions it works as expected. So just in case it was actually due to some mathematical working involving the determinant of the matrix to not be equal to 0 in case of T-Distribution, I tried with load_breast_cancer dataset ,even for this it is giving Singular Matrix error. So it might actually be the case that there's a bug in ngboost T-Distribution itself.

t-distribution does have non-degenerate natrual gradient afaik, so that might not be it? But perhaps it's worth trying a different combination here, e.g., normal with natural gradient, and tdistribution plain.

The error is not related to the natural_gradient it works for all other distributions when it is False.

Great investigation!

What we could do, remove the parameter set in this PR, but file a bug report on skpro, with code that uses these parameters and the diabetes dataset. In the issue, we then try to remove the skpro interface and use pure ngboost code, that allows us to check whether it's sth in the skpro API, or genuinely ngboost. If the issue remains, we can report on the ngboost repo as a bug, with the pure ngboost code.

fkiraly

Nice, thanks!

This is ready to merge, could you open the issue on the skpro tracker to investigate the TDistribution failure? With code if you have it, otherwise just a quick note.

ShreeshaM07 · 2024-05-02T11:50:09Z

Nice, thanks!

This is ready to merge, could you open the issue on the skpro tracker to investigate the TDistribution failure? With code if you have it, otherwise just a quick note.

I have now opened #291 along with the code causing the issue.

See #301 , #215 This PR ensures `NGBoostRegressor` inherits the common conversions from the `ngboost` distribution to `skpro` distribution from the `NGBoostAdapter` class. This adapter is common to both the `NGBoostSurvival` and `NGBoostRegressor`.

ShreeshaM07 added 2 commits March 19, 2024 15:56

[ENH] added 2 files _ngboost and __init__ to add an NGBRegressor inte…

ac08c16

…rface in skpro

added myself as a contributor

c659919

ShreeshaM07 marked this pull request as draft March 19, 2024 10:53

fkiraly reviewed Mar 19, 2024

View reviewed changes

skpro/regression/gradient_boosting/_ngboost.py Show resolved Hide resolved

implemented _predict_proba and conversion of ngboost pred_dist return…

c62b148

… to skpro Normal dist

ShreeshaM07 added 4 commits April 22, 2024 12:26

[ENH] added 2 files _ngboost and __init__ to add an NGBRegressor inte…

dc48640

…rface in skpro

added myself as a contributor

4178b5a

implemented _predict_proba and conversion of ngboost pred_dist return…

9f946f1

… to skpro Normal dist

Merge branch 'dev2' into dev

37789d9

fkiraly requested changes Apr 22, 2024

View reviewed changes

fkiraly added enhancement module:regression probabilistic regression module interfacing algorithms Interfacing existing algorithms/estimators from third party packages labels Apr 22, 2024

fkiraly reviewed Apr 22, 2024

View reviewed changes

Implemented different distributions and added dependency

77846df

fkiraly reviewed Apr 23, 2024

View reviewed changes

fkiraly requested changes Apr 23, 2024

View reviewed changes

dictionaries of distributions and parameters of distributions made co…

17d3a58

…rrect

fkiraly reviewed Apr 26, 2024

View reviewed changes

fkiraly reviewed Apr 29, 2024

View reviewed changes

fkiraly changed the title ~~[ENH] Adding an interface of ngboost to skpro~~ [ENH] interface to ngboost Apr 29, 2024

fkiraly reviewed Apr 29, 2024

View reviewed changes

Cloned the regressor and X.index inplace of kwargs["mu"]

1f881c1

fkiraly reviewed Apr 30, 2024

View reviewed changes

ravel() used to flatten the y into a 1D array but check_estimator fai…

48a3d11

…ling

added NGBoostRegressor to api_reference

e5e9b3a

rectified all errors except test fit overwrite hyper params caused du…

6a05bc5

…e to DecisionTreeRegressor

rectified hyper param overwrite by making it self.estimator_

22c5629

ShreeshaM07 marked this pull request as ready for review May 1, 2024 08:43

fkiraly reviewed May 1, 2024

View reviewed changes

changed test param from TDistribution to Normal

b53312d

fkiraly approved these changes May 2, 2024

View reviewed changes

fkiraly mentioned this pull request May 2, 2024

[ENH] ngboost survival prediction model #290

Closed

fkiraly merged commit 4441e55 into sktime:main May 2, 2024
36 checks passed

ShreeshaM07 deleted the dev branch May 6, 2024 06:24

ShreeshaM07 mentioned this pull request May 9, 2024

[MNT] refactor NGBoostRegressor to inherit NGBoostAdapter #309

Merged

ShreeshaM07 mentioned this pull request May 13, 2024

[MENTEE] Shreesha M sktime/mentoring#40

Open

[ENH] interface to ngboost #215

[ENH] interface to ngboost #215

Conversation

ShreeshaM07 commented Mar 19, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

For new estimators

ShreeshaM07 commented Apr 22, 2024

fkiraly left a comment • edited Loading

Choose a reason for hiding this comment

fkiraly commented Apr 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

ShreeshaM07 commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShreeshaM07 commented Apr 29, 2024

fkiraly commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShreeshaM07 commented May 1, 2024

fkiraly commented May 1, 2024

ShreeshaM07 commented May 1, 2024

fkiraly commented May 1, 2024

ShreeshaM07 commented May 1, 2024 • edited Loading

fkiraly commented May 1, 2024

fkiraly commented May 1, 2024

Choose a reason for hiding this comment

fkiraly May 1, 2024 • edited Loading

Choose a reason for hiding this comment

ShreeshaM07 May 1, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly May 1, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

ShreeshaM07 commented May 2, 2024

[ENH] interface to `ngboost` #215

[ENH] interface to `ngboost` #215

ShreeshaM07 commented Mar 19, 2024 •

edited

Loading

fkiraly left a comment •

edited

Loading

fkiraly Apr 23, 2024 •

edited

Loading

fkiraly Apr 23, 2024 •

edited

Loading

fkiraly Apr 29, 2024 •

edited

Loading

fkiraly May 1, 2024 •

edited

Loading

ShreeshaM07 commented May 1, 2024 •

edited

Loading

fkiraly May 1, 2024 •

edited

Loading

ShreeshaM07 May 1, 2024 •

edited

Loading

fkiraly May 1, 2024 •

edited

Loading