Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria #4216

Merged
merged 42 commits into from
Oct 12, 2021

Conversation

venkywonka
Copy link
Contributor

@venkywonka venkywonka commented Sep 21, 2021

This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests.


checklist:

  • Add Gamma and Inverse Gaussian Objective classes
  • Add C++ tests for above
  • Add remaining C++ tests for other objective functions: entropy and mean squared error
  • Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in [REVIEW] RF: Add Poisson deviance impurity criterion #4156 )
  • Check for regressions by benchmarking on gbm-bench
  • Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (mse)

@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Sep 21, 2021
@caryr35 caryr35 added this to PR-WIP in v21.10 Release via automation Sep 22, 2021
@venkywonka
Copy link
Contributor Author

venkywonka commented Oct 5, 2021

  • No regressions on gbm-bench for existing classification and regression tasks due to this PR.
  • Due to a reduction in division operations inside gain calculation of mse objective, there is a slight increase in performance for mse regression without any loss in accuracy
mse

main-vs-tweedie regression mse

poisson

main-vs-tweedie regression poisson

entropy

main-vs-tweedie classification entropy

gini

main-vs-tweedie classification gini

@venkywonka venkywonka added the improvement Improvement / enhancement to an existing function label Oct 5, 2021
@venkywonka venkywonka changed the title [WIP] RF: Add Gamma and Inverse Gaussian loss criteria [REVIEW] RF: Add Gamma and Inverse Gaussian loss criteria Oct 5, 2021
@ajschmidt8 ajschmidt8 removed the request for review from a team October 5, 2021 13:15
@ajschmidt8
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love to see the use of c++17 features and good use of STL for clean code.

Is this PR significantly impacting compile times due to the new template instantiations?

Could you produce a couple of python plots in the comments of this PR, of the same type that you did for Poisson, showing that these objectives reduce their accompanying loss function better than an MSE objective. i.e. train two models one with MSE objective, one with gamma, evaluate gamma training loss for both, the model trained with gamma objective should have a better loss.

Edit: I guess your tests are doing this already, but it would be useful to verify it visually.

def test_poisson_convergence(lam, max_depth):
@pytest.mark.parametrize("split_criterion",
["poisson", "gamma", "inverse_gaussian"])
def test_tweedie_convergence(max_depth, split_criterion):
np.random.seed(33)
bootstrap = None
max_features = 1.0
n_estimators = 1
min_impurity_decrease = 1e-5
n_datapoints = 100000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This amount of data is a bit excessive, you could dial this back to 1000.

@venkywonka
Copy link
Contributor Author

Love to see the use of c++17 features and good use of STL for clean code.

Is this PR significantly impacting compile times due to the new template instantiations?

Could you produce a couple of python plots in the comments of this PR, of the same type that you did for Poisson, showing that these objectives reduce their accompanying loss function better than an MSE objective. i.e. train two models one with MSE objective, one with gamma, evaluate gamma training loss for both, the model trained with gamma objective should have a better loss.

Edit: I guess your tests are doing this already, but it would be useful to verify it visually.

it is impacting compile-time due to more template instantiations of the builder class... when I used ccache with ./build.sh though, didn't observe the slowdown

@venkywonka
Copy link
Contributor Author

convergence plots of tweedies w.r.t mse:

plots

convergence

script used
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import itertools

from cuml.ensemble import RandomForestRegressor as curfr
from sklearn.metrics import mean_tweedie_deviance

split_criterion_list = ["poisson", "gamma", "inverse_gaussian"]
max_depth_list = [2, 4, 6, 8, 12]
df = pd.DataFrame(columns=["loss", "mse_tweedie_deviance", "tweedie_tweedie_deviance", "depth"])

for split_criterion, max_depth in itertools.product(split_criterion_list, max_depth_list):
    np.random.seed(33)
    bootstrap = None
    max_features = 1.0
    n_estimators = 1
    min_impurity_decrease = 1e-5
    n_datapoints = 100000
    tweedie = {
        "poisson":
            {"power": 1,
                "gen": np.random.poisson, "args": [0.01]},
        "gamma":
            {"power": 2,
                "gen": np.random.gamma, "args": [2.0]},
        "inverse_gaussian":
            {"power": 3,
                "gen": np.random.wald, "args": [0.1, 2.0]}
    }
    # generating random dataset with tweedie distribution
    X = np.random.random((n_datapoints, 4)).astype(np.float32)
    y = tweedie[split_criterion]["gen"](*tweedie[split_criterion]["args"],
                                        size=n_datapoints).astype(np.float32)

    tweedie_preds = curfr(
        split_criterion=split_criterion,
        max_depth=max_depth,
        n_estimators=n_estimators,
        bootstrap=bootstrap,
        max_features=max_features,
        min_impurity_decrease=min_impurity_decrease).fit(X, y).predict(X)
    mse_preds = curfr(
        split_criterion=2,
        max_depth=max_depth,
        n_estimators=n_estimators,
        bootstrap=bootstrap,
        max_features=max_features,
        min_impurity_decrease=min_impurity_decrease).fit(X, y).predict(X)
    # y should be positive and non-zero
    mask = mse_preds > 0
    mse_tweedie_deviance = mean_tweedie_deviance(y[mask],
                                                    mse_preds[mask],
                                                    power=tweedie
                                                    [split_criterion]["power"])
    tweedie_tweedie_deviance = mean_tweedie_deviance(y[mask],
                                                        tweedie_preds[mask],
                                                        power=tweedie
                                                        [split_criterion]["power"]
                                                        )

    df = df.append({
        "loss": split_criterion,
        "mse_tweedie_deviance": mse_tweedie_deviance,
        "tweedie_tweedie_deviance": tweedie_tweedie_deviance,
        "depth": max_depth,
    }, ignore_index=True)

    # model trained on tweedie data with
    # tweedie criterion must perform better on tweedie loss
    assert mse_tweedie_deviance >= tweedie_tweedie_deviance

matplotlib.use("Agg")
sns.set()
tweedies = ["poisson", "gamma", "inverse_gaussian"]
figs, axes = plt.subplots(nrows=len(tweedies), ncols=1, squeeze=True, figsize=(7, 12))
for loss, ax in zip(tweedies, axes):
    plot = sns.lineplot(data=df[df["loss"] == loss], x="depth", y="tweedie_tweedie_deviance", ax=ax, label=f"trained on {loss}")
    baseline = sns.lineplot(data=df[df["loss"] == loss], x="depth", y="mse_tweedie_deviance", ax=ax, label=f"trained on mse")
    ax.set_ylabel(f"{loss} deviance")
figs.tight_layout()
plt.savefig('convergence.png')

@venkywonka
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just one missing doc block, but I didn't notice any other issues.

v21.12 Release automation moved this from PR-WIP to PR-Needs review Oct 11, 2021
@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.12@c3b5aec). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.12    #4216   +/-   ##
===============================================
  Coverage                ?   86.07%           
===============================================
  Files                   ?      231           
  Lines                   ?    18694           
  Branches                ?        0           
===============================================
  Hits                    ?    16090           
  Misses                  ?     2604           
  Partials                ?        0           
Flag Coverage Δ
dask 47.01% <0.00%> (?)
non-dask 78.75% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3b5aec...651d827. Read the comment docs.

v21.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Oct 12, 2021
@dantegd
Copy link
Member

dantegd commented Oct 12, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ead8ef2 into rapidsai:branch-21.12 Oct 12, 2021
v21.12 Release automation moved this from PR-Reviewer approved to Done Oct 12, 2021
rapids-bot bot pushed a commit that referenced this pull request Oct 29, 2021
* Some updates to RF documentation
* to be merged after #4216

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - Vinay Deshpande (https://github.com/vinaydes)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #4138
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
This PR adds the Gamma and Inverse Gaussian Criteria to train decision trees, along with modifications to rf unit tests.


---


checklist:
- [x] Add Gamma and Inverse Gaussian Objective classes
- [x] Add C++ tests for above
- [x] Add remaining C++ tests for other objective functions: entropy and mean squared error
- [x] Add python level convergence tests for gamma and inverse gaussian ( just like the one added for poison loss in rapidsai#4156 )
- [x] Check for regressions by benchmarking on gbm-bench
- [x] Convergence plots showing model trained on particular criteria performs better on it's own loss metric than a baseline (`mse`)

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - William Hicks (https://github.com/wphicks)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4216
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
* Some updates to RF documentation
* to be merged after rapidsai#4216

Authors:
  - Venkat (https://github.com/venkywonka)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - Vinay Deshpande (https://github.com/vinaydes)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4138
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants