Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Probabilities by LeafContent and define StableRulesRegressor #18

Merged
merged 6 commits into from
Jun 16, 2023

Conversation

rikhuijzer
Copy link
Owner

@rikhuijzer rikhuijzer commented Jun 15, 2023

The PR has also discovered the hard way that the contents of the leaf should be a Vector for both kinds of trees. The core of that is summarized as follows:

"""
Type which holds the values inside a leaf.
For classification, this is a vector of probabilities of each class.
For regression, this is a vector of one element.

!!! note
    Vectors of one element are not as performant as scalars, but the
    alternative here is to have two different types of leafs, which
    results in different types of trees also, which basically
    requires most functions then to become parametric.
"""
const LeafContent = Vector{Float64}

So, don't define two types for the contents.

Furthermore, defines StableRulesRegressor, which goes towards fixing #13.

Predictive performance is poor though:

15×7 DataFrame
 Row │ Dataset   Model                   Hyperparameters    nfolds  AUC     RMS      1.96*SE
     │ String    String                  String             Int64   String  String   String
─────┼───────────────────────────────────────────────────────────────────────────────────────
   1 │ blobs     LGBMClassifier          (;)                    10  0.99             0.01
   2 │ blobs     LGBMClassifier          (max_depth = 2,)       10  0.99             0.01
   3 │ blobs     StableRulesClassifier   (n_trees = 50,)        10  1.00             0.00
   4 │ titanic   LGBMClassifier          (;)                    10  0.87             0.03
   5 │ titanic   LGBMClassifier          (max_depth = 2,)       10  0.85             0.02
   6 │ titanic   StableForestClassifier  (n_trees = 1500,)      10  0.85             0.02
   7 │ titanic   StableRulesClassifier   (n_trees = 1500,)      10  0.84             0.02
   8 │ haberman  LGBMClassifier          (;)                    10  0.71             0.06
   9 │ haberman  LGBMClassifier          (max_depth = 2,)       10  0.67             0.05
  10 │ haberman  StableForestClassifier  (n_trees = 1500,)      10  0.70             0.05
  11 │ haberman  StableRulesClassifier   (n_trees = 1500,)      10  0.67             0.05
  12 │ boston    LGBMRegressor           (;)                    10          0.70     0.06
  13 │ boston    LinearRegressor         (;)                    10          0.70     0.05
  14 │ boston    StableForestRegressor   (;)                    10          0.66     0.07
  15 │ boston    StableRulesRegressor    (n_trees = 1500,)      10          -1400.0  237.45

@rikhuijzer rikhuijzer marked this pull request as ready for review June 16, 2023 08:53
@rikhuijzer rikhuijzer changed the title Implement StableRulesRegressor Replace Probabilities by LeafContent Jun 16, 2023
@rikhuijzer rikhuijzer changed the title Replace Probabilities by LeafContent Replace Probabilities by LeafContent and define StableRulesRegressor Jun 16, 2023
@rikhuijzer rikhuijzer enabled auto-merge (squash) June 16, 2023 08:55
@rikhuijzer rikhuijzer merged commit c66c63c into main Jun 16, 2023
3 checks passed
@rikhuijzer rikhuijzer deleted the rh/regression-part-3 branch June 16, 2023 09:14
rikhuijzer added a commit that referenced this pull request Jun 21, 2023
in order to find the bug in the `StableRulesRegressor` (#18).

## Notes

The bug seems to be related to the too high scores put in the rules:
```julia
julia> include("test/mlj.jl")

julia> preds[1:5]
5-element Vector{Float64}:
 286.98408203125
 280.405224609375
 306.151708984375
 310.74091796875
 310.74091796875

julia> rulesmach.fitresult
StableRules model with 7 rules:
 if X[i, :x6] < 6.8 then 48.767 else 65.298 +
 if X[i, :x11] < 19.2 then 45.919 else 36.811 +
 if X[i, :x13] < 9.04 then 40.428 else 31.213 +
 if X[i, :x3] < 3.97 then 22.868 else 18.279 +
 if X[i, :x10] < 437.0 then 49.438 else 39.331 +
 if X[i, :x1] < 2.44953 then 50.514 else 39.592 +
 if X[i, :x5] < 0.52 then 36.275 else 29.05

julia> rulesmach.fitresult.weights
7-element Vector{Float16}:
 2.066
 1.897
 1.486
 0.778
 2.197
 2.275
 1.475

julia> rulesmach.fitresult.rules
7-element Vector{SIRUS.Rule}:
 SIRUS.Rule(TreePath(" X[i, :x6] < 6.8 "), [23.6], [31.6])
 SIRUS.Rule(TreePath(" X[i, :x11] < 19.2 "), [24.2], [19.4])
 SIRUS.Rule(TreePath(" X[i, :x13] < 9.04 "), [27.2], [21.0])
 SIRUS.Rule(TreePath(" X[i, :x3] < 3.97 "), [29.4], [23.5])
 SIRUS.Rule(TreePath(" X[i, :x10] < 437.0 "), [22.5], [17.9])
 SIRUS.Rule(TreePath(" X[i, :x1] < 2.44953 "), [22.2], [17.4])
 SIRUS.Rule(TreePath(" X[i, :x5] < 0.52 "), [24.6], [19.7])
```
So the summary from the `rules` and `weights` is fine, but the `then`
and `otherwise` contents make no sense since `y` is in a different
range:
```julia
julia> y[1:5]
5-element Vector{Float64}:
 24.0
 21.6
 34.7
 33.4
 36.2
```
It could be something else, but the value of the `then` and `otherwise`
seem the most likely culprit.

On second thought, the weights seem the most likely culprit. Those
weights make no sense whereas the `then` and `otherwise` could
correspond to `y` values.

Works towards fixing #16.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant