Replace `Probabilities` by `LeafContent` and define `StableRulesRegressor` #18

rikhuijzer · 2023-06-15T15:32:22Z

The PR has also discovered the hard way that the contents of the leaf should be a Vector for both kinds of trees. The core of that is summarized as follows:

"""
Type which holds the values inside a leaf.
For classification, this is a vector of probabilities of each class.
For regression, this is a vector of one element.

!!! note
    Vectors of one element are not as performant as scalars, but the
    alternative here is to have two different types of leafs, which
    results in different types of trees also, which basically
    requires most functions then to become parametric.
"""
const LeafContent = Vector{Float64}

So, don't define two types for the contents.

Furthermore, defines StableRulesRegressor, which goes towards fixing #13.

Predictive performance is poor though:

15×7 DataFrame
 Row │ Dataset   Model                   Hyperparameters    nfolds  AUC     RMS      1.96*SE
     │ String    String                  String             Int64   String  String   String
─────┼───────────────────────────────────────────────────────────────────────────────────────
   1 │ blobs     LGBMClassifier          (;)                    10  0.99             0.01
   2 │ blobs     LGBMClassifier          (max_depth = 2,)       10  0.99             0.01
   3 │ blobs     StableRulesClassifier   (n_trees = 50,)        10  1.00             0.00
   4 │ titanic   LGBMClassifier          (;)                    10  0.87             0.03
   5 │ titanic   LGBMClassifier          (max_depth = 2,)       10  0.85             0.02
   6 │ titanic   StableForestClassifier  (n_trees = 1500,)      10  0.85             0.02
   7 │ titanic   StableRulesClassifier   (n_trees = 1500,)      10  0.84             0.02
   8 │ haberman  LGBMClassifier          (;)                    10  0.71             0.06
   9 │ haberman  LGBMClassifier          (max_depth = 2,)       10  0.67             0.05
  10 │ haberman  StableForestClassifier  (n_trees = 1500,)      10  0.70             0.05
  11 │ haberman  StableRulesClassifier   (n_trees = 1500,)      10  0.67             0.05
  12 │ boston    LGBMRegressor           (;)                    10          0.70     0.06
  13 │ boston    LinearRegressor         (;)                    10          0.70     0.05
  14 │ boston    StableForestRegressor   (;)                    10          0.66     0.07
  15 │ boston    StableRulesRegressor    (n_trees = 1500,)      10          -1400.0  237.45

…wasn't so bad

…oat64}` wasn't so bad" This reverts commit ca3b67c.

in order to find the bug in the `StableRulesRegressor` (#18). ## Notes The bug seems to be related to the too high scores put in the rules: ```julia julia> include("test/mlj.jl") julia> preds[1:5] 5-element Vector{Float64}: 286.98408203125 280.405224609375 306.151708984375 310.74091796875 310.74091796875 julia> rulesmach.fitresult StableRules model with 7 rules: if X[i, :x6] < 6.8 then 48.767 else 65.298 + if X[i, :x11] < 19.2 then 45.919 else 36.811 + if X[i, :x13] < 9.04 then 40.428 else 31.213 + if X[i, :x3] < 3.97 then 22.868 else 18.279 + if X[i, :x10] < 437.0 then 49.438 else 39.331 + if X[i, :x1] < 2.44953 then 50.514 else 39.592 + if X[i, :x5] < 0.52 then 36.275 else 29.05 julia> rulesmach.fitresult.weights 7-element Vector{Float16}: 2.066 1.897 1.486 0.778 2.197 2.275 1.475 julia> rulesmach.fitresult.rules 7-element Vector{SIRUS.Rule}: SIRUS.Rule(TreePath(" X[i, :x6] < 6.8 "), [23.6], [31.6]) SIRUS.Rule(TreePath(" X[i, :x11] < 19.2 "), [24.2], [19.4]) SIRUS.Rule(TreePath(" X[i, :x13] < 9.04 "), [27.2], [21.0]) SIRUS.Rule(TreePath(" X[i, :x3] < 3.97 "), [29.4], [23.5]) SIRUS.Rule(TreePath(" X[i, :x10] < 437.0 "), [22.5], [17.9]) SIRUS.Rule(TreePath(" X[i, :x1] < 2.44953 "), [22.2], [17.4]) SIRUS.Rule(TreePath(" X[i, :x5] < 0.52 "), [24.6], [19.7]) ``` So the summary from the `rules` and `weights` is fine, but the `then` and `otherwise` contents make no sense since `y` is in a different range: ```julia julia> y[1:5] 5-element Vector{Float64}: 24.0 21.6 34.7 33.4 36.2 ``` It could be something else, but the value of the `then` and `otherwise` seem the most likely culprit. On second thought, the weights seem the most likely culprit. Those weights make no sense whereas the `then` and `otherwise` could correspond to `y` values. Works towards fixing #16.

rikhuijzer added 6 commits June 15, 2023 17:28

Figure out the hard way that const Probabilities = Vector{Float64} …

ca3b67c

…wasn't so bad

Revert "Figure out the hard way that `const Probabilities = Vector{Fl…

0c984e9

…oat64}` wasn't so bad" This reverts commit ca3b67c.

Get regressor sort of working

2a607a9

Set the right number of trees

38c0aa2

Refactor ST. to S. in tests

0ea5a73

Small changes

775d122

rikhuijzer marked this pull request as ready for review June 16, 2023 08:53

rikhuijzer changed the title ~~Implement StableRulesRegressor~~ Replace Probabilities by LeafContent Jun 16, 2023

rikhuijzer changed the title ~~Replace Probabilities by LeafContent~~ Replace Probabilities by LeafContent and define StableRulesRegressor Jun 16, 2023

rikhuijzer enabled auto-merge (squash) June 16, 2023 08:55

rikhuijzer merged commit c66c63c into main Jun 16, 2023
3 checks passed

rikhuijzer deleted the rh/regression-part-3 branch June 16, 2023 09:14

rikhuijzer mentioned this pull request Jun 19, 2023

Document the rules extraction and refactor #19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `Probabilities` by `LeafContent` and define `StableRulesRegressor` #18

Replace `Probabilities` by `LeafContent` and define `StableRulesRegressor` #18

rikhuijzer commented Jun 15, 2023 •

edited

Loading

Replace Probabilities by LeafContent and define StableRulesRegressor #18

Replace Probabilities by LeafContent and define StableRulesRegressor #18

Conversation

rikhuijzer commented Jun 15, 2023 • edited Loading

Replace `Probabilities` by `LeafContent` and define `StableRulesRegressor` #18

Replace `Probabilities` by `LeafContent` and define `StableRulesRegressor` #18

rikhuijzer commented Jun 15, 2023 •

edited

Loading