Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

predict.cubist unable to predict properly using sample (cubistControl) #1

Closed
Laurae2 opened this issue Apr 2, 2016 · 0 comments
Closed

Comments

@Laurae2
Copy link
Contributor

@Laurae2 Laurae2 commented Apr 2, 2016

I am trying to predict a linear relation in Cubist between three variables using Cubist (from a very large data set). However, it seems to always break down when trying to predict, when "it does not break down" when predicting manually using the output formulas.

Here are the vectors used:

v1 <- c(66787, 47194, 39871, 44990, 103933, 57240, 70013, 113002, 31145, 
47194, 64492, 36441, 197228, 202286, 14601, 25862, 120784, 84379, 
67224, 57301, 142191, 59581, 160405, 45648, 56873, 111957, 74430, 
84701, 0, 72055, 44301, 124789, 128377, 65629, 125613, 54780, 
78418, 36186, 59571, 148794, 17387, 79497, 47886, 160173, 100197, 
67793, 101231, 32230, 69549, 140863)

v2 <- c(187113, 244099, 142255, 116351, 179189, 127059, 174851, 233094, 
132003, 187143, 160828, 201573, 193093, 214011, 188806, 252668, 
173534, 355734, 160811, 215225, 204655, 221497, 175405, 126996, 
315174, 242112, 167534, 156679, 305221, 252339, 202403, 280700, 
206511, 257729, 184985, 291769, 108440, 259298, 252483, 213778, 
251058, 179890, 182320, 223046, 225751, 253243, 185440, 187539, 
169371, 254666)

v3 <- c(173569, 235079, 134319, 107179, 157561, 115288, 160548, 209768, 
128156, 177842, 147652, 193357, 151329, 171255, 186572, 248310, 
148230, 340352, 147043, 203874, 174868, 209683, 141529, 117734, 
304407, 219056, 152238, 139106, 306722, 237975, 191885, 256744, 
179724, 244784, 158690, 281340, 91991, 254696, 240824, 182605, 
248527, 164788, 172846, 189454, 205161, 239806, 164428, 181482, 
155141, 227106)

Expected answer: v1 = 3729.1 - 4.526_v3 + 4.54_v2

vdf <- data.frame(v2, v3)
set.seed(11111)
commiteeControl <- cubistControl(sample = 50, rules = 1)
commiteeModel <- cubist(x = vdf, y = v1, control = commiteeControl)
print(commiteeModel)
summary(commiteeModel)
predictions <- predict(commiteeModel, newdata = vdf)
sqrt(mean((predictions - v1)^2)) #RMSE
cor(predictions, v1)^2 #R^2

Output of print/summary:

> print(commiteeModel)

Call:
cubist.default(x = vdf, y = v1, control = commiteeControl)

Number of samples: 50 
Number of predictors: 2 

Number of committees: 1 
Number of rules: 1 
Other options: 50% sub-sampling
> summary(commiteeModel)

Call:
cubist.default(x = vdf, y = v1, control = commiteeControl)


Cubist [Release 2.07 GPL Edition]  Sat Apr 02 14:13:15 2016
---------------------------------

    Target attribute `outcome'

Read 25 cases (3 attributes) from undefined.data

Model:

  Rule 1: [25 cases, mean 88864.3, range 0 to 197228, est err 2566.0]

    outcome = 3729.1 - 4.526 v3 + 4.54 v2


Evaluation on training data (25 cases):

    Average  |error|             1965.2
    Relative |error|               0.05
    Correlation coefficient        1.00


    Attribute usage:
      Conds  Model

             100%    v2
             100%    v3


Evaluation on test data (25 cases):

    Average  |error|             3513.7
    Relative |error|               0.09
    Correlation coefficient        0.99


Time: 0.0 secs

Everything looks perfect... until:

> predictions <- predict(commiteeModel, newdata = vdf)
> sqrt(mean((predictions - v1)^2)) #RMSE
[1] 77990.2
> cor(predictions, v1)^2 #R^2
[1] 1.221792e-06

And when I compare to the following, it does not make sense as it is the right expected answer:

> predictions <- 3729.1 - 4.526*v3 + 4.54*v2
> sqrt(mean((predictions - v1)^2)) #RMSE
[1] 3297.196
> cor(predictions, v1)^2 #R^2
[1] 0.9952073

Tested under:
R version 3.2.4 Revised
R version 3.2.3
Rgui and RStudio under Windows 7, Windows 8.1, Windows 10
Virtual machine and non-virtual machine
Different computers

Removing from $model "sample="0.5" init="3965" cleared the issue in my case.

I put this in $Model of the model:

commiteeModel$model <- "id=\"Cubist 2.07 GPL Edition 2016-04-02\"\nprec=\"0\" globalmean=\"76794.88\" extrap=\"1\" insts=\"0\" ceiling=\"404572\" floor=\"0\"\natt=\"outcome\" mean=\"76794.8\" sd=\"49683.92\" min=\"0\" max=\"202286\"\natt=\"v2\" mean=\"209191\" sd=\"42482.83\" min=\"108440\" max=\"305221\"\natt=\"v3\" mean=\"193537.4\" sd=\"47135.79\" min=\"91991\" max=\"306722\"\nentries=\"1\"\nrules=\"1\"\nconds=\"0\" cover=\"25\" mean=\"76794.9\" loval=\"0\" hival=\"202286\" esterr=\"1456.4\"\ncoeff=\"144.4\" att=\"v2\" coeff=\"4.66\" att=\"v3\" coeff=\"-4.64\"\n"

instead of:

commiteeModel$model <- "id=\"Cubist 2.07 GPL Edition 2016-04-02\"\nprec=\"0\" globalmean=\"76794.88\" extrap=\"1\" insts=\"0\" ceiling=\"404572\" floor=\"0\"\natt=\"outcome\" mean=\"76794.8\" sd=\"49683.92\" min=\"0\" max=\"202286\"\natt=\"v2\" mean=\"209191\" sd=\"42482.83\" min=\"108440\" max=\"305221\"\natt=\"v3\" mean=\"193537.4\" sd=\"47135.79\" min=\"91991\" max=\"306722\"\nsample=\"0.5\" init=\"3965\"\nentries=\"1\"\nrules=\"1\"\nconds=\"0\" cover=\"25\" mean=\"76794.9\" loval=\"0\" hival=\"202286\" esterr=\"1456.4\"\ncoeff=\"144.4\" att=\"v2\" coeff=\"4.66\" att=\"v3\" coeff=\"-4.64\"\n"

And the prediction worked perfectly.

Laurae2 added a commit to Laurae2/Cubist that referenced this issue Apr 2, 2016
Fixing the following reported issue: predict.cubist unable to predict properly using sample (cubistControl) topepo#1

tl;dr explanation: when using sample parameter in cubistControl, predictions are breaking down immediately. This fix solves this issue without creating other issues.

Extra explanation about the fix:

There is probably a proper way to do it using a better regex but this version works perfectly, whether there is sample defined or not. Init has no impact overall.

It removes everything starting from "sample" and before "entries".

I found no impact removing redn (it shows up between sample and entries when using commitees).
redn = final error / (sum of errors / (number of commitees - 1)), just a calculated output value (it happens to be read when fed into Cubist into ErrReduction variable, but is not used at all to predict).
topepo pushed a commit that referenced this issue Dec 11, 2016
@topepo topepo closed this Jan 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.