Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

model dependent on dataset order #179

Closed
odedbd opened this issue Apr 2, 2018 · 15 comments
Closed

model dependent on dataset order #179

odedbd opened this issue Apr 2, 2018 · 15 comments

Comments

@odedbd
Copy link

odedbd commented Apr 2, 2018

I have ran into a situation where it appears that simply reordering the input dataset changes the output model. I make sure that X,y are reordered jointly, of course. repeat training on the same order dataset reproduces the exact same results. So, this is not some general stochasticity, but rather changes with the samples order in the dataset.

I have tried reading the knot_candidates and knot_search code to find if there is anywhere the data order could come into play, but was unable to follow the code well enough to detect such a place.

I cannot share my actual data, so I will try to reproduce this with a demo dataset. In the mean time, I would be grateful to know if this is expected, or if it is surprising to others as it is to me. I would appreciate any direction for further testing this, or suggestions for how to prevent this from happening, if it can be done.

@odedbd
Copy link
Author

odedbd commented Apr 2, 2018

UPDATE:

I tried printing the summary for different models trained from different orderings of the dataset, and it seems that the knots are selected identically, but the prunning ends up with different models (different terms are pruned or not).

This seems to indicate that the issue might lie within the prunning process. I will try to take a look in that part of the code next.

@odedbd
Copy link
Author

odedbd commented Apr 2, 2018

UPDATE (2):

Looking at the trace for two models trained of two different orderings of the dataset I noticed that while the Forward pass trace is the same for both (except for knot numbers being different due to different order in the dataset), the prunning trace looks different from the 0th iteration. Specifically, this is the last line of the Forward trace for both-

Forward Pass
--------------------------------------------------------------------
iter  parent  var  knot  mse        terms  gcv      rsq    grsq     
--------------------------------------------------------------------
...
52    0       1    355   7.745107   102    244.473  0.594  -11.781  
--------------------------------------------------------------------
Stopping Condition 0: Reached maximum number of terms

And the first lines of the prunning pass for both-

1:
Pruning Pass
--------------------------------------------------
iter  bf   terms  mse    gcv      rsq    grsq     
--------------------------------------------------
0     -    102    7.75   244.473  0.594  -11.781  
1     87   101    7.74   223.585  0.594  -10.689  
2:
Pruning Pass
--------------------------------------------------
iter  bf   terms  mse    gcv      rsq    grsq     
--------------------------------------------------
0     -    102    7.75   244.699  0.594  -11.792  
1     10   101    7.74   223.505  0.594  -10.684  

I find it curious that although the prunning trace in 1 indicates that the 0th iteration starts from the same values as the finish of the forward pass, for prunning trace 2 there are slight differences in the 0th iteration already. In the first iteration we can see that different terms are being pruned as well. This leads, in the end, for the two models to end up with different terms unpruned, as well as a different number of terms kept in the model, four terms for one model and five for the other.

Any idea as to why the prunning seems to start from a different point than the Forward phase ended? Or am I misinterpreting the traces?

@jcrudy
Copy link
Collaborator

jcrudy commented Apr 2, 2018

@odedbd Thanks for this report. This is not desired behavior. I'm guessing there is some numerical instability in the pruning code. These things can be tough to track down, so it would of course help a lot if you could provide some way to reproduce it. If you're comfortable sharing data privately, feel free to email me (address is on my github page) and we can set up any method of transfer you're comfortable with (including an NDA if necessary). If not, any descriptive statistics you can give about your data set might help. For example, you can use numpy.linalg.cond to calculate the condition number of your data (and your transformed data).

Is one of the pruned models substantially worse than the other? Also, can you share system information such as operating system, python version, numpy version, etc?

Finally, as a workaround some people have used py-earth without pruning in a pipeline with an elastic net or similar model. Assuming the elastic net model is not sensitive to data ordering, this would potentially solve your problem (assuming this is actually causing a problem for you).

@odedbd
Copy link
Author

odedbd commented Apr 3, 2018

@jcrudy Thank you for your suggestions. I have sent you an email re sharing the data privately.

In the mean time, using ElaticNet sounds like an interesting idea for my use cases. I especially like the potential for shrinking coefficients non-sparsely using some ridge regularization. Is there anything to it more than setting allow_pruning=False when running fit, then using the transform method before running ElasticNet fit? I guess I should be able to construct a scikit-learn Pipeline with the Earth estimator acting as a transformer, right?

@jcrudy
Copy link
Collaborator

jcrudy commented Apr 4, 2018

@odedbd You've got it exactly right. Just pass allow_pruning=False and use a Pipeline with ElasticNet. There's some discussion of this topic in issue #159 with @Fish-Soup.

@jcrudy jcrudy added the bug label Apr 4, 2018
@jcrudy jcrudy added the wontfix label Jun 8, 2018
@jcrudy
Copy link
Collaborator

jcrudy commented Jun 8, 2018

I've finally looked into this. The issue is caused by the pruning of terms with extremely similar contributions to MSE. These terms are so similar that choosing which to prune comes down to comparing the 15th digit or so. Changes in data ordering can affect this 15th digit due to completely reasonable numerical instability in various numpy algorithms. While it's slightly annoying that this behavior occurs, I'm thinking it isn't worth it to try to fix since it will generally not result in a worse model, just a different model.

@odedbd, there may be use cases I haven't thought of in which this behavior is harmful. If you have one, please let me know. For now, I'm marking this as wontfix.

@jcrudy jcrudy closed this as completed Jun 8, 2018
@odedbd
Copy link
Author

odedbd commented Jun 10, 2018

@jcrudy Thank you for taking the time to look into this. For my internal use case this produced a more problematic outcome, but I am unsure whether that was due to similar 15th digit instability or not. What's the best way for me to check this? Is there some verbosity flag I could set in order to see what the differences in MSE are during training?

@jcrudy
Copy link
Collaborator

jcrudy commented Jun 10, 2018

@odedbd Are you comfortable installing from source? I can make a branch that will print out the information you need during fitting.

@odedbd
Copy link
Author

odedbd commented Jun 11, 2018

@jcrudy That would be great. Please let me know what branch I should checkout and how to make sense of the printouts.

@jcrudy
Copy link
Collaborator

jcrudy commented Jun 12, 2018

@odedbd The branch is called issue_179. If you compare the _pruning.pyx file against master, you'll see that all I did was make it print out the basis function and loss after removal for each term at each step of pruning whenever verbose >= 2. It's not nicely formatted or anything, but should do what you need.

In case you're unfamiliar with how the pruning pass works, here it is: at each step of pruning, for every term in the model, the term is removed, the loss is calculated, and the term is replaced. The term whose removal resulted in the smallest increase in loss is then removed. The process repeats until no terms remain. The terms from the step that had the smallest loss (in terms of GCV) make up the final model.

So, using this higher verbosity, you should be able to see the how different the losses are from removing different terms at each step.

@odedbd
Copy link
Author

odedbd commented Jun 19, 2018

@jcrudy thanks for setting up the branch. I will try to build the code and test it on my use case. I'll update with my findings.

@odedbd
Copy link
Author

odedbd commented Jun 24, 2018

@jcrudy I am unable to properly build the issue branch. The python setup.py install command seemed to have worked without giving out any error. However, when I try from pyearth import Earth I get an exception:

ImportError: No module named _forward

I do see in the setup.py install output the lines-
copying build\lib.win-amd64-2.7\pyearth_forward.c -> build\bdist.win-amd64\egg\pyearth
copying build\lib.win-amd64-2.7\pyearth_forward.pxd -> build\bdist.win-amd64\egg\pyearth
copying build\lib.win-amd64-2.7\pyearth_forward.pyd -> build\bdist.win-amd64\egg\pyearth

My working environment is 64bit Windows 10 with python2.7. I tried running the build using the VS 64bit command console.

Any suggestions on what I might need doing? I am currently setting up a new environment on ubuntu16.04 with python2.7, I plan to try running the build there as well.

@jcrudy
Copy link
Collaborator

jcrudy commented Aug 3, 2018

@odedbd Apologies, I just saw that this post went unanswered. Have you changed directories after installing with setup.py? If not, that will probably fix the problem.

@odedbd
Copy link
Author

odedbd commented Aug 6, 2018

@jcrudy What do you mean by "changed directories"? Changing directories as in "cd" command, or moving folders from someplace to somewhere?

@jcrudy
Copy link
Collaborator

jcrudy commented Aug 6, 2018

@odedbd I mean as in the cd command. Often people have the problem you're seeing because they have tried to run py-earth from the source directory.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants