-
Notifications
You must be signed in to change notification settings - Fork 121
Inconsistent results on different computers #166
Comments
UPDATE: I have ran the above script also within a linux docker container on computer B and got different results from that. I have also tested on a few other Windows systems (versions 8 & 10) and all of these gave the same results as computer B. All these systems had identical virtual environments installed. Docker results:
|
@odedbd In general, py-earth will produce slightly different results on different machines. I believe it's due to differences in the underlying BLAS and Lapack subroutines, but I don't know for sure. Are these differences causing any problems for you? |
@jcrudy Thank you for getting back to me on this. I am not sure whether these differences are causing problems, so much as they are causing a concern. We test our processes, including training, as part of a continuous deployment setup to catch regression bugs etc. These differences cause tests of the training process to break, since the testing machine is not necessarily exactly the same as the development machine. Do you have any idea which part of the algorithm is more susceptible to such BLAS/Lapack differences which could be suspect? One idea I had was to maybe test using the earth R package, to see if it is also displaying these kinds of differences between machines. |
@odedbd I haven't done much to analyze where machine-dependent differences occur. A little searching, though, indicates that in general floating point calculations can't be relied upon to yield identical results across machines, and the differences can depend on many aspects of both hardware and software. See this, for example. One way to ensure that results are at least very similar across machines is to give py-earth an easy problem to solve. That is, a problem in which the predictors are not very correlated with each other. That should at least help. When I run tests, I often make my assertions around model performance, which is generally much less sensitive than the exact terms and coefficients. I would be curious to know to what extent this occurs in the R package, so please post if you do decide to test it out. |
@jcrudy I will update if we decide to test the R package, thanks. |
Just wondering if anyone has got any updates or ideas on resolving this issue, would be great to know. |
@ficerr No updates, except to say that this should probably be considered expected behavior. Is there a problem it's causing for you? Perhaps I could suggest a workaround. |
@jcrudy Yes, it is an issue as the difference in the results from one PC to another is dramatic, it is not like a rounding error in the 10th decimal place. I am now trying to use rpy2 to run R version of Earth and just use those results in Python, but I am not sure how much longer it will take because there must be some communication overheads between python and R, and, which is more important, I have not checked yet if the results are the same across different PCs. What workaround do you mean? Would be great to know. |
@ficerr When you say that the difference is dramatic, is this difference in the basis functions, the coefficients, or the predictions? While it is theoretically possible, due to the stepwise nature of the algorithm, to get large differences in prediction due to small numerical differences, I've never seen it happen. If it's happening to you, I would be interested to know more about your data set if possible. I am also interested to know whether the R package is giving you a similar problem. It's possible you've uncovered some bug or instability specific to py-earth, in which case it could be fixed. In terms of speed, the overhead of communication with R may not be a greater factor than the speed difference between the R package and py-earth, so your rpy2 implementation may actually be faster. However, it depends on your usage. Regarding workarounds, it depends on specifically what differences you're seeing and what your goals are. I only mean that if I know those things, I may be able to suggest something (but maybe not). |
@jcrudy Across different computers: basis functions are the same, knots are the same (at least up to 3rd decimal place, I do not know where to find the exact values for knots) after the forward pass, but coefficients are different and selected basis functions (after the pruning pass) are different, i.e. the results of the pruning pass are not the same across different PCs. Across Python and R: basis functions, knots and coefficients, and predictions are very different between the models, although I am not 100% sure the default settings are the same in Python and R (I called Earth in R the same way you did it in https://gist.github.com/jcrudy/7221551). R is giving the same results across different PCs, so I am using R solution now, however it is much slower because of the communication overheads between Python and R. I will try to make my code around it work faster, but I doubt it will help as the results of profiling are telling me that the longest part is the communication between Python and R (calling R Earth function from Python takes up to 10 times longer comparing to the Python Earth function). My data are just two real-valued vectors, x and y, with 500 rows each. mean(x)=-1.30, std(x)=1.50, mean(y)=52.20, std(y)=4.30. |
@ficerr @odedbd found a similar instability with the pruning pass in #179. One workaround is to skip the pruning pass entirely, and instead put the I currently consider this instability in the pruning pass to be not worth fixing, as so far I have not seen a case where it results in a model with substantially worse performance. However, you may have found that case. Can you tell me whether the predictions and predictive performance of your models differ much between machines? If so, I may change my mind (however, it still may be a long time before I can get around to attempting a fix). I am a little concerned that the R model gave very different predictions from the Python model. Generally, predictions should not differ too much between implementations. Of course, there are adjustable parameters that may differ. If there's anything more you can share about that, I'd be interested. |
@jcrudy Well, the effect of that instability depends on how sensitive you want your model to be. Regarding the differences across computers: for instance, sometimes I get a very small difference in slopes of the lines, 1-2% or so, but sometimes it is 10% or more. For some data the MSE, GCV, RSQ, and GRSQ are almost indentical or even identical (at least to the 4th decimal place), but sometimes there is 2-3% difference. Therefore, for example, on different computers the fitted lines (in 2D case) of the model (and thus the forecasts) look quite different which leads to inconsistent results across different computers and to the inability to reproduce it across computers (you use the same data and same parameters but obtain different results). Imagine that instead of one line between knots at x=2 and x=5 with a slope=0.8, because of a different pruning pass the model selects another knot in between (say at x=3) and instead of one line with a slope=0.8 you obtain two lines with say slopes=25 and -30. Actually, the most annoying is that however Earth model in R is stable and can be used from Python, it takes up to 10 times longer to calculate (due to the communication overheads between Python and R) than using the Python one, as I mentioned earlier. Therefore for everyone who is going to need to run the model many times on large sets of data, the processing time difference will become too constraining. So, thanks a lot for your effort and it would be really great if you could look at that instability issue. |
@ficerr Okay, I'm convinced it is at least desirable to fix this. Changing this from wontfix to bug. It will probably not get fixed by me any time soon, unfortunately. If anyone reading this wants to attempt it, I would be happy to advise. |
I also noticed this issue when I was fitting a model on my work vs home computers. I've been working off the original example that @odedbd used. I'm able to reproduce the "bug" that gives different results on the pruning pass onward. I'm testing this using 2 environments: 1. My base Windows and 2. Windows Subsystem for Linux both with Python 3.6. I think the difference in results on the pruning pass and selected Earth Model is being driven by numpy.linalg.lstsq in _pruning.pyx: Line 99 in b209d19
Line 144 in b209d19
It looks like numpy.linalg.lstsq will produce different results depending on machine precision if an explicit rcond condition is not provided. I can get pyearth to give consistent results if I set rcond to something low like rcond=1e-3 but I assume that the final model is not as accurate if the precision is being truncated. |
@johnmrudolph Thank you for this. If you get a chance, since you've got a reproducible test case in hand, could you see if the problem still exists in the v0.2dev branch? To install the branch, you should do:
Results on this branch might be different because I changed the lstsq function from numpy to scipy. However, I don't have a test case. |
I pulled the v0.2dev branch and reran the test. Still getting different results on pruning pass and selected model. Scipy lstsq has a different default rcond implementation but it still appears to give different results on the pruning pass. Similar to numpy lstsq if I trim the precision then earth models produce same results on pruning pass and model selection. I'll dig a little more into how the scipy lstsq is estimating default precisions to use on both the environments that I'm testing. I'll also take a look at the input matrices to see if there are any differences in the values going into scipy lstsq. Any other ideas??? |
Here is a test script https://github.com/AresEkb/pyearth_issue166 It gives very different results on a Windows and Linux machines (384 vs -704 for a last period). I run such a script on thousands of a similar datasets. On the Windows machine predictions seems to be better. On the Linux a significant amount of predictions are below zero. It seems that the problem with |
Revisiting this issue. The inconsistent results between machines looks to be caused by the least squares step on the Pruning Pass. The likely culprit is that the bx matrix generated by the Forward Pass is ill-conditioned which causes the numpy least squares solution to be unstable across machines. As I mentioned earlier in the thread, a quick and dirty way to fix this is to set the leastsq rcond (or cond in Scipy implmentation) to something low like 1-e3 which will force small singular values in the bx matrix to 0 which makes it easier for leastsq to find a consistent solution. That said allowing the user to set the rcond parameter is not a good fix because the user has no way knowing if the bx matrix generated by the forward pass will be near singular. The leastsq methods in numpy/scipy both assume that the coefficient matrix is full-rank. Some good background discussion on this here and how R lm function handles it differently: Differences in Linear Regression in R and Python [closed] A better solution would be to "fix" the bx matrix before fitting the leastsq method in the Prunning step to improve stability. Not sure what the best approach would be. Looks like in the R Earth package their is a routine that is called in the underlying C code called RegressAndFix which drops any collinear terms in the bx matrix before running the backward pass. Alternatively could also try regularize or scale the bx matrix to see if this makes the leastsq solution more stable. This would be a significant change to the existing py-earth mars implementation so not sure if its worth pursuing. |
@johnmrudolph Thank you for all this info. While in some cases the pruning pass results differ across machines, the example posted by @AresEkb seems to differ in the forward pass. I do think near-singular matrices at various steps may be the issue. The forward pass has the Apologies to everyone here for my relative absence from this thread and this problem. I have had little time available for this stuff lately, and that's likely to continue for a while. I would welcome any experiments or pull requests for this issue, and would be happy to advise on implementation details if anyone wants to work on it. |
@jcrudy Thanks a lot! I added an information on a run with zero_tol=1e-7 into README.md. It gives the same results on both machines now. Also it seems that the quality of a model is greater now. For zero_tol=None For zero_tol=1e-7 |
@jcrudy I would be interested in trying a few experiments to attempt to resolve the inconsistent results on the Pruning step. If you have any thoughts on what the preferred approach would be for handling the near-singular matrices (assuming that this is the cause) then let me know. Otherwise I will probably trial a few different methods and report back on which appears to be the best approach - this will probably take some time. I didn't not realize that @AresEkb was getting different results on the Forward pass too but looks like the issue was resolved by setting a different |
@johnmrudolph That's awesome. I think the general direction proposed in your previous comment is right. The forward pass uses an iterative update strategy based on the QR-decomposition to update the least squares solution when adding new terms to the model. A similar strategy and much of the same code could (I think) be used in the pruning pass, which would also allow for the use of @AresEkb Thank you very much for those results. Based on those, I think we can consider this problem solved for the forward pass (at least until someone discovers a data set where the problem persists with the higher |
@johnmrudolph FYI I just had a quick look at those notes. Only the very first part is relevant. The part that's really important is that you can do an update of a QR decomposition using Householder reflections, and that the lower right component of the R factor gives the error of the least squares problem. The code for building and updating the QR factorization is in _qr.pyx, and there are tests in test_qr.py. Again, not necessary to even look at that stuff, but just want to point you at it in case you decide to investigate that route. |
Thanks for the background information @jcrudy. I'll take a closer look at the .pyx files that handle the forward pass and QR factorization. I've mostly focused on the pruning step so will need to familiarize myself with this part of the code base. Your idea of repurposing the qr factorization seems like a good approach to start with. I'll check back after doing some investigating to determine if its a good path forward. |
Checking back in on this. I've had a chance to go through the forward pass in a little more detail to get a better understanding for how QR factorization is being applied in that step. General idea is that we could update the pruning pass to leverage QR factorization in a similar way as Forward Pass instead of using @jcrudy could use a little more guidance on how to leveraging the existing code base for solving the least squares problem using Maybe there is more direct way to compute SSE from |
I have the same issue but with a different method so the
Since there's no attribute to set the seed when the class object is initialized and then the fit function is called, I am setting the NumPy random seed to a fixed seed:
This seed is set globally in the first entry point of the application. The problem is that the results of the forecasting are different between different machines, although we are running inside a Docker Image that depends on a Pipfile so all the dependencies have the same versions + the python version is the same as well (3.9):
and from the Pipfile.lock:
The differences in the forecasted values are also significant to our application. Example:
As per my understanding, setting the seed to a specific number should confirm the reproducibility, but is there anything I am missing here. Or If I need to set the seed in some other way. Can anyone support on this? |
@SandraSukarieh This is a known bug. Others have suggested setting rcond or zero_tol to a very low value like 1e-7 seems to have helped with the reproducibility problem. But this project is largely abandoned so don't expect a fix for this issue. |
Could it have something to do with the processor's AVX and FMA instruction set? The results in an older PC with AVX and a newer one with AVX2 differ. I have checked numpy info 'np.show_config()', and: The numpy info in the older one states: Supported SIMD extensions in this NumPy install: While, the newer: Both work with OpenBlas, no MKL. |
Hi,
I have recently updated the commit I am working with to #160. The commit I used to work with before the update was 30abe96 (Mar 31, 2016).
After updating I noticed that in some cases the model fit by the Earth estimator can be different between two computers running the exact same script, even when both computers have identical virtual environments, as indicated by "pip freeze" generating the exact same output.
The trained estimator is consistent when the script is run repeatedly on the same computer. Also, the trained estimator can be consistent between two different computers, but this is not guaranteed. So far I have been unable to find any consistent difference between computers where the trained estimator ends up different, except perhaps for a difference between i7 and i5 CPUs. I have no idea whether this could be relevant in any way.
This inconsistency does not occur when using the 30abe96 commit.
I was able to replicate this by slightly modifying one of the example from the py-earth documentation as show below. I have set the verbose output on, which shows that there are small differences between two runs of this code made on the same computer.
As you can see in the outputs from computer A and B below (both running Windows 8.1), there are a lot of similarities in the output, but also some significant differences. The numbers calculated in the forward pass are identical for the first five terms, but begin to differ, usually slightly from the 6th term onward. The backward pass has difference numbers for all the terms, and the selected iterations are different for the two cases, 21 vs. 24.
The final model basis displayed below also shows differences. On a private dataset (which I cannot share), the forward pass was identical, and the differences only began showing on the backward pass.
I'll be grateful for any idea on what might cause such an inconsistency in the trained estimator. Thank you for your help.
The modified example code:
The output I receive on computer A:
The output I receive on computer B:
The text was updated successfully, but these errors were encountered: