New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SymbolicTransformer does not create added value features as expected #50
Comments
Looking to all the source code I understand why happens the previously commented situation, and I have a recomendation to improve the SymbolicTransformer if you consider. You sort by Being honest, I checked all the code in github and it seems that the calculations are correct. However if I do it by hand I am able to fix the error, so I live here to report and to fix code. I installed gplearn through pip and I have version 0.2, but they may be different code and for that reason I am not able to find the erro in github.
So, it is clear that everything works except the selection of the Hope it helps to improve the code. I ask it because using the boston data of your example (of the documentation), taking the first 300 as training, the resutls are the following ones (I used Ridge and Lars to be consistent with doc. example and scrips above):
Curious values as in boston works better the original transform code from the doc. example. However, using the dataset I used in the example, the results are (with Ridge and Lars):
Very curious... But no idea of what happens exactly with |
Looking to other issues, I have seen that this problem was reported by issue #42 |
Hi @iblasi , thank you greatly for your in depth description of the issue! 👍 You appear to be correct. I am working on fixing #42 in #60 ... I believe I came across the same issue as you when debugging that problem. If you have the time, are you able to use the code from #60 to see if it generates the features as you would hope? FWIW, most raised issues have related to the regressor, so this bug may well have been present but not found for a while. OOB means "out of bag" as we use "bagging" to subsample the rows to evaluate. |
@trevorstephens, I have seen that I have some errors on my first example code that they have been fixed in case you want to test exactly the code to see more clearly the bug. I have tested your code and it still does not work properly.
I mean, you have already measured fitness and you just want to sort through that score value. So just take the You can still maintain the
I may be doing something wrong, so please check it, but the code works and gives the expected results. |
@trevorstephens I already realized what I think you were trying to do with correlation matrix. That worked for me also, although the results are not perfect with LARS, but that's logical as I am creating features less correlated (most distiguished) and not the top EDITED
Using actual code gplearn-fix-second-issue-from-42, the output:
MAE is 1681.25844662 (depends on random_state used but it is always high). So clearly it is not correct. Using argmin mentioned as
The MAE is 3.77298192689e-13 which is perfect. But you make take other features and not the highest Using my proposed
As summary, I am not sure that using correlation coefficients is the best choice as you may miss the highest
that uses a linear selection of |
Thanks again @iblasi , you are correct in that the algorithm tries to pick out the least correlated features of the hall_of_fame so that they can then be used in another model without as much collinearity in the features. I do see your point in terms of potential removal of key programs in the group. In order to maintain the idea of removal of correlated programs, but maintain the best programs where possible, I think that my latest change might tick all boxes. Currently the code checks for the most correlated pair of programs from the hall of fame, and then removes the one that is also most correlated with all other programs left in the group. This could easily remove the top programs from the field as they are also most likely to have many other correlated "clones" in the generation. Instead I now propose to find the most correlated pair as before, but then remove the one with the worst fitness of the two -- and then iterate until the group is reduced to the required number of components. Take a look at the current version of #60 and let me know what you think. Really appreciate your taking the time to dig into this one! 👍 |
@trevorstephens just perfect. Good job! Just one comment that does not apply to final result, and may not improve too much speed, but when you use
But as I said, it does not improve the result as it makes the same operations. |
That's a great observation @iblasi , simplifies the code a fair bit. And as the |
Hi @trevorstephens ,
I am not sure if this is a bug, or the documentation is not correct focused refered to SymbolicTransformer.
I have done a show case of how SymbolicRegressor works and predicts well the equation that represents the dataset, while SymbolicTransformer does not work in the same way.
Starting with SymbolicRegressor, I have done a "easy" dataset to check if SymbolicRegressor give me the correct result and good metrics.
This example give us a perfect result and the MAE metrics is ~perfect as shows the output:
However, SymbolicTransformer although the training works well, the transform does not work well.
See next same example to previous one but with SymbolicTransformer:
I use Lars from sklearn for avoid Ridge sparse weights, and find the best solution fast for this easy and exact example. As it can be seen on the results of this code (below), the features that are generated with transform, although during the fit fitness become perfect, the added transformed features seem to be worng. The problem does not come from Lars, as last example of Lars shows that adding "the feature" which is the target, the accuracy is perfetc.
So I decided to see the fitted features created during the fit and some of them are perfect, however, the transform seems not to use them correctly on
gp_features
createdIs this a bug? I am doing the same thing as explained on SymbolicTransformer example
The text was updated successfully, but these errors were encountered: