-
Notifications
You must be signed in to change notification settings - Fork 121
Add variable importance #24
Comments
I would like to implement this, I think it should be optional (we will need to add for instance a boolean parameter compute_feature_importances=True/False). I have seen than in the earth R package (http://www.milbo.org/doc/earth-notes.pdf), they have three criteria. During the pruning pass, we obtain, for each size of subsets of basis functions, either the best one (in terms of RSS) or an estimate of the best one depending on whether we do an exhaustive search or a greedy search (like implemented here and in the original paper). The three criteria are, if I understand well, the following :
So would that be valuable to add these features to py-earth ? if not , any other ideas ? |
Yes, this would be a great contribution. I'd suggest that instead of a parameter of the Earth constructor, it should be its own class or function. That way users wouldn't need to decide up front whether or not they want variable importance, but could potentially decide after fitting the model. This may require storing some additional information in the pruning record, but not too much I think. Other methods than evimp could be included in a similar way, with perhaps additional small changes to the Earth class. To be more clear, I'm thinking something like this:
This is just a general idea. You can maybe come up with better names and think about whether it should be a class or a function, what should be returned, whether training data need be used (not for evimp, but for other methods?), etc. |
I implemented the internal code for in _pruning.pyx for these three criteria I talked about, but not yet the external API you suggest. So just to be sure to understand your idea what is done here : "importance = Importance(model).fit(X,y)", what would an Importance class contain ? is importance a model which is exaclty the same than model but which "filters" only the relevant features ? a plot here for feature importance with these three criteria using the friedman1 dataset (sklearn.datasets.make_friedman1, http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html) : |
That's great. For the API, I was trying to think about how additional importance measurements might later be added. There are some that would require training data as well as a fitted model. However, the result would still just be a vector of importance measurements. So, I think there is no reason it needs to be a class. And, since for now it's just the evimp method and needs no training data, it's probably best to keep things simple and just do:
You could either specify the criterion as an argument or come up with separate function names for each. Things worth thinking about:
|
to do it the sklearn way you should have an attribute to the object called
feature_importances_ ie. no extra function. If it's costly to compute and
you don't want to compute it by default add an init param to say to compute
it or not.
|
In this case it is definitely not costly to compute, and if that's the sklearn way then I guess we should do it that way. My only concern with that approach is that someone could fit a model without feature importance and later decide it is actually needed. In the case of this particular set of feature importance measures, we can just compute them every time because they're very cheap. There is still the issue of exactly which feature importance measure to use: gcv, mse, or subsets. How would sklearn handle such an option? Ideally, all three types of importance measurement should be available after model fit. Perhaps it should be an n x 3 array? Is there anywhere sklearn uses feature_importances_ and expects them to be a single value per feature? Another option is to include an init parameter to make the choice (for sklearn compatibility), but also have methods or functions that can compute them later if needed. @mehdidc, given what @agramfort said, I suggest the following:
|
that's what sklearn does for random forest
|
@mehdidc what is the status of this? I seem to remember you did some work on it already, but I can't recall how far you got with it. I'm marking it for 0.2, but if you think 0.1 feel free to change. Also, assigning to you. |
@jcrudy I am preparing very soon a PR for this, I think it is fine for 0.1 as it is orthogonal to the large changes you made to the forward pass, because it is only affecting the pruning pass. |
@mehdidc Awesome. Just changed it to 0.1. |
No description provided.
The text was updated successfully, but these errors were encountered: