Add variable importance #24

jcrudy · 2013-03-10T19:36:54Z

No description provided.

mehdidc · 2015-04-17T09:20:35Z

I would like to implement this, I think it should be optional (we will need to add for instance a boolean parameter compute_feature_importances=True/False).

I have seen than in the earth R package (http://www.milbo.org/doc/earth-notes.pdf), they have three criteria. During the pruning pass, we obtain, for each size of subsets of basis functions, either the best one (in terms of RSS) or an estimate of the best one depending on whether we do an exhaustive search or a greedy search (like implemented here and in the original paper). The three criteria are, if I understand well, the following :

Having the best subsets of basis functions for each size, they compute the number of times each variable occurs in these subsets (there is one subset for each size, determined by the pruning pass).
They compute the decrease of RSS when prunning a basis function from a subset, and for each variable they sum the decrease of RSS over all the subsets that include the variable.
same than 2) except that they use GCV instead of RSS

So would that be valuable to add these features to py-earth ? if not , any other ideas ?

jcrudy · 2015-04-18T08:42:39Z

Yes, this would be a great contribution. I'd suggest that instead of a parameter of the Earth constructor, it should be its own class or function. That way users wouldn't need to decide up front whether or not they want variable importance, but could potentially decide after fitting the model. This may require storing some additional information in the pruning record, but not too much I think. Other methods than evimp could be included in a similar way, with perhaps additional small changes to the Earth class. To be more clear, I'm thinking something like this:

from pyearth import Earth
from pyearth.importance import Importance
from your_data import X, y
model = Earth().fit(X,y)
importance = Importance(model).fit(X,y)

This is just a general idea. You can maybe come up with better names and think about whether it should be a class or a function, what should be returned, whether training data need be used (not for evimp, but for other methods?), etc.

mehdidc · 2015-06-15T16:44:19Z

I implemented the internal code for in _pruning.pyx for these three criteria I talked about, but not yet the external API you suggest. So just to be sure to understand your idea what is done here : "importance = Importance(model).fit(X,y)", what would an Importance class contain ? is importance a model which is exaclty the same than model but which "filters" only the relevant features ?

a plot here for feature importance with these three criteria using the friedman1 dataset (sklearn.datasets.make_friedman1, http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html) :

jcrudy · 2015-06-17T21:11:35Z

That's great. For the API, I was trying to think about how additional importance measurements might later be added. There are some that would require training data as well as a fitted model. However, the result would still just be a vector of importance measurements. So, I think there is no reason it needs to be a class. And, since for now it's just the evimp method and needs no training data, it's probably best to keep things simple and just do:

from pyearth import Earth
from pyearth.importance import variable_importance
from your_data import X, y
model = Earth().fit(X,y)
importance = variable_importance(model)

You could either specify the criterion as an argument or come up with separate function names for each.

Things worth thinking about:

What happens if the model hasn't been pruned?
If there are multiple outputs, is variable importance calculated separately for each output to give an importance matrix?

agramfort · 2015-06-18T07:20:27Z

to do it the sklearn way you should have an attribute to the object called feature_importances_ ie. no extra function. If it's costly to compute and you don't want to compute it by default add an init param to say to compute it or not.

jcrudy · 2015-06-24T17:27:27Z

In this case it is definitely not costly to compute, and if that's the sklearn way then I guess we should do it that way. My only concern with that approach is that someone could fit a model without feature importance and later decide it is actually needed. In the case of this particular set of feature importance measures, we can just compute them every time because they're very cheap.

There is still the issue of exactly which feature importance measure to use: gcv, mse, or subsets. How would sklearn handle such an option? Ideally, all three types of importance measurement should be available after model fit. Perhaps it should be an n x 3 array? Is there anywhere sklearn uses feature_importances_ and expects them to be a single value per feature? Another option is to include an init parameter to make the choice (for sklearn compatibility), but also have methods or functions that can compute them later if needed.

@mehdidc, given what @agramfort said, I suggest the following:

Have an init parameter, importance_type (feel free to change the name) that can be 'gcv', 'mse', or 'subsets'. By default, it is 'gcv'. Have a feature_importances_ attribute on any fitted and pruned model. Docstrings should point out that pruning is required for currently implemented importance measurements.
Add an importance module with all three importance functions. Have a function, importance, that takes an Earth object and importance_type argument and returns an n x 1 array of importance values. When Earth computes importance at the end of the pruning pass, it can call this function. If someone decides to later compute a different importance measurement, the function can still be used on any pruned Earth model.

agramfort · 2015-06-25T07:11:41Z

In this case it is definitely not costly to compute, and if that's the
sklearn way then I guess we should do it that way. My only concern with
that approach is that someone could fit a model without feature importance
and later decide it is actually needed. In the case of this particular set
of feature importance measures, we can just compute them every time because
they're very cheap.

that's what sklearn does for random forest

There is still the issue of exactly which feature importance measure to
use: gcv, mse, or subsets. How would sklearn handle such an option?

not a problem with sklearn there is only one implemented

Ideally, all three types of importance measurement should be available
after model fit. Perhaps it should be an n x 3 array? Is there anywhere
sklearn uses feature_importances_ and expects them to be a single value per
feature?

keep it one float per feature. is there one method that is more standard?
that would be a good default. If we want to support all I would add a param
in init that takes None (no feature importance computed) | "mse" | "gcv" ...

jcrudy · 2016-03-25T22:06:44Z

@mehdidc what is the status of this? I seem to remember you did some work on it already, but I can't recall how far you got with it. I'm marking it for 0.2, but if you think 0.1 feel free to change. Also, assigning to you.

mehdidc · 2016-06-23T10:25:46Z

@jcrudy I am preparing very soon a PR for this, I think it is fine for 0.1 as it is orthogonal to the large changes you made to the forward pass, because it is only affecting the pruning pass.

jcrudy · 2016-06-23T22:39:13Z

@mehdidc Awesome. Just changed it to 0.1.

jcrudy · 2016-07-31T23:02:11Z

Implemented by @mehdidc in commit 2d70700. Closing this issue.

jcrudy added this to the 0.2 milestone Mar 25, 2016

jcrudy assigned mehdidc Mar 25, 2016

jcrudy modified the milestones: 0.1, 0.2 Jun 23, 2016

jcrudy closed this as completed Jul 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add variable importance #24

Add variable importance #24

jcrudy commented Mar 10, 2013

mehdidc commented Apr 17, 2015

jcrudy commented Apr 18, 2015

mehdidc commented Jun 15, 2015

jcrudy commented Jun 17, 2015

agramfort commented Jun 18, 2015 via email

jcrudy commented Jun 24, 2015

agramfort commented Jun 25, 2015

jcrudy commented Mar 25, 2016

mehdidc commented Jun 23, 2016

jcrudy commented Jun 23, 2016

jcrudy commented Jul 31, 2016

Add variable importance #24

Add variable importance #24

Comments

jcrudy commented Mar 10, 2013

mehdidc commented Apr 17, 2015

jcrudy commented Apr 18, 2015

mehdidc commented Jun 15, 2015

jcrudy commented Jun 17, 2015

agramfort commented Jun 18, 2015 via email

jcrudy commented Jun 24, 2015

agramfort commented Jun 25, 2015

jcrudy commented Mar 25, 2016

mehdidc commented Jun 23, 2016

jcrudy commented Jun 23, 2016

jcrudy commented Jul 31, 2016