Skip to content
This repository has been archived by the owner on Dec 6, 2023. It is now read-only.

Add variable importance #24

Closed
jcrudy opened this issue Mar 10, 2013 · 11 comments
Closed

Add variable importance #24

jcrudy opened this issue Mar 10, 2013 · 11 comments
Assignees
Milestone

Comments

@jcrudy
Copy link
Collaborator

jcrudy commented Mar 10, 2013

No description provided.

@mehdidc
Copy link
Contributor

mehdidc commented Apr 17, 2015

I would like to implement this, I think it should be optional (we will need to add for instance a boolean parameter compute_feature_importances=True/False).

I have seen than in the earth R package (http://www.milbo.org/doc/earth-notes.pdf), they have three criteria. During the pruning pass, we obtain, for each size of subsets of basis functions, either the best one (in terms of RSS) or an estimate of the best one depending on whether we do an exhaustive search or a greedy search (like implemented here and in the original paper). The three criteria are, if I understand well, the following :

  1. Having the best subsets of basis functions for each size, they compute the number of times each variable occurs in these subsets (there is one subset for each size, determined by the pruning pass).

  2. They compute the decrease of RSS when prunning a basis function from a subset, and for each variable they sum the decrease of RSS over all the subsets that include the variable.

  3. same than 2) except that they use GCV instead of RSS

So would that be valuable to add these features to py-earth ? if not , any other ideas ?

@jcrudy
Copy link
Collaborator Author

jcrudy commented Apr 18, 2015

Yes, this would be a great contribution. I'd suggest that instead of a parameter of the Earth constructor, it should be its own class or function. That way users wouldn't need to decide up front whether or not they want variable importance, but could potentially decide after fitting the model. This may require storing some additional information in the pruning record, but not too much I think. Other methods than evimp could be included in a similar way, with perhaps additional small changes to the Earth class. To be more clear, I'm thinking something like this:

from pyearth import Earth
from pyearth.importance import Importance
from your_data import X, y
model = Earth().fit(X,y)
importance = Importance(model).fit(X,y)

This is just a general idea. You can maybe come up with better names and think about whether it should be a class or a function, what should be returned, whether training data need be used (not for evimp, but for other methods?), etc.

@mehdidc
Copy link
Contributor

mehdidc commented Jun 15, 2015

I implemented the internal code for in _pruning.pyx for these three criteria I talked about, but not yet the external API you suggest. So just to be sure to understand your idea what is done here : "importance = Importance(model).fit(X,y)", what would an Importance class contain ? is importance a model which is exaclty the same than model but which "filters" only the relevant features ?

a plot here for feature importance with these three criteria using the friedman1 dataset (sklearn.datasets.make_friedman1, http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html) :
Imgur

@jcrudy
Copy link
Collaborator Author

jcrudy commented Jun 17, 2015

That's great. For the API, I was trying to think about how additional importance measurements might later be added. There are some that would require training data as well as a fitted model. However, the result would still just be a vector of importance measurements. So, I think there is no reason it needs to be a class. And, since for now it's just the evimp method and needs no training data, it's probably best to keep things simple and just do:

from pyearth import Earth
from pyearth.importance import variable_importance
from your_data import X, y
model = Earth().fit(X,y)
importance = variable_importance(model)

You could either specify the criterion as an argument or come up with separate function names for each.

Things worth thinking about:

  1. What happens if the model hasn't been pruned?
  2. If there are multiple outputs, is variable importance calculated separately for each output to give an importance matrix?

@agramfort
Copy link
Member

agramfort commented Jun 18, 2015 via email

@jcrudy
Copy link
Collaborator Author

jcrudy commented Jun 24, 2015

In this case it is definitely not costly to compute, and if that's the sklearn way then I guess we should do it that way. My only concern with that approach is that someone could fit a model without feature importance and later decide it is actually needed. In the case of this particular set of feature importance measures, we can just compute them every time because they're very cheap.

There is still the issue of exactly which feature importance measure to use: gcv, mse, or subsets. How would sklearn handle such an option? Ideally, all three types of importance measurement should be available after model fit. Perhaps it should be an n x 3 array? Is there anywhere sklearn uses feature_importances_ and expects them to be a single value per feature? Another option is to include an init parameter to make the choice (for sklearn compatibility), but also have methods or functions that can compute them later if needed.

@mehdidc, given what @agramfort said, I suggest the following:

  1. Have an init parameter, importance_type (feel free to change the name) that can be 'gcv', 'mse', or 'subsets'. By default, it is 'gcv'. Have a feature_importances_ attribute on any fitted and pruned model. Docstrings should point out that pruning is required for currently implemented importance measurements.
  2. Add an importance module with all three importance functions. Have a function, importance, that takes an Earth object and importance_type argument and returns an n x 1 array of importance values. When Earth computes importance at the end of the pruning pass, it can call this function. If someone decides to later compute a different importance measurement, the function can still be used on any pruned Earth model.

@agramfort
Copy link
Member

In this case it is definitely not costly to compute, and if that's the
sklearn way then I guess we should do it that way. My only concern with
that approach is that someone could fit a model without feature importance
and later decide it is actually needed. In the case of this particular set
of feature importance measures, we can just compute them every time because
they're very cheap.

that's what sklearn does for random forest

There is still the issue of exactly which feature importance measure to
use: gcv, mse, or subsets. How would sklearn handle such an option?

not a problem with sklearn there is only one implemented

Ideally, all three types of importance measurement should be available
after model fit. Perhaps it should be an n x 3 array? Is there anywhere
sklearn uses feature_importances_ and expects them to be a single value per
feature?

keep it one float per feature. is there one method that is more standard?
that would be a good default. If we want to support all I would add a param
in init that takes None (no feature importance computed) | "mse" | "gcv" ...

@jcrudy
Copy link
Collaborator Author

jcrudy commented Mar 25, 2016

@mehdidc what is the status of this? I seem to remember you did some work on it already, but I can't recall how far you got with it. I'm marking it for 0.2, but if you think 0.1 feel free to change. Also, assigning to you.

@jcrudy jcrudy added this to the 0.2 milestone Mar 25, 2016
@mehdidc
Copy link
Contributor

mehdidc commented Jun 23, 2016

@jcrudy I am preparing very soon a PR for this, I think it is fine for 0.1 as it is orthogonal to the large changes you made to the forward pass, because it is only affecting the pruning pass.

@jcrudy jcrudy modified the milestones: 0.1, 0.2 Jun 23, 2016
@jcrudy
Copy link
Collaborator Author

jcrudy commented Jun 23, 2016

@mehdidc Awesome. Just changed it to 0.1.

@jcrudy
Copy link
Collaborator Author

jcrudy commented Jul 31, 2016

Implemented by @mehdidc in commit 2d70700. Closing this issue.

@jcrudy jcrudy closed this as completed Jul 31, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants