-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] Add Information Gain and Information Gain Ratio feature selection functions #6534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
get_t4(c_prob, f_prob, f_count, fc_count, total)).mean(axis=0) | ||
|
||
|
||
def ig(X, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use informative function names, e.g. info_gain
and info_gain_ratio
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the functions in the next commit.
Please add some tests that check the expected values on simple edge cases when it's easy to compute the true value using the formula: e.g. the feature is constant and the binary target variable is 50% 50%, the input feature and the target variable are equal, and so one. |
@larsmans I think you might be interested in this PR. |
@@ -226,6 +226,185 @@ def chi2(X, y): | |||
return _chisquare(observed, expected) | |||
|
|||
|
|||
def _ig(fc_count, c_count, f_count, fc_prob, c_prob, f_prob, total): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here: _info_gain
.
@vpekar, since we're working towards a release and this is non-critical, you're best off bumping this thread for review after scikit-learn 0.18 is out. |
can you try to rebase on master? There seems to be some weird changes in their. |
Can you maybe add an illustrative example when this method is good or fails compared to what we already have? |
@jnothman I added tests comparing with manually calculated values for IG and IGR, see "tests/test_info_gain.py", test_expected_value_info_gain and test_expected_value_info_gain_ratio |
|
||
clf = MultinomialNB(alpha=.01) | ||
|
||
for func, name in [(chi2, "CHI2"), (info_gain, "IG"), (info_gain_ratio, "IGR"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use the function name. There's plenty of space for the legend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding to the legend the amount of time each feature scoring function took.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparison of feature selection functions | ||
========================================= | ||
|
||
This example illustrates performance of different feature selection functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "univariate"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
y_train, y_test = data_train.target, data_test.target | ||
categories = data_train.target_names # for case categories == None | ||
|
||
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much the feature selection functions are affected by this rescaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to CountVectorizer, the accuracy rate dropped by ~5% for all the functions.
|
||
# apply feature selection | ||
|
||
selector = SelectKBest(func, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is very slow (particularly for the sake of calculating MI; we might need to exclude it if it is truly this slow). Perhaps we should just calculate the func directly and perform the argsort here without SelectKBest
. Thus the score is calculated once per example run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's MI that slows the example a lot. It seems what slows it is the particular method to calculate entropies, which is based on k-nearest neighbors. If MI is calculated once for all cutoff points, the example will be 10 times faster (i.e., still take ~4.5 mins on the test server - not sure if it's worth it), and the code will get quite complicated. I guess it's best to just remove it from the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please remove MI from the example, but perhaps leave a comment that it is too slow for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the MI function and added a comment on top of why it is not shown in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, MultinomialNB
doesn't expose coef_
anymore and certainly we should leave this out of the example then.
y : array-like, shape = (n_samples,) | ||
Target vector (class labels). | ||
|
||
globalization : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "aggregate" or "pooling". Change "aver" to "mean".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise lgtm
The plot shows the accuracy of a multinomial Naive Bayes classifier as a | ||
function of the amount of the best features selected for training it using five | ||
methods: chi-square, information gain, information gain ratio, F-test and | ||
Kraskov et al's mutual information based on k-nearest neighbor distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this
@@ -83,13 +83,7 @@ | |||
help="Remove newsgroup information that is easily overfit: " | |||
"headers, signatures, and quoting.") | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert your changes to this file
@@ -24,7 +24,8 @@ | |||
from sklearn.pipeline import Pipeline | |||
from sklearn.svm import LinearSVC | |||
from sklearn.decomposition import PCA, NMF | |||
from sklearn.feature_selection import SelectKBest, chi2 | |||
from sklearn.feature_selection import SelectKBest, chi2, info_gain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps revert your changes to this file
y = [0, 1, 1, 0] | ||
|
||
|
||
def mk_info_gain(k): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just put this inline
Xtrans = scores.transform(Xsp) | ||
assert_equal(Xtrans.shape, [Xsp.shape[0], 2]) | ||
|
||
# == doesn't work on scipy.sparse matrices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this comment
|
||
Xsp = csr_matrix(X, dtype=np.float) | ||
scores, probs = info_gain_ratio(Xsp, y) | ||
assert_almost_equal(scores[0], 0.25614, decimal=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number is identical to that for IG. Can we tweak the example so that they differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have traced the values with different data and now the number is different. I hope it's correct.
@@ -390,13 +576,17 @@ class SelectPercentile(_BaseFilter): | |||
f_classif: ANOVA F-value between label/feature for classification tasks. | |||
mutual_info_classif: Mutual information for a discrete target. | |||
chi2: Chi-squared stats of non-negative features for classification tasks. | |||
info_gain: Information Gain of features for classification tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is getting silly, especially now that we have "Read more in the User Guide". Maybe another PR should propose removing all this verbosity.
y : array-like, shape = (n_samples,) | ||
Target vector (class labels). | ||
|
||
aggregate : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add ", optional"
y : array-like, shape = (n_samples,) | ||
Target vector (class labels). | ||
|
||
aggregate : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add ", optional"
fc_count : array, shape = (n_features, n_classes) | ||
total: int | ||
""" | ||
X = check_array(X, accept_sparse=['csr', 'coo']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 'coo'? I think 'csc' might be most appropriate: I think X will be converted into CSC when safe_sparse_dot(Y.T, X)
is performed below (assuming Y is CSR, as output by LabelBinarizer
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it to csc, though in my tests I didn't detect any big difference in performance.
This requires a refresh and is close to being merged (at least Joel has only small points in the last review round). cc @StefanieSenger might be a good candidate to take over? |
Thanks, @adrinjalali, I will try to take over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed all the remaining issues, updated the PR to current repo standards and made a new PR for it: #28905
|
||
# apply feature selection | ||
|
||
selector = SelectKBest(func, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the MI function and added a comment on top of why it is not shown in the example.
|
||
# apply feature selection | ||
|
||
selector = SelectKBest(func, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, MultinomialNB
doesn't expose coef_
anymore and certainly we should leave this out of the example then.
fc_count : array, shape = (n_features, n_classes) | ||
total: int | ||
""" | ||
X = check_array(X, accept_sparse=['csr', 'coo']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it to csc, though in my tests I didn't detect any big difference in performance.
|
||
Xsp = csr_matrix(X, dtype=np.float) | ||
scores, probs = info_gain_ratio(Xsp, y) | ||
assert_almost_equal(scores[0], 0.25614, decimal=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have traced the values with different data and now the number is different. I hope it's correct.
closing as superseded by #28905 |
Just writing a message here to mention that the information gain is already implemented in |
The implementation in The one in The two function produce different feature rankings: https://gist.github.com/vpekar/7b45b13d800f61d30b701034448705f9 Also see in the notebook the time it takes for the two functions to run - |
Did you set |
With |
yep we probably have many more checks and additional overhead because the function is a bit more generic with branching. One thing that is weird is about the statistic computation: the definition of the EMI and IG are the same and should results to the same values. I see difference in the implementation so I should double check. |
The commit implements Information Gain [1] and Information Gain Ratio functions used for feature selection. The functions are commonly used in the filtering approach to feature selection in tasks such as text classification ([2] and [3]). IG is implemented in WEKA package.
The input parameters and output values as well as tests of the functions follow the example for the chi-square function.
The coverage of sklearn.feature_selection.univariate_selection is 98%.
PEP8 and PyFlakes pass.