Minimum redundancy maximum relevance (mRMR) feature selection #2547

Closed
wants to merge 10 commits into
from

Projects

None yet

7 participants

@AndreaBravi

Hi!

I have created a new class computing the mRMR filtering feature selection.

pep8, pyflakes and nosetests run succesfully on the submitted code.

I am planning to create the documentation for this class as soon as I receive your approval.

Thanks!

Andrea

@coveralls

Coverage Status

Coverage remained the same when pulling 62cf55b on AndreaBravi:mRMR into d43a767 on scikit-learn:master.

@larsmans larsmans and 1 other commented on an outdated diff Oct 26, 2013
sklearn/feature_selection/multivariate_filtering.py
+ redundancy : array, shape=[n_features]
+ Redundancy of all the features
+ rule : string, default='diff'
+ Rule to combine relevance and redundancy, either
+ 'diff' - difference between the two
+ 'prod' - product between the two
+ X : array, shape=[n_samples, n_features]
+ Input dataset, must be either integer or categorical
+ y : array, shape=[n_samples]
+ Label vector, must be either integer or categorical
+
+ Methods
+ -------
+ _compute_mRMR(X, y)
+ Computes the minimal relevance maximal redundancy of each feature
+ returning mask and score
@larsmans
larsmans Oct 26, 2013

The docstring should never refer to private methods.

@larsmans larsmans and 1 other commented on an outdated diff Oct 26, 2013
sklearn/feature_selection/multivariate_filtering.py
+ """
+ M = X.shape[1] # Number of features
+ self.n_features = M
+
+ # Computation of relevance and redundancy
+ relevance = np.zeros(M)
+ redundancy = np.zeros([M, M])
+ for m1 in range(0, M):
+ relevance[m1] = mutual_info_score(X[:, m1], y)
+ for m2 in range(m1+1, M):
+ redundancy[m1, m2] = mutual_info_score(X[:, m1],
+ X[:, m2])
+ redundancy[m2, m1] = redundancy[m1, m2]
+
+ self.relevance = relevance
+ self.redundancy = redundancy
@larsmans
larsmans Oct 26, 2013

Please postpone setting instance attributes until after the algorithm has succeeded. Otherwise, the estimator winds up in an inconsistent state (some attributes set, but not usable).

@larsmans larsmans and 1 other commented on an outdated diff Oct 26, 2013
sklearn/feature_selection/multivariate_filtering.py
+
+ self.X = X
+ self.y = y
+ self.mask, self.score = self._compute_mRMR(X, y)
+ return self
+
+ def _get_support_mask(self):
+ """
+ Returns
+ -------
+ support : array, dype=bool, shape=[n_features]
+ Boolean mask with True the selected features
+ """
+
+ support = np.zeros(self.n_features, dtype=bool)
+ support[[self.mask]] = True
@larsmans
larsmans Oct 26, 2013

What do the extra brackets do?

@AndreaBravi
AndreaBravi Oct 29, 2013

Nothing! Removed them

@amueller
scikit-learn member

It would be cool to have an example that compares this method with univariate selection and RFE.

@amueller
scikit-learn member

This looks good, thanks. I am not totally convinced by the tests, though. I am not familiar with the method but it would be good if the expected result from the model could be computed in an easy way. Currently it looks like the scores are some magic numbers and I don't know what they mean.

@AndreaBravi

Nice suggestion! I will insert that kind of example in the documentation, as soon as I get familiar with sphinx.

About the testing, I emulated what done in the tests for mutual_information (sklearn.metrics.cluster). Also in that case there are arbitrary numbers. I am not aware of a theoretical value of mRMR that can be used for this purpose.

@coveralls

Coverage Status

Coverage remained the same when pulling c2d1852 on AndreaBravi:mRMR into d43a767 on scikit-learn:master.

@amueller
scikit-learn member

To create an example, you simply have to add a file under the examples folder that starts with plot_.

@AndreaBravi

Thanks for the clarification, I was thinking of adding it in the description of the method, inside feature_selection.rst

By the way, once I have added the example in the examples folder, which rst file do I need to modify to make sure that it gets published in http://scikit-learn.org/stable/auto_examples/index.html?

@coveralls

Coverage Status

Coverage remained the same when pulling 9d1ea9a on AndreaBravi:mRMR into d43a767 on scikit-learn:master.

@ddofer

Are there any plans to end up implementing this? I'd love to see mrmr/Mutual Info feature selection actually decently implemented in python (without needing the C.exe) especially in scikit.

@amueller
scikit-learn member

Sorry this lay around for a while. We seem to be all pretty busy at the moment. I still think this is a cool addition.

@jnothman jnothman commented on the diff Jan 18, 2015
sklearn/feature_selection/multivariate_filtering.py
+from ..base import BaseEstimator
+from .base import SelectorMixin
+from ..metrics.cluster.supervised import mutual_info_score
+from ..utils.validation import array2d
+
+
+class MinRedundancyMaxRelevance(BaseEstimator, SelectorMixin):
+ """
+ Select the subset of features with minimal redundancy and maximal
+ relevance (mRMR) with the outcome.
+
+ IMPORTANT: This version only supports data in categorical or integer form.
+
+ Attributes
+ ----------
+ k : int, default=2
@jnothman
jnothman Jan 18, 2015

This should be in a Parameters section together with "rule".

@jnothman jnothman commented on the diff Jan 18, 2015
sklearn/feature_selection/multivariate_filtering.py
+ 'prod' - product between the two
+ X : array, shape=[n_samples, n_features]
+ Input dataset, must be either integer or categorical
+ y : array, shape=[n_samples]
+ Label vector, must be either integer or categorical
+
+ References
+ ----------
+ .. [1] H. Peng, F. Long, and C. Ding, "Feature selection based on mutual
+ information: criteria of max-dependency, max-relevance, and
+ min-redundancy", IEEE Transactions on Pattern Analysis and Machine
+ Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005.
+ """
+ def __init__(self, k=2, rule='diff'):
+ """
+ Parameters
@jnothman
jnothman Jan 18, 2015

Rather, this should move to the class docstring.

@jnothman jnothman commented on the diff Jan 18, 2015
sklearn/feature_selection/multivariate_filtering.py
+ """
+ self.k = k
+ self.rule = rule
+
+ def fit(self, X, y):
+ """
+ Parameters
+ ----------
+ X : array, shape=[n_samples, n_features]
+ Input dataset, must be either integer or categorical
+ y : array, shape=[n_samples]
+ Label vector, must be either integer or categorical
+ """
+ X = array2d(X)
+
+ self.X = X
@jnothman
jnothman Jan 18, 2015

I don't think these should be stored.

@jnothman jnothman commented on the diff Jan 18, 2015
sklearn/feature_selection/multivariate_filtering.py
+ """
+ Select the subset of features with minimal redundancy and maximal
+ relevance (mRMR) with the outcome.
+
+ IMPORTANT: This version only supports data in categorical or integer form.
+
+ Attributes
+ ----------
+ k : int, default=2
+ Number of features to select (selected_features)
+ mask : list, len=selected_features
+ Integer list of the features ordered by maximal relevance and
+ minimal redundancy
+ score : array, shape=[selected_features]
+ mRMR score associated to each entry in mask
+ relevance : array, shape=[n_features]
@jnothman
jnothman Jan 18, 2015

All attributes representing the model should have a _ suffix.

@jnothman jnothman commented on the diff Jan 18, 2015
sklearn/feature_selection/multivariate_filtering.py
+
+ Returns
+ -------
+ mask : list, len=selected_features
+ Integer list of the features ordered by maximal relevance and
+ minimal redundancy
+ score : list, len=selected_features
+ mRMR score associated to each entry in mask
+ """
+ M = X.shape[1] # Number of features
+
+ # Computation of relevance and redundancy
+ relevance = np.zeros(M)
+ redundancy = np.zeros([M, M])
+ for m1 in range(0, M):
+ relevance[m1] = mutual_info_score(X[:, m1], y)
@jnothman
jnothman Jan 18, 2015

It would be nice if we could reimplement mutual_info to support calculation over sparse feature spaces (in a CSC matrix).

@jnothman
jnothman Jan 18, 2015

Also, we could possible calculate MI for all the features more efficiently.

@jnothman jnothman commented on the diff Jan 18, 2015
...eature_selection/tests/test_multivariate_filtering.py
+ [1, 3, 3],
+ [1, 3, 1]])
+
+y = np.array([3, 1, 3, 1, 3])
+
+
+def test_mMRM():
+ """
+ Test MinRedundancyMaxRelevance with default setting.
+ """
+
+ m = MinRedundancyMaxRelevance().fit(X, y)
+
+ assert_array_equal([2, 0], m.mask)
+
+ assert_array_equal(0.6730116670092563, m.score[0])
@jnothman
jnothman Jan 18, 2015

It would be better to compare to results in the literature, which I presume are not reported to 16 decimal places ;)

@jnothman jnothman commented on the diff Jan 18, 2015
sklearn/feature_selection/multivariate_filtering.py
+ redundancy[m1, m2] = mutual_info_score(X[:, m1],
+ X[:, m2])
+ redundancy[m2, m1] = redundancy[m1, m2]
+
+ # Sequential search optimization
+ mask = []
+ score = []
+ search_space = range(0, M)
+
+ score.append(max(relevance))
+ ind = int(relevance.argmax(0)) # Optimal feature
+ mask.append(ind)
+ search_space.pop(ind)
+
+ if self.rule == 'diff':
+ for m in range(0, self.k-1):
@jnothman
jnothman Jan 18, 2015

Rather than repeating yourself with the logic, you could define rule = np.subtract or rule = np.multiply

@jnothman jnothman commented on the diff Jan 18, 2015
sklearn/feature_selection/multivariate_filtering.py
+ # Sequential search optimization
+ mask = []
+ score = []
+ search_space = range(0, M)
+
+ score.append(max(relevance))
+ ind = int(relevance.argmax(0)) # Optimal feature
+ mask.append(ind)
+ search_space.pop(ind)
+
+ if self.rule == 'diff':
+ for m in range(0, self.k-1):
+ tmp_score = relevance[search_space] - \
+ np.mean(redundancy[:, search_space]
+ .take(mask, axis=0), 0)
+ score.append(max(tmp_score))
@jnothman
jnothman Jan 18, 2015

Please use np.max when operating over an array

@jnothman
jnothman Jan 18, 2015

Although it may be more sense to just use score.append(tmp_score[ind]) after the following line.

@jnothman jnothman commented on the diff Jan 18, 2015
examples/plot_mRMR.py
+B = np.array([2] * 25 + [1] * 25 + [1] * 25 + [0] * 25)
+
+# Creating a feature able to classify 40% of the samples
+C = np.array([2] * 20 + [0] * 30 + [1] * 30 + [2] * 20)
+
+X = np.array([A, B, C]).T
+feature = ['A', 'B', 'C']
+
+# We will be using the following three selectors
+selectors = [('RFE', RFE(LogisticRegression(), 2)),
+ ('Uni', SelectKBest(chi2, k=2)),
+ ('mRMR', MinRedundancyMaxRelevance(k=2))]
+
+for name, selector in selectors:
+ k = selector.fit(X, y).get_support(True).tolist()
+ print name, 'selected %s and %s' % (feature[k[0]], feature[k[1]])
@jnothman
jnothman Jan 18, 2015

Please make this Python 3-compatible (i.e. use the function form of print with from __future__ import print_function)

@jnothman jnothman commented on the diff Jan 18, 2015
doc/modules/feature_selection.rst
@@ -247,10 +247,38 @@ features::
* :ref:`example_ensemble_plot_forest_importances_faces.py`: example
on face recognition data.
+.. _mRMR:
+
+Minimum Redundancy Maximal Relevance (mRMR)
+===============================================
+
+This filter feature selector was proposed by Peng et al. in 2005. mRMR
+identifies a subset of features having maximal mutual information with the
+target (i.e. relevance), and minimal mutual information with each other (i.e.
+redundancy).
+
+The algorithm expects discretized features. Peng et al. suggest to use the mean
+and standard deviation of each feature for that purpose. For instance, divide
@jnothman
jnothman Jan 18, 2015

Maybe this transformation should be provided as part of the feature selector.

@MechCoder MechCoder closed this Nov 5, 2015
@MechCoder
scikit-learn member

Closed in favour of the other PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment