Feature selection based on f_score: Dummy columns #2359

Closed
amelio-vazquez-reina opened this Issue Aug 13, 2013 · 1 comment

3 participants

@amelio-vazquez-reina

I found that feature_selection based on f_classif (i.e. F-test) can break if one feature has a constant value (e.g. all zeros).

The best way to test this is to add a column of all zeros to X in the example plot_feature_selection.py, e.g.: see the column dummy below

# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))

# Add the noisy data to the informative features
X = np.hstack((iris.data, E))

dummy = np.zeros((X.shape[0],3))
X = np.hstack((iris.data, E, dummy))
y = iris.target

f_classif already throws a warning whenever multiple columns (multiple features) are duplicates of each other. It may be a good idea to also warn the user when a feature is constant across all instances.

@agramfort
scikit-learn member
@arjoly arjoly added the Enhancement label Jul 18, 2014
@agramfort agramfort closed this in #3744 Oct 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment