Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature selection based on f_score: Dummy columns #2359

Closed
amelio-vazquez-reina opened this issue Aug 13, 2013 · 1 comment · Fixed by #3744
Closed

Feature selection based on f_score: Dummy columns #2359

amelio-vazquez-reina opened this issue Aug 13, 2013 · 1 comment · Fixed by #3744

Comments

@amelio-vazquez-reina
Copy link

I found that feature_selection based on f_classif (i.e. F-test) can break if one feature has a constant value (e.g. all zeros).

The best way to test this is to add a column of all zeros to X in the example plot_feature_selection.py, e.g.: see the column dummy below

# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))

# Add the noisy data to the informative features
X = np.hstack((iris.data, E))

dummy = np.zeros((X.shape[0],3))
X = np.hstack((iris.data, E, dummy))
y = iris.target

f_classif already throws a warning whenever multiple columns (multiple features) are duplicates of each other. It may be a good idea to also warn the user when a feature is constant across all instances.

@agramfort
Copy link
Member

fine with me. Pull request very welcome.

On Tue, Aug 13, 2013 at 3:19 AM, ribonoous notifications@github.com wrote:

I found that feature_selection based on f_classif (i.e. F-test) can break
if one feature has a constant value (e.g. all zeros).

The best way to test this is to add a column of all zeros to X in the
example plot_feature_selection.py, e.g.: see the column dummy below

Some noisy data not correlated

E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))

Add the noisy data to the informative features

X = np.hstack((iris.data, E))

dummy = np.zeros((X.shape[0],3))
X = np.hstack((iris.data, E, dummy))
y = iris.target

f_classif already throws a warning whenever multiple columns (multiple
features) are duplicates of each other. It may be a good idea to also warn
the user when a feature is constant across all instances.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2359
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants