Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the use of chi2 with the iris dataset is incorrect #17286

Open
OfirKedem opened this issue May 20, 2020 · 9 comments · May be fixed by #23589
Open

the use of chi2 with the iris dataset is incorrect #17286

OfirKedem opened this issue May 20, 2020 · 9 comments · May be fixed by #23589

Comments

@OfirKedem
Copy link

OfirKedem commented May 20, 2020

Describe the issue linked to the documentation:

many of the examples in the documentation uses the iris data set, and some also use the chi2 for feature selection. however the since the iris data is continuous the use of chi2 as a metric is invalid.

here are the examples i found with this error:

https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py

https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

Suggest a potential alternative/fix

change the dataset or the metric. or bin the data.

@glemaitre
Copy link
Member

This seems correct. I think that the easiest is to use a KBinsDiscritizer since we might not have a fully categorical dataset available.

@glemaitre glemaitre added the Bug label May 20, 2020
@glemaitre
Copy link
Member

@OfirKedem Do you want to make a PR?

@jnothman
Copy link
Member

jnothman commented May 20, 2020 via email

@glemaitre
Copy link
Member

(And I don't think we should discretise unless it's justified by the
example use case.)

I was thinking that it would make sense with iris since you have kind of grouping related to the size .

@2796gaurav
Copy link

Hi @jnothman @glemaitre i think we should provide better examples in the documentation. Let me know if you really want this. I can make PR

@AfekAmiri
Copy link

@2796gaurav yes please it should be more clear for users.I think documentation should be more accurate so that one can rely on it. Thank you !

@i-aki-y
Copy link
Contributor

i-aki-y commented Jun 2, 2022

I think replacing chi2 with f_classif is sufficient and the right choice for the tasks.

@jnothman

Is it problematic to use chi2 with continuous vars for a relative ranking?
I.e. is it only invalid theoretically?

I think that this is theoretically and practically invalid usage.
When the input X is a continuous feature, the resulting score would be sensitive to the scale of the input values. This means that the chi2 tend to select the features that simply have large absolute values instead of correlated features.

See the following explanation.

The chi-square is given by the following eq:

score \sim \sum (O_i - E_i)^2 / E_i,

where O_i and E_i are observation and expectation values of the i-th class.
If a scale factor a is applied to X (X -> aX), the score becomes a times larger:

\sum (a*O_i - a*E_i)^2 / a*E_i = a*score.

This means that features with large absolute values have large chi2 scores.

Demo:

By using the above result, you can get the opposite result in the following example with just increasing noise intensity:

https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py

With this change:

- X = np.hstack((X, 2 * rng.random((X.shape[0], 36))))
+ X = np.hstack((X, 200 * rng.random((X.shape[0], 36))))

I can get the opposite result:

chi2

The blue is the original and the orange is noise enlarged version.

See also the same experiment with f_classif version that is stable for scaling.

- ("anova", SelectPercentile(chi2)),
+ ("anova", SelectPercentile(f_classif)),

f_classif

@AfekAmiri
Copy link

AfekAmiri commented Jun 4, 2022 via email

@i-aki-y
Copy link
Contributor

i-aki-y commented Jun 11, 2022

@AfekAmiri

hence the counts won't change
I think it is not true.
Because the chi2 assumes that the elements of the input X are counts. The scaling a*X means that all counts are multiplied by a. In other words, the values of the contingency table are changed by a times.

Note that the sklearn.feature_selection.chi2 applies one-way chi-square test (same as scipy.stats.chisquare) for each features (columns of X).
If you have a contingency table and apply chi-square test to it, the scipy.stats.chi2_contingency might be what you want.

I think that there is no sklearn's counterpart of the chi2_contingency for feature selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants