the use of chi2 with the iris dataset is incorrect #17286

OfirKedem · 2020-05-20T09:48:33Z

Describe the issue linked to the documentation:

many of the examples in the documentation uses the iris data set, and some also use the chi2 for feature selection. however the since the iris data is continuous the use of chi2 as a metric is invalid.

here are the examples i found with this error:

https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py

https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

Suggest a potential alternative/fix

change the dataset or the metric. or bin the data.

glemaitre · 2020-05-20T12:30:14Z

This seems correct. I think that the easiest is to use a KBinsDiscritizer since we might not have a fully categorical dataset available.

glemaitre · 2020-05-20T12:37:42Z

@OfirKedem Do you want to make a PR?

jnothman · 2020-05-20T12:57:14Z

Is it problematic to use chi2 with continuous vars for a relative ranking? I.e. is it only invalid theoretically? (And I don't think we should discretise unless it's justified by the example use case.)

glemaitre · 2020-05-20T13:16:24Z

(And I don't think we should discretise unless it's justified by the
example use case.)

I was thinking that it would make sense with iris since you have kind of grouping related to the size .

2796gaurav · 2020-06-09T06:29:25Z

Hi @jnothman @glemaitre i think we should provide better examples in the documentation. Let me know if you really want this. I can make PR

AfekAmiri · 2022-05-30T10:40:52Z

@2796gaurav yes please it should be more clear for users.I think documentation should be more accurate so that one can rely on it. Thank you !

i-aki-y · 2022-06-02T06:07:09Z

I think replacing chi2 with f_classif is sufficient and the right choice for the tasks.

@jnothman

Is it problematic to use chi2 with continuous vars for a relative ranking?
I.e. is it only invalid theoretically?

I think that this is theoretically and practically invalid usage.
When the input X is a continuous feature, the resulting score would be sensitive to the scale of the input values. This means that the chi2 tend to select the features that simply have large absolute values instead of correlated features.

See the following explanation.

The chi-square is given by the following eq:

score \sim \sum (O_i - E_i)^2 / E_i,

where O_i and E_i are observation and expectation values of the i-th class.
If a scale factor a is applied to X (X -> aX), the score becomes a times larger:

\sum (a*O_i - a*E_i)^2 / a*E_i = a*score.

This means that features with large absolute values have large chi2 scores.

Demo:

By using the above result, you can get the opposite result in the following example with just increasing noise intensity:

https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py

With this change:

- X = np.hstack((X, 2 * rng.random((X.shape[0], 36))))
+ X = np.hstack((X, 200 * rng.random((X.shape[0], 36))))

I can get the opposite result:

The blue is the original and the orange is noise enlarged version.

See also the same experiment with f_classif version that is stable for scaling.

- ("anova", SelectPercentile(chi2)),
+ ("anova", SelectPercentile(f_classif)),

AfekAmiri · 2022-06-04T09:45:21Z

I don't understand why the scale of data would affect the result. isn't chi2 performed based on a contingency table and hence the counts won't change if I rescale data ?

…

On Thu, 2 Jun 2022, 07:07 i-aki-y, ***@***.***> wrote: I think replacing chi2 with f_classif is sufficient and the right choice for the tasks. @jnothman <https://github.com/jnothman> Is it problematic to use chi2 with continuous vars for a relative ranking? I.e. is it only invalid theoretically? I think that this is theoretically and practically invalid usage. When the input X is a continuous feature, the resulting score would be sensitive to the scale of the input values. This means that the chi2 tend to select the features that simply have large absolute values instead of correlated features. See the following explanation. The chi-square is given by the following eq: score \sim \sum (O_i - E_i)^2 / E_i, where O_i and E_i are observation and expectation values of the i-th class. If a scale factor a is applied to X (X -> aX), the score becomes a times larger: \sum (a*O_i - a*E_i)^2 / a*E_i = a*score. This means that features with large absolute values have large chi2 scores. Demo: By using the above result, you can get the opposite result in the following example with just increasing noise intensity: https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py With this change: - X = np.hstack((X, 2 * rng.random((X.shape[0], 36))))+ X = np.hstack((X, 200 * rng.random((X.shape[0], 36)))) I can get the opposite result: [image: chi2] <https://user-images.githubusercontent.com/9190086/171563030-63581f03-f039-4013-8267-fb49a6023c0d.jpg> The blue is the original and the orange is noise enlarged version. See also the same experiment with f_classif version that is stable for scaling. - ("anova", SelectPercentile(chi2)),+ ("anova", SelectPercentile(f_classif)), [image: f_classif] <https://user-images.githubusercontent.com/9190086/171563203-c488a7f3-e01a-4b2f-8584-e16c9dcc8a97.jpg> — Reply to this email directly, view it on GitHub <#17286 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWSFTHZEECUKDEDXB4EUJ4LVNBFRRANCNFSM4NFYZC7Q> . You are receiving this because you commented.Message ID: ***@***.***>

i-aki-y · 2022-06-11T08:06:40Z

@AfekAmiri

hence the counts won't change
I think it is not true.
Because the chi2 assumes that the elements of the input X are counts. The scaling a*X means that all counts are multiplied by a. In other words, the values of the contingency table are changed by a times.

Note that the sklearn.feature_selection.chi2 applies one-way chi-square test (same as scipy.stats.chisquare) for each features (columns of X).
If you have a contingency table and apply chi-square test to it, the scipy.stats.chi2_contingency might be what you want.

I think that there is no sklearn's counterpart of the chi2_contingency for feature selection.

OfirKedem added the Documentation label May 20, 2020

glemaitre added the Bug label May 20, 2020

i-aki-y linked a pull request Jun 11, 2022 that will close this issue

DOC Replace chi2 with f_classif in feature selection examples #23589

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the use of chi2 with the iris dataset is incorrect #17286

the use of chi2 with the iris dataset is incorrect #17286

OfirKedem commented May 20, 2020 •

edited

glemaitre commented May 20, 2020

glemaitre commented May 20, 2020

jnothman commented May 20, 2020 via email

glemaitre commented May 20, 2020

2796gaurav commented Jun 9, 2020

AfekAmiri commented May 30, 2022

i-aki-y commented Jun 2, 2022

AfekAmiri commented Jun 4, 2022 via email

i-aki-y commented Jun 11, 2022

the use of chi2 with the iris dataset is incorrect #17286

the use of chi2 with the iris dataset is incorrect #17286

Comments

OfirKedem commented May 20, 2020 • edited

Describe the issue linked to the documentation:

Suggest a potential alternative/fix

glemaitre commented May 20, 2020

glemaitre commented May 20, 2020

jnothman commented May 20, 2020 via email

glemaitre commented May 20, 2020

2796gaurav commented Jun 9, 2020

AfekAmiri commented May 30, 2022

i-aki-y commented Jun 2, 2022

AfekAmiri commented Jun 4, 2022 via email

i-aki-y commented Jun 11, 2022

OfirKedem commented May 20, 2020 •

edited