New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the use of chi2 with the iris dataset is incorrect #17286
Comments
This seems correct. I think that the easiest is to use a |
@OfirKedem Do you want to make a PR? |
Is it problematic to use chi2 with continuous vars for a relative ranking?
I.e. is it only invalid theoretically?
(And I don't think we should discretise unless it's justified by the
example use case.)
|
I was thinking that it would make sense with iris since you have kind of grouping related to the size . |
Hi @jnothman @glemaitre i think we should provide better examples in the documentation. Let me know if you really want this. I can make PR |
@2796gaurav yes please it should be more clear for users.I think documentation should be more accurate so that one can rely on it. Thank you ! |
I think replacing
I think that this is theoretically and practically invalid usage. See the following explanation. The chi-square is given by the following eq:
where O_i and E_i are observation and expectation values of the i-th class.
This means that features with large absolute values have large Demo: By using the above result, you can get the opposite result in the following example with just increasing noise intensity: With this change: - X = np.hstack((X, 2 * rng.random((X.shape[0], 36))))
+ X = np.hstack((X, 200 * rng.random((X.shape[0], 36)))) I can get the opposite result: The blue is the original and the orange is noise enlarged version. See also the same experiment with - ("anova", SelectPercentile(chi2)),
+ ("anova", SelectPercentile(f_classif)), |
I don't understand why the scale of data would affect the result. isn't
chi2 performed based on a contingency table and hence the counts won't
change if I rescale data ?
…On Thu, 2 Jun 2022, 07:07 i-aki-y, ***@***.***> wrote:
I think replacing chi2 with f_classif is sufficient and the right choice
for the tasks.
@jnothman <https://github.com/jnothman>
Is it problematic to use chi2 with continuous vars for a relative ranking?
I.e. is it only invalid theoretically?
I think that this is theoretically and practically invalid usage.
When the input X is a continuous feature, the resulting score would be
sensitive to the scale of the input values. This means that the chi2 tend
to select the features that simply have large absolute values instead of
correlated features.
See the following explanation.
The chi-square is given by the following eq:
score \sim \sum (O_i - E_i)^2 / E_i,
where O_i and E_i are observation and expectation values of the i-th class.
If a scale factor a is applied to X (X -> aX), the score becomes a times
larger:
\sum (a*O_i - a*E_i)^2 / a*E_i = a*score.
This means that features with large absolute values have large chi2
scores.
Demo:
By using the above result, you can get the opposite result in the
following example with just increasing noise intensity:
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py
With this change:
- X = np.hstack((X, 2 * rng.random((X.shape[0], 36))))+ X = np.hstack((X, 200 * rng.random((X.shape[0], 36))))
I can get the opposite result:
[image: chi2]
<https://user-images.githubusercontent.com/9190086/171563030-63581f03-f039-4013-8267-fb49a6023c0d.jpg>
The blue is the original and the orange is noise enlarged version.
See also the same experiment with f_classif version that is stable for
scaling.
- ("anova", SelectPercentile(chi2)),+ ("anova", SelectPercentile(f_classif)),
[image: f_classif]
<https://user-images.githubusercontent.com/9190086/171563203-c488a7f3-e01a-4b2f-8584-e16c9dcc8a97.jpg>
—
Reply to this email directly, view it on GitHub
<#17286 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AWSFTHZEECUKDEDXB4EUJ4LVNBFRRANCNFSM4NFYZC7Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Note that the I think that there is no sklearn's counterpart of the |
Describe the issue linked to the documentation:
many of the examples in the documentation uses the iris data set, and some also use the chi2 for feature selection. however the since the iris data is continuous the use of chi2 as a metric is invalid.
here are the examples i found with this error:
https://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm-anova-py
https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
Suggest a potential alternative/fix
change the dataset or the metric. or bin the data.
The text was updated successfully, but these errors were encountered: