Describe the bug
When looking for correlation between features (for feature selection), I found that sklearn implementation of Chi2 test of independence produce significantly different results from scipy.stats implementation.
My sample data contains 300 records, with 6 anonymized categorical features and the label. My focus is on the feature A. This data is available in this folder in github . The file sample300.csv has the file, while the file chi2_showcase.ipynb has the code demonstrating the mismatch.
For the feature A, sklearn's SelectKBest() returned the lowest ranking, suggesting there is no correlation between A and the target. But scipy.stats.chi2_contingency() returned very different result, suggesting the correlation is very high.
Because of mismatch between the two, I went a long way performing a number of different tests described in detail in this article The results suggest that the scipy implementation is correct, while sklearn implementation is incorrect.
Steps/Code to Reproduce
Please see the two links given above, where I provided the full source code and the results.
The piece of code is quite standard:
fs = SelectKBest(score_func=skfs.chi2, k = 'all')
X, y = df[cat_feature_cols], df[label]
selector = fs.fit(X, y)
kbest = pd.DataFrame({'feature': X.columns, 'score': fs.scores_})
kbest.sort_values(by = 'score', ascending = False).reset_index()
Expected Results
I would expect that sklearn.feature_selection.SelectKBest(score_func=skfs.chi2) returns same, or at least similar results (p-value and chi2 statistics) as scipy.stats.chi2_contingency() . In the particular case of feature A from my set, these expected results are:
chi2 = 127.497517
p-value = 1.445816e-29
Actual Results
For the feature A in my set, sklearn.feature_selection.chi2() (encapsulated inside SelectKBest(score_func=skfs.chi2)) returned lowest rank of all features, suggesting no correlation. Feature A has score 1.412797, while other features score between 1647 and 24.
In contrast, scipy.stats.chi2_contingency() gave highest rank to feature A, suggesting high correlation. The other tests described in the article suggest that the latter is correct.
Versions
System:
python: 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\pplaszczak\AppData\Local\Continuum\anaconda3\python.exe
machine: Windows-10-10.0.18362-SP0
Python dependencies:
pip: 21.0.1
setuptools: 52.0.0.post20210125
sklearn: 0.24.1
numpy: 1.19.2
scipy: 1.6.2
Cython: 0.29.22
pandas: 1.2.3
matplotlib: 3.3.4
joblib: 1.0.1
threadpoolctl: 2.1.0
Built with OpenMP: True
Describe the bug
When looking for correlation between features (for feature selection), I found that sklearn implementation of Chi2 test of independence produce significantly different results from scipy.stats implementation.
My sample data contains 300 records, with 6 anonymized categorical features and the label. My focus is on the feature A. This data is available in this folder in github . The file sample300.csv has the file, while the file chi2_showcase.ipynb has the code demonstrating the mismatch.
For the feature A, sklearn's SelectKBest() returned the lowest ranking, suggesting there is no correlation between A and the target. But scipy.stats.chi2_contingency() returned very different result, suggesting the correlation is very high.
Because of mismatch between the two, I went a long way performing a number of different tests described in detail in this article The results suggest that the scipy implementation is correct, while sklearn implementation is incorrect.
Steps/Code to Reproduce
Please see the two links given above, where I provided the full source code and the results.
The piece of code is quite standard:
Expected Results
I would expect that sklearn.feature_selection.SelectKBest(score_func=skfs.chi2) returns same, or at least similar results (p-value and chi2 statistics) as scipy.stats.chi2_contingency() . In the particular case of feature A from my set, these expected results are:
Actual Results
For the feature A in my set, sklearn.feature_selection.chi2() (encapsulated inside SelectKBest(score_func=skfs.chi2)) returned lowest rank of all features, suggesting no correlation. Feature A has score 1.412797, while other features score between 1647 and 24.
In contrast, scipy.stats.chi2_contingency() gave highest rank to feature A, suggesting high correlation. The other tests described in the article suggest that the latter is correct.
Versions