Skip to content

chi2 should support categorical data other than binary or document count #21455

@altanova

Description

@altanova

Describe the bug

When looking for correlation between features (for feature selection), I found that sklearn implementation of Chi2 test of independence produce significantly different results from scipy.stats implementation.

My sample data contains 300 records, with 6 anonymized categorical features and the label. My focus is on the feature A. This data is available in this folder in github . The file sample300.csv has the file, while the file chi2_showcase.ipynb has the code demonstrating the mismatch.

For the feature A, sklearn's SelectKBest() returned the lowest ranking, suggesting there is no correlation between A and the target. But scipy.stats.chi2_contingency() returned very different result, suggesting the correlation is very high.

Because of mismatch between the two, I went a long way performing a number of different tests described in detail in this article The results suggest that the scipy implementation is correct, while sklearn implementation is incorrect.

Steps/Code to Reproduce

Please see the two links given above, where I provided the full source code and the results.
The piece of code is quite standard:

fs = SelectKBest(score_func=skfs.chi2, k = 'all')
X, y = df[cat_feature_cols], df[label]
selector = fs.fit(X, y)
kbest = pd.DataFrame({'feature': X.columns, 'score': fs.scores_})
kbest.sort_values(by = 'score', ascending = False).reset_index()

Expected Results

I would expect that sklearn.feature_selection.SelectKBest(score_func=skfs.chi2) returns same, or at least similar results (p-value and chi2 statistics) as scipy.stats.chi2_contingency() . In the particular case of feature A from my set, these expected results are:

chi2 = 127.497517
p-value = 1.445816e-29

Actual Results

For the feature A in my set, sklearn.feature_selection.chi2() (encapsulated inside SelectKBest(score_func=skfs.chi2)) returned lowest rank of all features, suggesting no correlation. Feature A has score 1.412797, while other features score between 1647 and 24.

In contrast, scipy.stats.chi2_contingency() gave highest rank to feature A, suggesting high correlation. The other tests described in the article suggest that the latter is correct.

Versions

System:
    python: 3.7.7 (default, May  6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\pplaszczak\AppData\Local\Continuum\anaconda3\python.exe
   machine: Windows-10-10.0.18362-SP0

Python dependencies:
          pip: 21.0.1
   setuptools: 52.0.0.post20210125
      sklearn: 0.24.1
        numpy: 1.19.2
        scipy: 1.6.2
       Cython: 0.29.22
       pandas: 1.2.3
   matplotlib: 3.3.4
       joblib: 1.0.1
threadpoolctl: 2.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions