maximal information coefficient for feature selection #25771

leeauk21 · 2023-03-06T22:59:44Z

Describe the workflow you want to enable

maximal information coeffcient

Describe your proposed solution

maximal information coeffcient

Describe alternatives you've considered, if relevant

No response

Additional context

No response

Higgs32584 · 2023-03-07T19:28:23Z

I want to look into doing this, but can you tell me where in the repo it should be built? I have not contributed to this repo before, and a stronger description would be helpful. Thank you!

glemaitre · 2023-03-08T16:06:07Z

@leeauk21 Do you have more details regarding why this metric is useful?
Looking around to find the definition, I ended up on the following literature:
https://www.pnas.org/doi/abs/10.1073/pnas.1309933111

It seems that using mutual information directly is enough and MIC does not provide anything more.

So if you have additional thoughts and background this would be useful to evaluate whether or not we should integrate this feature.

leeauk21 · 2023-03-09T17:19:21Z

I think Mutual information values can vary depending how you bin continuous data. whereas theoretically MIC wont but if MI works better then. There is no reason in implementing it. I found the above reason in Machine Learning in Probabilistic Perspective

ogrisel · 2023-03-09T17:43:33Z

If someone can craft a use-case where MIC stably return good values while MI can catastrophically fail depending on the binning, then why not. Otherwise it sounds like a YAGNI.

Charlie-XIAO · 2023-03-19T22:55:44Z

@ogrisel Hi, would this paper suffice to prove the that MIC is useful? The point here is, in most cases, MIC provides similar scores to equally noisy relationships of different types, but MI does not. Though MIC is computationally more complex, there have been effective approximations that can be more easily evaluated.

The paper also gave pratical suggestions as for when and how to use MIC (starting from page 22 in the paper). It says:

In some cases, many methods will return a large number of relationships, and so the number of relationships detected is less important than their relative ranking. This does happen in practice, for example in the gene expression analysis of Heller et al. (2016), in which several methods identified over half of the thousands of relationships in the data set as significant, as well as in the analysis of the WHO data set (can be found in section 8 of the paper, starting from page 24). In such cases, increasing the proportion of variable pairs identified as significant seems less important for scientific inquiry than having a meaningful way to prioritize the detected relationships for follow-up.

Thus, a promising strategy for exploratory data analysis is: first, to compute a statistic designed to identify a large number of significant relationships of all kinds, and then second, to compute an equitable statistic on all significant relationships, ensuring a ranking that is meaningful.

Therefore, I do think adding the MIC metric as an option would be useful in specific cases. That being said, it is up to you and other maintainers to decide whether it is a useful feature for scikit-learn.

leeauk21 added Needs Triage Issue requires triage New Feature labels Mar 6, 2023

ogrisel changed the title ~~maximal information coeffcient for feature selection~~ maximal information coefficient for feature selection Mar 9, 2023

ogrisel removed the Needs Triage Issue requires triage label Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maximal information coefficient for feature selection #25771

maximal information coefficient for feature selection #25771

leeauk21 commented Mar 6, 2023

Higgs32584 commented Mar 7, 2023

glemaitre commented Mar 8, 2023

leeauk21 commented Mar 9, 2023

ogrisel commented Mar 9, 2023

Charlie-XIAO commented Mar 19, 2023

maximal information coefficient for feature selection #25771

maximal information coefficient for feature selection #25771

Comments

leeauk21 commented Mar 6, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

Higgs32584 commented Mar 7, 2023

glemaitre commented Mar 8, 2023

leeauk21 commented Mar 9, 2023

ogrisel commented Mar 9, 2023

Charlie-XIAO commented Mar 19, 2023