Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maximal information coefficient for feature selection #25771

Open
leeauk21 opened this issue Mar 6, 2023 · 5 comments
Open

maximal information coefficient for feature selection #25771

leeauk21 opened this issue Mar 6, 2023 · 5 comments

Comments

@leeauk21
Copy link

leeauk21 commented Mar 6, 2023

Describe the workflow you want to enable

maximal information coeffcient

Describe your proposed solution

maximal information coeffcient

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@leeauk21 leeauk21 added Needs Triage Issue requires triage New Feature labels Mar 6, 2023
@Higgs32584
Copy link
Contributor

I want to look into doing this, but can you tell me where in the repo it should be built? I have not contributed to this repo before, and a stronger description would be helpful. Thank you!

@glemaitre
Copy link
Member

@leeauk21 Do you have more details regarding why this metric is useful?
Looking around to find the definition, I ended up on the following literature:
https://www.pnas.org/doi/abs/10.1073/pnas.1309933111

It seems that using mutual information directly is enough and MIC does not provide anything more.

So if you have additional thoughts and background this would be useful to evaluate whether or not we should integrate this feature.

@ogrisel ogrisel changed the title maximal information coeffcient for feature selection maximal information coefficient for feature selection Mar 9, 2023
@leeauk21
Copy link
Author

leeauk21 commented Mar 9, 2023

I think Mutual information values can vary depending how you bin continuous data. whereas theoretically MIC wont but if MI works better then. There is no reason in implementing it. I found the above reason in Machine Learning in Probabilistic Perspective

@ogrisel
Copy link
Member

ogrisel commented Mar 9, 2023

If someone can craft a use-case where MIC stably return good values while MI can catastrophically fail depending on the binning, then why not. Otherwise it sounds like a YAGNI.

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Mar 9, 2023
@Charlie-XIAO
Copy link
Contributor

@ogrisel Hi, would this paper suffice to prove the that MIC is useful? The point here is, in most cases, MIC provides similar scores to equally noisy relationships of different types, but MI does not. Though MIC is computationally more complex, there have been effective approximations that can be more easily evaluated.

The paper also gave pratical suggestions as for when and how to use MIC (starting from page 22 in the paper). It says:

In some cases, many methods will return a large number of relationships, and so the number of relationships detected is less important than their relative ranking. This does happen in practice, for example in the gene expression analysis of Heller et al. (2016), in which several methods identified over half of the thousands of relationships in the data set as significant, as well as in the analysis of the WHO data set (can be found in section 8 of the paper, starting from page 24). In such cases, increasing the proportion of variable pairs identified as significant seems less important for scientific inquiry than having a meaningful way to prioritize the detected relationships for follow-up.

Thus, a promising strategy for exploratory data analysis is: first, to compute a statistic designed to identify a large number of significant relationships of all kinds, and then second, to compute an equitable statistic on all significant relationships, ensuring a ranking that is meaningful.

Therefore, I do think adding the MIC metric as an option would be useful in specific cases. That being said, it is up to you and other maintainers to decide whether it is a useful feature for scikit-learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants