Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate a use_missing argument #142

Closed
omsuchak opened this issue Jun 30, 2020 · 2 comments
Closed

Incorporate a use_missing argument #142

omsuchak opened this issue Jun 30, 2020 · 2 comments

Comments

@omsuchak
Copy link

omsuchak commented Jun 30, 2020

First, This is a lovely framework!

One suggestion: It would be very useful to expand the framework to accept sparse data/missing data items. LightGBM has incorporated this in their use_missing argument.

@alejandroschuler
Copy link
Collaborator

Hey @omsuchak, thanks for the suggestion. There is no one "natural" or good way to generically handle missing data. If ngboost were to do this for you, we would be making a number of choices behind the scenes that would be obscured from the user.

If we limited ourselves to use cases where the base learner is a regression tree (like we do with the feature importances) there are some reasonable default choices for what to do with missing data. Implementing those strategies here is probably not crazy hard to do but it's also not a trivial task. Either way, I'd want the user to have a transparent choice about what is going on. I'd be open to review pull requests on that front as they satisfy that requirement, but it's not something I plan on working on myself in the foreseeable future. I'll close for now but if anyone wants to try to add this please feel free to comment.

@alejandroschuler
Copy link
Collaborator

alejandroschuler commented Jun 30, 2020

As a practical note that might help you- for prediction problems it's typically hard to beat some sort of imputation (e.g. column mean) + adding a missingness indicator feature per column. sklearn makes it easy. I'd recommend handling missing data in your feature matrix upfront as a pre-processing step using those tools before passing the data into ngboost.

As long as you apply the same ("trained") imputation strategy to your test set or future observations, you're not incurring any bias from doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants