Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transform xgboostImpute and rangerImpute into a generic function with methods for formula and data #73

Open
alexkowa opened this issue Nov 10, 2023 · 4 comments

Comments

@alexkowa
Copy link
Member

@GregorDeCillia I included a new function xgboostImpute very similar to your rangerImpute function. On first sight, it performs very well. The functions take formulas as first input.
To make it more pipe-friendly and aligned with other imputation functions, should we
A) simply change the order of parameters so the first input is the data set (possibly breaking code of some)
B) create new generic functions that make a method dispatch based on the first input

@GregorDeCillia and also @matthias-da @JohannesGuss what do you think?

@JohannesGuss
Copy link
Collaborator

@alexkowa looks good! I just made some slight changes to rangerImpute() in the case of factors. Now when a factor is imputed the imputed value is randomly drawn using the predicted probabilites from the model output.
Maybe this should also be adopted to xgboostImpute()

I would personally opt for version A.

@alexkowa
Copy link
Member Author

Good idea to sample. Yes, let's do that for XGBoost too if someone has time to implement it.

@GregorDeCillia
Copy link
Contributor

Not sure if introducing braking changes is worth it although I totally agree that the dataset as the first argument makes more sense especially for usage with the native pipe from R 4.0. Making an S3 dispatch should allow this change without breaking old code.

rangerImpute.formula <- function(x ...) {}
rangerImpute.data.frame <- function(x, ...) {}

@matthias-da
Copy link
Collaborator

In the end of the day, it would be also good to compare both rangerImpute and xgboostImpute with missRanger and mixgb (althought both can be used in a chain to impute multivariate missingness), especially not only for precision measures (comparing imputed and original data values in a simulation) but also on coverage rates and root mean squared errors on estimators. I can do this when there is a bit time for it. It might give an idea about if imputation uncertainty and model uncertainty are treated well.

One argument against almost all imputation methods in VIM that I hear often is that we only account for imputation uncertainty (draw from predictive distributions, one can also think about PMM and midastouch) but not for model uncertainty (e.g. with a bootstrap which would be very simple to implement (at least as an option)).

I recently implemented PMM and midastouch in function imputeRobust (just committed).
There is a function imputeRobustChain - this is very unfinished, ignore it.
There is a function imputeRobust that has a lot of enhancements in comparison to irmi

  • complex formulas can be provided, e.g. log(income) ~ I(age^2) + region * whatever for each varibable with missings.
  • PMM and midastouch are available, PMM as the default
  • model uncertainty is considered by different versions of robust bootstraps
  • and some more.

What is missing is testing and code improvement (almost no checks implemented) - its currently a working solution and - of course - there is no time to do this since months :-( If somebody is interested...?

So, one might use the PMM and midastouch from imputeRobust in rangerImpute and xgboostImpute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants