Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

Exploration and visualisation of missing data #15

Open
njtierney opened this issue Mar 29, 2016 · 7 comments
Open

Exploration and visualisation of missing data #15

njtierney opened this issue Mar 29, 2016 · 7 comments

Comments

@njtierney
Copy link
Collaborator

In my PhD research I work with medical data and there are often large amounts of it missing. In my attempts to explore missing data problems and make my life easier I have done some work on two packages: ggmissing with Di Cook, and mex with Damjan Vukcevic. But, as my PhD research continues, I have been finding it hard to dedicate some serious time to continue work on these packages.

I'd like to propose a project on one, or perhaps both of these packages.

A bit more about them:

ggmissing extends ggplot to allow for missing data to be visualised. This would basically involve creating a couple of ggplot geom_missing_* functions that could be added as a layer to a plot. For example, geom_missing_point() would add in and colour the missing points. You can see more about it on the github repo, and at these slides.

mex is a missingness exploration package. This extends off of some research that I have done into using decision trees to explore missing data. The original idea of the package was to create a framework or even a recommended path for handling missing data. One idea was to break it into exploring, modelling, and confirming.

Exploring would include:

  • Creating a better, fast version of Little's MCAR test
  • Tabulation of missing data
  • use of t-tests/chi^2 to explore whether missingness affects values/counts
  • Tools and variations on function from previous work in packages like MissingDataGUI
  • Incorporating visualisations from visdat

Modelling would include:

  • Using machine learning methods to explore missing data.
  • Identifying clusters of missing data and then predicting these clusters with machine learning methods

Confirming might be something like:

  • Using cross validation to explore how accurate the missing data mechanism is

I'm very much open to suggestions about how to implement these ideas.

@greenLeopard
Copy link

Snap! I have medical data with missing entries too. I'm interested in being able to visual it and explore clusters of missingness as well as other types of data inconsistencies (e.g. end time before start time).
I am hoping to bring a mockup of the kind of datasets that I use at work.

@jonocarroll
Copy link

The mice package (Multivariate Imputation by Chained Equations in R) has some good tools for imputation (MCAR/otherwise).

Also have a look at VIM::aggr for producing a neat plot of missing data.

e.g. http://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/

@njtierney
Copy link
Collaborator Author

Thank Jonno,

Thanks for that, VIM certainly does have some useful plots, what do you think about incorporating them into ggmissing?

@dicook
Copy link

dicook commented Apr 5, 2016

  • Summary of missingness (like the norm or MissingDataGUI package)
  • Missingness map: more options for setting order of rows and columns that work for large data
  • Vignette
  • Enable imputed values to position the points

Keep the package simple. Primary purpose is to make ggplot2 graphics that include the missings in the plot.

@cpsievert
Copy link

I'm not very familiar with ggmissing, but I'd like to know more about it!

BTW, here is a nice example of a scatterplot with margins for missing values http://kbroman.org/d3panels/assets/test/scatterplot/

@jesse-jesse
Copy link
Contributor

7 votes from the AuUnconf... :) Might be worth continuing discussions around this..

@jesse-jesse
Copy link
Contributor

Nick created a channel on the AuUnconf slack account. Anyone interested can join discussions there also.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants