New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing values can be hidden in the presence of large (enough) N #18

Closed
njtierney opened this Issue Apr 15, 2016 · 8 comments

Comments

Projects
None yet
2 participants
@njtierney
Collaborator

njtierney commented Apr 15, 2016

Sometimes if there is only one cell missing in a large dataset of a few thousand, you cannot see the missing cell.

So I think that a little message for vis_miss and vis_dat that just spits out:

There are X number of missing values in dataset

this could just be

paste("There are", sum(is.na(df)), "number of missing values in dataset")

And perhaps if there are ZERO missing values, it could state that "No missing values found".

@mdlincoln

This comment has been minimized.

Show comment
Hide comment
@mdlincoln

mdlincoln Apr 15, 2016

👍

I've also had the complementary issue, where almost all the values in a column are missing, but a few present values are too small to be seen on the plot.

I've tried using the alpha levels to indicate when all(is.na(x)), making completely missing rows translucent. I suppose if you wanted to get fancy, you could have 3 alpha levels: entirely present, entirely missing, and in between - but that might get visually confusing.

mdlincoln commented Apr 15, 2016

👍

I've also had the complementary issue, where almost all the values in a column are missing, but a few present values are too small to be seen on the plot.

I've tried using the alpha levels to indicate when all(is.na(x)), making completely missing rows translucent. I suppose if you wanted to get fancy, you could have 3 alpha levels: entirely present, entirely missing, and in between - but that might get visually confusing.

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney Apr 19, 2016

Collaborator

Glad it's not just me having this problem!

I like the idea of using transparency but I'm not sure how this scales when you have larger data, such that there are more data than pixels

Collaborator

njtierney commented Apr 19, 2016

Glad it's not just me having this problem!

I like the idea of using transparency but I'm not sure how this scales when you have larger data, such that there are more data than pixels

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney Apr 22, 2016

Collaborator

possible solution here is to include a marginal histogram

Collaborator

njtierney commented Apr 22, 2016

possible solution here is to include a marginal histogram

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney Apr 22, 2016

Collaborator

Or can we stick in a little strip along the bottom or top of the graphic to indicate whether there is data missing or present?

We also need to make sure that the names/position of the barplot variables matches the names/position of the vis_dat

Collaborator

njtierney commented Apr 22, 2016

Or can we stick in a little strip along the bottom or top of the graphic to indicate whether there is data missing or present?

We also need to make sure that the names/position of the barplot variables matches the names/position of the vis_dat

@mdlincoln

This comment has been minimized.

Show comment
Hide comment
@mdlincoln

mdlincoln Apr 22, 2016

I think I see where you are going with the histogram idea - but could you end up with the same problem, where a lot of missing values in one column end up obscuring the one missing value in another column because they would expand the scale of the histogram too much?

One other possibility is using geom_rug() to mark columns where any(is.na(x)).

mdlincoln commented Apr 22, 2016

I think I see where you are going with the histogram idea - but could you end up with the same problem, where a lot of missing values in one column end up obscuring the one missing value in another column because they would expand the scale of the histogram too much?

One other possibility is using geom_rug() to mark columns where any(is.na(x)).

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney May 9, 2016

Collaborator

Yeah you are absolutely right, we could run into the same problem.

I was thinking that some sort of a bar could be placed above the columns to indicate whether there are any missings in that column, geom_rug() could be an interesting way to handle this.

Another option would be to include both the geom_rug() and the marginal histogram.

My only concern is that in adding in these features the graph will become more "noisy" and hard to explain

Collaborator

njtierney commented May 9, 2016

Yeah you are absolutely right, we could run into the same problem.

I was thinking that some sort of a bar could be placed above the columns to indicate whether there are any missings in that column, geom_rug() could be an interesting way to handle this.

Another option would be to include both the geom_rug() and the marginal histogram.

My only concern is that in adding in these features the graph will become more "noisy" and hard to explain

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney May 29, 2016

Collaborator

commit 0fe211c has provided a partial solution to this by indicating when there is <0.1% missing data. However, this currently only currently works for vis_miss, and does not show up in vis_dat. That's the next step from here, I think.

Collaborator

njtierney commented May 29, 2016

commit 0fe211c has provided a partial solution to this by indicating when there is <0.1% missing data. However, this currently only currently works for vis_miss, and does not show up in vis_dat. That's the next step from here, I think.

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney Aug 1, 2016

Collaborator

At the moment I am happy with this solution.

Collaborator

njtierney commented Aug 1, 2016

At the moment I am happy with this solution.

@njtierney njtierney closed this Aug 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment