-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explore how to make visdat work with facetting #78
Comments
Some thoughts on this. I think that one good way forward, rather than (perhaps _only) supplying a "facet" argument as in the This is already kind of provided, I think, in This could instead be exported, and called something (slightly) better like These could then have a data %>%
group_by(grouping) %>%
# get the data structure
data_vis_dat() %>%
# perhaps vis_dat gains some S3 methods, so that it works with a grouped_df, and maybe has a special `.vis_dat` class?
vis_dat() This seems like a lot more work than just: vis_dat(data, facet = grouping) But it would allow for perhaps more flexible operations. I don't think I can use |
I want to pursue this idea, but at a later date |
Just a note, I repeatedly need this ability and so wrote a little hack using the See example below (this is data from the IPEDS data on higher ed institutions): If anyone wants to take my code and modify it to their own purpose, here you go (don't judge me, it was a rush job). This is custom for a specific purpose (IPEDS data), so will take a little work to generalize. And I'm not suggesting this as a good method for the actual
|
… methods in data_vis_dat and data_vis_cor. See #78
@jzadra I've worked on an approach for this in #149, how does this look to you? Currently I've just got library(visdat)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
vis_dat(airquality) vis_dat(airquality, facet = Month) vis_cor(airquality) vis_cor(airquality, facet = Month) airquality %>% data_vis_dat()
#> # A tibble: 918 × 4
#> rows variable valueType value
#> <int> <chr> <chr> <chr>
#> 1 1 Day integer 41
#> 2 1 Month integer 190
#> 3 1 Ozone integer 7.4
#> 4 1 Solar.R integer 67
#> 5 1 Temp integer 5
#> 6 1 Wind numeric 1
#> 7 2 Day integer 36
#> 8 2 Month integer 118
#> 9 2 Ozone integer 8
#> 10 2 Solar.R integer 72
#> # … with 908 more rows
airquality %>% group_by(Month) %>% data_vis_dat()
#> # A tibble: 765 × 5
#> # Groups: Month [5]
#> Month rows variable valueType value
#> <int> <int> <chr> <chr> <chr>
#> 1 5 1 Day integer 41
#> 2 5 1 Ozone integer 190
#> 3 5 1 Solar.R integer 7.4
#> 4 5 1 Temp integer 67
#> 5 5 1 Wind numeric 1
#> 6 5 2 Day integer 36
#> 7 5 2 Ozone integer 118
#> 8 5 2 Solar.R integer 8
#> 9 5 2 Temp integer 72
#> 10 5 2 Wind numeric 2
#> # … with 755 more rows
airquality %>% data_vis_cor()
#> # A tibble: 36 × 3
#> row_1 row_2 value
#> <chr> <chr> <dbl>
#> 1 Ozone Ozone 1
#> 2 Ozone Solar.R 0.348
#> 3 Ozone Wind -0.602
#> 4 Ozone Temp 0.698
#> 5 Ozone Month 0.165
#> 6 Ozone Day -0.0132
#> 7 Solar.R Ozone 0.348
#> 8 Solar.R Solar.R 1
#> 9 Solar.R Wind -0.0568
#> 10 Solar.R Temp 0.276
#> # … with 26 more rows
airquality %>% group_by(Month) %>% data_vis_cor()
#> # A tibble: 125 × 4
#> # Groups: Month [5]
#> Month row_1 row_2 value
#> <int> <chr> <chr> <dbl>
#> 1 5 Ozone Ozone 1
#> 2 5 Ozone Solar.R 0.243
#> 3 5 Ozone Wind -0.374
#> 4 5 Ozone Temp 0.554
#> 5 5 Ozone Day 0.302
#> 6 5 Solar.R Ozone 0.243
#> 7 5 Solar.R Solar.R 1
#> 8 5 Solar.R Wind -0.227
#> 9 5 Solar.R Temp 0.455
#> 10 5 Solar.R Day -0.0644
#> # … with 115 more rows Created on 2022-11-25 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os macOS Monterey 12.3.1
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Australia/Brisbane
#> date 2022-11-25
#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0)
#> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
#> curl 4.3.3 2022-10-06 [1] CRAN (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#> digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.0)
#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.17 2022-10-07 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#> ggplot2 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
#> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
#> knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0)
#> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
#> rmarkdown 2.17 2022-10-07 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
#> stringr 1.4.1 2022-08-20 [1] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0)
#> tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
#> tidyr 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.4.2 2022-09-29 [1] CRAN (R 4.2.0)
#> visdat * 0.6.0.9000 2022-11-25 [1] local
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.33 2022-09-12 [1] CRAN (R 4.2.0)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Hi @njtierney, I think this is a great addition! I think it would be nice if there was an option for how the facets were organized just like in ggplot, as far as number of cols/rows. In many of my use cases, having the data all in one column is much easier to understand at a glance when the grouping variable is continuous or ordinal. The other feature that would help is some sampling options for large data. Since I last posted, I greatly improved my function to be generalizable to any data (before it was just for IPEDS). In addition, it has the following features:
Drawbacks/Issues:
Anyways, I'll share this code in case any of it is useful. #' vis_dat for grouped data
#' @description Produce a vis_dat plot for ipeds data split by year with optional sampling.
#' `r lifecycle::badge('maturing')`
#'
#' Note that parallel processing is built in if a `future::plan()` is set
#' @importFrom magrittr "%>%"
#' @param ... bare, unquoted column(s) to use as the index to group by. Alternatively will accept a grouped df.
#' @param .sample_frac Percent of observations to sample from each year. Default "auto" samples down to 100,000 rows, split evenly between groups for vis_dat. For vis_miss and vis_value, "auto" uses all data.
#' @param method Which visdat function to use. One of "vis_dat", "vis_miss", or "vis_value". Accepts shorthand "dat", "val", and "miss".
#' @return visdat plot separated by grouping variable.
#' @examples
#' \dontrun{
#' diamonds %>% visdat_grouped(facet_group = cut)
#' }
#' @importFrom rlang .data
#' @export
visdat_grouped <- function(.data, ..., method = "vis_dat", .sample_frac = "auto") {
is_pregrouped <- dplyr::is_grouped_df(.data) #Does the data already have grouping structure?
#Set the visdat function to use
if(stringr::str_detect(method, "dat")) method <- "dat"
if(stringr::str_detect(method, "val")) method <- "val"
if(stringr::str_detect(method, "miss")) method <- "miss"
# for val and miss we want to see all the data, hence auto = 1
if((method == "val" | method == "miss") & .sample_frac == "auto") .sample_frac = 1
# Otherwise downsmample
if(.sample_frac == "auto") {
if(nrow(.data) > 100000) {
.sample_frac <- 100000 / nrow(.data)
cli::cli_alert_info("Large data, automatically down-sampling data at {round(.sample_frac * 100)}%. To disable or change, set .sample_frac to a value between 0 and 1.")
} else .sample_frac <- 1
}
#Group the data
if(is_pregrouped) {
.data <- .data %>%
tibble::add_column(group_index = dplyr::group_indices(.)) %>%
tidyr::unite(group_name, dplyr::group_vars(.), sep = "\n", remove = F) %>%
dplyr::arrange(group_index)
} else {
.data <- .data %>%
dplyr::group_by(...) %>%
tibble::add_column(group_index = dplyr::group_indices(.)) %>%
tidyr::unite(group_name, ..., sep = "\n", remove = F) %>%
dplyr::arrange(group_index)
}
# Do any sampling
if(.sample_frac < 1) {
.data <- .data %>%
dplyr::sample_frac(.sample_frac / dplyr::n_groups(.)) #Needs to be updated, as sample_frac() is superseded. However sample_frac applies the fraction to each group if the data is grouped.
} else cli::cli_alert_info("Using 100% of data, this may be slow.")
#Split the data
.data <- .data %>% dplyr::group_split(.keep = F)
#Methods for each visdat graph
if(method == "dat") {
plist <- .data %>%
furrr::future_map(function(...) {
.data <- tibble::as_tibble(...)
group_name <- .data %>% dplyr::distinct(group_name) %>% dplyr::pull(group_name)
group_index <- .data %>% dplyr::distinct(group_index) %>% dplyr::pull(group_index)
.data <- .data %>% dplyr::select(-group_name, -group_index)
p <- .data %>%
visdat::vis_dat(warn_large_data = F, sort_type = F, palette = "qual") +
ggplot2::labs(y = group_name) +
ggplot2::theme(plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))
if(group_index > 1) {
p <- p + ggplot2::theme(axis.text.x = ggplot2::element_blank(),
plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))
}
return(p)
})
}
if(method == "val") {
plist <- .data %>%
furrr::future_map(function(...) {
.data <- tibble::as_tibble(...)
group_name <- .data %>% dplyr::distinct(group_name) %>% dplyr::pull(group_name)
group_index <- .data %>% dplyr::distinct(group_index) %>% dplyr::pull(group_index)
.data <- .data %>% dplyr::select(-group_name, -group_index)
p <- .data %>%
dplyr::select(tidyselect::where(is.numeric)) %>%
visdat::vis_value() +
ggplot2::labs(y = group_name) +
ggplot2::theme(plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))
if(group_index > 1) {
p <- p + ggplot2::theme(axis.text.x = ggplot2::element_blank(),
plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))
}
return(p)
})
}
if(method == "miss") {
plist <- .data %>%
furrr::future_map(function(...) {
.data <- tibble::as_tibble(...)
group_name <- .data %>% dplyr::distinct(group_name) %>% dplyr::pull(group_name)
group_index <- .data %>% dplyr::distinct(group_index) %>% dplyr::pull(group_index)
.data <- .data %>% dplyr::select(-group_name, -group_index)
p <- .data %>%
dplyr::select(tidyselect::where(is.numeric)) %>%
visdat::vis_miss(show_perc = T, warn_large_data = F) +
ggplot2::labs(y = group_name) +
ggplot2::theme(plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))
if(group_index > 1) {
p <- p + ggplot2::theme(axis.text.x = ggplot2::element_blank(),
plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))
}
return(p)
})
}
patchwork::wrap_plots(plist, ncol = 1, guides = "collect")
}
|
as per Sam Firke's tweet:
https://twitter.com/samfirke/status/984425923243134976
The text was updated successfully, but these errors were encountered: