Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

visdat package #87

Closed
12 of 13 tasks
njtierney opened this issue Jan 4, 2017 · 58 comments
Closed
12 of 13 tasks

visdat package #87

njtierney opened this issue Jan 4, 2017 · 58 comments

Comments

@njtierney
Copy link
Contributor

njtierney commented Jan 4, 2017

Summary

  • What does this package do? (explain in 50 words or less)

visdat visualises R data frames so that you can quickly identify data structure and data types. This makes it easier to "get a look at the data" and visually identify abnormalities with a dataset.

  • Paste the full DESCRIPTION file inside a code block below.
Package: visdat
Title: Preliminary Visualisation of Data
Version: 0.0.5.9000
Authors@R: person("Nicholas", "Tierney", email = "nicholas.tierney@gmail.com", role = c("aut", "cre"))
Description: visdat makes it easy to visualise your whole dataset so that you
    can visually identify problems.
Depends:
    R (>= 3.2.2)
License: MIT + file LICENSE
LazyData: true
RoxygenNote: 5.0.1.9000
Imports:
    ggplot2,
    tidyr,
    dplyr,
    purrr,
    readr,
    plotly (>= 4.5.6),
    magrittr,
    stats
URL: https://njtierney.com/visdat
BugReports: https://github.com/njtierney/visdat/issues
Suggests:
    testthat,
    knitr,
    rmarkdown,
    vdiffr
VignetteBuilder: knitr

  • URL for the package (the development repository, not a stylized html page)

https://github.com/njtierney/visdat

  • Who is the target audience?

R users who want to explore their data, particularly when they first receive it.

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?

In terms of visualising missing data as a heatmap, there are a few other packages that have worked on this. The mi package used to have a visualisation method for missing data, missing.pattern.plot - however this is no longer present in the latest versions. The Amelia package has missmap, but the default requires some more work to make the final output easier to read.

The VIM package provides visualisations for missing data, for example, the aggr function provides a histogram of the missingness present in each variable.

In terms of visualising the types of data in a dataset, the wakefield package provides the table_heat function for visualising column data types.

But what makes visdat different?

visdat adheres to the principle that R packages should try to do one thing, it is a simple package that specialises in visualisation of data frames. Amelia and mi focus on multiple imputation and missing data methods. VIM focusses on visualising missingness and imputation in data, and the wakefield package focusses on creating random, reproducible data.

The functionality in visualising missing data for these packages is not the main focus, and so I argue that because visdat is purely about visualising dataframes, it gives it greater scope to work on just one thing.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description.
    • The package is deposited in a long-term repository with the DOI:

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

It succeeds, but there are some notes.

checking R code for possible problems ... NOTE
vis_guess: no visible binding for global variable ‘valueGuess’
Undefined global functions or variables:
  valueGuess

It does not yet have an rOpenSci footer image

I have not set up pre-commit hooks to ensure that README.md is always newer than README.Rmd, as I'm not sure what devtools::use_git_hook does?

I have not added #' @noRd to internal functions as I think it is still useful to have them documented, but I can change this if need be.

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

  • Maëlle Salmon (@masalmon)

  • Jenny Bryan (@jennybc)

  • Andrew MacDonald (@aammd)

@seaaan
Copy link
Contributor

seaaan commented Jan 4, 2017 via email

@noamross
Copy link
Contributor

noamross commented Jan 4, 2017

That's fine @seaaan, I'll assign once I get through editor checks. Could you fill out the form at http://ropensci.org/onboarding/? We've started using that to help keep track of reviewers' various expertises.

@seaaan
Copy link
Contributor

seaaan commented Jan 4, 2017 via email

@noamross
Copy link
Contributor

noamross commented Jan 4, 2017

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Currently seeking additional reviewers. This package fits well into our visualization/reproducibility categories.

  • I note that vignette building fails without a recent version of plotly. I suggest you set minimum version numbers for packages in your DESCRIPTION file.
  • Below is goodpractice::gp() output, with some small things to be addressed. Test coverage is relatively low, which is understandable given the nature of the package, but I suggest you look into using the vdiffr testthat extension to create tests for expected visual output.
── GP visdat ─────────

It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 12% of code lines are covered by test cases.

    R/compare_print.R:10:NA
    R/compare_print.R:11:NA
    R/compare_print.R:12:NA
    R/compare_print.R:13:NA
    R/fingerprint.R:11:NA
    ... and 287 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that are about 80 characters wide. Try make your lines shorter than 80 characters

    R/vis_compare.R:24:1
    R/vis_compare.R:27:1
    R/vis_compare.R:30:1
    R/vis_compare.R:33:1
    R/vis_dat.R:29:1
    ... and 3 more lines

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error prone and result 1:0 if the expression on the right hand side is zero. Use seq_len() or seq_along() instead.

    R/vis_compare.R:46:26
    R/vis_dat.R:50:26
    R/vis_guess.R:42:26
    R/vis_miss.R:43:24
    R/vis_miss.R:74:26

  ✖ fix this R CMD check NOTE: vis_guess: no visible binding for global variable ‘valueGuess’ 


Reviewers: @seaaan @batpigandme
Due date: 2017-01-30

@noamross
Copy link
Contributor

noamross commented Jan 4, 2017

Reviewers assigned: @seaaan @batpigandme

@njtierney
Copy link
Contributor Author

Hi Everyone,

Thank you very much for this, this is great feedback, and a great process.

I'm not sure what my expected role here as submitter is but I just wanted to say that I agree with the feedback given, and have been looking for a good excuse to use vdiffr, so that's good. Thank you @noamross for your suggestions, and to @seaaan for volunteering.

I was also wondering if I am able to submit this paper to JOSS? I understand if this is too late to do so.

@noamross
Copy link
Contributor

noamross commented Jan 5, 2017

Fine to submit to JOSS. To do so, check off the the JOSS publication options above, and write a paper.md file per JOSS requirements. Let us know when you have so reviewers can do the whole review at once. You can save deposition in a repository for a DOI for after you have made changes post-review.

@noamross
Copy link
Contributor

noamross commented Jan 5, 2017

Oh, and the comments I put above, e.g., testing, can be addressed once all other reviews are in, too. If they were major I would have held up assigning reviewers until they were addressed.

@njtierney
Copy link
Contributor Author

OK great, just working on the JOSS paper now, will hopefully have it done this evening.

Re your comments @noamross

I note that vignette building fails without a recent version of plotly. I suggest you set minimum version numbers for packages in your DESCRIPTION file.

DESCRIPTION file has been updated to reflect this here

Below is goodpractice::gp() output, with some small things to be addressed. Test coverage is relatively low, which is understandable given the nature of the package, but I suggest you look into using the vdiffr testthat extension to create tests for expected visual output.

Here is my updated output from goodpractice::gp()

── GP visdat ────────────────────────────────

It is good practice to

  ✖ write unit tests for all functions,
    and all package code in general. At
    least some lines are not covered in
    this package.

    R/vis_compare.R:33:NA
    R/vis_compare.R:34:NA

───────────────────────────────────────────── 

Test coverage is now at 99% using vdiffr. I think I have used vdiffr correctly, it looks like the main use is to check that the plot reliably produces the same output, and the test uses expect_doppelganger().

Thank you again for the feedback, I will let you know when I add the paper.md doc.

@njtierney
Copy link
Contributor Author

Ack, sorry if I jumped the gun replying to the comments.

@njtierney
Copy link
Contributor Author

OK @noamross I have pushed my paper.md into the github repo under the folder visdat/paper/.

Thank you again!

@njtierney
Copy link
Contributor Author

I've also updated the DESCRIPTION file as well.

@noamross
Copy link
Contributor

noamross commented Jan 9, 2017

Great, @seaaan @batpigandme you can go ahead with your reviews. I've moved the due date to 1/30.

@seaaan
Copy link
Contributor

seaaan commented Jan 15, 2017

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION
Paper (for packages co-submitting to JOSS)

The package contains a paper.md with:

  • A short summary describing the high-level functionality of the software
  • Authors: A list of authors with their affiliations
  • A statement of need clearly staing problems the software is designed to solve and its target audience.
  • References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • [?] Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:
7.5h

Review Comments

SUMMARY

visdat is designed to help data analysts get a visual overview of the structure of a data set. Specifically, it shows data as a heatmap, with colors indicating the type of data in each cell, as well as any missing data. There are additional functions for focusing on missing data, guessing the types of data in cases where a column might have data of multiple types, and comparing two data sets. I think visdat is a very worthwhile project and have already started using it in my own work.

In general, the documentation is clear and the functions are easy to use. The code is pretty well-written, with the main issues I noticed being some code duplication and allowing some irrelevant warning messages to be presented to the user. In a few places, I suggest revisions to the current default behavior. Most notably, I suggest that the heatmaps should show the columns and rows in the same order as the data frame passed by the user. Currently, the rows are in reverse order and, in some functions, the columns get reordered by default.

In principle, I think this package is well-worth accepting into rOpenSci. Below I've provided specific comments on many aspects of the package. Please view the comments as suggestions or ideas rather than as commands that must all be implemented.

When I found minor typos, I changed them and made a pull request rather than trying to describe them for you to change. See link to pull request here.

Thank you for the opportunity to review this package. I learned a lot in the process of doing so and it was an enjoyable experience. You have a great system set up here @noamross, @sckott, and other rOpenSci folks. It was very smooth and easy for me to figure out what to do.

Documentation

README

  • The README is very good! It gives a helpful overview of everything in the package and is clearly written, with good examples. All of the code examples worked on my computer. You might consider breaking the "experimental" parts of the package into a separate README like you do with the vignettes.

  • Link to the vignettes from the README.

  • The title of the github project is "A package to assist in visually testing a dataset", which doesn't describe the project that well. I suggest changing it to be the same title as on the DESCRIPTION file, which is effective in quickly summarizing what the package does.

  • You don't introduce the airquality data set at first usage, but you do introduce it after the second plot. Move the introduction to first use.

  • I don't understand this sentence: "When there is <0.1% of missingness, vis_miss indicates that there is >1% missingness."

Vignettes

  • I was able to build the vignettes on my computer without issue. They are both clear and helpful, just like the README. Good job!

  • Most of the rest of the "Using visdat" vignette overlaps with the README, so I'm not repeating comments here that apply to both.

  • Make sure you have permission to reproduce the image from R4DS.

  • The first few paragraphs of the "Using visdat" vignette are a bit unnecessary in my opinion. You could start much more simply, with something like: "When you get a new data set, you need to get a sense of what it contains and potential problems with it." and then continue with the discussion of different ways to approach this challenge (e.g. head, glimpse, ...).

  • In the "Using visdat" vignette, it says "missing data represented by black", but it shows up as gray on my computer.

  • I don't know why, but in the plotly section, a vis_dat() (non-interactive) plot of df_test appears between the first two interactive plots. I can't explain it, hopefully it's just a weird quirk on my computer.

  • There is a "future work and experimental features" section in the "Using visdat" vignette -- I suggest transferring that to the experimental vignette and just linking to it from the "Using visdat" vignette. Perhaps move the vis_guess() and the interactive plots to the experimental features vignette as well, since they seem to be evolving.

  • The plot in the experimental features vignette shows up small, I think you need to add knitr::opts_chunk$set(fig.width = 5, fig.height = 4) to the first chunk like you have in the other vignette.

Function documentation

  • Did you mean to export guess_type()?

  • Formatting: links, code, and other formatting need to be done with .Rd syntax. For example, for code, use \code{}, not backticks. For bold, use \strong{} instead of asterisks. See guidelines here.

  • Return values: The documentation doesn't include the "Value" section which is typically used to say what a function returns. I would add this. E.g. @return A \code{ggplot} object displaying the type of values in the data frame and the position of any missing values.

  • ?vis_compare: The documentation for this function needs to be updated. It is more a list of ideas for how to implement the function than a description of what it does.

  • ?vis_guess: First sentence of description seems to have an extra word or phrase, not sure exactly what was intended.

  • ?vis_miss_ly: The reference to vis_miss in the "See Also" section should be a link. Like this: \code{\link[visdat]{vis_miss}}

  • ?visdat: It's good that this page exists. However, it doesn't have any content -- add a brief overview of the package with links to the documentation page for the main functions.

Examples

  • Don't need to call library for either visdat or packages you use internally (only if you actually call a function from another package in the example code itself).

  • example(vis_compare) is a good example but gives a bunch of warning messages.

  • example(vis_dat): "palette" is misspelled.

  • example(vis_guess): gives warning message

Community guidelines

  • There are no guidelines for how to contribute, either in the README or in a CONTRIBUTING file.

JOSS

  • The paper seems more detailed than most other JOSS papers and goes into specific functions. Additionally, does JOSS allow figures? I didn't see any in the papers I looked at or see mentions of them in the author guidelines. But I'm not an expert on JOSS, so maybe someone else can weigh in on that.

  • The paper doesn't have any references in the "References" section, but does have inline references.

  • If you keep it, make sure you have permission to reproduce the image from R4DS.

Functionality

Common issues

There are a number of issues that apply to most or all of the functions in the package. Rather than repeat them under each function, I'll just put those comments together here.

  • It is unintuitive to me to have the rows in reverse order (i.e. row 1 at the bottom) and for the columns to be clustered by type, rather than appear in the order they appear in the data frame. I think the default behavior should be for rows and columns to appear in the same order as in the input. Additionally, always putting the titles at the top of the columns makes sense to me. Including sort_type as an option might be useful, but by default it should be off. I'm not quite sure of the use case for the flip argument, but maybe there is one!

  • The flip argument should be available for all of the functions if you provide it for any, for the sake of consistency. Possibly also the palette argument.

  • A number of the functions emit warning messages ("Warning message: attributes are not identical across measure variables; they will be dropped"), specifically when they are called with a data frame that has >1 factor column with different levels. This arises from the call tidyr::gather_(x, "variables", "value", names(x))$value. This message is not relevant to the end user and should be suppressed.

  • There is some code duplication across some of the functions, where you could replace it with a helper function. The code creating the plots is generally similar across functions. In addition, the following code blocks appear in essentially identical form in >1 function:

  dplyr::mutate(rows = seq_len(nrow(.))) %>%
  tidyr::gather_(key_col = "variables",
                 value_col = "valueType",
                 gather_cols = names(.)[-length(.)])

Using a helper function for this next one in particular would allow you to suppress the warning messages just once:

  d$value <- tidyr::gather_(x, "variables", "value", names(x))$value

The code for flipping the rows is also duplicated:

 if (flip == TRUE){
    suppressMessages({
      vis_dat_plot +
        ggplot2::scale_y_reverse() +
        ggplot2::scale_x_discrete(position = "top",
                                  limits = type_order_index) +
        ggplot2::theme(axis.text.x = ggplot2::element_text(hjust = 0.5)) +
        ggplot2::labs(x = "")
    })
  } else if (flip == FALSE){
    vis_dat_plot
  }
  • vis_compare and vis_guess are indicated as being in beta, which seems also to apply to the plotly versions of the functions. The message that they emit is a helpful indication that they may change in the future. Before submitting to CRAN, however, you might consider moving those functions to the development version of the package and only uploading the functions with a stable API, then adding the beta version later once they stabilize. This is a judgment call; I think I tend towards being conservative on this issue personally.

  • This is a minor issue, but was so thoroughly drummed into me by CS professors that I have to mention it. In calls like if (sort_type == TRUE), the == TRUE is redundant, so it's equivalent to just write if (sort_type). Similarly, if (x == FALSE) is equivalent to if (!x). There are a number of these in the code for various functions.

vis_dat()

  • Maximum data frame size: the heatmaps stop working on my computer with very large data sets. E.g. library(nycflights13); vis_dat(flights) hangs for a minute or so and then displays an empty grid. It works with 10000 rows, but not 100000. Similar issue with many columns. Especially because this package is designed for people to get a sense of new data sets, where users may not have realized how big the data are, I think it makes sense to prevent this issue. Possibilities: (1) max_rows = 10000 parameter, which stops if nrow(data) > max_rows and gives a descriptive error message. Then the user can increase the maximum or subset their data frame. Might also need to consider max_cols because 10000 rows and 2 cols is very different from 10000 rows and 500 cols. (2) Downsample the non-missing rows somehow and indicate to the user where omissions were made. (3) ???

  • The palette argument doesn't cause color changes, as far as I can tell. I think you need to do vis_dat_plot <- vis_dat_plot + <palette>, whereas currently the value of vis_dat_plot isn't being updated in the blocks related to color palettes. Additionally, you could do ggplot2::scale_color_brewer(type = "qual", palette = "Set1") rather than explicitly writing out the color names, although the colors would be in a slightly different order than you have now.

  • If the user provides a palette name, you should check that it's valid. If not, throw an error with a message. Currently, if you misspell a palette name, it just goes ahead with the default palette.

Minor comments

  • There are some lines of commented-out code: # x = airquality and # mutate_each_(funs(fingerprint), tbl_vars(.)) %>%

  • Inside of the first if-block, there are some comments about arranging columns in order of missingness, but as far as I can tell, that is not done.

vis_miss()

  • It would be cool to indicate the % missing for each column individually (maybe in column labels?) like how you indicate the % missing for the data overall.

vis_compare()

  • This is a neat function! It would be cool if you could generalize it to also handle data frames of unequal dimensions, perhaps by showing which rows/columns are only in one data frame or the other. Just an idea I had though, not a requirement at all!

  • Warning message: "1: In if (dim(df1) != dim(df2)) { : the condition has length > 1 and only the first element will be used". This is a consequence of calling dim(df1) != dim(df2). The dim function returns a two element vector in this case, but the != only compares the first element of each vector. (I.e. it's only comparing the number of rows). Instead, use !identical(dim(df1), dim(df2)).

  • In the error message about data frames of unequal dimensions, you mean to say "vis_compare only handles dataframes of identical dimensions", not "does not handle".

vis_guess

  • vis_guess lacks some of the parameters that vis_dat has, but should have the same options (except maybe not sort_type?)

  • Performance: Setting the locale once and then calling collectorGuess instead of guess_parser is about 33% faster on my computer. I think the savings is by avoiding repeatedly calling the locale function and avoiding repeatedly calling stopifnot (inside the readr code).

l <- readr::locale()
  
output[!nas] <- vapply(FUN = function(y) readr:::collectorGuess(y, locale_ = l),
                       X = x[!nas], FUN.VALUE = character(1))
  • If you're interested in converting to C++ for speed, I played around a little bit with implementing a vectorized version for guess_parser in readr. It's about 20X faster. Basically, I wrote a wrapper around collectorGuess that operates on a vector instead of single elements. This is faster because it does the looping in C++ rather than repeatedly transitioning from R to C++ and because it does some initial code once per vector instead of once per element. It is currently written as a modification of readr because that was more convenient, but I imagine it is possible to implement in your package as well. Let me know if you want to discuss that, but I'm definitely not an expert. To get it up and running, clone readr and then put this in readr/src/collectorGuess.cpp:
std::string collectorGuess2(const std::string& input, LocaleInfo locale) {

  // Work from strictest to most flexible
  if (canParse(input, isLogical, &locale))
    return "logical";
  if (canParse(input, isInteger, &locale))
    return "integer";
  if (canParse(input, isDouble, &locale))
    return "double";
  if (canParse(input, isNumber, &locale))
    return "number";
  if (canParse(input, isTime, &locale))
    return "time";
  if (canParse(input, isDate, &locale))
    return "date";
  if (canParse(input, isDateTime, &locale))
    return "datetime";

  // Otherwise can always parse as a character
  return "character";
}

// [[Rcpp::export]]
CharacterVector collectorGuessVector(CharacterVector input, List locale_) {
  LocaleInfo locale(locale_);
  CharacterVector output(input.size());

  if (input.size() == 0 || allMissing(input)) {
    CharacterVector result(1);
    result[0] = "character";
    return result;
  }

  for (int i = 0; i < input.size(); i++) {
    output[i] = collectorGuess2(std::string(input[i]), locale);
  }

  return output;
}

Then put this in readr/R/collectors.R:

#' @rdname parse_guess
#' @export
guess_parser_vector <- function(x, locale = default_locale()) {
  stopifnot(is.locale(locale))
  collectorGuessVector(x, locale)
}

Then rebuild readr. Back in visdat, call guess_parser_vector once with the whole vector of unknowns instead of calling guess_parser repeatedly.

vis_miss_ly

  • The pop-up window is great! Can you make it the same as for ggplotly(vis_dat())? It's especially helpful to show the row number.

  • Why not have the same arguments for this function as for vis_miss?

vis_dat_ly()

  • With vis_dat(airquality) %>% ggplotly(), the window that appears on mouseover should say "type" instead of "valueType", "variable" instead of "variables" and "row" instead of "rows"

  • If you provide vis_miss_ly, it makes sense to me to also provide vis_dat_ly, and possibly also vis_compare_ly and vis_guess_ly. An initial implementation could just be vis_dat(x) %>% ggplotly(), but then you have the API in place at least.

Tests

  • Cool use of vdiffr and generally good coverage. The tests don't all pass on my machine (they are mostly skipped), which I think is because I don't have the .svg reference files on my computer.

  • I suggest adding a test for each function using typical_data, because that currently emits a warning message. That way once you stop the error message from occurring, you'll have a regression test for the future.

  • Maybe I don't understand how vdiffr works, but I'm confused about test-vis-dat.R where in the calls to vdiffr you always provide vis_dat_plot, never any of the other plots you created. Additionally, both of the plots you create using non-default palettes are given the same name, so you're overwriting one of them.

  • On my computer, the palette argument doesn't change the appearance of the vis_dat output, but the tests seem to pass, so can you check that you have a test that catches that issue? When I run vdiffr, vis-dat-qualitative-palette.svg and vis-dat-vanilla.svg appear to be the same.

@njtierney
Copy link
Contributor Author

Thank you very much @seaaan for the review, and for taking the time to cover it in as so much detail. I agree with you very much that this is a great process set up by rOpenSci.

@noamross and @sckott, is it OK for me to address these comments now, or should I wait for the second reviewer, to avoid doubling up?

I'm really looking forward to addressing these changes, I feel like visdat is going to be much better as a result of this process.

@noamross
Copy link
Contributor

Thank you @seaaan for the thorough review. @njtierney you can reply here and make changes, but I suggest you use a branch or work locally so that @batpigandme may have a stable version to review.

Two quick notes:

  • @seaaan, it looks like @njtierney is use the development version of roxygen2, which allows one to write documentation with markdown syntax which is converted to .Rd syntax. This is fine.
  • @arfon, can you comment if the paper.md isn't in JOSS format. I believe images are allowed, correct?

@batpigandme
Copy link

batpigandme commented Jan 16, 2017 via email

@seaaan
Copy link
Contributor

seaaan commented Jan 16, 2017

Cool, I didn't know that roxygen2 would do that. That definitely makes it easier. For some reason, however, the markdown syntax didn't get converted to .Rd syntax in parts of the description section of vis_compare.Rd.

@arfon
Copy link

arfon commented Jan 16, 2017

@arfon, can you comment if the paper.md isn't in JOSS format. I believe images are allowed, correct?

Images are ok. I'm not exactly sure what happens to those ```r code blocks - we'll see when I try and compile it later :-)

@njtierney
Copy link
Contributor Author

@batpigandme At this stage I probably won't have time to address @seaaan 's comments until the weekend, no need to wait on me, :)

Re markdown in the Rd files, @noamross and @seaaan, I need to one more thing to the DESCRIPTION file to get the markdown syntax to work - [Roxygen: list(markdown = TRUE)] (more here) so Sean's comment makes sense.

@batpigandme
Copy link

@njtierney apologies — I meant to synchronize my PR and review posting, but if you ignore the PR for a few more hours, you can pretend I did it at the same time

@seaaan
Copy link
Contributor

seaaan commented Jul 5, 2017 via email

@seaaan
Copy link
Contributor

seaaan commented Jul 8, 2017

Summary

I respond below to a few specific things, but in general for the sake of brevity I didn't respond when I agreed with your changes. So mentally insert "awesome change! you rock!" everywhere that I didn't respond :)

The two points of substance I have, as you'll see, are:

  • Please implement the warning for large data sets with vis_dat (to save we low RAM types from ourselves). This is the only change that needs to be done in my opinion before acceptance and approval.
  • For speeding up vis_guess I do think a pull request to readr is the way to go. I think it's fine to approve the package in the meantime and have that as a future issue to work on.

Details

Did you mean to export guess_type()?

  • Yes, as I thought that it might be a useful function for users, although I can't think of a good usecase outside of vis_guess right now. For the moment I will unexport guess_type now.
To clarify I'm not opposed to exporting it. What I meant to draw attention to (but 
didn't actually say!) was that its documentation makes it sound like an internal 
function not intended for public use. If exported, the documentation should be changed.

Formatting: links, code, and other formatting need to be done with .Rd syntax. For example, for code, use \code{}, not backticks. For bold, use \strong{} instead of asterisks.

  • I have now converted this over to use markdown with roxygen, this should be fixed now.
Based on Noam's comment, I was actually in the wrong here. Sorry about that!  

The paper doesn't have any references in the "References" section, but does have inline references.

  • These references are provided in the yaml, and in a .bib file - hopefully these get compiled with the JOSS paper perhaps @arfon can chime in here?
Sorry, my bad!

I agree that the rows should be in the order that you see the dataframes. I would also prefer for the column names to be on top of the visualisation

  • I have now implemented this, although I think that the presentation of the column names is perhaps not as nice.
Just a minor comment here, for the new plots with the column names on top, I don't 
think you  need the title "variables in dataset."

This [vis_compare] is a neat function! It would be cool if you could generalize it to also handle data frames of unequal dimensions, perhaps by showing which rows/columns are only in one data frame or the other. Just an idea I had though, not a requirement at all!

  • I am glad you like this! It is a difficult problem that I cannot work out, re the rows/cols matching when they are out of the dimension. I'm not quite sure what to do here to make this possible, I think that this should be tackled in a future release of visdat, and will be moved to a development branch.
Fair! I agree that it's a difficult problem. My thought was something along the lines of
 displaying a df the size of the bigger of the two dfs and indicating which rows and 
columns are present in the bigger and not in the smaller. That of course leaves the
 problem of some rows/cols missing in one df and some others missing in the other df, 
where you need to somehow display the union of the two. Though perhaps at this level 
the use case on such dissimilar dfs breaks down. 

readr stuff for efficiency in vis_guess.

  • OK, so I think that this is a fantastic idea! But I'm not sure on the best way to move this into production of visdat, since it sounds like we would need to borrow a chunk of code from readr and then modify it. I wonder if perhaps the solution here is to do a pull request for this specific feature in readr?
I think yeah asking them to export a vectorised collectorGuess function from readr 
would be the way to go. It should be relatively straight-forward from what I did last 
time to submit a pull request if they're open to it. It's possible that a vectorised 
function actually already exists in readr and I just missed it -- there's lots of code
in that package! 

If they don't want to export such a function, you could include a bunch of readr code
in visdat but it would be a lot of code because the parsing functions are complicated.

Answers to your questions

I agree with @batpigandme's answers to your questions.

Q. 3: I agree with Mara. Please do implement the warning and save me from myself: My dinky computer runs out of memory and dies trying to display large data frames.
Q. 4: Answered above.
Q. 5: Happily, the plotly issue is no more.

Great job on the changes!

@noamross
Copy link
Contributor

noamross commented Jul 9, 2017

Thanks for continued great progress, everyone! Let me just chime in with that it's fine to defer a more efficient vis_guessto a future version if it first requires updates to readr. Also I'll put in a request for a warn_large_data argument, maybe with a default of interactive(), so that users can skip the prompt for big data frames when scripting.

@njtierney
Copy link
Contributor Author

@batpigandme Thank you again for your comments! :)

Q. 2: I prefer the visual analogue to the data frame, but, again that's totally stylistic.

OK, I think I will go with my default of sort_type == TRUE, but this can still be changed by the user, so this should be OK. I might change this in the future though, I imagine as I teach more I might change my mind!

@njtierney
Copy link
Contributor Author

@seaaan Thank you again for your comments! :D Likewise, please insert "Awesome! Thank you!" For each of these :)

Re guess_type - I think that this function, whilst kinda useful, doesn't really belong to the user, so I think the move to make it internal is better, and makes the purpose of the package, pre-exploratory visualisation, clearer.

Re x axis label of "variables in dataset" - I agree, this has been changed in a recent push to the cran branch, great point!

Re vis_compare, I need to think about this some more, for the moment, perhaps we can label this as an upcoming feature?

Re vectorized guess, perhaps we can post an issue on readr and ask the tidyverse team?

@noamross I have just added the warn_large_data param for both vis_dat and vis_miss - which you can see here

Just to clarify - I am working on getting a cut down, preliminary version of visdat onto CRAN that just has vis_dat and vis_miss, then we can go from there, does that work for everyone?

@seaaan
Copy link
Contributor

seaaan commented Jul 11, 2017 via email

@noamross
Copy link
Contributor

Hi @seaaan and @batpigandme. @njtierney has informed me that he has addressed all of the above and harmonized the CRAN changes with the master branch of the repo. Please take a look at the current version and see if it addresses any outstanding concerns. If it does, let us know below and also check off any boxes on your original views. If not, let us know any details.

@njtierney
Copy link
Contributor Author

Thanks @noamross !

Important question - how do I best add you all into the DESCRIPTION file?

@noamross
Copy link
Contributor

noamross commented Jul 28, 2017

We're not done until we're done @njtierney 😉

We are currently working with CRAN to get a rev (reviewer) role approved for the Authors@R field - it has been accepted in the past but inconsistently. Past authors have, if reviewers agree and authors desire it, added reviewers as ctb contributors with some extra comments in the metadata. See here for an example: https://github.com/ropensci/roadoi/blob/master/DESCRIPTION . As editor I should not be included in the authors field.

@njtierney
Copy link
Contributor Author

OK, sure thing, that makes sense!

Would you mind if I added you all to the Thank Yous section at the end of visdat?

@seaaan
Copy link
Contributor

seaaan commented Jul 28, 2017 via email

@njtierney
Copy link
Contributor Author

Oh, thanks for picking up on that! I've just removed it now and added y'all in the Thank yous in the referenced commits above :D

Lemme know what you all think.

@batpigandme
Copy link

Other than a renegade missing s (just submitted PR), it all looks great. @njtierney and visdat FTW!

@njtierney
Copy link
Contributor Author

woohoo! Changes merged, thanks Mara! :D

@noamross
Copy link
Contributor

Approved! Thanks @njtierney for submitting and @batpigandme and @seaaan for your reviews and all of you hanging on for the long haul!

To-dos:

  • Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
  • Add the rOpenSci footer to the bottom of your README
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
  • Fix any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)
  • Activate Zenodo watching the repo
  • Tag and create a release so as to create a Zenodo version and DOI
  • Submit to JOSS using the Zenodo DOI. We will tag it for expedited review.

Welcome aboard! I believe you already have plans for a blog post to follow this up.

@njtierney
Copy link
Contributor Author

njtierney commented Aug 1, 2017

Hooray! 🎉

  1. visdat has been transferred, I've joined the visdat team
  2. I've added the rOpenSci footer
  3. It looks like some of the badges are working, except for appveyor and codecov?

image

Just working on activating Zenodo and touching up the JOSS paper, will report back ASAP

@njtierney
Copy link
Contributor Author

OK so I'm ready to create a Zenodo release, but before I go ahead and do that I was wondering what the status is with adding the rOpenSci onboarding folk (editor/reviews) to the DESCRIPTION?

@noamross
Copy link
Contributor

noamross commented Aug 1, 2017

Reviewers may be added with the annotated ctb as in the role to DESCRIPTION (as in the roadoi example), but don't list editors.

@njtierney
Copy link
Contributor Author

Alrighty!

  1. Activated Zenodo watching the repo
  2. tag and create a release so as to create a zenodo version and DOI

image

I'm not sure what to do about the badges for appveyor and codecov, can I help with that?

Working on submitting visdat to JOSS now, I might do this tomorrow morning as it's getting a bit late 😪

@njtierney
Copy link
Contributor Author

OK...I think I'm done! Except I submitted to JOSS using the github repo not the Zenodo repo link: https://zenodo.org/record/837274 Oops!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants