Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submission: EndoMineR #153

Closed
12 of 14 tasks
sebastiz opened this issue Sep 22, 2017 · 51 comments
Closed
12 of 14 tasks

submission: EndoMineR #153

sebastiz opened this issue Sep 22, 2017 · 51 comments

Comments

@sebastiz
Copy link

sebastiz commented Sep 22, 2017

Summary

  • What does this package do? (explain in 50 words or less):

The goal of EndoMineR is to extract as much information as possible from semi-structured text endoscopy reports and their associated pathology specimens for The package extracts,cleans and manipulates the data in a standardized way for the purpose of automating audit and further research in Gastroenterology.

  • Paste the full DESCRIPTION file inside a code block below:
Package: EndoMineR
Title: Functions to mine endoscopic and associated pathology datasets
Version: 0.0.0.9000
Authors@R: person("Sebastian", "Zeki", email = "sebastiz@hotmail.com",
  role = c("aut", "cre"))
Description: This script comprises the functions that I use to clean up endoscopic 
             reports and pathology reports as well as many of the scripts used for analysis.  
             The scripts assume the endoscopy and histopathology data set is merged already but it can 
             also be used of course with the unmerged datasets.
Depends: R (>= 3.3.2)
Imports:
        ggplot2,
        data.table,
        reshape2,
        fuzzyjoin,
        circlize,
        magrittr,
        randomNames,
        generator,
        tidyr,
        directlabels,
        gplots,
        splitstackshape,
        googleVis,
        lattice,
        tm,
        gtools,
        lubridate,
        knitr,
        stringr,
        rlang,
        dplyr
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1.9000
Suggests: rmarkdown,testthat
VignetteBuilder: knitr
  • URL for the package (the development repository, not a stylized html page):
    https://github.com/sebastiz/EndoMineR

  • Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

[e.g., "data extraction, because the package parses a scientific data file format"]

data extraction,and data munging as the data takes and cleans free text endoscopy and pathology reports and cleans and presents the data in formats needed by gastroenterologists

  • Who is the target audience?

Gastroenterologists,pathologists but potential for many medical specialities with electronic reporting systems

There are no other similar R packages to my knowledge

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license. Yes: GLP3
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

Not yet as this is a pre-publication enquiry

- [ ] The package is deposited in a long-term repository with the DOI: 

Not yet as this is a pre-publication enquiry

- (*Do not submit your package separately to JOSS*)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

RMD check passes

snake_case not used universally but this can be changed based on the outcome of the pre-submission enquiry

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@sebastiz sebastiz changed the title Pre-sunbmission enquiry for EndoMineR Pre-submission enquiry for EndoMineR Sep 22, 2017
@sckott
Copy link
Contributor

sckott commented Sep 22, 2017

hi @sebastiz - Is this a pre-submission inquiry OR are you submitting a package? Usually with pre-submissions you only need to briefly introduce the pkg and we discuss whether we'd like you to submit the package or not, see examples here https://github.com/ropensci/onboarding/issues?q=is%3Aissue+label%3A0%2Fpresubmission

@sebastiz
Copy link
Author

sebastiz commented Sep 23, 2017

@sckott sckott changed the title Pre-submission enquiry for EndoMineR submission: EndoMineR Sep 25, 2017
@sckott
Copy link
Contributor

sckott commented Sep 25, 2017

thanks @sebastiz

i changed the title to remove pre- part and removed the presubmission label

The first thing we (editors) will do is discuss fit for ropensci - we'll get back to you asap with that decision, if it's a fit, then we'll proceed as normal ...

@sckott
Copy link
Contributor

sckott commented Sep 27, 2017

hi @sebastiz

We're curious about how broadly this may be used. Are the input files somewhat standard? Or are there are a set of possible file formats/types you support? Can you envision other researchers using this? Are the files in data-raw types of files users would input?

@sebastiz
Copy link
Author

sebastiz commented Sep 28, 2017

Hi @sckott

There are over 10 million electronic endoscopy records across the UK alone. There is no standardised scripted approach to allow hospitals to analyse these records so that we can assess missed cancer rates/ quality of endoscopy and many other aspects as documented. This is really what this work addresses

In terms of broad applicability, the basic functions of the package can be applied to any endoscopic procedure (colonoscopy/ bronchoscopy/ cystoscopy) so that although it is designed as a gastroenterology package it is also a useful package for many other specialities to analyse their data. I have focussed on gastroenterology to give the project a focussed scope and a defined audience.

The files data-raw are typical input files which are always semi-structured text. The internal structure of the input files (as in headings used) may change from hospital to hospital which is why the Extractor function is designed to allow the user to define the extractable sections of a report. This gives the user to flexibility to extract the information they want and then apply the relevant downstream functions as needed.

Please let me know if you have any further questions and I will answer them as soon as I can.

@sckott
Copy link
Contributor

sckott commented Sep 30, 2017

thanks! Will get back to you soon

@sckott
Copy link
Contributor

sckott commented Oct 2, 2017

@sebastiz sorry for delay on this. We think it's a good fit, so we'll go ahead with the review.

Will assign an editor very soon ...

@karthik karthik self-assigned this Oct 2, 2017
@sckott sckott assigned karthik and unassigned karthik Oct 2, 2017
@karthik
Copy link
Member

karthik commented Oct 6, 2017

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Thank you for the submission. I have done a quick scan through the package and have also copied the output of gp from the package goodpractice. Please fix the issues noted below. Primarily you should strive to improve your test coverage. Statements in your tests like

expect_that(nrow(em)>0,equals(TRUE))

can just be

expect_true(...)

I'd recommend scanning through the testing chapter on Hadley's book, or for a more detailed description, Richie Cotton's book on testing. There are many prebuilt expectations in testthat and I see a bunch of your code is doing things a convoluted way (e.g. checking for classes etc).

There are also a lot of imports on your package. For example, I see both ggplot2 and lattice. Is there a good reason to import both? Can some imports be streamlined?

If you use Rstudio, please format the code to make it more readable. You can 'tidy' the code easily by going to the menu option code and choosing "reformat code". There are various readability issues, like for example spaces (or lack thereof) around assignment operators.

Your examples lack comments. This makes it hard for a new user to figure out what is going on.


I'm currently seeking reviewers and will update the thread once I have some folks lined up. Stay tuned. 📻

── GP EndoMineR ────────────────────────────────────────────────────────────────

It is good practice to

write unit tests for all functions, and all package code
in general. 32% of code lines are covered by test cases.

R/BarrettsFunctions.R:22:NA
R/BarrettsFunctions.R:23:NA
R/BarrettsFunctions.R:24:NA
R/BarrettsFunctions.R:25:NA
R/BarrettsFunctions.R:26:NA
... and 374 more lines

add a "URL" field to DESCRIPTION. It helps users find
information about your package online. If your package does not
have a homepage, add an URL to GitHub, or the CRAN package package
page.
add a "BugReports" field to DESCRIPTION, and point it to
a bug tracker. Many online code hosting services provide bug
trackers for free, https://github.com, https://gitlab.com, etc.
use '<-' for assignment instead of '='. '<-' is the
standard, and R users and developers are used it and it is easier
to read your code for them if you use '<-'.

R/BarrettsFunctions.R:261:7
R/BarrettsFunctions.R:263:7
R/BarrettsFunctions.R:264:7
R/BarrettsFunctions.R:367:7
R/BarrettsFunctions.R:368:7
... and 7 more lines

avoid long code lines, it is bad for readability. Also,
many people prefer editor windows that are about 80 characters
wide. Try make your lines shorter than 80 characters

R/BarrettsFunctions.R:1:1
R/BarrettsFunctions.R:2:1
R/BarrettsFunctions.R:3:1
R/BarrettsFunctions.R:4:1
R/BarrettsFunctions.R:55:1
... and 209 more lines

avoid sapply(), it is not type safe. It might return a
vector, or a list, depending on the input data. Consider using
vapply() instead.

R/CleanUp.R:504:19
R/EndoMineR.R:320:10

avoid 1:length(...), 1:nrow(...), 1:ncol(...),
1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
result 1:0 if the expression on the right hand side is zero. Use
seq_len() or seq_along() instead.

R/EndoMineR.R:184:15

not import packages as a whole, as this can cause name
clashes between the imported packages. Instead, import only the
specific functions you need.
fix this R CMD check NOTE: Namespaces in Imports field
not imported from: ‘data.table’ ‘directlabels’ ‘generator’
‘googleVis’ ‘gtools’ ‘knitr’ ‘randomNames’ ‘splitstackshape’ All
declared Imports should be used.
fix this R CMD check WARNING: LaTeX errors when creating
PDF version. This typically indicates Rd problems.
fix this R CMD check ERROR: Re-running with no
redirection of stdout/stderr. Hmm ... looks like a package You may
want to clean up by 'rm -rf /tmp/RtmpVZkr8H/Rd2pdf7ee73c7d752e'
avoid 'T' and 'F', as they are just variables which are
set to the logicals 'TRUE' and 'FALSE' by default, but are not
reserved words and hence can be overwritten by the user. Hence,
one should always use 'TRUE' and 'FALSE' for the logicals.

R/CleanUp.R:NA:NA
R/CleanUp.R:NA:NA
R/CleanUp.R:NA:NA
R/CleanUp.R:NA:NA
R/CleanUp.R:NA:NA
... and 27 more lines

────────────────────────────────────────────────────────────────────────────────
Warning messages:
1: In readLines(filename) :
incomplete final line found on '/root/foo/EndoMineR/R/BarrettsFunctions.R'
2: In readLines(file) :
incomplete final line found on '/root/foo/EndoMineR/R/BarrettsFunctions.R'

@sebastiz
Copy link
Author

sebastiz commented Oct 9, 2017

OK. I am away for the next couple of days but I will get this done as soon as I am back

@sebastiz
Copy link
Author

sebastiz commented Nov 10, 2017

I've made a series of changes. I have 98% test coverage but the 2% left I think is incorrect so I havent covered it. I've also left two sapply dependent functions as they are although goodpractice doesn't like it. All the examples are fully explained and also the vignette is more comprehensive.

@sebastiz
Copy link
Author

sebastiz commented Nov 23, 2017

Hi @karthik is there anything else that needs to be done to move this forward?

@sckott sckott added the package label Nov 27, 2017
@karthik
Copy link
Member

karthik commented Nov 27, 2017

Hi @sebastiz Apologies from me. Nothing to do on your side. The reviewers I contacted never replied and I lost track. I will strive to assign new folks asap.

@karthik
Copy link
Member

karthik commented Nov 27, 2017

@sebastiz New reviewers pinged, I'll update the thread in a few days. Thanks for your patience.

@karthik
Copy link
Member

karthik commented Nov 27, 2017

Reviewer 1 is @RMHogervorst (assigned on 11/27/2017)

@karthik
Copy link
Member

karthik commented Nov 27, 2017

Reviewer 2 is @jonclayden (assigned on 11/27/2017). Review due December 18th

@RMHogervorst
Copy link

RMHogervorst commented Nov 29, 2017

Hi @sebastiz , when I install the package from github the vignette is not included. You could re-document the project in your rstudio session if you toggle knit vignettes. and
-either use devtools::build()

  • or devtools::build_vignettes()

and push your changes to github

you would not believe how many times I 've had this issue too

@sebastiz
Copy link
Author

sebastiz commented Nov 29, 2017

@RMHogervorst I have now done this. Another good place for the package explanation is https://sebastiz.github.io/EndoMineR/index.html

@sebastiz
Copy link
Author

sebastiz commented Jan 10, 2018

@karthik. I am currently correcting the documentation and am getting through it as fast as I can. Thanks for checking in.

@sebastiz
Copy link
Author

sebastiz commented Feb 7, 2018

@karthik Just to let you know that I am still working on this and should have it finished soon - just renaming functions and reflecting changes in the documentation as per the reviewers' comments

@sebastiz
Copy link
Author

sebastiz commented Mar 7, 2018

@karthik @RMHogervorst @jonclayden here is my response to the reviewers following my amendments

REVIEWER ONE

Design, data and documentation

  1. The vignette would benefit substantially from embedding the results of each code block inline. As well as making the effects of each function much more obvious to the reader, this would reveal that some of the code does not in fact work as expected. (The first line of the second code block, Mypath<-data(PathDataFrameFinalColon), sets Mypath to a character string, not a data frame, so it is not valid input to the extractor function. This code is also in the README.) Listing the bodies of functions, as you do in some places, is not very helpful to readers who aren't programmers.

This has now been amended. Where possible all code documented has an input and expected output. Bodies of functions have been removed. The extractor function has been completely re-written and works as intended.

  1. While the process for generating copious fake-but-plausible data for testing is ingenious, the full dataset of 2000 records seems to me too large for demonstration purposes—and presumably this is part of the reason that the code in the vignette is not run and its results shown. It also leads to a large data directory, which R CMD check picks up on, and fairly long run-times for the tests and examples. Could a much smaller sample of representative records be used to illustrate usage of the package?

Unfortunately the dataset has to be this size because of the nature of selecting subsets to show the user how to process different aspects. I could provide several smaller datasets but the overall data directory size may not change significantly as a result. The data has been compressed wherever possible and datasets which were no longer in use have been removed so that the directory is now much smaller

  1. Some interdependencies between functions are described in the vignette, in the sense that one must be run before the other, but perhaps these could be laid out graphically for clarity.

Ths documentation now contains several more images detailing how the functions are used.

  1. Are the assumptions stated at the beginning of the vignette likely to apply internationally? If information is not parsed correctly by package functions, do NAs appear in the new column? Are there consequences of silently missing relevant information in practice?

The assumptions are necessary to use the package and all medical records are organised by patient unique ID and date a test was performed. NAs do appear in the new column with incorrect parsing so the user should be aware that further data preparation is necessary in order to obtain the required output. The package does not contain package specific error messages at the moment but this may be included in future iterations

  1. There is no top-level documentation available from R, viz. ?EndoMineR, as required by the rOpenSci packaging guide.

At the moment I am struggling to generate this. Clicking on the package in Rstudio allows the documentation to be seen but for some reason ?EndoMineR doesn't seem to access the man files. This is currently the subject of a stackexchange question

  1. There are several typos in the code documentation and vignette (devtools::spell_check() shows up some, alongside several false positives), and the vignette would benefit from some proofreading.

All typos have now been corrected

Packaging

  1. The list of package dependencies is extensive, and I would suggest moving some dependencies to the Suggests section for ease of installation and use. In particular, the randomNames and generator packages seem to be only required for generating the test dataset, which is not a user operation, so these could surely be suggested packages. Nothing from splitstackshape is imported into the package namespace, and I don't see any reference to it, so perhaps it isn't needed at all. knitr is presumably just needed for building the vignette. Other packages are only needed in a small number of functions: circlize, for example, seems to only be needed by the PatientFlow_CircosPlots function, so if this function is not needed for every analysis the circlize package could be moved to Suggests, and its functions referred to using namespace notation, e.g. circlize::chordDiagram.

The DESCRIPTION list has now been shortened with many of the dependencies moved to Suggests

  1. Twelve of the tests produce warnings on my system (current macOS, Homebrew-installed R) of the forms "is.na() applied to non-(list or vector) of type 'NULL'" and "provided 9 variables to replace 1 variables".

The tests should now run OK with none of the warnings listed.

3.There is no package coverage badge in the README, as required by the packaging guide. There are also no community guidelines, which are stipulated above.

These have now been added

  1. The top-level docs directory is nonstandard and produces a NOTE from R CMD check. It should probably be removed or added to .Rbuildignore.
    All non standard directories are now in the .Rbuildignore

Code

  1. Return types from package functions are somewhat inconsistent. For example, many functions take and return a data frame, but BarrettsQuality_AnalysisBiopsyNumber returns a list containing a tibble and a ggplot. While tibbles can mostly be treated like data frames, they are presented a little differently. Plotting is a logically a separate function, so you might want to consider splitting it into a separate function.

All tibbles are now returned as dataframes. I am not keen to split the functions that return a tibble (now a dataframe) and a plot as the dataframe is often the data that feeds the plot but in numerical form, in case the user wants to have access to it. Splitting into a separate function feels like creating a new function for the same unique task. I will separate if absolutely necessary but it doesn't feel logical.

  1. Usage of the key extractor function seems a little complicated, with quite a bit of boilerplate code in the README example. The example could be simplified by vectorising the function, rather than requiring it to be called repeatedly, and using a character-mode vector rather than a list for the boundary labels. Note also that the construction seq_len(length(HistolTree)-1) is preferable to 1:(length(HistolTree)-1), since it handles the case where the length of HistolTree is one (and immediately produces an error if it is zero).

I agree this was a complicated function. This has now been completely re-written and is significantly simpler

  1. The naming convention for functions is a little opaque to me, although I recognise it may be more obvious to someone from the field. Several function names are also quite long. You may want to adopt some consistent structure, like object-verb or verb-object.

The functions have been renamed and shortened to be more logical and consistent

  1. Quite a few functions have non-meaningful argument names (x, y, z). Use of x for the main data source is quite common, but it would be better for other arguments to be named more clearly (and consistently, across functions).

These have all been changed to take more meaningful argument names.

  1. Could the various cleaning functions be applied wholesale within an overarching function, rather than having to call lots of smaller functions in similar ways? Perhaps this could have a form like EndoscChopperAll(Myendo, Meds='Medications', Instrument='Instrument', Indications='Indications', ProcPerformed='ProcedurePerformed'), so that the relevant column for each subfunction can be specified?

Such cleaning functions have now been created (the HistolAll,EndoAll and BarrettsAll) as suggested.

  1. Given the package's very heavy usage of regular expressions and text substitution, there may be some performance benefit in exploring regex functions other than those in the base package. (stringr is used in places, but there are also a lot of calls to gsub, grepl, etc.)

gsubs and grepls have been replace where possible. Sometimes it is not possible and so they have been left in sight. The functions run well over large datasets so at present this doesn't seem to be a performance issue

  1. A lot of the tests just check that the results are not NA. Stronger checks would provide better tests where possible.

REVIEWER TWO

  1. I feel as a first time user, I need to check the documentation for every function.

I very much hope that reading through the illustrated vignette and associated website will help the user to understand the package. he functions have been renamed, as have their arguments, to make the functions more intuitive to use. Some of the functions have been re-written altogether to make them simpler.

  1. Some of the texts are difficult to understand: "All patients have a unique identifier. At the departmental level this is likely to be a Trust specific hospital number." These things
    are probably related but I don't think a patient gets a hospital number?

Patients always get a unique hospital number which is the basis of attributing all data to a unique individual. The unique idenitifier is the same as a hospital number.

  1. I understand that these assumptions are the prerequisites for extracting information from freetext but they make for a bit of dry reading. Perhaps a quickstart would benefit those who already know or don't care about the assumptions?

The vignettes have all been substantially reorganised to make the assumptions etc part of background reading which doesn't have to be read.

  1. Another great piece to put into a seperate vignette would be 'Application to specific disease entities- Barrett’s oesophagus'

This has been separated out

  1. Reading through all of the documentation in the vignette, I believe the author is an experienced poweruser of R in gastro medicine. I think the package solves a need for the author and might be very useful for other people as well. However I think the package but mostly the documentation, should be more simplified for beginners, and it is really difficult to get into the beginner mindset, but you sometimes need to take the users more by the hand to guide them through the process. Because I believe that the package is well thought out in principle, for instance: I really like that most functions have a predictable API, first argument is dataframe, second the column.

Thank you. I agree that the first iteration was maybe a little complex. I have spent considerable time trying to re-write the documentation to make it easier to digest. The vignettes have been separated out and pkgdown has been used to create further documentation and tutorials.

notes for endominer

  1. I tried to look at the examples as if I was researcher, somewhat used to R, but by no means an experienced programmer. Someone with expert knowledge about endoscopy ( which I have not. So I don't know if the guidelines are correct, but it seems very well thought out ). That means for-loops are quite difficult and I just want to see what the package can do for me as a scientist-practitioner. So I went through the vignette, because that is what I would do as a user,to see what the package can do for me. I feel as if the author has put in a lot of thought on what information needs to be extracted from labrecords. In a way it is really three packages in 1. If the package will be extensively used and developed, it might need to be splitted into separate packages in the future.

I have rewritten the functions that contain for-loops and any functions that required boilerplate code (especially the Extractor function). The re-writing of many of the functions has been specifically to simplify their use and I hope this is reflected in the code examples in the vignette as well as the website.

General remarks

  1. It is not clear who the package is for, is it for physicians, researchers, endoscopists?

It is for all the above. I have incorporated this into the documentation.

  1. I feel as if there are too many specific functions. Some of those function could be unified perhaps?
    Wherever possible, functions that perform similar tasks have been unified. However on the basis that it is good practice for one function to perform one specific task I have been careful not allow this.

3.On the one hand I like that the package is very focused, on the other hand I feel as if it is too specific on some points
perhaps rewrite the main vignette into several vignettes, one high level overview and one specific for patient flow, one specific for extracting histology reports, one for endoscopy reports?

I have taken this on board and I hope the reviewer agrees with the vignette separation into EndoMineR principle (explaining assumptions etc/ Package overview (explaining all the cleaning functions both pathology and Endoscopy), Analysis (containing all the analysis functions eg Patient flow, surveillance etc) and disease specific vignette (ie Barrett's). This seems more logical.

  1. I don't think we need a specific tool for the merging of endoscopic datasetand a pathological dataset when we have dplyr for that specific task? Don'treinvent the wheel and I hope that most R-users are already familiar with that package.

I agree with the concept of not re-inventing the wheel but this can be useful for beginning R users especially as the merge is not a straightforward merge but allows merging with some variability of the date between endoscopy and pathology (which is the case in real world scenarios where the tissue samples can be received on a separate date to the day of the endoscopy).

  1. I like the stating of assumptions, but it also makes it hard to read the text. The assumptions and parts about standards could go into a separate vignette.

Thank you. This is now a separate vignette

  1. When I run the goodpractice function on the latest edition on github (pull friday 2017-12-15) I still see some uses of sapply where other functions would work.

sapply has now been repace with vapply

  1. Some of the functions use unclear assignments, as a user I often lose track of what single letter assignments stand for, since most users would use autocomplete a more descriptive assignment will not take that much time extra.
    example:
    f<- PolypTidyUpLocator(f,'SampleLocation')
    polyplocations <- PolypTidyUpLocator(f,'SampleLocation') # you would know what we are referencing

I have now renamed the arguments so it is more intuitive to understand

  1. There are quite some references to http://www.gastrodatascience.com which seems to be down

This is now functional but most references have been removed anyway

  1. I like the pictoral overview of the package. At first I thought the
    combination was too small to see the details, but the seperate images are used further on to
    explain the functions.

The large overview was too big so has been removed. The smaller images have been embellished and added to. Hopfeully this makes things clearer

  1. I see you use length(HistolTree)-1) generally seq_along or seq_length is
    recommended if you want to use a for loop.

This has now been removed and re-written as part of the Extractor function

  1. the authors uses CamelCase, where snake_case is generally recommended. The
    scheme is consistently used so it does not bother me.

I haven't changed this as long as it is OK

  1. The package size is quite large. As Jon mentioned reducing the size of the
    sampleset would bring this back.

I recieved warnings (with goodpractice::gp()) that entire packages were
imported and also ‘generator’ ‘knitr’ ‘randomNames’ ‘splitstackshape’ are
imported but not used.

  1. Perhaps writing function names as actionverbs would make intent more clear,
    for instance the 'EndoscChopperEndoscopist' function is really a function that
    extracts the name of the endoscopist. Perhaps ExtractEndoscopist would make
    more sense. (then again this clashes with the prefix_action scheme I proposed
    earlier, so I really leave this to the author)

I have re-written the names of many of the functions so that they are more intuitive. Hopefully this makes more sense

  1. I would really like a list of acronyms f.i. 'GRS reqquirement for the ADR' is very hard for me to understand.

The acronyms have been explained in the text as it is introduced. I hope this is now clearer

I found some interesting things that could be optimized in the funtion 'SurveySankey': you use both dplyr's group_by AND datatable, which also has groupings. AND reshape2 which could be succeeded by tidyr?

Vignette walk through

Extractor

  1. I think the example makes the extraction look to difficult. What would help me as a new user,
    is a small piece of freetext, and what it needs to be turned into (cleaned tidy
    data). I feel that the chopperfunction and subsequent histolTree function look
    more difficult than they really are. The histolTree function is very flexible
    but also difficult to understand for a new user. If you add a piece of freetext that
    contains "Hospital Number","Patient Name","DOB:","General Practitioner:",
    "Date of procedure:","Clinical Details:"". we would be able to see what the extraction does.

Extraction of sentences can also be done with the tokenizers package^[Although
perhaps you do not want to add more dependencies?]

  1. I have re-written the Extractor function specifically because I agree that it was too complex to follow. This is a lot simpler now. I haven't used the tokenizers package to reduce the number of dependencies but thank you for highlighting it.

cleaning

  1. (this might be a taste or non-english thing) I would rewrite a sentence like :
    "Once the extraction has been done there are various cleaning functions provided based around the extraction of likely columns." I read this sentence as if we would extract columns, but I think
    you meant to say you would extract information from the freetext into new columns.

Thank you. I agree this is a terribly written sentence and I have changed it.

  1. In general I don't think displaying the function such as endochopperendoscopist makes things more clear. I think it is a very useful function to extract the names. Using an example text before and after the function would make its use much clearer.

Thank you. I removed all function bodies from the explanatory text and have replaced it with input and expected output to make things clearer

  1. The function description of HistolChopperNumbOfBx shows some potential problems:
    HistolChopperNumbOfBx <- function(x, y, z){ function details... }
    it is unclear what x,y,and z are meant to be. x is probably a dataframe,
    y a column? using words and not single characters would really improve the
    usability of these functions.
    You might want to rewrite the part from HistolChopperNumbOfBx until the next
    heading. Perhaps showing examples of what it extracts?

I have re-written this function so that the arguments are clearer. I have also provided examples of what it extracts with a given input

The Analysis Functions

  1. I think the time related functions are very useful, but perhaps the dplyr
    functions group_by and arrange would do the same. I am not completely opposed
    to replicating functionality but I do not see what these functions add in value?

The functions carry out several dplyr and ggplot functions to provide output in a standardised way that many users of this kind of package would probably find extremely useful

  1. how<-HowManyTests(Myendo,'Indications','Dateofprocedure','Surv') returns
    mutliple results: both a table and a plot. I would advise against that, because
    you have to know how this works to fully use it. Perhaps 2 functions, or a
    plot method and a print method? I think it is not clear for users what to do with the
    endresult.

Thanks. I have explained how to use this in the vignette. Regarding the need for two functions I am not keen to split the functions that return a tibble (now a dataframe) and a plot as the dataframe is often the data that feeds the plot but in numerical form, in case the user wants to have access to it. Splitting into a separate function feels like creating a new function for the same unique task. I will separate if absolutely necessary but it doesn't feel logical.

Patient flow functions

  1. for me the images did not load inside the vignette. (I was offline at that time).

The images all load of the computers I have tried so this may be a local issue

  1. I quite like the visualization of patient flow in sankey and circos (?) plots.
    You might want to put these packages in the suggest, rather than import location, because you only use them once.

I have done as suggested

Assessment of quality functions

  1. 'a variety of different ways' seems redundant (but I'm not a native speaker).

Thank you I have changed this

  1. tt<-ListLookup(Myendo,'Findings',myNotableWords) did not show a barchart in my session.

This should show a barchart now

Summary

I think the package has many many timesavers but would be hard to use for r-beginners. Rewriting the examples in the vignette to include a 'before' and 'after' would be very helpful. As well as a quickstart. I would recommend to focus on the documentation first and functionality second.
Great job, and thank you for this work!
And thanks for giving me this opportunity to review the package.

@RMHogervorst
Copy link

RMHogervorst commented Mar 7, 2018

I'm looking forwards to checking it out ! It sounds as if you put in a lot of effort.

Regarding package docs (you have to make a seperate piece of roxygen with
the @name EndoMineR ) ;
In http://r-pkgs.had.co.nz/man.html#man-packages there is this example:

#' foo: A package for computating the notorious bar statistic.
#'
#' The foo package provides three categories of important functions:
#' foo, bar and baz.
#' 
#' @section Foo functions:
#' The foo functions ...
#'
#' @docType package
#' @name foo
NULL

@jonclayden
Copy link

jonclayden commented Mar 15, 2018

Thanks, @sebastiz. You've clearly done a lot of work on the package, and particularly the documentation, since I last looked at it. I just have a few small follow-up points:

  • Are there contribution guidelines somewhere? I didn't find them.
  • You can get ?EndoMineR working by making this name an alias of a new or existing roxygen2 block. The form given by @RMHogervorst above works fine, but without an alias it will only be accessible through the slightly more verbose package?EndoMineR.
  • Rebuilding the vignettes produces some warnings, which you may want to look at because they could lead to unexpected results in certain cases:
Building EndoMineR vignettes
Warning in if (grepl("[Ff]entanyl", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
Warning in if (grepl("[Mm]idazolam", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
Warning in if (grepl("[Pp]ethidine", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
Warning in if (grepl("[Pp]ropofol", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
`geom_smooth()` using method = 'loess'
Warning in if (grepl("[Ff]entanyl", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
Warning in if (grepl("[Mm]idazolam", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
Warning in if (grepl("[Pp]ethidine", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
Warning in if (grepl("[Pp]ropofol", dataframe$Medications)) { :
  the condition has length > 1 and only the first element will be used
  • I see some test failures with devtools::test() and the default reporter in the latest testthat, but I suspect it's a testthat bug rather than an issue on your side.

@sebastiz
Copy link
Author

sebastiz commented Mar 16, 2018

@jonclayden thanks. These have now been all corrected. The tests seem to be fine when I run devtools::check(). Hopefully the top level documentation is what is needed and should be accessible with ?EndoMiner. The CONTRIBUTING.md file has been added but the URLs may not be quite correct yet until (and if) the package is taken up by ropensci

@jonclayden
Copy link

jonclayden commented Mar 16, 2018

Great! This all looks good, so I think that's everything from me.

@sebastiz
Copy link
Author

sebastiz commented Mar 16, 2018

Hi @RMHogervorst just to let you know that the top level package documentation has been corrected as has Jon Clayden's latest corrections. Thanks

@RMHogervorst
Copy link

RMHogervorst commented Mar 21, 2018

The vignettes look beautiful!
Feels very complete. 3 separate vignettes are available, really focused on one task, very nice!
@sebastiz has more than adequately responded to all questions and all my points were answered. This is a go from me.

Beautiful work Sebastian !

Small note before you submit it to CRAN:
i downloaded the repo on 2018-03-21 (sebastiz/EndoMineR@4708e0d) and the vignettes were created, but for some reason are not included in the package (there is a link to the source and r but not the html version. But they are in the package.
So I don't know how that works. (problably a simple toggle somewhere?).

@sebastiz
Copy link
Author

sebastiz commented Mar 22, 2018

Thank you @RMHogervost really appreciate that

@sebastiz
Copy link
Author

sebastiz commented Mar 22, 2018

Hi @karthik can you tell me what the next steps are?

@sebastiz
Copy link
Author

sebastiz commented Apr 3, 2018

Hi @karthik I am keen to move forward with this is it has been accepted by both reviewers as there are some conferences coming up id like to describe it as. It would be great to say it had been accepted here when I do. Thanks

@sebastiz
Copy link
Author

sebastiz commented Apr 4, 2018

Hi @sckott and @karthik
I updated the package about a month ago and the reviewers have given it the thumbs up. What is the best way to move things forward from here?

@karthik
Copy link
Member

karthik commented Apr 5, 2018

Hi @sebastiz sorry this fell through the cracks. I will update with next steps in an hour or so. My apologies

@karthik
Copy link
Member

karthik commented Apr 5, 2018

Approved! 🎉 Thank you for submitting and @RMHogervorst and @jonclayden for thorough and timely reviews. To-dos:

  • Transfer the repo to the rOpenSci organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.

  • Add the rOpenSci footer to the bottom of your README

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

  • Fix any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

@sebastiz
Copy link
Author

sebastiz commented Apr 6, 2018

OK great @karthik . I'll make the relevant changes. Happy to do tech-notes for this. I'd like to submit to JOSS. Can I still do this?

@stefaniebutland
Copy link
Member

stefaniebutland commented Apr 6, 2018

@sebastiz Great to hear that you'd like to publish a technote on this. We can publish at any time (in contrast to blog posts that are scheduled). Please submit a draft by pull request at your convenience and we can review before publishing.

Here are examples: https://ropensci.org/technotes/

Instructions on pull request and preview: https://github.com/ropensci/roweb2#contributing-a-blog-post. The only difference in YAML for technote is
categories:
- technotes

I will add the topicid before publishing.

Don't hesitate to ask any questions here.

@karthik
Copy link
Member

karthik commented Apr 6, 2018

I'd like to submit to JOSS. Can I still do this?

Yes of course! When you submit to JOSS, mention in the created thread that it was reviewed here and it will get fast tracked. Feel free to @ tag me there.

@sebastiz
Copy link
Author

sebastiz commented Apr 10, 2018

Hi @karthik I don't think I have any collaborator invitations on my github yet
https://github.com/sebastiz/EndoMineR/settings/collaboration. Also when I try to transfer to rOpenSci I get 'You don’t have the permission to create repositories on ropensci'.....

@karthik
Copy link
Member

karthik commented Apr 10, 2018

Please check your Github notifications for a pending invite.

image

If you don't find it I can cancel and resend it.

@sebastiz
Copy link
Author

sebastiz commented Apr 11, 2018

Hi @karthik Yes please resend. ?case sensitive?

@karthik
Copy link
Member

karthik commented Apr 11, 2018

Canceled and resent. Please try now.

@sebastiz
Copy link
Author

sebastiz commented Apr 13, 2018

OK thanks @karthik all done.

@karthik
Copy link
Member

karthik commented Apr 16, 2018

Thanks @sebastiz! All set here. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants