Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSTORr package #86

Closed
12 of 13 tasks
benmarwick opened this issue Dec 20, 2016 · 14 comments
Closed
12 of 13 tasks

JSTORr package #86

benmarwick opened this issue Dec 20, 2016 · 14 comments

Comments

@benmarwick
Copy link
Contributor

Summary

  • What does this package do? (explain in 50 words or less):

The aim of this package is provide some simple functions in R to explore changes in word frequencies over time in a specific journal archive. It is designed to solve the problem of finding patterns and trends in the unstructured text content of a large number of scholarly journals articles from the JSTOR archive.

  • Paste the full DESCRIPTION file inside a code block below:
Package: JSTORr
Type: Package
Title: Simple text mining and document clustering of JSTOR journal articles
Version: 1.0.20161214
Date: 2016-01-03
Author: Ben Marwick
Maintainer: Ben Marwick <bmarwick@gmail.com>
Description: Simple exploratory text mining and document clustering of journal
    articles from JSTOR's Data for Research service. Go to
    http://dfr.jstor.org/, make a request for data (specify CSV as outout
    format and Word Counts as data type), then once you get a zip file, unzip
    it and start with one of the unpack functions and then you're ready to go
    with any of the other functions. For more details on installation and
    usage, see https://github.com/benmarwick/JSTORr
License: MIT + file LICENSE
Imports:
    ggplot2,
    reshape2,
    plyr,
    stringr,
    tm,
    openNLP,
    NLP,
    lda,
    XML,
    cluster,
    apcluster,
    ggdendro,
    FactoMineR,
    gridExtra,
    grid,
    data.table,
    snowfall,
    slam,
    digest,
    igraph,
    snow,
    devtools
Suggests: testthat
LazyData: true
  • URL for the package (the development repository, not a stylized html page):
    https://github.com/benmarwick/JSTORr

  • Who is the target audience?
    Researchers whose primary literature is published in journals archived by JSTOR

  • Are there other R packages that accomplish the same thing? If so, what is different about yours?
    Not that I'm aware of, this pkg is unique

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
    Currently these examples are in the readme
  • has a test suite.
  • has continuous integration with Travis CI and/or another service.

Publication options

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
    It passes, but with many warnings, see the Travis logs https://travis-ci.org/benmarwick/JSTORr/builds/184114026, I could do with some help on how to fix those

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@brooksambrose has used the pkg and submitted a very nice PR recently. @juliasilge has recently transformed text mining in R with her pkg with @dgrtwo. @noamross saw this pkg submitted to JOSS and recommended I submit it here first.

@sckott
Copy link
Contributor

sckott commented Dec 22, 2016

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Currently seeking reviewers. It's a good fit and not overlapping.

  • Tests do succeed, but take a long time, would be nice to get these to run faster somehow
  • Below is goodpractice::gp() output.
It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. At least some lines are not covered in this package.

    R/JSTOR_1bigram.R:16:NA
    R/JSTOR_1bigram.R:17:NA
    R/JSTOR_1bigram.R:22:NA
    R/JSTOR_1bigram.R:23:NA
    R/JSTOR_1bigram.R:25:NA
    ... and 1154 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it
    gets invalid quite often. A build date will be added to the package
    when you perform `R CMD build` on it.
  ✖ add a "URL" field to DESCRIPTION. It helps users find
    information about your package online. If your package does not
    have a homepage, add an URL to GitHub, or the CRAN package package
    page.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to
    a bug tracker. Many online code hosting services provide bug
    trackers for free, https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the
    standard, and R users and developers are used it and it is easier
    to read your code for them if you use '<-'.

    R/JSTOR_clusterbywords.R:200:12
    R/JSTOR_unpack_multiple_archives.R:77:5
    R/JSTOR_unpack_multiple_archives.R:78:5
    R/JSTOR_unpack_multiple_archives.R:79:5
    R/JSTOR_unpack1grams.R:76:7
    ... and 5 more lines

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/JSTOR_1bigram.R:21:1
    R/JSTOR_1bigram.R:31:1
    R/JSTOR_1bigram.R:34:1
    R/JSTOR_1bigram.R:41:1
    R/JSTOR_1bigram.R:50:1
    ... and 271 more lines

  ✖ avoid calling setwd(), it changes the global environment.
    If you need it, consider using on.exit() to restore the working
    directory.

    R/JSTOR_MALLET.R:21:3
    R/JSTOR_MALLET.R:27:20
    R/JSTOR_MALLET.R:49:1
    R/JSTOR_unpack1grams.R:30:3
    R/JSTOR_unpack1grams.R:118:3
    ... and 2 more lines

  ✖ avoid sapply(), it is not type safe. It might return a
    vector, or a list, depending on the input data. Consider using
    vapply() instead.

    R/JSTOR_MALLET_hotncoldtopics.R:107:65
    R/JSTOR_MALLET_topicsovertime.R:161:5
    R/JSTOR_MALLET_topicsovertime.R:191:19
    R/JSTOR_unpack_multiple_archives.R:67:9
    R/JSTOR_unpack1grams.R:63:9

  ✖ avoid the library() and require() functions, they change
    the global search path. If you need to use other packages, import
    them. If you need to load them explicitly, then consider
    loadNamespace() instead, or as a last resort, declare them as
    'Depends' dependencies.
  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...),
    1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
    result 1:0 if the expression on the right hand side is zero. Use
    seq_len() or seq_along() instead.

    R/JSTOR_clusterbywords.R:44:32
    R/JSTOR_clusterbywords.R:100:14
    R/JSTOR_dtmofnouns.R:43:44
    R/JSTOR_findassocs.R:160:39
    R/JSTOR_findassocs.R:166:39
    ... and 21 more lines

  -- more lines but not including for brevity sake

There's a bunch of non-ascii text in some data stored in the pkg, check out tools::showNonASCIIfile()


Reviewers: @leeper @kbenoit
Due date: 2017-02-03

@leeper
Copy link
Member

leeper commented Dec 23, 2016

I just took an initial look at this and am seeing lots of R CMD check warnings and notes (which are also shown on Travis). Most of them relate to dependencies because of how some of the packages are being used (namely using library() and unqualified package function calls). Most of these appear to be due to:

  1. unnecessary library() calls given that dependencies are already being imported (either switch these to Suggests and requireNamespace() or leave as imports without library()),
  2. missing imports of base packages (e.g., utils, stats), and
  3. ggplot2 variables (see this post for how to handle that).

Do you want to clean these up first before I review?

@sckott
Copy link
Contributor

sckott commented Dec 23, 2016

missing imports of base packages (e.g., utils, stats)

@benmarwick you could also just namespace these so you don't have to list them in imports, like stats::setNames

@benmarwick
Copy link
Contributor Author

Thanks, I'll clean these up and get back to you with an update here

@sckott
Copy link
Contributor

sckott commented Jan 6, 2017

both reviewers assigned.

@benmarwick ping back here when you get that stuff cleaned up

@kbenoit
Copy link

kbenoit commented Mar 14, 2017

Maybe this is in a holding stage but in the spirit of better late than never, I've refreshed my clone of this repo and refreshed the comments I started last month. Here they are.

Package Review

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (such as being a major contributor to the software).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and URL, Maintainer and BugReports fields in DESCRIPTION

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 2.5


Review Comments

Basic motivation and value added of the package

Below, I've included some practical issues that could be addressed to improve the package and which have fairly straightforward solutions (such as: use knitr for the README; write a vignette; etc.). Before turning to those, however, I will give some tough but I hope fair comments about what I the basic need for such a package.

In essence, this package is a wrapper around a set of text-based utilities to analyze texts with a specific data structure, namely the article texts and some document-level variables from JSTOR. Its stated purpose is to perform word analysis, clustering, and topic modelling on the texts, with the option of using the document-level variables to show topic prevalence across time. However these really have nothing to do with JSTOR, but rather to do with the list structure crated by the JSTOR_unpack* functions that transform the JSTOR .csv files into a list of a document-term matrixes and the document-level variables. The parts specific to JSTOR or article data, in other words, are only in having the time variable, and this could be part of any textual data structure.

When I first cloned the package, I thought it would have something to do with querying data from a JSTOR API, or special processing for JSTOR data, but in fact it has nothing to do with that. Rather it reads in a .csv or .tsv file with only the expectation that specific fields exist in the data. The interaction with JSTOR is entirely up to the user, and the package has to assume that the data provided by JSTOR keeps the same structure as it is expecting.

Now here I am going to be that annoying reviewer who suggests citing his or her own work, in a slightly different way, but there are packages to do this, including the one I've worked on, and which most users tell me is more straightforward than tm: the package quanteda, which can along with a companion package readtext, read in files from .zip, .csv, .tsv data and make that into a corpus, including automatic processing of the document variables, and then create document-feature matrixes that (at least in the current GitHub version) retain the document variables. This makes it easy to keep them for things like the stm package which can use additional variables for fitting correlated LDA models, or machine learning, etc. quanteda can do almost all that you are doing in this package including:

  • tokenizing and forming document-feature matrixes (and does it faster than tm)
  • removing stopwords, weighting, and frequency threshold-based feature selection
  • forming ngrams
  • similarity, distance, and correlation measures for documents (and keeps the matrixes sparse when doing so)
  • finding frequent words
  • providing translation tools to feed the dfm to other packages such as lda, topicmodels, stm, mallet, etc.
  • plotting wordclouds and other lexical visualisations

It does not directly filter nouns but we are hard at work on a package spacyr that will tag a quanteda corpus and add new token selection tools to select on parts of speech.

These are found at http://github.com/kbenoit/quanteda, http://github.com/kbenoit/readtext, and http://github.com/kbenoit/spacyr.

Ok, so now </annoyingselfpromotion>, but please convince me that this set of wrappers, which lacks the flexibility of a more basic approach, is better for most users, especially when some of the functions reimplement completely generic tools while purportedly doing so in a way that has something to do with JSTOR, such as JSTORE_findassocs. To my mind, this would be like creating a package called worldbank_lm which fitted linear models to the World Bank development indicators dataset, after first getting the instructions to download this dataset and load it through a wrapper to read.table called worldbank_readtable. Maybe there is a case to be made that this will help R-challenged biliometricians, but if so then this package should make that case. Otherwise, while not quite reinventing the wheel, it's creating an additional wheel inside which is wrapped the original wheel which might be better and more effective without the wrapper.

Another way to make this case, of course, is to convince me what is special about the wrapper set that extracts more value from the JSTOR data specifically. This goes beyond an ease of use motivation, because it argues that putting together components in a specially selected way leads to specific insights about journal articles.

Don't get me wrong, I have total respect for anyone taking their precious time and effort to write packages and to tidy them up sufficiently to submit them to rOpenSci, and I don't want to discourage that. But part of that process involves applying the rOpenSci standards, and I think this package needs much stronger motivation to meet those.

Note also that I am being slightly hypocritical in that two of the three packages I list above are not yet on CRAN (one because of installation issues on Windows involving the Python/Cython links), and quanteda I have not yet submitted to rOpenSci. But we have been working toward that with the standards firmly in mind. If you wanted to use more of quanteda for the scaffolding of this package I would be happy to advise or assist.

Now on to more mundane issues.

Installation problems

Overall, I could not test the package because I could not attach it. I followed the installation instructions in the README.md, and was able to build the package, but then got this:
After I build the package, it fails to load:

> library(JSTORr)
Error : .onLoad failed in loadNamespace() for 'rJava', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so
  Reason: image not found
In addition: Warning messages:
1: replacing previous importNLP::annotatebyggplot2::annotatewhen loadingJSTORr2: replacing previous importapcluster::similaritybyigraph::similaritywhen loadingJSTORrError: package or namespace load failed forJSTORr

I'm running:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] igraph_1.0.1         Rcpp_0.12.9          magrittr_1.5         knitr_1.15.1         cluster_2.0.5        MASS_7.3-45          leaps_3.0           
 [8] munsell_0.4.3        scatterplot3d_0.3-38 colorspace_1.3-2     lattice_0.20-34      ggdendro_0.1-20      plyr_1.8.4           tools_3.3.2         
[15] grid_3.3.2           data.table_1.10.4    gtable_0.2.0         lazyeval_0.2.0       digest_0.6.12        assertthat_0.1       tibble_1.2          
[22] Matrix_1.2-8         NLP_0.1-10           lda_1.4.2            gridExtra_2.2.1      FactoMineR_1.35      apcluster_1.4.3      ggplot2_2.2.1       
[29] scales_0.4.1         flashClust_1.01-2    XML_3.98-1.5   

I admit to having the same problems when trying to get other Java-based R packages to work. (This is why I have given up on Java-R integration and shifted to the Python-based http://spaCy.io).

Namespace conflicts, imports

When building or running check, I got numerous warnings due to the very large number of packages whose namespace this package imports in their entirety. These can be eliminated by importing just the functions you need from the packages, rather than their entire namespaces:

* checking whether package ‘JSTORr’ can be installed ... WARNING
Found the following significant warnings:
  Warning: replacing previous import ‘NLP::annotate’ by ‘ggplot2::annotate’ when loading ‘JSTORr’
  Warning: replacing previous import ‘apcluster::similarity’ by ‘igraph::similarity’ when loading ‘JSTORr’
  Warning: replacing previous import ‘data.table::melt’ by ‘reshape2::melt’ when loading ‘JSTORr’
  Warning: replacing previous import ‘data.table::dcast’ by ‘reshape2::dcast’ when loading ‘JSTORr’
  Warning: replacing previous import ‘igraph::%>%’ by ‘stringr::%>%’ when loading ‘JSTORr’

It's overkill to import the whole package, e.g. in JSTOR_dtmofnouns.R:

#' @import tm NLP data.table openNLP

No demonstration of how it works, or vignette

Use README.Rmd instead of README.md, and linking this to commit hooks to ensure it is always knitted. The README.md contains executable code demonstrating the package, but contains no output. It would be far better to demonstrate what the package does through seeing the knitted output. Alternatively or even better, in addition, this could form part of a vignette with more extended examples.

Other issues

  • DESCRIPTION

    * checking DESCRIPTION meta-information ... NOTE
    Malformed Description field: should contain one or more complete sentences.
    

    This is easy to solve, and I think comes simply from not including a period at the end of the second sentence of DESCRIPTION.

  • Versioning

    The versioning is very non-standard and appears to incorporate the date. I suggest changing this to a more standard semantic versioning scheme.

  • Code formatting

    The code could be indented, so easy to do in RStudio with command-i. This would make it easier to read and more pleasing aesthetically. No matter what they tell you, use four spaces for indents. ;-)

  • function nomenclature

    I'm not crazy about the command nomenclature of JSTOR_* for every command, since yes while this avoids namespace clashes for those who don't use the package::function() syntax, it seems a bit tedious. At the least, I suggest jstor_ and at best, not using this scheme. It's just a preference however.

@sckott
Copy link
Contributor

sckott commented Mar 14, 2017

thanks for your review @kbenoit !

@benmarwick any updates on your last comment #86 (comment) ?

@sckott
Copy link
Contributor

sckott commented Apr 3, 2017

@benmarwick any updates on your last comment #86 (comment) ?

@benmarwick
Copy link
Contributor Author

Yes, thanks for the reminder, and @kbenoit for the comprehensive and thoughtful review. I'll post a more substantial response within this week.

@sckott
Copy link
Contributor

sckott commented Aug 11, 2017

@benmarwick should we continue to hold on this?

@benmarwick
Copy link
Contributor Author

ok to close this for now.

@sckott
Copy link
Contributor

sckott commented Aug 11, 2017

@benmarwick what do you mean? like you need a while before you can get back to this? if so we can keep the holding tag on it, and just revisit every once in a while

OR, are you saying you don't want to go through with the submission anymore?

@benmarwick
Copy link
Contributor Author

Holding sounds good, thanks!

@sckott
Copy link
Contributor

sckott commented Aug 11, 2017

okay, then I'll reopen

@sckott sckott reopened this Aug 11, 2017
@karthik karthik mentioned this issue Feb 2, 2018
19 tasks
chainsawriot added a commit to ropensci/readODS that referenced this issue Jun 24, 2022
- Post acceptance changes, Change README, DESCRIPTION with ropensci
Github links
- Remove COC
- Bump version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants