-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSTORr package #86
Comments
Editor checks:
Editor commentsCurrently seeking reviewers. It's a good fit and not overlapping.
There's a bunch of non-ascii text in some data stored in the pkg, check out |
I just took an initial look at this and am seeing lots of
Do you want to clean these up first before I review? |
@benmarwick you could also just namespace these so you don't have to list them in imports, like |
Thanks, I'll clean these up and get back to you with an update here |
both reviewers assigned. @benmarwick ping back here when you get that stuff cleaned up |
Maybe this is in a holding stage but in the spirit of better late than never, I've refreshed my clone of this repo and refreshed the comments I started last month. Here they are. Package Review
DocumentationThe package includes all the following forms of documentation:
Functionality
Estimated hours spent reviewing: 2.5 Review CommentsBasic motivation and value added of the packageBelow, I've included some practical issues that could be addressed to improve the package and which have fairly straightforward solutions (such as: use knitr for the README; write a vignette; etc.). Before turning to those, however, I will give some tough but I hope fair comments about what I the basic need for such a package. In essence, this package is a wrapper around a set of text-based utilities to analyze texts with a specific data structure, namely the article texts and some document-level variables from JSTOR. Its stated purpose is to perform word analysis, clustering, and topic modelling on the texts, with the option of using the document-level variables to show topic prevalence across time. However these really have nothing to do with JSTOR, but rather to do with the list structure crated by the When I first cloned the package, I thought it would have something to do with querying data from a JSTOR API, or special processing for JSTOR data, but in fact it has nothing to do with that. Rather it reads in a .csv or .tsv file with only the expectation that specific fields exist in the data. The interaction with JSTOR is entirely up to the user, and the package has to assume that the data provided by JSTOR keeps the same structure as it is expecting. Now here I am going to be that annoying reviewer who suggests citing his or her own work, in a slightly different way, but there are packages to do this, including the one I've worked on, and which most users tell me is more straightforward than tm: the package quanteda, which can along with a companion package readtext, read in files from .zip, .csv, .tsv data and make that into a corpus, including automatic processing of the document variables, and then create document-feature matrixes that (at least in the current GitHub version) retain the document variables. This makes it easy to keep them for things like the stm package which can use additional variables for fitting correlated LDA models, or machine learning, etc. quanteda can do almost all that you are doing in this package including:
It does not directly filter nouns but we are hard at work on a package spacyr that will tag a quanteda corpus and add new token selection tools to select on parts of speech. These are found at http://github.com/kbenoit/quanteda, http://github.com/kbenoit/readtext, and http://github.com/kbenoit/spacyr. Ok, so now Another way to make this case, of course, is to convince me what is special about the wrapper set that extracts more value from the JSTOR data specifically. This goes beyond an ease of use motivation, because it argues that putting together components in a specially selected way leads to specific insights about journal articles. Don't get me wrong, I have total respect for anyone taking their precious time and effort to write packages and to tidy them up sufficiently to submit them to rOpenSci, and I don't want to discourage that. But part of that process involves applying the rOpenSci standards, and I think this package needs much stronger motivation to meet those. Note also that I am being slightly hypocritical in that two of the three packages I list above are not yet on CRAN (one because of installation issues on Windows involving the Python/Cython links), and quanteda I have not yet submitted to rOpenSci. But we have been working toward that with the standards firmly in mind. If you wanted to use more of quanteda for the scaffolding of this package I would be happy to advise or assist. Now on to more mundane issues. Installation problemsOverall, I could not test the package because I could not attach it. I followed the installation instructions in the > library(JSTORr)
Error : .onLoad failed in loadNamespace() for 'rJava', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so
Reason: image not found
In addition: Warning messages:
1: replacing previous import ‘NLP::annotate’ by ‘ggplot2::annotate’ when loading ‘JSTORr’
2: replacing previous import ‘apcluster::similarity’ by ‘igraph::similarity’ when loading ‘JSTORr’
Error: package or namespace load failed for ‘JSTORr’ I'm running: > sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] igraph_1.0.1 Rcpp_0.12.9 magrittr_1.5 knitr_1.15.1 cluster_2.0.5 MASS_7.3-45 leaps_3.0
[8] munsell_0.4.3 scatterplot3d_0.3-38 colorspace_1.3-2 lattice_0.20-34 ggdendro_0.1-20 plyr_1.8.4 tools_3.3.2
[15] grid_3.3.2 data.table_1.10.4 gtable_0.2.0 lazyeval_0.2.0 digest_0.6.12 assertthat_0.1 tibble_1.2
[22] Matrix_1.2-8 NLP_0.1-10 lda_1.4.2 gridExtra_2.2.1 FactoMineR_1.35 apcluster_1.4.3 ggplot2_2.2.1
[29] scales_0.4.1 flashClust_1.01-2 XML_3.98-1.5 I admit to having the same problems when trying to get other Java-based R packages to work. (This is why I have given up on Java-R integration and shifted to the Python-based http://spaCy.io). Namespace conflicts, importsWhen building or running check, I got numerous warnings due to the very large number of packages whose namespace this package imports in their entirety. These can be eliminated by importing just the functions you need from the packages, rather than their entire namespaces:
It's overkill to import the whole package, e.g. in
No demonstration of how it works, or vignetteUse Other issues
|
thanks for your review @kbenoit ! @benmarwick any updates on your last comment #86 (comment) ? |
@benmarwick any updates on your last comment #86 (comment) ? |
Yes, thanks for the reminder, and @kbenoit for the comprehensive and thoughtful review. I'll post a more substantial response within this week. |
@benmarwick should we continue to hold on this? |
ok to close this for now. |
@benmarwick what do you mean? like you need a while before you can get back to this? if so we can keep the OR, are you saying you don't want to go through with the submission anymore? |
Holding sounds good, thanks! |
okay, then I'll reopen |
- Post acceptance changes, Change README, DESCRIPTION with ropensci Github links - Remove COC - Bump version
Summary
The aim of this package is provide some simple functions in R to explore changes in word frequencies over time in a specific journal archive. It is designed to solve the problem of finding patterns and trends in the unstructured text content of a large number of scholarly journals articles from the JSTOR archive.
URL for the package (the development repository, not a stylized html page):
https://github.com/benmarwick/JSTORr
Who is the target audience?
Researchers whose primary literature is published in journals archived by JSTOR
Are there other R packages that accomplish the same thing? If so, what is different about yours?
Not that I'm aware of, this pkg is unique
Requirements
Confirm each of the following by checking the box. This package:
Currently these examples are in the readme
Publication options
paper.md
with a high-level description in the package root or ininst/
.Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:It passes, but with many warnings, see the Travis logs https://travis-ci.org/benmarwick/JSTORr/builds/184114026, I could do with some help on how to fix those
Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
@brooksambrose has used the pkg and submitted a very nice PR recently. @juliasilge has recently transformed text mining in R with her pkg with @dgrtwo. @noamross saw this pkg submitted to JOSS and recommended I submit it here first.
The text was updated successfully, but these errors were encountered: