-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submission: jstor #189
Comments
👋 @tklebel Thank you for this submission. During your pre-submission review I completely missed that we had a similar submission a year ago #86 by @benmarwick Since that submission was stalled, we are good to proceed with yours. I checked in with Ben on this, and his suggestion was for you (and reviewers) to look through his work to see if there are ideas that might help your effort. So please take a look at https://github.com/benmarwick/JSTORr Editor checks:
Editor commentsYou are already aware of the
Not a problem. There is no requirement that both be timed together. JOSS submission can be made independently at anytime. Reviewer 1: @jsonbecker Reviewer 2: @rileymsmith19 |
Thanks @karthik! I cannot reproduce your errors and warnings from R CMD check, neither on travis, nor on my machine, nor Appveyor or the win-builder. Could you help me in debugging, or should we ignore them? You put my submission into Before my pre-submission, I look into @benmarwick s package, and read the review by @kbenoit thoroughly. I am reluctant to add similar features like which are currently implemented in |
Thanks for catching my mistake with the wrong tag. I have fixed it. Re. the issues from R CMD CHECK, I'll look again when I have a bit of time, but let's proceed since the reviewers will also be able to see if it's a replicable issue. For long lines, you can add # nolint at the end of those lines https://github.com/jimhester/lintr#project-configuration this way lintr will not see them. And lastly, I appreciate that you have looked into Ben's work and the relevant feedback on the package. I agree that it would be very helpful to include some use cases in your package. |
I added In the meantime I wrote an initial version for a case-study which is live here: https://tklebel.github.io/jstor/articles/analysing-n-grams.html In order to host this case-study which cannot be included as a vignette, I went ahead and added pkgdown. Finally, I somehow mistook the JOSS for the JSS. Since JOSS only requires a very short introductory paper, I included one in |
Both reviewers are now assigned. Thanks @rileymsmith19 and @jsonbecker 🙏! |
Since I made a few improvements over the last few days, I created a new version (0.2.6) and made a snapshot at Zenodo. I updated my initial post accordingly and will try to avoid any further changes until the reviews are in. |
I think the correct Github username for @rileymsmith19 is @EccRiley 😺 |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
When reading the description of
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 2 -2.5hrs Review Comments
|
Thanks @jsonbecker for the thorough review! I'm eager to respond to the more substantial questions ( |
Absolutely. I plan to add a few more overall thoughts but with deadline looming I really wanted to get it in! I realize my review heavily focused on a suggestion that may have limited value and is a bit of my own preferences, but that’s largely because everything is otherwise in great shape!
http://www.jsonbecker.com/contact/
… On Feb 21, 2018, at 3:53 AM, Thomas Klebel ***@***.***> wrote:
Thanks @jsonbecker for the thorough review! I'm eager to respond to the more substantial questions (csv vs sqlite and file_name vs other_identifier), but I think I will wait for the review by @EccRiley, so I can address all comments in one go.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I'm still eager to work on/discuss some changes, but wanted to wait for the second review. Do you happen to know until when you will be able to finish your review, @EccRiley? |
@EccRiley are you still able to do the review? |
I have been, yes. I could do a review. I also have some data if @JasonBecker would like to try it. |
@elinw Gentle ping checking in on your review. |
@karthik On it, should finish it up this weekend. |
This is still a draft Package Review
DocumentationThe package includes all the following forms of documentation:
Functionality
JSTOR is a really important data source for the study of scholarly publication on issues. JSTOR as a data source has wide coverage over many disciplines and also includes sources going back in time. As a data source JSTOR has some challenges and features that are a bit different than others. First, All of this to say that it is extremely helpful to have a package to help with managing this. Specifically the The package itself is solid, test coverage is good, documentation is complete, and good practices are followed. It does exactly what it says and works as described. The code seemed fine, and in particular the ways that it it made batch size and number of cores easy to specify was helpful to me. I appreciated the single interface to all the data types. The README is helpful in defining the structure and consistency of the data. CommentsThese are some additional comments for consideration. I have a JSTOR basic file (not full text) from my research that I used in my testing. It contains 1678 items, 42051 references, 12376335 bigrams. I did my testing mainly on RStudioServer (open source) installed on CentOS with 4 cores, although I also did some on my macbook air. I will say that I was happy to have a package that installed easily. At first I was surprised that the data were read to csv files. However, after some thought I feel it is very useful to have this as the main outcome as it is then up to the user to decide what kind of storage One thing I will mention is that I did find that for my data I could not use The vignettes are helpful and the new one in particular is a nice overview of the process of cleaning and analyzing the ngram data. However, I think it would be stronger if it discussed some of the issues For example for the references data I have:
And for the bigrams I had this.
(It is very easy to add custom stop words, I could figure out how just from following the vignette and realizing I also did not necessarily want to get rid of all numbers but one kind of troublesome number was e.g. 1998b. So it would probably be easy to write a function for that, though I ended up just listing out the ones I found. Another example, with JSTOR unless you specify otherwise, you get articles in multiple languages. The language names are not consistently coded so I used I do not have a problem with using the baseid as the main ID. This potentially lets me work with other interfaces to JSTOR, is consistently produced over time and is unique. If I have data from separate JSTOR queries it lets me find overlap easily. Final approval (post-review)
Estimated hours spent reviewing: 5Review Comments |
I would be happy with the renaming to I'm also good with database being out of scope. |
It took some time, but I think I am finished with the requested changes. Due to the long duration of the review process, there are many changes altogether, and only some are responses to the reviewers comments. In my post above I outlined five areas where I would make changes:
All of them are done as of now. The news file, containing all changes, is as follows (or in a prettier version online here: https://tklebel.github.io/jstor/news/index.html): jstor 0.3.0Breaking changes
If you want to terminate the proceses, at least on *nix-systems you need to kill
Importing data directly from zip-filesThere is a new set of functions which lets you directly import files from In the following example, we have a zip-archive from DfR and want to import First we specify what we want, and then we apply it to our zip-archive: # specify definition
import_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),
book = jst_get_book,
ngram1 = jst_get_ngram)
# apply definition to archive
jst_import_zip("zip_archive.zip",
import_spec = import_spec,
out_file = "out_path") If the archive contains also research reports, pamphlets or other ngrams, they # import multiple forms of ngrams
import_spec <- jst_define_import(article = c(jst_get_article, jst_get_authors),
book = jst_get_book,
ngram1 = jst_get_ngram,
ngram2 = jst_get_ngram,
ngram3 = jst_get_ngram) Note however that for larger archives, importing all ngrams takes a very long Before importing all files from a zip-archive, you can get a quick overview with New vignetteThe new New functions
Minor changes
I don't expect you to dive into all the changes, though. I hope that all your comments are addressed with the changes and that we can finish the onboarding soonish 😄 |
I'm sorry I'm slow on replying but this seems good. I love the import from zip. I have some things to focus on right now but by the end of the week I'm planning to have updated to the current development branch and try things out. I will also get you an example article. |
I am very happy with these changes. I especially like the set and structure of the vignettes and the separation of the "case study". This now serves not only as a well documented package but as solid documentation of the data it works with. This has my full 👍 |
25074331 I actually pulled the article and it's true, they cited the same page multiple times. It was striking because it made Greengard the most cited author but I had never heard of him. |
Thanks @elinw for the example. I added a few sentences about issues with references in the vignette and linked the article as an example. |
I actually don't think that the article is "wrong" it's just following the citation style of the journal it is published in. Other journals probably would have used Ibid and op. cit. or similar. |
@elinw I changed my language a bit to remove the certainty of the judgement, though I would still consider it to be an artifact since they use Ibid too, but only for some references. However there might be another reason for this which I am not aware of. I therefore rephrased like this: "[the article cites ...] which is possibly an artifact." |
@karthik just pinging if you could take a quick look at this issue. The current status seems to be:
How will this onboarding proceed? |
Yes, I am good with these changes, the last of which are mainly just polishing documentation etc. This is going to be really useful and encourage the use of JfR. I noticed the change from stop() to abort() and read about the differences, my suggestion is that you make that rlang::abort() explicitly, which you'll need to do for CRAN anyway. Great job! |
@elinw I'm actually not sure if I really prefer Apart from that, I am very happy for the thumbs up! This comes just in time before useR! 😃 |
Congrats @tklebel , your submission has been approved! 🎉 Thank you for submitting and @elinw and @jsonbecker for thorough and timely reviews. To-dos:
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/technotes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing. |
Great! Thank you @karthik for managing the application, and @jsonbecker and @elinw for your helpful comments! I transferred the repo, added the rOpenSci footer and changed all relevant links in the README. I'll see to make the package ready for CRAN and I will complete the suggestion by @elinw while doing so. I'd very much like to write a blog post with some narrative about the development of the package. @stefaniebutland, I should have time to write this post some time after July. |
@karthik I changed more or less all links I found, build systems (travis and appveyor) are working. The only missing part as far as I can see are admin rights for the repo, so I can update a few remaining things (mainly the link to the pkgdown site). Then we could close the issue, I think. |
Very glad to hear that @tklebel. This link will give you many examples of blog posts by authors of onboarded packages so you can get an idea of the style and length you prefer: https://ropensci.org/tags/review/. Here are some technical and editorial guidelines for contributing a blog post: https://github.com/ropensci/roweb2#contributing-a-blog-post. I'll mark my calendar to follow up with you in mid- August so we can agree on a publication date. We ask that you submit your draft post via pull request a week before the planned publication date so we can provide feedback. Happy to answer any questions as they come up. |
@tklebel I have blog post publication dates available Sept 18, Oct 2, 9, 16. If you're still interested, let me know your preferred date and mark your calendar to submit a draft a week in advance. Please see my comment above for more details. |
@stefaniebutland I'm still interested in writing a blog post. Regarding publication dates, I would prefer Oct the 9th, if that is ok for you. |
Glad to hear that @tklebel. Tuesday 2018-10-09 is yours. Please submit a draft via pull request by 2018-10-02 and don't hesitate to ask any questions here. |
@karthik is it possible, that I am still missing appropriate rights for the repo? I wanted to fix the link in the title of the repo, but I still couldn't do it (someone else fixed it for me in the meantime). |
@tklebel Can you please check now? |
@karthik perfect, thanks! |
Hi @tklebel. Checking to see if you're still on track to submit a draft bog post by 2018-10-02 for publication the following week. |
Hi @stefaniebutland. Is it ok if the post is more about why and how I wrote the package, including what I learned along the way, and less about how to use it? I have a lot of documentation, including a case study on the pkgdown site, so I don't think it would be helpful and interesting to write something similar again... If you are fine with that, than I would answer, that I am on track 😺 |
Sounds great @tklebel. More people will likely relate to this kind of post. Please include a short section noting & linking to the docs you have. |
Summary
The tool Data for Research (DfR) by JSTOR is a
valuable source for citation analysis and text mining.
jstor
provides functions and suggests workflows for importing
datasets from DfR.
URL for the package (the development repository, not a stylized html page): https://github.com/tklebel/jstor
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
The package should fit very well into the data extraction-category, because
it extracts data from
*.xml
-files for further use.The target audience would be scientists leveraging JSTOR/DfR for textual research.
Currently there is is no implementation in R to parse the meta data they deliver.
Although the data quality on references is not very consistent, one could try
to conduct citation analysis with a lot of control over the sample and the data
compared to GoogleScholar or Web of Science.
yours differ or meet our criteria for best-in-category?
No.
Pre-submission was approved by @karthik: #186
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.I would very much like to submit a paper to JOSS, however at the moment I don't
have a proper case-study yet. If possible, I would like to submit a
paper at a later stage.
Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
The only "problems" from
goodpractice::gp()
are a test coverage of 98% andlengthy lines in the test-files.
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
If he would find the time to do it, a technical review by @jimhester could
probably help in resolving some performance issues, although he might urge me
to re-write the whole thing (again).
The text was updated successfully, but these errors were encountered: