Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_all should not touch the DESCRIPTION file with roxygen2::update_collate #723

Closed
renozao opened this issue Feb 23, 2015 · 14 comments
Closed

Comments

@renozao
Copy link
Contributor

renozao commented Feb 23, 2015

Hi,

the recent change in load_all using update_collate (PR #649) has a deep impact when loading the package on parallel jobs, has it simply creates a massively invalid DESCRIPTION file, with each job writing bits of the file in random places.

More generally, I think load_all should not touch any of the package files: it is meant to load the package as is, and by all means should honor what the user wrote in the DESCRIPTION file, even if it is incorrect.

If collate order must be updated, the user can call collate_update, and may be suggested to do so if a computed collate order is not as in the original file.

I would really stick to the rule: load_all emulates what library does, i.e., read/load package files without trying to fix things, roxygen2 helps writing/generating package files.

Thank you.

@wch
Copy link
Member

wch commented Feb 23, 2015

I agree with the principle that load_all does one thing, and document does another.

However, if #649 were reverted, then bug #623 would come back. That is, it wouldn't even be possible to run document on packages that need the Collate field to be updated. So we'd need some other solution to that issue.

@gaborcsardi
Copy link
Member

Fix me if I am wrong, but my understanding is that devtools is for package development, and if you want to use the package, in your case by multiple R processes in parallel, then you can just install it, and load it with library.

Because if you really want to use devtools in parallel, then there are issues with all the generated files, not just DESCRIPTION, and these issues are hard to fix, essentially all writes would require locking.

@Geoff99
Copy link
Contributor

Geoff99 commented Feb 23, 2015

Hi @renozao
As the original author of the pull request, I feel slightly responsible for your problem, and would like to see if I can contribute to a solution. I don't have any experience running parallel jobs in R though. It would help me understand better if you could describe the workflow where devtools is used in parallel jobs on the same package. Like @gaborcsardi , I understand devtools is to be used during development, and I can't quite see how that can be done in parallel. But as I said, I have no experience with parallel processing, so I may be missing something obvious.

PS Your philosophy of keeping the concepts of the packages separate has some attractions to me, but there are things that devtools can't properly do unless it refreshes the DESCRIPTION and NAMESPACE files based on the roxygen2 comments in the packages R files (at least that's what I've observed with my own experiments, while trying to become totally roxygen based for developing my own package)

@renozao
Copy link
Contributor Author

renozao commented Feb 23, 2015

I am using it on a cluster with independent jobs, not real parallel R
session, so things work well and running in dev mode makes it easier to
quickly fix bugs.

If the purpose is to be able to run document, then shouldn't the
collate be updated only when effectively running document?

The current solution might actually be fine, as long as it only writes the
DESCRIPTION file when necessary. In my case it keeps writing it even if the
resulting Collate field is identical to the current one.

On Monday, February 23, 2015, Gábor Csárdi <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Fix me if I am wrong, but my understanding is that devtools is for package
development, and if you want to use the package, in your case by multiple R
processes in parallel, then you can just install it, and load it with
library.

Because if you really want to use devtools in parallel, then there are
issues with all the generated files, not just DESCRIPTION, and these
issues are hard to fix, essentially all writes would require locking.


Reply to this email directly or view it on GitHub
#723 (comment).

@gaborcsardi
Copy link
Member

The current solution might actually be fine, as long as it only writes the
DESCRIPTION file when necessary. In my case it keeps writing it even if the
resulting Collate field is identical to the current one.

@renozao Maybe so for your case.

But I think that the general point of "do not run devtools from multiple R processes on the same directory" still stands. AFAIK devtools does not do any locking, so your toolchain will probably break at some point.

@Geoff99
Copy link
Contributor

Geoff99 commented Feb 23, 2015

Thanks. I'll have to think a bit.

The problem I was having is that underneath the hood, document uses load_all --- it needs to source the R code, or (as load_all does) load the result of running it into an environment / namespace, so that roxygen2 has the information it needs about the package when running the roclets. And if the Collate order in the DESCRIPTION file was wrong, then load_all didn't work, so document didn't work, so load_all didn't work, so ...

I was enormously confused when it first happened to me because I resorted to commenting out new bits of my code to try and track down the error - at some stage document would suddenly work - which would fix the collate entry (which I didn't realise) - once my code was working I removed the comments, one by one, and then got all the way back to my original code - which suddenly was fine :-(

'nuff of my issue, I'll have a think about your problem.

@Geoff99
Copy link
Contributor

Geoff99 commented Feb 24, 2015

Hi @renozao

I agree with @gaborcsardi about the dangers of running multiple R processes on the same directory.

But putting aside those reservations, I have created a branch in my forked (for my own learning and development) copy of roxygen2 which does what you suggested, ie only tries to rewrite the DESCRIPTION file if the collate entry has changed. If you want to try it out, it is available at http://github.com/Geoff99/roxygen/tree/only-update-changed-collate

You'd have to download and install it from github (and probably, if you are using Windows as I do, you might have to restart your R session before it would be available).

Let me know if you try it out and it does what you want. It seems to work for me - but I'm not doing anything in parallel.

@renozao
Copy link
Contributor Author

renozao commented Feb 24, 2015

@gaborcsardi: when you say toolchain, do you mean in the case there is compiled code or even in plain R packages?
I know it seems safer to use installed packages when running jobs on the cluster, but I am using multiple packages, and it would require to re-install each of them after debugging from my local machine (which I sometimes forget), and by using devtools I am sure I am using the latest fixed version of all of them. It is just simpler and I had no issue until the update_collate change.
Full production runs will use standard installed packages.

@Geoff99: I understand the objective of calling update_collate in load_all, but I think it is better practice to have a conservative behavior of only updating if necessary, and notify the user about the update as well (with a message or a warning).
Will try your branch (on Ubuntu).

@hadley
Copy link
Member

hadley commented Feb 24, 2015

I'm not prepared to make any guarantees that load_all() will work in parallel, and I'm happy that it makes minor changes in order to make the package load correctly (and I expect there might be more in future to (e.g.) deal with broken namespaces). I'd be happy to review a pull request as long as it was extremely simple.

@hadley hadley closed this as completed Feb 24, 2015
@Geoff99
Copy link
Contributor

Geoff99 commented Feb 24, 2015

Hi @hadley

As the person who wanted (and still does) the update_collate addition so that document and load_all would work, I'm more than happy with that.

Based on my recent experiences, devtools / roxygen2 might well be OK with NAMESPACES already. I make a lot of mistakes while I'm learning, but having messed up my NAMESPACES several times over the last few days (eg while I tried to figure out which imported package I was using it was that promoted assorted S3 generics to S4 status, so I could importFrom the right place to get rid of assorted warnings), I think devtools copes OK with NAMESPACE mistakes. It will load_all and document (albeit with various grumbles), the resultant package just doesn't run quite properly, which trips up various test_thats which I am progressively building and tidying up, which tells me where to look to fix my own package up. In short, the closer I approximate your recommended package development cycle, the better things work :-) many thanks. (Of course tomorrow, I may well find a new and unusual way to break something).

I'll wait to hear back whether my branch makes life easier for the use case that initiated this issue, before I even think about submitting a pull request.. It's only a few lines, but not as crisply elegant as I would like. I'm not able to test it myself, and indeed, the more I thought about it, the harder I found it to think of how I might develop a reproducible parallel test_that anyway. And I am steadily becoming more of a fan of test_thats each day. Even just thinking about what I want to test and in which order clarifies my thoughts about the structure of my own package.

@gaborcsardi
Copy link
Member

when you say toolchain, do you mean in the case there is compiled code or even in plain R packages?

@renozao Even in plain R packages. E.g. if you export a new function via roxygen, then devtools will need to rewrite the NAMESPACE file, so there is a potential race condition if you do this from multiple R processes. The NAMESPACE file might get messed up.

It is just simpler and I had no issue until the update_collate change.

Maybe you were just lucky. Obviously, if devtools updates files less, there will be fewer race conditions. But this does not necessarily make the situation better. You'll get fewer crashes, yes, but they'll also be harder to debug.

@Geoff99
Copy link
Contributor

Geoff99 commented Feb 24, 2015

Even in plain R packages. E.g. if you export a new function via roxygen, then devtools will need to rewrite the NAMESPACE file, so there is a potential race condition if you do this from multiple R processes. The NAMESPACE file might get messed up

Hi @gaborcsardi . I totally agree with you about the possibility of race conditions (even though I have never yet felt confident enough to try anything remotely parallel in R).

Using the default arguments in devtools::load_all() makes it ignore the NAMESPACE export definitions in order to make development work simpler. So maybe (depending on the exact details of what is running in the parallel sessions, and how dependent they are on NAMESPACE imports --- versus say using the otherpkg::fun(...) approach) some races have been lost already, but the messed up NAMESPACE didn't break anything? And if roxygen is being used to generate all the NAMESPACE contents, devtools will fix the NAMESPACE file for you automagically when the package is built and installed the traditional way anyway.

@renozao
Copy link
Contributor Author

renozao commented Feb 25, 2015

To clarify my thoughts and use case:

  • I think having load_all fix/update some incorrect files like DESCRIPTION and NAMESPACE is fine and a good feature.
  • My jobs load a "frozen" package directory that was previously tested and loaded with load_all, so I am not expected any change to appear in those files.

What happens is that there is a race condition due to parallel jobs trying to re-write a file with its very same content.
@Geoff99, I forked and implemented a fix on your roxygen2 feature branch. Will send a push request shortly.

@lock
Copy link

lock bot commented Sep 18, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 18, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants