Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for alternative default branches in GitHub #508

Closed
ha0ye opened this issue Jun 13, 2020 · 29 comments
Closed

Add support for alternative default branches in GitHub #508

ha0ye opened this issue Jun 13, 2020 · 29 comments

Comments

@ha0ye
Copy link

ha0ye commented Jun 13, 2020

remotes::install_github() assumes that the default branch is named "master", which is not always the case, e.g.

remotes::install_github("elizagrames/litsearchr")
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo elizagrames/litsearchr@master
#> Error in utils::download.file(url, path, method = method, quiet = quiet,  : 
#>   cannot open URL 'https://api.github.com/repos/elizagrames/litsearchr/tarball/master'

Created on 2020-06-13 by the reprex package (v0.3.0)

The GitHub API notes a default_branch field that could provide this information, e.g.:

remotes:::github_GET("repos/elizagrames/litsearchr")$default_branch
#> [1] "main"

Created on 2020-06-13 by the reprex package (v0.3.0)

So, I think this can be a drop-in for the default value for the ref argument in install_github. Some additional re-factoring might be needed if we wanted to bundle the GitHub API calls.

Note, to my knowledge, this issue was first reported by @elizagrames on twitter, and we chatted about a potential fix before writing up this issue. We're happy to work on a PR to this, and look into propagating the solution to other install_* functions, if so desired.

@jimhester
Copy link
Member

jimhester commented Jun 15, 2020

I think this would be a non-starter unless you can implement it only using the existing number of GitHub API calls, we already run into API limiting problems with the current number of calls, so we can't really do an additional call to lookup the default branch.

If you can work out a way to do this reusing the same number of API calls we could review a pull request for it.

@gaborcsardi
Copy link
Member

Might be possible with the graphql api.

@MyKo101
Copy link
Contributor

MyKo101 commented Jun 17, 2020

I don't know how far reaching this is inside the inner workings of remotes (i.e. in other functions), but I can't see this being hard to implement. Forgive me if it is more complicated (particularly with sources other than github), but to me it seems straightforward.

Running install_github() with ref="" finds the default branch automatically regardless of it's name.

remotes:::remote_download.github_remote() builds a url for the desired tarball directory and then downloads it. A references can be tacked on at the end of the url, but if you're downloading the default branch, you don't need it.

For example:
https://api.github.com/repos/r-lib/remotes/tarball/
and
https://api.github.com/repos/r-lib/remotes/tarball/master
download the same file.

Why not just change the default value for ref to be ref=""? What other consequences will there be? What other parts of remotes need changing? The only consequence for the user seems to be if they want to download @master and it isn't their default branch (which is highly unlikely).

@gaborcsardi
Copy link
Member

Running install_github() with ref="" finds the default branch automatically regardless of it's name.

Are you sure? It does not for me:

> remotes::install_github("gaborcsardi/secret", ref = "")
Error: Failed to install 'unknown package' from GitHub:
  HTTP error 404.
  No commit found for the ref

  Did you spell the repo owner (`gaborcsardi`) and repo name (`secret`) correctly?
  - If spelling is correct, check that you have the required permissions to access the repo.

@MyKo101
Copy link
Contributor

MyKo101 commented Jun 17, 2020

Yes, I see the problem now. There is also the github_DESCRIPTION() which attempts to grab the DESCRIPTION file from github. This is then parsed by remote_package_name.github_remote() to get the package name.

Setting ref="" worked for me because I already had the package installed, and so remote_package_name() used the local version of the package.

But again, the url created within the github_DESCRIPTION() function works if you don't supply any reference to it at all.

e.g.
https://api.github.com/repos/gaborcsardi/secret/contents/DESCRIPTION
is the same as
https://api.github.com/repos/gaborcsardi/secret/contents/DESCRIPTION?ref=x

So a quick if(ref=="") before adding the "?ref=x" would work.

I think the next issue would probably be github_commit(). From my understanding, it is trying to get the SHA reference for the branch we are going to download. Again though, if we don't use curl, the JSON file extracted with download() seems to be fine.

However, the curl method does indeed get two different files. I don't know enough about curl to be able to confirm whether this is feasible, but it seems that the two urls that can possibly get created (with or without a reference) point to either a JSON with details for the package as it stands on that branch (reference), or a list of all commits to the default branch (with slightly less information for each entry in the list). See here:

No reference:
https://api.github.com/repos/gaborcsardi/secret/commits
vs
Reference
https://api.github.com/repos/gaborcsardi/secret/commits/x

Someone more versed in curl could probably write the correct fetch command for this bit, again using if(ref=="") to distinguish between whether we're looking at the url for the current package on the branch (with reference) or the entire history of the branch (without reference) and then pull out the relevant SHA.

Again, no extra API calls. I imagine most other calls to github/git could have something like this around it (and presumably for others in the install_*() family).

@jhollist
Copy link

jhollist commented Jun 17, 2020

Currently, install_github() assumes that "master" is the default branch name, I know this hasn't worked its way out completely yet, but are there other names that could reasonably be assumed to be the default? I believe that "main" is the name GitHub is switching to for the default branch. Could check for the existence of either https://github.com/<user>/<repo>/tree/main or https://github.com/<user>/<repo>/tree/master and use whichever exists. If neither then user would need to know the ref name anyway.

@gaborcsardi
Copy link
Member

gaborcsardi commented Jun 18, 2020

It seems like HEAD can be used to refer to the default branch, even though this is not really documented anywhere. But a quick search shows that many people have found this over the years.

HEAD is the, well, the HEAD of the current branch in git, so this makes sense, since on GitHub the current branch is always the default one. Probably.

@MyKo101
Copy link
Contributor

MyKo101 commented Jun 18, 2020

That makes perfect sense. I've just double checked and all the links I've posted earlier work using ref="HEAD" and remotes::install_github("gaborcsardi/secret@HEAD") also works (whether secret is already installed or not.

Documentation for install_git() mentions using HEAD if you don't know the branch, but still defaults to "master", and works fine with ref="HEAD". I'd therefore assume that the git based repositories all use the same convention and this should therefore work for them all.

@jimhester
Copy link
Member

using HEAD seems a good workaround to me.

@HenrikBengtsson
Copy link
Contributor

HenrikBengtsson commented Jun 18, 2020

remotes::install_github() assumes that the default branch is named "master", which is not always the case ...

Hi, I'm sorry if this has already have been discussed, and it could be that it is the implicit assumption, but I don't see where it says that install_github() should install the default branch. The help pages says that ref Defaults to "master", which I have taken to be a de-facto standard we use in the R community. To me this discussion sounds more like a change of the default behavior of install_github() (from installing a pre-defined branch to installing the default branch) rather than a "bug fix".

I have designed all my GitHub repositories based on the fact that remotes::install_github() installs from pre-defined branch name (currently master) which is not the same as my default branch (which I name develop and is what I want PRs to be made towards), e.g.

> remotes:::github_GET("repos/HenrikBengtsson/future")$default_branch
[1] "develop"

Changing the behavior of install_github() would all of a sudden have users install from a branch that should not be installed from.

If I had a vote/say, and this is about migrating away from the branch name master to an new standard for the main branch (e.g. main), may I propose to have it default to ref = c("main", "master") where main is used with a fallback to master?

@gaborcsardi
Copy link
Member

Can you implement a fallback with a single GH API query and base R?

The other issue is that main might not be a consensus choice. Now it is a good time to get rid of any hard-coded branch names in remotes, and the using default branch is a good default for remotes as well. We would have probably used this default, had we known how to do it.

I agree that this will break some people's workflows. However, I am fairly sure that for most R package repos the default branch and the current remotes default is the same, and they won't experience any disruptions. We will also give them the freedom to name their default branch however they want.

@HenrikBengtsson
Copy link
Contributor

Righty-oh. Thanks for confirming the design decision. One more though, would it be possible to give the maintainer control over this via a config file in the repos? For example,

$ cat .remotes.yml
defaults:
  - ref=main

Or, by hijacking some existing file for the same purpose. The downside would be one-more round trip. The upside would be giving the maintainer full control of what end users install when they don't specify ref=....

Just a thought. Not a biggy but certainly an itch.

@gaborcsardi
Copy link
Member

I am afraid I don't quite understand. Where would you put that file, and which packages would it apply to?

@HenrikBengtsson
Copy link
Contributor

HenrikBengtsson commented Jun 19, 2020

Sorry for not being clear. I'd put it in the root of my repos. Just like .travis.yml etc.

@HenrikBengtsson
Copy link
Contributor

Another thought would be a field in the DESCRIPTION file

@gaborcsardi
Copy link
Member

Yes, but then what does it apply to? To the repo itself? So we would first download DESCRIPTION from the default branch, and if that specifies a different branch, then download the package from that one, otherwise from the default branch?

That's another round of queries again, and it also complicates the process substantially.

Isn't it just easier to make the branch that you want people to download the default? That's also what they see on the main repo page, shouldn't they get exactly that when they run install_github()?

@HenrikBengtsson
Copy link
Contributor

That's another round of queries again, and it also complicates the process substantially.

Yes.

Isn't it just easier to make the branch that you want people to download the default?

I think we discussed this elsewhere: In short, I want the default to be the develop branch (for PRs and visibility) and master the most recent stable release (for installs).

Long argument:
I'd like to have remote::install_github() to always install from a stable branch and not from a branch that is constantly in flux/under development. That is just a mess for both end-users and developers - there's a lot back and forth needed to establish what they're actually running if there are issues. Also, if I go and install a random commit point from someone else, I can't be sure if it happens to be in a broken state or not. So, I argue that remote::install_github() should default to a stable release point - not the flavor of the day. (I know, I'll die on this hill).

We don't have a standard for release branches or tags, so master is the only one can could be considered stable. Then you develop in another branch, in my case develop. I only merge releases to master. Now, I'd like for people to do PR to my develop branch and I'd like for drive-by-ers to see that what's happening even if it might be months before master is being updated (I actually think you're the one that said that the latter is an important feature).

@MyKo101
Copy link
Contributor

MyKo101 commented Jun 20, 2020

It's an interesting way to deal with your branches that I'd never really considered. Not quite as much of an edge-case as I had originally assumed. From your explanation it actually sounds like a really good workflow.

The problem is you essentially want two different "defaults". The one as applied by git/GitHub and "master" for two different purposes (PRs vs installations).

From what I gather, it seems like the creators of {remotes} intended to use the default branch, but weren't aware of using "HEAD" as an option and so just used "master" as that's the standard default. However, if git/GitHub implement to switch to something else (eg "main") as has been suggested, this would break {remotes}. Sure, the default in this package could be changed to "main", but then everybody would have to switch to "main" too (and a lot of people aren't happy with that).

Your use-case (as reasonable as it seems) is kinda exploiting a loop-hole in install_github(), and this PR would fix that loophole.

A solution (and unfortunately, it would probably take some work to setup in your workflow) would be to use a second account for stable releases vs development versions.

Commits have three levels: <user>/<repo>@<branch> (for simplicity). You're currently using one of those levels (branch) to differentiate your two "defaults", but this fix breaks that. It wouldn't be feasible to use repo, so the best option would be to use user. Rather than merging your "development" and "master" branch when you get a stable release, you push it to the other account. Similar to pushing stable releases to CRAN.

I know this isn't ideal for you or anybody else that uses this kind of workflow, but I'm fairly sure this change is going to have to happen at some point (sooner or later I don't know), so it's something to consider.

@HenrikBengtsson
Copy link
Contributor

In understand that the using the default branch will be a convenient solution to the upcoming change.

Note that there is a bigger discussion here: remotes::install_github() is creating a de-facto workflow standard among developers and users. It's ad[ao]pted by lots of people of these days. After install.packages() its the most common installation method in R right now. That is an important role. This workflow implies that you as a developer/maintainer should always attempt provide a working, stable version of your package in the default branch because you newer know who installs it and when. You cannot push a broken version and then go to bed without risking messing it up for someone somewhere out there.

In my own work, I've already have instructions how to install from the development branch install_github("user/pkg", ref="develop"). For packages on CRAN, I refer to install.packages() for installations. For the packages that are not yet on CRAN, I have now updated by instructions from install_github("user/pkg") to install_github("user/pkg", ref="master"). Right now, I could even prevent installing the package from GitHub by not having a master branch such that install_github("user/pkg") will give an error. If that becomes the default, that's not possible.

In this bigger picture there is also the support for declaring dependencies in the DESCRIPTION file:

Remotes:
  userA/pkgA
  userB/pkgB

This is not stable or safe because what's installed there can change second by second (contrary to stable releases on CRAN/Bioconductor). I argue each package maintainer should provide a stable release branch so that others can always point to that one. This is the only way we can guarantee some kind of stability outside of CRAN/Bioconductor releases. Something like:

Remotes:
  userA/pkgA@latest
  userB/pkgB@latest

This is a standard that only the remotes package has the momentum to introduce. I obviously argue that the default should then be ref="latest" (given that most people use master as their under-development branch). The second best alternative is not to have a default behavior at all and require everyone to specify argument ref per developers instructions.

My suggestion of having a .remotes.yml in the repos root would allow me as a maintainer to control the default installation reference (commit, branch, tag). Alternatively, one could imagine a DESCRIPTION field:

RemotesNotes:
  ref=latest

that the package maintainer can control. This would require an extra roundtrip in those cases, i.e. remotes::install_github() pulls down the default branch, parses DESCRIPTION (or .remotes.yml) to see if they specify a default installation branch and if different from what's been downloaded, then it does a second round of download to get the intended reference/commit.

This would actually be useful for some other repositories where their master branch is know to be broken and they're asking people in issue comments to install a specific commit until they've fixed up their master branch, which in some cases can take months because they have diverged too far.

In summary: This might be a good time to think through the long-term role of install_github().

PS. Does the changes proposed here, including the HEAD hack, transfer to any other Git services, e.g. GitLab, Bitbucket, Gitea, ...?

@jonkeane
Copy link

For what it's worth, remotes::install_git() already grabs the default branch when you run it against a github URL, so the change proposed by @gaborcsardi to use HEAD would bring install_guthub() inline with install_git()'s current behavior.

I wonder if the behavior that @HenrikBengtsson is looking for would be better served by using the special @*release syntax along with github tags+releases? (cf https://remotes.r-lib.org/articles/dependencies.html#github). It would require needing to tag/release on github which you say you're not currently doing, but would make allow you to advertise one install_github() that always installed the latest stable version.

@HenrikBengtsson
Copy link
Contributor

For what it's worth, remotes::install_git() already grabs the default branch when you run it against a github URL, so the change proposed by @gaborcsardi to use HEAD would bring install_guthub() inline with install_git()'s current behavior.

I think this is a good point - checking off consistency across the API is important.

I wonder if the behavior that @HenrikBengtsson is looking for would be better served by using the special @*release syntax along with github tags+releases? ...

This is actually a model I'm using right now. My current stable release is the (HEAD of) master branch - but I could easily tag the latest stable release. I tag every version with the package version number to x.y.z, which shows up under Releases on GitHub, so they can easily be installed via @ or ref="...".

It's on my to do list to learn how to do proper '*release' releases - I think it was @gaborcsardi who pointed this out to me.

I'd sign up on making new default would be rel="*release" but I guess that would then require everyone doing proper releases (which I'm not against) - maybe 'devtools' can provide/already provides this?

@MyKo101
Copy link
Contributor

MyKo101 commented Jun 20, 2020

I also like this idea. It would require more information regarding github releases taught from an R perspective. devtools has a release() function which pushes to CRAN, so it would probably be implementable in there.

It would also require that *release defaults to the current HEAD if there are no releases in the repo. It could be coded to check for a *release and then find HEAD otherwise, but then it has the same limitations already discussed (multiple API checks).

@gaborcsardi
Copy link
Member

gaborcsardi commented Jun 20, 2020

Changing to installing releases by default is not possible, as it would break tons of code.

The goal of remotes and install_github() always was and still is to support installing packages from places not supported directly by install.packages().

My personal view is that the main use case of install_github() is to install the development version of a package. After looking at about ~20 random popular R packages (from https://www.r-pkg.org/starred) it seems to me that this is also how people use it. I.e. the README typically tells you to install the stable version from CRAN, and that if you want the dev version, then use install_github().

I completely agree that people should not install dev versions of packages by default. E.g. I removed the instructions for installing dev versions from the READMEs of all my packages. I also think that the right tools for distributing R packages are CRAN and CRAN-like repositories.

@maelle
Copy link
Contributor

maelle commented Jun 21, 2020

@MyKo101 for info since you mention devtools, regarding GitHub releases, there's usethis::use_github_release(), with the GitHub release an item in the checklist produced by usethis::use_release_issue().

@jeroen
Copy link
Member

jeroen commented Jun 21, 2020

FYI, the remote HEAD is a a standard reference to the 'default' branch. For example the libgit2 docs for git_remote_default_branch says:

The default branch of a repository is the branch which HEAD points to. If the remote does not support reporting this information directly, it performs the guess as git does; that is, if there are multiple branches which point to the same commit, the first one is chosen. If the master branch is a candidate, it wins.

@MichaelChirico
Copy link
Contributor

MichaelChirico commented Jun 22, 2020

@HenrikBengtsson FWIW data.table is always including the commit ref installed from our master (=default, at the moment) branch as the Revision field to DESCRIPTION:

https://github.com/Rdatatable/data.table/blob/3c788b7914918e5b6e3c1d76bed42fcb1e61a69d/.gitlab-ci.yml#L49

This averts the problem you mention of not being able to track down which bleeding-edge install a user may be using when installing from master.

IIUC to do this remotes would require a little extra work & I'm not sure if it would require extra API calls...

My 2 cents is if you're installing with remotes::install_git{hub,lab}, you should be expecting a potentially unstable release; stable releases can be kept as tags and installed from there:

remotes::install_github('Rdatatable/data.table@1.12.2')

@gaborcsardi
Copy link
Member

gaborcsardi commented Jun 22, 2020

This averts the problem you mention of not being able to track down which bleeding-edge install a user may be using when installing from master.

IIUC to do this remotes would require a little extra work & I'm not sure if it would require extra API calls...

Not at all. :) No extra work is needed, remotes adds the git sha automatically if you install from git, GitHub, etc. sessioninfo::session_info() prints it automatically as well:

> remotes::install_github("r-lib/cli")
[...]
> packageDescription("cli")$RemoteSha
[1] "e3ca34656f5bb82df63bfc1c741e75acc79b13d9"
> library(cli)
> sessioninfo::session_info()
- Session info ---------------------------------------------------------------
[...]
- Packages -------------------------------------------------------------------
 package     * version    date       lib source
 assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
 cli         * 2.0.2.9000 2020-06-22 [1] Github (r-lib/cli@e3ca346)
 crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)
 fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)
 glue          1.4.1      2020-05-13 [1] CRAN (R 4.0.0)
 remotes       2.1.1.9000 2020-06-18 [1] Github (r-lib/remotes@9b5dc29)
 sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
 withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)

[1] C:/Users/csard/R/win-library/4.0
[2] C:/Program Files/R/R-4.0.1/library

My 2 cents is if you're installing with remotes::install_git{hub,lab}, you should be expecting a potentially unstable release

Again, the purpose of install_github() is to install an unstable "release".

@MyKo101
Copy link
Contributor

MyKo101 commented Jul 13, 2020

Should this be closed with the new merges?

@jimhester
Copy link
Member

Closed by #510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

11 participants