Skip to content

Conversation

@mguaypaq
Copy link
Member

Fixes #128.

After a long bout of debugging, we figured out that, when working from a clone of a fork of this repository, it's important to have up-to-date versions of the git-annex branch from the fork and from this repository before running git annex get (which will automatically reconcile information from all available git-annex branches). This can be done by configuring this repository as a remote called upstream for the clone, and running git fetch upstream before git annex get.

This pull request adds instructions to the readme to this effect.

Questions for the reviewers:

  1. Are the instructions clear enough?
  2. Is this too much of a niche case to include in the readme?
  3. Should it also be copied to data-single-subject?

@renelabounek, I can't tag you as a reviewer, but if you have any thoughts I'd love to hear them.

@mguaypaq mguaypaq added the documentation Improvements or additions to documentation label Aug 17, 2022
@mguaypaq mguaypaq requested a review from kousu August 17, 2022 14:45
@mguaypaq mguaypaq self-assigned this Aug 17, 2022
@renelabounek
Copy link
Contributor

@mguaypaq: Answers below

  1. Instructions are clear to me
  2. I think it is fine as forks>=10. Anyone of the forking pepole, can need to update database and certainly will forget about this for the first time command try. Git annex is a hassle...
  3. There is lower number of forks, but still >0

@kousu
Copy link
Contributor

kousu commented Aug 17, 2022

2. I think it is fine as forks>=10. Anyone of the forking pepole, can need to update database and certainly will forget about this for the first time command try. Git annex is a hassle...

Agreed. It really is.

I wonder if datalad handles this but I am pretty sure they just don't. Their docs on publishing are here https://handbook.datalad.org/en/latest/basics/basics-thirdparty.html#chapter-thirdparty and they don't mention forks at all

Screenshot 2022-08-17 at 14-16-41 site https _handbook datalad org fork at DuckDuckGo

git-annex, and hence datalad, are all really built around the idea of multiple repos owned by a single person, or a single repo shared by multiple people. In all the time I've worked with it I've never seen the workflow function well as soon as you try to go full-distributed version control, which is strange because that's git's primary selling point.

@kousu
Copy link
Contributor

kousu commented Aug 18, 2022

I wonder if this would be better fixed by adding a GitHub Workflow that syncs the git-annex branch from upstream regularly. Their API lets you find out who the upstream is, I used it over in:

https://github.com/neuropoly/gitea/blob/37cf5c9e9b9f18b5d56df5fa9db52d932659e3b3/.github/workflows/sync-upstream.yml#L112-L113

If we can get all forks to run that automatically we should avoid the problem. Right?

@mguaypaq
Copy link
Member Author

I like the idea of using automation to avoid this issue, but I think the regular instructions are still useful. Some thoughts (where I use upstream to refer to this repository, origin to refer to a fork, and local to refer to a clone of the fork):

  • origin may not be on github, actually, but just on some convenient server (on some internal network, for example). Though in that case a cron job could be used instead of a github action.
  • Modifications in local that are pushed to origin would mean that origin/git-annex would diverge from upstream/git-annex, with potential merge conflicts.
  • I'm not sure I want to end up debugging github action problems on someone else's repository.

@kousu
Copy link
Contributor

kousu commented Aug 18, 2022

I like the idea of using automation to avoid this issue, but I think the regular instructions are still useful. Some thoughts (where I use upstream to refer to this repository, origin to refer to a fork, and local to refer to a clone of the fork):

* `origin` may not be on github, actually, but just on some convenient server (on some internal network, for example). Though in that case a cron job could be used instead of a github action.

I was definitely thinking this would be a GitHub specific workaround.

* Modifications in `local` that are pushed to `origin` would mean that `origin/git-annex` would diverge from `upstream/git-annex`, with potential merge conflicts.

I think this is always a risk no matter where you're running git-annex. It somehow takes care to make sure that branch is always mergeable, every time it says like this

(merging upstream/git-annex into git-annex...)

Maybe it's doing something special to avoid merge conflicts that git by itself would run into, but if so we use git annex sync --only-annex instead of git merge upstream git-annex:git-annex.

* I'm not sure I want to end up debugging github action problems on someone else's repository.

This is the one that gives me the most pause. I don't want to do that either. If it's in our repos, we'll get notifications about problems, but in someone else's we won't see it until they come to us scratching their heads.

Copy link
Contributor

@kousu kousu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root-ish cause is clicking the GitHub sync button, which is not git-annex aware. The current instructions neglect to mention this step. I think we should lean into that and give an all-in-one command line solution that handles syncing both branches in one reliable line, i.e. my latest suggestion.

@kousu
Copy link
Contributor

kousu commented Aug 24, 2022

Sorry it took me so long to review this!

@mguaypaq
Copy link
Member Author

Not at all, thanks for the review, the suggestions, and the thorough testing!

I like your earlier suggestion of:

git config remote.upstream.annex-readonly true
git sync --only-annex --no-content

because it should reliably solve the problem (that git annex can't find file contents) without interfering with the master branch.

With your last suggestion, I think the git pull upstream master possibly does too much: if there are separate changes in upstream:master, and also in origin:master or the local master, then it could merge unwanted changes into the local master, or have merge conflicts. (Here I'm imagining that someone is working from a fork because they want to make non-upstream changes to the dataset.)

@kousu
Copy link
Contributor

kousu commented Aug 24, 2022

Not at all, thanks for the review, the suggestions, and the thorough testing!

I like your earlier suggestion of:

git config remote.upstream.annex-readonly true
git sync --only-annex --no-content

because it should reliably solve the problem (that git annex can't find file contents) without interfering with the master branch.

With your last suggestion, I think the git pull upstream master possibly does too much: if there are separate changes in upstream:master, and also in origin:master or the local master, then it could merge unwanted changes into the local master, or have merge conflicts. (Here I'm imagining that someone is working from a fork because they want to make non-upstream changes to the dataset.)

Isn't that what clicking the Sync button on github and then running git pull does anyway? If there's merge conflicts they will be the same at git pull or git pull origin master. And if they need bomp.nii.gz, say to modify it, add better segmentations, etc, then they need to have both master and git-annex synced.

In fact I realized this while testing.

git config remote.upstream.annex-readonly true
git annex sync --only-annex --no-content

isn't enough. It has to be

git config remote.upstream.annex-readonly true
git annex sync --only-annex --no-content
git pull

Are you thinking about the case where they're if they're just submitting new data in a PR? In that case, they don't need to be synced up.


Maybe we're overcomplicating this. The designed UI is

git annex sync

to do everything in one, but I had reasons for moving away from that:

  1. it does a full download before any uploads (which isn't a problem if you're fully synced up, but is if you're just trying to make a small edit),
  2. it tries to sync with all remotes whether or not you have permission to them (but maybe we can deal with that with git config remote.<name>.annex-readonly?).
  3. It tries to sync all branches, apparently (but maybe we can deal with that with git config --global annex.synconlyannex true?)

Copy link
Contributor

@kousu kousu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in any case I feel like my review was well heard, so make whatever updates you think are appropriate and feel free to merge then.

@mguaypaq
Copy link
Member Author

There's an unrelated failure in the Dataset Validator, when trying to download the dataset. A single file fails, with:

get sub-mniPilot1/anat/sub-mniPilot1_T1w.nii.gz (from amazon...) 
  download failed: HandshakeFailed (Error_Misc "Network.Socket.recvBuf: resource vanished (Connection reset by peer)")

  Unable to access these remotes: amazon

  Try making some of these repositories available:
  	5a5447a8-a9b8-49bc-8276-01a62632b502 -- [amazon]
   	5cdba4fc-8d50-4e89-bb0c-a3a4f9449666 -- julien@julien-macbook.local:~/code/spine-generic/data-multi-subject
   	9e4d13f3-30e1-4a29-8b86-670879928606 -- alex@NeuroPoly-MacBook-Pro.local:~/data/data-multi-subject
   	e405e14e-33b2-4a35-b7a7-3eeec054f0d4 -- sebeda@joplin.neuro.polymtl.ca:/mnt/nvme/sebeda/data-multi-subject

  (Note that these git remotes have annex-ignore set: origin)
failed

Hopefully it's just a random failure; I've re-started the job.

@mguaypaq mguaypaq merged commit 321798b into master Aug 29, 2022
@mguaypaq mguaypaq deleted the mguaypaq/work-from-fork branch August 29, 2022 20:16
@kousu
Copy link
Contributor

kousu commented Aug 29, 2022

Hopefully it's just a random failure; I've re-started the job.

That's the right call. This is #107!

mguaypaq added a commit to spine-generic/data-single-subject that referenced this pull request Aug 29, 2022
kousu pushed a commit to spine-generic/data-single-subject that referenced this pull request Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update readme instructions with git fetch upstream if using a fork

3 participants