Skip to content

Conversation

gsfr
Copy link
Member

@gsfr gsfr commented Feb 23, 2016

This PR converts our bootstrapping logic to use http rather than manipulating mongo directly. As such, bootstrapping now requires the api to be up. Ping @ryansanford re flywheel infra.

The code still contains a reference to the bootstrap branch of the testdata repo. When ready to merge this, testdata should be merged and the branch reference should be removed.

To test this code, persistent/testdata needs to be removed as it is not currently a git repo.

_id = result.inserted_id

log.info('Running %s as job %s to process %s %s' % (gear.name, str(_id), input.container_type, input.container_id))
log.info('Enqueuing %s as job %s to process %s %s' % (gear.name, str(_id), input.container_type, input.container_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kofalt
Copy link
Contributor

kofalt commented Feb 23, 2016

Seems reasonable. Not reviewing run.sh.

Can you speak to the schema changes? Lengths changed, restrictions were removed?

@gsfr
Copy link
Member Author

gsfr commented Feb 23, 2016

Can you speak to the schema changes? Lengths changed, restrictions were removed?

Our own bootstrap file did not pass schema validation. Current bootstrapping circumvents validation.

The input schemas are for input validation; the mongo schemas are sanity checks on our own code to prevent database corruption. Length and pattern restrictions don't constitute database corruption.

@rentzso: Could you please give a practical example of how the mongo schemas help? Can all the mongo schemas be reduced to the required fields, as that is all we actually care about in terms of internal consistency? Additional fields certainly don't affect us.

@rentzso
Copy link
Contributor

rentzso commented Feb 24, 2016

@gsfr a practical example is permissions where you want to validate that they are an array and that they follow a certain schema (_id + site).
This is a basic requirement for the API code and it's better to fail a request than to continue running in a corrupted state.

I think that this problem actually happened and was blocked by the mongo schema a while ago. In project creation the permissions are copied from group roles,
and group roles, at the time generated by the bootstrap process, were missing the site.

additionalProperties is also important to protect the API from potential field misspelling in the code.

@kofalt
Copy link
Contributor

kofalt commented Feb 24, 2016

SGTM 👍

bin/bootstrap.py Outdated
log.info('Packaging %s' % dirpath)
filepath = create_archive(dirpath, os.path.basename(dirpath), metadata, tempdir, filenames)
filename = os.path.basename(filepath)
metadata.get('acquisition', {}).get('files', [{}])[0]['name'] = filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata could be undefined if there is no metadata.json.
this line doesn't seem also to update the metadata if it doesn't have the 'acquisition' field or the 'files' subfield.

@rentzso
Copy link
Contributor

rentzso commented Feb 24, 2016

LGTM pending a comment

@ryansanford
Copy link
Contributor

I'm testing now. some tweaks will be required for docker env.

@gsfr
Copy link
Member Author

gsfr commented Feb 25, 2016

Added proper schema validation.

@ryansanford
Copy link
Contributor

@gsfr Required changes for /docker folder are pushed.

I'd like to know your thoughts on the below commit. If you approve, we should do similar in bin/run.sh, or take this as an opportunity to refactor the bootstrapping out of bin/run.sh and have both that, and docker reference the same scripts.

029ec55

RE Download size:
I question the use of git rather than curl to pull down the test data, because the current method with git resulted in a 660 MiB download (using transport compression) rather than 210 MB with curl. For now, I didn't change the use of curl for the docker bootstrap.

Gunnar Schaefer and others added 9 commits February 25, 2016 16:07
- run paster asynchronously
- improve mongo installation
- pull out mongo version
Now supports API enabled bootstrapping.
Don't reference branches from within source. This allows core and testdata repo to move without breaking not up-to-date versions of core. Also allows branches to be fully merged.

Still a similar branch reference in bin/run.sh
@gsfr
Copy link
Member Author

gsfr commented Feb 26, 2016

Thanks, @ryansanford. I added my final commit and rebased.

I don't like the commit hash in the code. My idea for run.sh is to use the tip of master by default and otherwise stay on whatever branch the user has manually checked out. I also don't like that we are currently implementing our own versioning in run.sh, based on curl and ETags.

An alternative approach could be to get the testdata branch that matches the current core branch, if it exists, else fall back to master.

The large data size is due to several iterations of zip files polluting the git history. I just updated the bootstrap branch on testdata to start fresh, without history. With that, a --single-branch clone comes in at 216MiB.

I guess that obliterated the commit hash you are referencing. 😧

@ryansanford
Copy link
Contributor

@gsfr

Nice work on reducing the download size for bootstrap branch of scitran/testdata. Right now downloading "master" branch comes in at 444.69 MiB. Do we run into the same issue there as we merge changes in?

You did invalidate that commit hash, but that's ok. =)

The final commit hash that would get pushed to bootstrap branch of scitran/core is the one after scitran/testdata is merged to master. See comments here. https://github.com/scitran/core/blob/bootstrap/docker/bootstrap-data.sh#L22-L23

The benefit is that, moving forward, merging changes to "master" of scitran/testdata would never break users who aren't at the tip of scitran/core. When the user of scitran/core decides to pull or rebase off the tip of "master" they get the new hash for the compatible testdata, and off they go.

When breaking changes are made to scitran/testdata, this makes the lives of downstream scitran/core developers easier, because they are no longer reliant on "getting the memo". It also means people merging scitran/testdata are no longer concerned with others "receiving the memo" or even properly understanding it.

@gsfr
Copy link
Member Author

gsfr commented Feb 26, 2016

I'm ready to drop the hammer on this today. Re testdata: I'm planning to blow away the history on master as well and temporarily keep the old stuff on a legacy branch.

@ryansanford: Standing by for your go-ahead on merging testdata.

@ryansanford
Copy link
Contributor

@gsfr
No objections to merging scitran/testdata

gsfr pushed a commit that referenced this pull request Feb 26, 2016
@gsfr gsfr merged commit 680930f into master Feb 26, 2016
@gsfr gsfr deleted the bootstrap branch February 26, 2016 21:51
gsfr pushed a commit that referenced this pull request Feb 26, 2016
gsfr pushed a commit that referenced this pull request Feb 26, 2016
gsfr pushed a commit that referenced this pull request Feb 26, 2016
@nagem nagem mentioned this pull request Jun 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants