Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup GitHub information #20

Closed
brettcannon opened this issue Feb 11, 2017 · 24 comments
Closed

Backup GitHub information #20

brettcannon opened this issue Feb 11, 2017 · 24 comments

Comments

@brettcannon
Copy link
Member

Better safe than sorry.

@dhimmel
Copy link

dhimmel commented Feb 16, 2017

Versioned content history

Better safe than sorry.

Certainly!

Additionally, GitHub issues are editable and deletable. Accordingly, there is potential for abuse (e.g. deceptive revisions) that a diff-tracking backup system would help prevent. On the other hand, in the case of content that needs to be purged (e.g. copyrighted, malicious, or inappropriate material), persistence of old versions could be problematic.

I recently created a backup solution for a text corpus (dhimmel/thinklytics) that errs on the side of versioning. To backup the content and track history, this repo uses scheduled Travis CI builds to download and process the content. If successful, the CI job commits the changes back to GitHub. I'm not sure if this method would scale to the activity level of python repositories. Especially, if you'd like to back up all content including uploads / images attached to comments. So just a thought.

@Mariatta
Copy link
Member

There are several options according to: https://help.github.com/articles/backing-up-a-repository/

@Mariatta
Copy link
Member

Mariatta commented Jun 8, 2018

New GitHub Migrations API: https://developer.github.com/changes/2018-05-24-user-migration-api/

@Mariatta
Copy link
Member

Mariatta commented Jun 8, 2018

I haven't tried the new migrations API, but I've tried one of the backup mechanism mentioned in https://help.github.com/articles/backing-up-a-repository/, by using GitHub Records Archiver.

I used my personal access token to run the script. It was able to backup these repos for me within python organization, before it came across API rate-limit issue 😛

But for each of the projects that it did back up:

  • it downloads the issues and PRs as both .md and .json formats
  • it has git clone of the repo, retaining git history

It was able to back up these projects before I used up all my available API calls..

community-starter-kit		psf-ca				pypi-cdn-log-archiver
docsbuild-scripts		psf-chef			pypi-salt
getpython3.com			psf-docs			pythondotorg
historic-python-materials	psf-fastly			raspberryio
mypy				psf-salt			teams
peps				psfoutreach
planet				pycon-code-of-conduct

@Mariatta
Copy link
Member

Mariatta commented Jun 8, 2018

Ok just read this about the Migrations AP:

The Migrations API is only available to authenticated organization owners. The organization migrations API lets you move a repository from GitHub to GitHub Enterprise.

This is as far as I can go since I'm not Python organization owner :)

@brettcannon
Copy link
Member Author

@Mariatta I would ask @ewdurbin if you can maybe become an org owner to continue to look into this (or bug him to 😉 ).

@ewdurbin
Copy link
Member

ewdurbin commented Jun 8, 2018

I kicked off an archive for python/cpython just to see what it produces. Once it finishes I'll summarize the contents here and we can discuss if it fits our needs.

@Mariatta
Copy link
Member

Mariatta commented Jun 8, 2018

Thanks @brettcannon and @ewdurbin :)

I archived my own project (black_out), and the output is [link expired]

It's not as huge as CPython, figured it might be easier to analyze.

@Mariatta
Copy link
Member

Mariatta commented Jun 8, 2018

Nevermind that link above, it timed out 😅 Here is the downloaded content:
f8244650-6b4c-11e8-8b72-4d7fe688e0a1.tar.gz

@ewdurbin
Copy link
Member

The result of the Migrations API dump appears to have everything and is well organized.

Since the intention of the dump is for migrating from GitHub to GitHub Enterprise and the dump is an official GitHub offering (although currently in preview), it seems to be the solution that is least likely to require any regular maintenance beyond ensuring it's run and that we have collected and stored the tarball safely.

Summary of what's there, on a cursory glance these generally line up with some GitHub API object in JSON format from their API:

schema.json: contains a version specifier for what I assume is the dump version, and a github_sha for what I assume is the version of the GitHub codebase that ran the dump.

repositories_NNNNNN.json: metadata about the the repository's GitHub configuration including the enabled settings (has_issues, has_wiki, has_downloads) as well as the labels, collaborators, and webhooks configured.

repositories: directory containing the actual git repos!

protected_branches.json: the configuration for branches that have specific requirements for merging. this includes review requirements, status checks, and enforcement settings.

users_NNNNNN.json and organizations_NNNNNN.json metadata round all GitHub Users and associated Organizations that have interacted with the repository via commit, PR, PR review, or comment.

teams_NNNNNN.json: contains the various teams defined in the organization and their permissions on various repositories.

beyond that we get into the primitives that comprise what we see as a "Pull Request" or "Issue", again these appear to line up 1:1 with JSON objects from the GitHub API.

attachments and attachments_NNNNNN.json

pull_requests_NNNNNN.json and issues_NNNNNN.json

pull_request_reviews_NNNNNN.json:

commit_comments_NNNNNN.json, issue_comments_NNNNNN.json, pull_request_review_comments_NNNNNN.json

@Mariatta
Copy link
Member

Thanks for the update, @ewdurbin!
I'm a little curious how long does the backup take, but it doesn't matter much. To me the backup data is great!! 😄

Will you be able to set up daily backups for the python GitHub org? (cpython is higher priority I would think 😇)
I assume this is something that can be stored within PSF's infrastructure.

Thanks!

@ewdurbin
Copy link
Member

@Mariatta the backup took about 15 minutes to run, but it's asynchronous so we can just kick them off and then poll for completion before pulling the archive.

The result was 320 MB so I'm curious if weekly might suffice for now?

If we stick with daily, what kind of retention would we want?

Daily backups for the past week, weekly backups for the past month, monthly backups forever?

@Mariatta
Copy link
Member

Hmm I don't know what the usual good backup practise is.. Open to suggestions.
I'm thinking at the very least we really should do daily backups. How long to keep it, I don't know :)

@hroncok
Copy link
Contributor

hroncok commented Jun 12, 2018

How crazy would it be to stick everything to a git repo that would be hosted on github but also mirrored somewhere else?

@ewdurbin
Copy link
Member

that's probably not completely out of the realm of reasonability. the biggest concern there would be attachments and the notorious "git + big/binary files" limitations.

@dhimmel
Copy link

dhimmel commented Jun 12, 2018

the biggest concern there would be attachments and the notorious "git + big/binary files" limitations.

For large binary files, I would suggest using Git LFS. GitHub supports LFS files up to two gigabytes in size. If your organization qualifies for GitHub education, you can request a free LFS quota. It is also possible to use a GitHub repository with LFS assets stored with GitLab, however the interface is less user friendly this way.

@hroncok
Copy link
Contributor

hroncok commented Jun 12, 2018

Well I assume there are limits for the attachments anyway. @ewdurbin could you check what's the biggest file there?

@ewdurbin
Copy link
Member

Limitations on attachments are documented here

@hroncok
Copy link
Contributor

hroncok commented Jun 12, 2018

That's not that big. I mean of course versioning a 25 MB binary blob will eventually be crazy, but those attachments don't change over time IMHO.

@brettcannon
Copy link
Member Author

I think Ernest's retention policy suggestion works.

@ewdurbin
Copy link
Member

Okay, for the initial pass I'll setup a task to kick off the "migration" and fetch it once complete each day.

I think the archives can just be dropped in an S3 bucket with a little bit of structure and some retention policies to automatically clear out unnecessary archives.

Will post back here with more information.

@ewdurbin
Copy link
Member

Never came back and updated this. Ended up using Backhub for this. It is keeping daily snapshots for the past month and pushing archives to S3 as well.

@pradyunsg
Copy link
Member

In that case, this issue can probably be closed! :)

@hugovk
Copy link
Member

hugovk commented Jan 27, 2022

Backup is set up, and no comments in the past ~2 years, closing! 💾💾

@hugovk hugovk closed this as completed Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants