Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive TC39's GitHub presence #4

Closed
littledan opened this issue May 27, 2018 · 34 comments
Closed

Archive TC39's GitHub presence #4

littledan opened this issue May 27, 2018 · 34 comments

Comments

@littledan
Copy link
Member

TC39 has historically had a script used to periodically checkpoint its GitHub archive for storage in Ecma's servers. This needs to be recreated now, as the current version is no longer working. cc @ecmageneva @keithamus

@keithamus
Copy link
Member

Can we please get a list of requirements for what is needed? If regular source code snapshots are needed then just git clone will suffice. If we need access to issue data, pull request data, and the source for each pull request - we can make use of the Graphql API to get to all of that data relatively easily. If we need more info, such as the source code of forks, the issues and pull requests on those forks - that may be a more complicated endevour.

Could someone please enumerate what the existing archival script recorded? This might be a useful starting point.

@allenwb
Copy link
Member

allenwb commented May 29, 2018

The biggest concern is capturing metadata that isn't in git. In particular issues and PR comment threads

The last time I tried, the current script worked but was slow, and occasionally failed because of rate limits or timeout issues.

Note it is a raw dump and currently there are no provisions for extracting data from the dump. But it should be possible

So here is a scenario that ecma needs to enable:

Assume it is the year 2070. GitHub (and git) are long abandoned services and technologies. A history of technology PhD student wants to research the evolution of ES class features between ES 2015 to ES 2025. What does ECMA need to archive today to ensure that the raw material for that research will be available.

@allenwb
Copy link
Member

allenwb commented May 29, 2018

The current backup script is in https://github.com/tc39/Ecma-secretariat

@IgnoredAmbience
Copy link
Contributor

IgnoredAmbience commented May 30, 2018

The current backup script is in https://github.com/tc39/Ecma-secretariat

This repository doesn't appear to be visible to delegates.

@allenwb
Copy link
Member

allenwb commented May 30, 2018

Fixed

@IgnoredAmbience
Copy link
Contributor

Summary of that repository: the existing backup script is a wrapper around https://github.com/josegonzalez/python-github-backup

@keithamus
Copy link
Member

I'm still unable to see the repo. Can someone please tell me how python-github-backup is invoked - specifically does it include the -F flag?

@IgnoredAmbience
Copy link
Contributor

The line in question is: github-backup tc39 -o $BKUPDIR --all -O -P -t $TOKEN

The script was last modified on 11 Apr 2016, so will correspond to version 0.9.0 of the python-github-backup tool.

@IgnoredAmbience
Copy link
Contributor

Possibly the breaking change for running this script for Ecma was the dropping of Python 2 support?

@allenwb
Copy link
Member

allenwb commented May 31, 2018

@keithamus you should now have access to the repo

@allenwb
Copy link
Member

allenwb commented May 31, 2018

@IgnoredAmbience note that the script does an updates of its version of github-backup each time the script is run.

@keithamus
Copy link
Member

keithamus commented Jun 27, 2018

GitHub has started to offer a user migration API. With this API you can make REST request to GitHub's migration endpoint, and it will kick off a process of .tar.gzing the following data:

  • attachments
  • bases
  • commit_comments
  • issue_comments
  • issue_events
  • issues
  • milestones
  • organizations
  • projects
  • protected_branches
  • pull_request_reviews
  • pull_requests
  • releases
  • repositories
  • review_comments
  • schema
  • users

You can then query for when the tar becomes available, and when it does, you will be given a URL to download the .tar.gz in its entirety.

In other words, GitHub now has a blessed route to do (almost) everything that the python script does.

I say almost, because the big question we still have is: how important are forks. The backup script right now does --all which implies -F, which goes ahead and downloads every fork in the tc39 network. With every proposal repo (which have between hundreds to thousands of forks) and the Spec (6k forks and counting) you're talking north of 20-30,000 additional repository downloads. Forks (in general) offer very low signal to noise: many forks have 0 changes to them, some will have 1 or more changes which are already available in the central repos PR data, and a small percentage will have commit data which was never pull requested.

In summary (TL;DR): if we can forgo the requirement of downloading forks - which adds significant burden to the process - then GitHub has a recent, built-in, turnkey solution to this.

@littledan
Copy link
Member Author

@keithamus Will this technique download forks that have PRs against the main repositories?

@ljharb
Copy link
Member

ljharb commented Jun 27, 2018

If it includes the pr, does it need the full fork?

@allenwb
Copy link
Member

allenwb commented Jun 27, 2018

We certainly don't need to archive thousands of working forks that never make there way back into the TC39 process. But here is the use cases we need to think about:

  • A delegate (or other active contributor to TC39) makes a fork of some proposal, extensively modifies it, and then presents about it at one or more meetings as an alternative to the original proposal. The alternative is discussed by TC39 but ultimately isn't accepted.
  • Similar to above, except the discussion takes place as issues on the forked repository.

The content of such a forked repository should be achieved as part of the TC39 deliberative record. I don't think we should accomplish this by try to archive all forks, but we should have a documented process that the developer of such an alternative proposal can follow to make sure that it does get archived.

@ljharb
Copy link
Member

ljharb commented Jun 27, 2018

Those forks should be transferred to TC39 in that case, i think (or forked to TC39, to achieve the same result)

@allenwb
Copy link
Member

allenwb commented Jun 27, 2018

Those forks should be transferred to TC39 in that case, i think (or forked to TC39, to achieve the same result)

I agree. Actually my main point is that we need to have a document process and expectations to ensure this happens.

@keithamus
Copy link
Member

I think if we can agree a protocol that any forks presenting in meeting should be PRd or transferred to the tc39 org, it would be vastly preferable to attempting to archive 20-30k+ forks. I don't want to sound like a broken record - but it pretty much hinges on this requirement, and downloading literal gigabytes of 99% duplicate data seems like a waste of time.

@keithamus
Copy link
Member

An alternative could be that we iterate through every fork, check to see if those forks have commits that don't feature in the source repos history or PRs and download only those forks. However this would require a non-trivial amount of engineering effort and would still run into the same rate-limiting issues. So it seems like less of a tech problem and more of a people problem - adding rules + discipline can solve this easier (and would be more procedurally correct).

@littledan
Copy link
Member Author

It's true that there are some forks in use like what @allenwb describes, for example https://github.com/valtech-nyc/proposal-fsharp-pipelines . Let's encourage maintainers of these forks to establish a WIP PR against the main repository to opt into archiving.

For me, the high order bit is that we've been missing effective archives for some time, and getting started again will be a big benefit, even if it's not perfect in terms of coverage the first time.

@jorydotcom
Copy link
Member

So I've been poking around at this a bit tonight & giving the endpoint @keithamus shared a shot.
Unfortunately I've run into two issues:

  1. It's not really supported yet; the Octokit SDK doesn't have any documentation on this so I kinda pieced it together and was able to get the expected responses.
  2. It doesn't seem to work for org repos. I was able to use the API to generate test archives of my own repos, but failed when I tried initiating an archive for small tc39 repos that I have admin rights to (like tc39/tc39-notes). It seems like the Orgs Migration endpoint will work, but you have to be an org owner - I'm not sure who that is to be honest. @littledan?

The github-backup script ran fine for me earlier until I was rate-limited; I may try it again overnight and see if it happens to work.

@allenwb @ecmageneva a nice feature might be to adopt something like what the WHATWG is doing, wherein they do snap-shots as part of their build & deploy steps so you can go to a snapshot version at any given commit (they used to have a warning banner to make it more obvious you were looking at an outdated version of the spec, not sure what happened to that). It doesn't solve the document archival problem but it does give access to a historical view at a given moment in time.

Here's their build script for reference.

@littledan
Copy link
Member Author

Thanks for the great work here Jory!

Looks like a lot of people have owner access, including @jugglinmike @rwaldron @leobalter. I wouldn't mind giving @jorydotcom access if others are OK with it, cc @bterlson.

Adapting WHATWG's script sounds good to me, if we can make it work for us. Seems like really useful functionality.

@jorydotcom
Copy link
Member

Good news, @ecmageneva! I ran Allen's script overnight and it definitely worked (took about 4 hours according to the timestamps). I'm betting the issue you had is related to @IgnoredAmbience's post & we need to update Python + pip on your computer.

Bad news: the resulting zip file is 484MB so I can't just email it to Patrick. I'll email you both to see if you're able to use dropbox, or if I can just write to the NAS & ya'll tell me where to put it.

@littledan RE adding some build functionality; do you think that would be something Yulia would be interested in discussing too?

@littledan
Copy link
Member Author

I know it was raised in the TC39 meeting, but I honestly don't see much of a close dependency between archiving and the website or groups. If we just keep using GitHub, then this archiving strategy should "just work". Worth verifying of course. Cc @codehag

@codehag
Copy link

codehag commented Aug 6, 2018

I agree that archiving should be treated separately, as if we try to work it into the website project we will lose focus and not do either well.

Regarding using github for archiving, do we have a document outlining how this will work? I know that there is a crawler that Keith is working on, and that other people are working on getting the old archived material back into a browsable form. It would be great to have a meeting regarding this to understand where we are with everything.

@keithamus
Copy link
Member

Apologies: I've been on a bit of a vacation recently so haven't been able to keep up to date with these things recently.

@jorydotcom I'd be happy to work with you on resolving the issues you have with GitHub's migrations feature. Using this would likely be preferable to the backup script - especially if it is getting continuously rate limited. The APIs are likely behind an ACL that means only owners can migrate. I would recommend we have owners try the migration endpoints.

@jorydotcom
Copy link
Member

no worries, @keithamus - I totally agree the GH API would be preferable to the script & would be happy to work with you on the crawler &/or API issues. This is your domain, after all!

There's a small ad hoc history group forming with @allenwb & @ecmageneva that this work is specifically pertinent to. Would be great to have the archiving conversation tied in with that group if everyone agrees it makes sense.

@littledan
Copy link
Member Author

@jorydotcom GitHub archiving seems like a very important important history task. @IgnoredAmbience has been doing some great archiving work; maybe he would also be interested in this group.

@IgnoredAmbience
Copy link
Contributor

I'm running on the assumption that all my archival work will go into this repository and thus be picked up by the tc39 org archive for ECMA when it is taken.

The one potential interface with the website would be to make the tc39.github.io/archives site fit better with the overall website design, I'd briefly discussed this with @codehag over IRC a couple of weeks back.

@jorydotcom
Copy link
Member

@IgnoredAmbience would you be interested in joining an ad hoc discussion we're trying to arrange for the second week of September RE history? https://github.com/tc39/Reflector/issues/165

@ctcpip
Copy link
Member

ctcpip commented May 22, 2024

resolved via https://github.com/tc39/archive

@ctcpip ctcpip closed this as completed May 22, 2024
@ecmageneva
Copy link

Not sure where we are on this issue now.....
Just for those who are new here.

For several years Ecma tried to capture all TC39 related entries of the TC39 GitHub pages.
We never had a well functioning solution, where we could also search and find all TC39 info again. Apparently GitHub did not offer a feature/service (I do not know how to call it) that would allow to capture all TC39 relevant information into the Ecma Private Server (It was a Synology Server, now I do not know...) under the TC39 Directory.

At the beginning Allen (and myself until the script worked from Europe....later some timer did not allow it...) and Jory gathered with a script everything under TC39 GitHub. But we always had a big problem: We only captured the files, but had never an effective software to search and present it. I complained about this in many TC39 meetings. So we have put everything in around a yearly fashion (later even less frequently...) into a huge ZIP file that got an Ecma filenumber (like TC39/2018/xyz). So, to be honest in my opinion it was not terribly useful exercise.

Then, this is what we do for the last couple of years (I do not remember when Jory did the last script run...) we took all information from the GitHub (practically via file duplication into the Ecma Private FIle Server) that we have regarded as key information for long-term Ecma storage (These were e.g. the contribution slides, GitHub drafts of the specs, a copy of the Technical Notes etc.). We selected those docs that we regarded as for Ecma relevant information for the long term storage. This is required by the WTO guidelines on how SDOs should work. So, it is a manual process, and in my opinion we had for the Ecma long term storage all relevant information now. In my opinion the hand-picking of those documents is not too elegant, but it is ok. It takes after each meeting to collect the data into a ZIP File about an hour Secretariat work. WE have been doing this now over several years, the last one we did immediately after the 2024 April meeting. So it works. PLease note that according to Ecma Rules we have to finish the Meeting Minutes and the ZIP file in less than 3 weeks. We always were able to match that so far. For the ZIP we make an update even after the 3 weeks deadline when the Technical Notes are become available.

Parallel to it, according to GitHub (the MS company) also have a long-term storage project to save all the GitHub information for all GitHub users forever, I think it is somewhere in the Artic region in Norway. But concretely what is behind, what is a plan, where they are I just do not know....

So, this is the current situation. All in all have a working solution, but of course - as everything - this can be also improved....

@keithamus
Copy link
Member

Parallel to it, according to GitHub (the MS company) also have a long-term storage project to save all the GitHub information for all GitHub users forever, I think it is somewhere in the Artic region in Norway. But concretely what is behind, what is a plan, where they are I just do not know....

This program is called the "Arctic Code Vault" and is situated in the Arctic World Archive in Svalbard, Norway. As to what is inside:

https://github.com/github/archive-program/blob/master/GUIDE.md#whats-inside

The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. (Repos with 250+ stars retained their binaries.) Each was packaged as a single TAR file.

@ljharb
Copy link
Member

ljharb commented May 22, 2024

Sadly there's no way to see what's in there :-/ there's a bunch of valuable since-deleted content in there that'd be great to recover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants