Skip to content

The Omega Repo

Adam Bliss edited this page May 1, 2017 · 5 revisions

Definition

An omega-repo is a special type of mono-repo wherein the meta-repo and sub-repos live in one Git repository.

Usage

I have under review a command that will enable a full omega-repo workflow. To create a brand-new sub-repo:

$ git meta new my/sub/repo
Created new sub-repo my/sub/repo.  It is currently empty.  Please
stage changes and/or make a commit before finishing with 'git meta commit';
you will not be able to use 'git meta commit' until you do so.
$ touch my/sub/repo/README.md
$ git meta add .

Or, to create a sub-repo by importing existing history:

$ git meta new -i http://example.com/my-repo.git my/imported/repo

At this point, your new repositories are staged for commit:

On branch master.
Changes to be committed:
  (use "git meta reset HEAD <file>..." to unstage)

        new file:     my/imported/repo (submodule, newly created)
        new file:     my/sub/repo (submodule, newly created)
        new file:     my/sub/repo/README.md

You can commit them and push to a branch as normal:

$ git meta commit -m "created some sub-repos"
$ git meta push origin master:new-repos

It's important to note that the new command does not interact with any back-end and has no side-effects outside the repository on which it operates. The changes it makes and any created repositories (e.g., my/imported/repo) exist only in your local repository until pushed upstream.

Implementation

Implementing this scheme is easy, and doesn't conflict with the existing proposal for mono-repos: when sub-repos are created, they simply have the same URL as the meta-repo. That is, all sub-repos and the meta-repo share the same URL. Specifically, each submodule has an entry of "." in the .gitmodules file. Existing methods of opening, fetching from, and pushing to sub-repos remain the same. Alternatively, we could completely ignore URLs configured with a submodule.

Sub-repo creation

A repo is created by mounting a repository (generated-locally) as a submodule. It has to have a commit before you can push, but otherwise using, e.g., git-meta push would be sufficient to land a newly-created repository.

Sub-repo deletion

Removing a submodule is sufficient to remove a sub-repo.

Sub-repo rename

Rename is effectively implemented as delete + create.

Summary

It's important to note that none of the lifecycle operations described above require interacting with the back-end. If a user locally creates a sub-repo then changes his or her mind, there is no "mess" left on the back-end.

Advantages

This approach solves several problems:

  • Submodule URLS -- Our current prototype relies on a unique capability of gitolite that allows repositories names to contain nested paths, e.g. foo/bar/baz. No other widely-used Git hosting solution supports this feature. Without this feature, we would need a scheme to map nested paths to flat names, or translate URLs somewhere along the way.
  • Sub-repo lifecycle transitions -- There are many potential corner-cases, especially in the back-end, involved in creating, deleting, and renaming repositories. If they don't exist as repositories, these issues mostly disappear, or at worst are resolved using normal Git conflict-resolution.
  • Sub-repo lifecycle APIs -- The Git protocol does not provide for server-side operations such as repository creation or deletion. Thus, such operations are tied to specific Git hosting solutions and fall outside the purview of git-meta. With the new approach, we can create, delete, and rename sub-repos using standard Git operations.
  • Sub-repo lifecycle changes are local -- When sub-repos must be backed one-to-one by back-end repos, you cannot make lifecycle changes without interacting with, and manipulating the back-end. With the new approach, sub-repos can be created, removed, renamed, etc., with purely local operations.
  • Sub-repo lifecycle changes are first-class Git changes -- The entire history of a sub-repo lives in the mono-repo. If a user creates a sub-repo and never pushes the change that introduced it, it actually never happened.

Additionally, this approach has other advantages and allows for new possibilities:

  • Easier management -- Maintaining and working with a single repository on the back-end may dramatically simplify mono-repo maintenance.
  • Better mono-repo mobiility -- Because everything lives in a single Git repository, it's extremely easy to move the mono-repo around. If a developer wants to work entirely locally, for example, it's very easy to fetch all necessary refs in one shot rather than issuing potentially thousands of fetch commands.
  • Any Git repo can be a mono-repo! -- With the new approach, any individual repository can be a mon-repo. A normal Github repository can be a mono-repo. You don't need to have a managed server instance with programmatic access for repository creation.
  • Mono-repos can be distributed again -- Because they are self-contained, mono-repos can be local, and they can be peers. Particularly, if we choose to ignore submodule URLs, we can have true remotes and leverage normal forking.

I think there are more advantages that I haven't considered. This approach gets us a little closer to the ideal of thinking of code as being in a single repository, where sub-repos are defined as an optimization to avoid the need to fetch and check out the whole world.

Disadvantages

I can think of a few disadvantages to this approach over having separate back-end repositories for sub-repos, and I'm sure others will find more:

  • We lose repo-level permissioning -- One advantage of breaking code into sub-repos is the ability to leverage, e.g., Github's ability to specify permissions on a per-repository basis. By using a single repository, we lose this. However, as we've discussed the mono-repo more, I've come to believe that his ability is not-sufficient; we do want to think of the mono-repo as a single repo, and need the ability to specify permissions at a greater-than-repo level. For example, if all external repositories live in a single tree, we may wish to specify a single user (group) to review all external code additions.
  • Performance -- It may be less efficient to put all commits in a single repository, and with truly large systems, it we may lose some ability to distribute work. I do not think this issue is truly a problem; the physical size of even very large repositories can be readily handled by modern distributed filesystems.
  • Discoverability -- We may get less use from built-in facilities for searching and navigation from, e.g., Github, if there are no formal sub-repos.
  • Cloning -- There is no way to quickly and easily clone a single sub-repo.
  • Submodule Refs -- You lose the ability to mirror meta-level refs inside the individual submodules, since pushing them to the server would cause a collision. But someday git meta will be able to manage these locally for you.
  • Poor vanilla performance -- The git submodule update --init command will be very slow, since by default this clones the entire omega repo. By contrast, git meta open is unaffected, since it only fetches the named sha1.

Mitigation with namespaces

We may be able to mitigate some of the above disadvantages by using Git namespaces. We could establish a sub-repo namespace, where all branches in the meta-repo are mapped to branches in each sub-repo namespace; these branches and namespaces would be maintained by server-side hooks. Thus, there would be branches available to discovery tools to allow inspection of the heads in each sub-repo, and users could easily make local clones containing only refs from a specific sub-repo.

Experiment

@novalis and others raised questions about possible server-side performance problems with this scheme, especially for fetches and pushes. To address those concerns and see how and see how a repository implemented this way would feel at scale, I created the generate-repo.js script and used it to create an enormous mono-repo, which I have hosted here: https://github.com/bpeabody/mongo. This repository has 260,000 commits on master and around 26,000 submodules.

Unfortunately, you can't clone that repository and begin using it like a mono-repo because Github doesn't appear to support the uploadpack.allowReachableSHA1InWant we need to directly fetch commits (neither does Gitlab or Bitbucket). You can, however, clone that repository and host it locally.

Findings so far are very promising. The clone operation shows the performance that would be expected from a repository with so many commits and submodules -- taking about 45s. Similarly, push and fetch take just a few seconds each. I have yet to see any evidence of the supralinear degradation that we feared, that would have been very evident in a repository of this size.