Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Git/Cassandra Ruby Library

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 lib
Octocat-spinner-32 scripts
Octocat-spinner-32 server
Octocat-spinner-32 spec
Octocat-spinner-32 .document
Octocat-spinner-32 .gitignore
Octocat-spinner-32 DESIGN.txt
Octocat-spinner-32 LICENSE
Octocat-spinner-32 README.rdoc
Octocat-spinner-32 Rakefile
Octocat-spinner-32 TODO.txt
Octocat-spinner-32 VERSION.yml
Octocat-spinner-32 example.conf.xml
Octocat-spinner-32 post-receive.rb
README.rdoc

Agitmemnon

Agitmemnon is a Ruby library for filling and interacting with a Cassandra cluster with a specific keyspace for storing Git repository data. This library can take a Git repo on disk and fill the Cassandra keyspace with all the data it needs, then can be used to get Git data back out of it.

Currently the Cassandra keyspace looks something like this:

<Keyspaces>
    <Keyspace Name="Agitmemnon">
        <ColumnFamily CompareWith="UTF8Type" Name="Objects"/>
        <ColumnFamily ColumnType="Super" CompareWith="UTF8Type" CompareSubcolumnsWith="UTF8Type" Name="Repositories"/>
        <ColumnFamily CompareWith="UTF8Type" Name="CommitDiffs"/>
        <ColumnFamily ColumnType="Super" CompareWith="UTF8Type" CompareSubcolumnsWith="UTF8Type" Name="RevTree"/>
        <ColumnFamily CompareWith="UTF8Type" Name="PackCache"/>
        <ColumnFamily CompareWith="UTF8Type" Name="PackCacheIndex"/>
    </Keyspace>
</Keyspaces>

The Repository column family keeps one row per repository, which holds all the current references and the last updated (push) time. The Objects column family holds one row per Git object, with the SHA-1 of the object as the key. This allows one global space for all the repositories objects. The CommitDiffs family is used to keep the diff of each commit for fast retrieval. Finally, the RevTree family is keeping the commit revlist and difftree for each repository (though I'm not sure if I want to do it this way yet - if so, I'll have to shard the rows for larger repos).

The Keyspace Schema is thusly:

Repositories (main_listing) {repo_handle => {'updated' => Time.now.to_i.to_s}} Repositories (projectname) {:heads, :tags, :remotes} Object (sha) {:type, :size, :data, :json (for commits/trees)} RevTree (projectname) {:sha => {:parents, :objects}} PackCacheIndex (projectname) {(cache_key) => (list of objects), …} PackCache (cache_key) {:size => (size), :count => (count), :data => (list of objects)} # CommitDiffs (sha) {'diff' => diff}

I've also included a small Sinatra app that uses Agitmemnon to run a GitWeb style web UI, but it's just for testing stuff out and isn't very fully featured yet.

Right now this works pretty well for smaller repositories. To scale up, we need to be able to fragment large blobs into multiple smaller Object entries and then pull them together on clones. I also want to add a column family for partial packfile caches, so I can precalculate groups of objects and pull hundreds of small commit/tree objects out of the db at once on clones.

For cloning, I've also created a git-daemon (by modifying the dulwich project) that will actually run clones out of Agitmemnon. You can find this project here:

github.com/schacon/agitmemnon-server

Soon, this project will help in scaling Git to the MOON!

Copyright

Copyright © 2009 Scott Chacon. See LICENSE for details.

Something went wrong with that request. Please try again.