Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
320 lines (243 sloc) 7.78 KB

% Duffer's Guide to git Internals % Vaibhav Sagar

I'm going to try to give two talks in one: a `git` internals talk as well as a Haskell beginner talk. That is because this is the first ever Haskell code I wrote from scratch. I hope that despite my relative inexperience you will learn something new.

I have had zero success explaining either git or Haskell to people by talking at them, so please feel free to stop me and ask questions. I'm also a very fast talker expecially when I'm nervous, so don't hesitate to tell me to slow down.

What

My Library

  • an API for git repositories
  • Pure Haskell

Demo

Code

Types

data GitObjectGeneric ref container entries
    = Blob {blobContent :: B.ByteString}
    | Tree {treeEntries :: container entries}
    | Commit
        { commitTreeRef    :: ref
        , commitParentRefs :: [ref]
        , commitAuthor     :: PersonTime
        , commitCommitter  :: PersonTime
        , commitSignature  :: Maybe B.ByteString
        , commitMessage    :: B.ByteString
        }
    | Tag
        { tagObjectRef  :: ref
        , tagObjectType :: B.ByteString
        , tagName       :: B.ByteString
        , tagTagger     :: PersonTime
        , tagAnnotation :: B.ByteString
        }
    deriving (Eq)

type GitObject = GitObjectGeneric Ref Set TreeEntry
- A Blob represents a file - A Tree represents a directory listing - A Commit represents a snapshot of a project at a point in time - A Tag names another `git` object.

Types

data TreeEntry = TreeEntry
    { entryPerms :: EntryPermission
    , entryName  :: B.ByteString
    , entryRef   :: Ref
    }

data PersonTime = PersonTime
    { personName :: B.ByteString
    , personMail :: B.ByteString
    , personTime :: B.ByteString
    , personTZ   :: B.ByteString
    }
    deriving (Eq)

data EntryPermission
    = Directory
    | Regular
    | Executable
    | SymbolicLink
    | SubModule
    deriving (Show, Eq)

type Ref  = B.ByteString
type Repo = FilePath
Included for completeness.

Functor

parseTree :: Parser GitObject
parseTree = Tree . fromList <$> many' parseTreeEntry
- `attoparsec` is easily the best library I've ever used - provides a compelling reason to learn about Functor/Applicative/Monad

Applicative

parseMessage :: Parser B.ByteString
parseMessage = endOfLine *> takeByteString
I think this reads pretty well.

Monad (Transformers)

type WithRepo = ReaderT Repo IO

hasLooseObject :: Ref -> WithRepo Bool
hasLooseObject ref = do
    path <- asks (sha1Path ref)
    liftIO $ doesFileExist path
- I was threading through the repository path as a parameter until I found a random blog post talking about using the Reader monad for configuration. It's a perfect fit for this problem. Also convinced me that monad transformers are a good idea.

Bit Twiddling

setMSBs :: (Bits t, Integral t) => [t] -> [Word8]
setMSBs [] = []
setMSBs (i:is) = map fromIntegral $ i : map setMSB is
This is where most of my bugs live. I think this comes back to why we have types in the first place: an invalid bytestring typechecks the same as a valid bytestring, just like all data is bytes in memory. Serialisation was much more straightforward.

Looking forward to those bidirectional parsing libraries becoming commonplace.

Streaming

getNextEntry = do
    Just (Right tLen) <- PA.parse parseTypeLen
    baseRef <- case fst tLen of
        OfsDeltaObject -> do
            Just (Right offset) <- PA.parse parseOffset
            return $ encodeOffset offset
        RefDeltaObject -> do
            Just (Right ref)    <- PA.parse parseBinRef
            return $ fst $ decode ref
        _              -> return ""
    remainder <- get
    let decompressed = PZ.decompress' PZ.defaultWindowBits remainder
    PB.drawByte
    Just levelByte <- PB.peekByte
    let level = getCompressionLevel levelByte
    return (uncurry encodeTypeLen tLen, baseRef, decompressed, level)
This code is gross, but it works! I asked for help with this and I think the consensus was that the libraries for streaming `zlib` decompression are less than ideal.

Layout

src
├── Duffer
│   ├── Loose
│   │   ├── Objects.hs
│   │   └── Parser.hs
│   ├── Loose.hs
│   ├── Misc.hs
│   ├── Pack
│   │   ├── Entries.hs
│   │   ├── File.hs
│   │   └── Parser.hs
│   ├── Pack.hs
│   ├── Plumbing.hs
│   ├── Unified.hs
│   ├── WithRepo.hs
│   └── WorkObject.hs
└── Duffer.hs
`git` has a packfile representation that is used for compression and network transfer. I have parsers for the loose and packed representations and additional logic for dealing with some quirks of the packfile format.

Testing

  • duffer uses its own repo to test itself.
- `git` hashes everything so if even one byte is off it is very obvious. - Free test cases every time a new commit is created

Functional Git

Data Structures

  • Merkle trees & Merkle DAGs
  • Not too different from an efficiently implemented functional data structure.

Design

  • Append-only object store and mutable ref store
  • Separating the immutable and the mutable? Sounds like this language I know...

More

What's next?

  • A better API
  • Multiple backends
Maybe `lens` would help me come up with a better API for my library?

There's no reason the filesystem has to be the backing store of a repository.

Ideas

  • A git-based project management system
  • Using git as the backend of an application

My experience with Haskell

Advice for beginners

  • Don't get too hung up on understanding monads before you write any code
  • Lots of support on IRC and r/haskell
  • IHaskell is a godsend and everyone should use it
- What worked for me was learning how to desugar `do`-notation. - Haskell stars like Gabriel Gonzalez, OCharles, and EKmett hang out there. - It's the one thing keeping me from moving to GHC 8.

Bad

  • Documentation is mostly terrible
  • Hard to find the right libraries
  • Very demanding learning curve
- random blog posts or r/haskell comments - more random blog posts - difficult to see the motivation

Some thoughts

  • I thought Haskell was pretty silly the first time I encountered it
  • Only after a couple of bad production experiences did I see the point
  • How else do we provide a compelling reason to learn it?

What can you do with this

Historical Revisionism

Git Objects as a Service

Further Reading

Better Implementations

Interesting Projects

About this project

RC

<style> .reveal section img { border: none; box-shadow: none; } </style>

https://www.recurse.com

In lieu of an employer slide I thought I'd mention RC. If you're interested check us out :)

Questions?