Skip to content

tomo-waka/gitrail

Repository files navigation

gitrail

A CLI tool that extracts commit history from a local Git repository and outputs it as JSON Lines (.jsonl) files, suitable for ingestion into data warehouses and analytical systems.

Features

  • Reads the local .git directory directly via isomorphic-git — no git CLI required at runtime
  • Outputs one commit per line in JSON Lines format
  • Explicit extraction modes: --mode snapshot for independent extraction, --mode incremental for differential extraction using a state file
  • Handles multi-branch extraction with cross-branch deduplication

Requirements

  • Node.js ≥ 22.0.0
  • A local Git repository (cloned and fetched via your preferred method — gitrail reads .git data directly and does not require the git CLI)

Installation

npm install -g gitrail

Quick Start

# One-time extraction from a local clone
gitrail -b main ./my-repo

# Continuous extraction — fetch remote changes, then extract new commits
git -C ./my-repo fetch origin
gitrail -m incremental -b origin/main -s ./gitrail-state.json --on-missing-state snapshot ./my-repo

See the User Guide for detailed workflow patterns including incremental setup, release-tag-based extraction, and CI configuration.

CLI Reference

gitrail [options] <repository-path>
Parameter Alias Type Required Default Description
<repository-path> positional Local path to the Git repository
--mode -m snapshot | incremental snapshot Extraction mode. snapshot runs independently of state; incremental reads state to extract only new commits.
--branch <ref> -b string (repeatable) Ref to traverse from. Specify one or more times.
--output-dir <path> -o string ./ Directory for output .jsonl files
--output-prefix <string> string derived Filename prefix (derived from remote origin if omitted)
--state <path> -s string State file path. Required with --mode incremental.
--on-missing-state error | snapshot error Behavior when state file is absent. Only valid with --mode incremental.
--since-ref <ref> string Exclude commits reachable from this ref (tag, branch, or hash). Snapshot mode only.
--since-date <ISO8601> string Include only commits after this datetime. Snapshot mode only.
--rotate-lines <n> number Start new file after n lines
--rotate-size <bytes> number Start new file after n bytes
--quiet -q boolean false Suppress progress and summary output

Progress updates and the final summary are written to stderr; use --quiet to suppress them. Validation errors exit with code 1; runtime errors with code 2. See the User Guide for the full list of mutual exclusion rules.

Output

Each line in the output .jsonl file is a JSON object representing one commit:

{
  "oid": "a1b2c3d4...",
  "subject": "Fix null pointer in auth module",
  "body": "",
  "author": {
    "name": "Jane Doe",
    "email": "jane@example.com",
    "timestamp": "2024-01-15T09:00:00+09:00"
  },
  "committer": {
    "name": "Jane Doe",
    "email": "jane@example.com",
    "timestamp": "2024-01-15T09:05:00+09:00"
  },
  "parents": ["parenthash1"],
  "repository": { "name": "my-repo", "url": "https://github.com/org/my-repo" }
}
Field Description
oid Full SHA-1 commit hash
subject First line of the commit message
body Remainder of the commit message (empty string if none)
author Person who originally authored the changes
committer Person who committed (may differ from author after rebase/cherry-pick)
author.timestamp / committer.timestamp ISO 8601 datetime using the offset embedded in the commit object
parents Array of parent commit hashes (empty for the initial commit; two entries for merge commits)
repository.name Repository name derived from remote origin URL (falls back to directory name)
repository.url Remote origin URL, or null if no remote is configured

Output files are named <prefix>-<timestamp>-000001.jsonl, <prefix>-<timestamp>-000002.jsonl, and so on. The prefix is derived from the repository's remote origin URL; use --output-prefix to override. The timestamp segment (YYYYMMDDTHHmmssZ) is captured once per session, so all files from a single run share the same timestamp and will not overwrite files produced by earlier runs. Use --rotate-lines or --rotate-size to split output across multiple files.

Note: Output line order is not guaranteed to be chronological. Sort by committer.timestamp in your downstream system.

Documentation

  • User Guide — detailed workflows, mode explanations, and full CLI reference
  • Changelog — release history and notable changes by version

Developer Guide

  • Contributing Guide — local setup, quality checks, and pull request workflow
  • Architecture — layer responsibilities, end-to-end flow, and key design decisions
  • Git Traversal — DAG traversal, differential extraction modes, and deduplication strategy
  • Output Schema — JSONL format, field definitions, timestamp conversion, and file rotation

License

MIT

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors