Skip to content

tnull/git-privacy-filter

Repository files navigation

git-privacy-filter

A git pre-commit and pre-push hook that scans what you're about to commit or push with OpenAI's privacy-filter (opf) model and blocks the operation when it detects any of:

category covers
account_number bank / card numbers
private_address street addresses
private_date birthdates, etc.
private_email email addresses
private_person names
private_phone phone numbers
private_url URLs treated as private (e.g. webhooks)
secret API keys, tokens, passwords, credentials

Unlike regex-based tools (detect-secrets, gitleaks, ggshield), this uses opf's learned 1.5 B-param sparse-MoE classifier, so it catches novel credential formats and PII the regex rules don't know about — at the cost of a ~3 GB first-run model download and ~seconds of per-commit inference.

Install

You need Python 3.11+ and the opf package available at hook-invocation time. git-privacy-filter declares opf as a dependency, so installing this tool pulls opf with it.

Pick one of:

# uv users (recommended)
uv tool install git+https://github.com/<owner>/git-privacy-filter

# pipx users
pipx install git+https://github.com/<owner>/git-privacy-filter

# pre-commit framework users — see "Via the pre-commit framework" below

Then preflight:

git-privacy-filter doctor

doctor reports whether opf is importable and whether the model weights are on disk. First run triggers a ~3 GB download from HuggingFace into ~/.opf/privacy_filter/; doctor --load-model warms torch caches in addition.

Install size and CPU-only default

git-privacy-filter pins torch to the CPU-only PyTorch wheels (torch==X.Y.Z+cpu), routed through https://download.pytorch.org/whl/cpu via [tool.uv.sources] in our pyproject.toml. This cuts ~3.7 GB of transitive NVIDIA CUDA libraries (nvidia-cudnn, nvidia-cublas, nvidia-cusolver, cuda-toolkit, triton, …) that the default PyPI torch wheel drags in on Linux.

Rough footprint with the CPU-only default:

component size
uv tool venv ~900 MB
opf model weights ~2.8 GB
total ~3.7 GB

Versus roughly ~7.5 GB if you install with the default CUDA-enabled torch.

GPU users — at runtime we auto-detect via torch.cuda.is_available() (PyTorch exposes both NVIDIA CUDA and AMD ROCm devices under the same cuda namespace), so once the right torch wheel is in the environment no code changes are needed. GIT_PRIVACY_FILTER_DEVICE is the env-var escape hatch if auto-detect picks wrong.

NVIDIA / CUDA

uv tool install \
    --index https://download.pytorch.org/whl/cu121 \
    git+https://github.com/<owner>/git-privacy-filter

AMD / ROCm

Short version — host side, then Python side.

Host-side (Fedora 40+, no third-party repos):

sudo dnf install rocminfo rocm-hip rocm-smi rocm-clinfo rocm-opencl
sudo usermod -aG render,video "$USER"   # log out / back in for groups
rocminfo | grep -A1 "Marketing Name"    # confirm your GPU shows up
rpm -q rocm-hip --qf '%{VERSION}\n'     # note the major.minor (e.g. 6.2)

On Ubuntu/Debian: sudo apt install rocm reaches roughly the same state (AMD's own apt repo is the fallback when distros don't package ROCm natively).

Python-side — dedicated venv (recommended for ROCm):

uv tool install on this project doesn't work cleanly with the rocm6.2 torch wheels today: those wheels only cover cp39-cp312, and requires-python = ">=3.11" forces uv's lockfile to also resolve on cp313, which has no rocm6.2 torch wheel. Until the rocm index catches up, use a dedicated venv for the ROCm setup instead:

cd /path/to/git-privacy-filter

# Pin to Python 3.12 (rocm6.2 torch wheels don't exist for 3.13).
uv venv --python 3.12 .venv-rocm
source .venv-rocm/bin/activate

# Install torch + its rocm-only transitive straight from the rocm index.
# Match the rocm6.x minor to whatever `rpm -q rocm-hip` printed.
uv pip install \
    --extra-index-url https://download.pytorch.org/whl/rocm6.2 \
    --index-strategy unsafe-best-match \
    'torch>=2.5' pytorch-triton-rocm

# Install opf + git-privacy-filter itself (torch is already satisfied).
uv pip install \
    --extra-index-url https://download.pytorch.org/whl/rocm6.2 \
    --index-strategy unsafe-best-match \
    -e .

# Verify
python -c 'import torch; print(torch.__version__, torch.cuda.is_available())'
# Expect: 2.5.1+rocm6.2  True   (PyTorch exposes HIP devices under "cuda")

To use it globally, symlink the venv's binary onto your PATH (or source the venv in your shell init):

ln -sf "$PWD/.venv-rocm/bin/git-privacy-filter" ~/.local/bin/git-privacy-filter

The hook scripts call plain git-privacy-filter, so whichever copy is first on $PATH wins — the ROCm-venv symlink above takes precedence over any previously uv tool install-ed CPU version.

The commented pytorch-rocm blocks in pyproject.toml are kept as a reference for when the rocm6.2 index catches up to cp313; at that point the uv tool install --reinstall --python 3.12 flow becomes the cleaner path.

Troubleshooting: if torch.cuda.is_available() prints False, check group membership (id -nG should include render), /dev/kfd permissions, and dmesg | grep amdgpu. If you see HIP error: no kernel image is available, your GPU's gfx target isn't in the wheel's compiled set — set HSA_OVERRIDE_GFX_VERSION=11.0.0 (RDNA3) or 10.3.0 (RDNA2).

Note for PyPI consumers: [tool.uv.sources] is a uv-only directive; pip install git-privacy-filter (once we publish) will still pull the default CUDA-enabled torch. If you're a pip user on a CPU-only machine and want the small install, pre-install the CPU wheel:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install git-privacy-filter

Via the pre-commit framework

Add to your repo's .pre-commit-config.yaml:

repos:
  - repo: https://github.com/<owner>/git-privacy-filter
    rev: v0.1.0
    hooks:
      - id: git-privacy-filter         # runs on every commit
      - id: git-privacy-filter-push    # runs on every push

Then pre-commit install --hook-type pre-commit --hook-type pre-push.

Standalone (without the pre-commit framework)

From inside any git repo:

git-privacy-filter install

This writes small shell wrappers into the repo's hooks directory (respecting core.hooksPath) that exec git-privacy-filter precommit / prepush. A magic comment on the second line lets git-privacy-filter uninstall safely remove our hooks while leaving any other hooks you have in place untouched.

Configure

Drop a .git-privacy-filter.toml at your repo root:

[allowlist]
# Skip these paths entirely — the scanner never even runs on them.
# Useful for test fixtures that intentionally contain credential-like
# patterns, lockfiles, vendored third-party code, etc.
paths = [
    "tests/fixtures/**",
    "**/*.lock",
    "vendor/**",
]

[categories]
# Optional. Omit both keys to enable all 8 categories.
# Use EITHER `enabled` OR `disabled`, not both.
enabled  = ["secret", "private_email"]   # only these two block
# disabled = ["private_person"]          # or: block everything except these

Malformed config is a hard error, not a silent "revert to defaults" — a mistyped allowlist_paths that disabled every filter would be the worst kind of bug.

Inline allow markers

For one-off exceptions, annotate the offending line:

# Bare form — allows all categories on this line:
AWS_FAKE_KEY = "AKIA..."  # git-privacy-filter: allow

# Category-scoped — allows only `secret` here; private_email on the same
# line would still block:
USER_AND_TOKEN = ("alice@example.com", "ghp_...")  # git-privacy-filter: allow secret

# For a line that can't carry a trailing comment (e.g. a generated blob),
# use the previous line:
# git-privacy-filter: allow-next-line
1600_Pennsylvania_Avenue_NW_Washington_DC_20500

Any comment prefix works — #, //, --, ;, /* — we match on the git-privacy-filter: token, not on the comment syntax.

Emergency bypass

git commit --no-verify skips all hooks including this one. Use sparingly; there's always a paper trail (the commit exists) but no block.

Manual scan

# Scan the currently staged diff without committing:
git-privacy-filter scan

# Scan a range of commits (what a pre-push would see):
# (no CLI flag yet — invoke the pre-push code path directly):
echo "refs/heads/feature $(git rev-parse HEAD) refs/heads/feature $(git rev-parse origin/main)" \
    | git-privacy-filter prepush

Development

Clone + sync:

git clone https://github.com/<owner>/git-privacy-filter
cd git-privacy-filter
uv sync --group test --group dev

Fast unit tests (the default — mocked opf, milliseconds):

uv run pytest

Tier-2 integration tests (tmp git repos, still mocked opf):

uv run pytest -m integration

Tier-3 real-model tests (loads the actual ~3 GB model — slow, requires the model to be cached, requires opf + torch installed):

uv run pytest -m real_model

On a machine without a GPU, set GIT_PRIVACY_FILTER_TEST_DEVICE=cpu (the default) so torch doesn't try to initialize CUDA.

Lint + format:

uv run ruff check
uv run ruff format

How it works, briefly

For git commit:

  1. git diff --cached -U0 --diff-filter=ACMR → the lines this commit would add (context and deletions are ignored).
  2. Group additions by path, drop allowlisted paths, join per-file additions with \n, feed each file's blob to opf.OPF.redact(text) once.
  3. Map each returned span's character offset back to (file, line, col) via the per-file line-start table we built during the join.
  4. Drop findings whose category is disabled in config, or whose line has a matching inline allow marker.
  5. Print a path:line:col [category] report and exit 1 if anything is left.

For git push, step 1 is git log -p --reverse -U0 <remote_sha>..<local_sha> per ref being pushed, and everything else is the same. A brand-new branch (zero remote_sha) is scanned as "everything reachable from local_sha"; a branch deletion (zero local_sha) is a no-op.

License

Licensed under either of

at your option. This is the same dual-licensing convention used by rust-lang/rust, lightningdevkit/*, and most of the Rust ecosystem — it lets consumers pick the license that matches their own project's licensing obligations.

Upstream opf (code and model weights) is Apache-2.0; pulling it in as a runtime dependency doesn't change the licensing of our own code, but downstream redistributors of a combined work obviously inherit opf's Apache-2.0 requirements.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.

About

No description, website, or topics provided.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages