Skip to content

VeerajSai/SemShift

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SemShift

PyPI Python CI Security License: MIT

Catch risky meaning changes Git diff misses.

SemShift is a local-first review assistant for AI-rewritten and human-edited docs, prompts, policies, resumes, and research drafts. It flags likely semantic drift before you merge, publish, or submit text.

Current release line: v0.2.x alpha. The default backend is lexical + heuristic (tfidf). Optional SentenceTransformers embeddings are local semantic embeddings, not a claim of legal, factual, or scientific authority.

5-Second Demo

Before:

We do not share personal data with third parties.

After:

We may share personal data with trusted partners.

SemShift:

CRITICAL: privacy commitment weakened.
Risk flag: third-party sharing.
Recommendation: hold approval until a human reviews the change.

Install

pip install semshift

Optional local embedding backend:

pip install "semshift[models]"

Development:

pip install -e ".[dev]"

Quick Start

semshift compare examples/old_policy.md examples/new_policy.md --mode policy
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --json
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --report semshift-report.md

Use limits for large or generated files:

semshift compare old.md new.md --max-file-size 5242880 --max-chunks 2000

GitHub Action

name: SemShift Check

on:
  pull_request:
    paths:
      - "**/*.md"
      - "**/*.txt"
      - "**/*.yml"

permissions:
  contents: read
  pull-requests: write

jobs:
  semshift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
        with:
          fetch-depth: 0

      - uses: VeerajSai/SemShift@v0.2.0
        with:
          mode: policy
          fail_on: high
          pr_comment: "true"
          paths: "docs/**,prompts/**,**/*.md,**/*.txt"
          exclude_paths: ".github/workflows/**"
          model: tfidf
          report: semshift-report.md
          artifact_name: semshift-policy-report

Inputs include files, paths, exclude_paths, mode, fail_on, model, report, artifact_name, base_ref, pr_comment, github_token, max_file_size, and max_chunks.

Note: fail_on defaults to high. Use fail_on: none for warn-only mode.

Python API

from semshift import compare_files, compare_text

result = compare_text(
    old="We do not share personal data.",
    new="We may share personal data with partners.",
    mode="policy",
)

print(result.drift_label)
print(result.summary)
print(result.risk_flags)
print(result.to_markdown())

file_result = compare_files("old_policy.md", "new_policy.md", mode="policy")
report = file_result.to_markdown()

Canonical fields include drift_label, overall_score, drift_score, summary, matched_chunks, chunk_matches, claim_changes, tone_shift, risk_flags, warnings, metadata, to_dict(), to_json(), and to_markdown().

Modes

Mode Maturity Best for Main signals
policy stable privacy policies, terms, consent language sharing, retention, rights, obligations
prompt stable system prompts and instruction files safety rules, hidden instructions, scope
research experimental research drafts and reports metrics, datasets, baselines, limitations
resume experimental resumes and bios titles, metrics, company/project names
readme experimental README and support docs install requirements, guarantees, scope
default stable general text review drift score, claims, tone, generic risk

How It Works

SemShift combines transparent signals:

  1. Chunk alignment by headings and text structure.
  2. Lexical TF-IDF similarity by default, or optional local SentenceTransformers embeddings.
  3. Claim extraction, tone signals, and mode-specific risk rules.

TF-IDF is a lexical backend, not a true semantic model. Optional embedding models may download weights on first use; document text is processed locally unless you explicitly integrate external services.

Benchmarks

SemShift includes a starter self-evaluation benchmark for regression tracking. See docs/benchmarks.md.

Do not treat starter benchmark numbers as independent validation. Human-labeled outside evaluation is still needed.

Compared To

Tool What it catches What it misses
Git diff exact text edits risk, claims, weakened obligations
diff-match-patch text similarity domain-specific meaning changes
LLM judge broad qualitative review local determinism, reproducibility, privacy by default
Grammar checker style and grammar policy, prompt, research, and factual drift
SemShift likely risky semantic drift subtle context, truth verification, legal authority

Limitations

SemShift is:

  • not legal advice
  • not a fact-checker
  • not scientific authority
  • not a replacement for human review
  • likely to miss subtle context-dependent changes
  • likely to false-positive on harmless paraphrases
  • lexical + heuristic by default

Troubleshooting

semshift: command not found: Confirm the active environment is the one where you installed semshift.

Model import error: Install optional dependencies with pip install "semshift[models]", or use --model tfidf.

Slow first model run: SentenceTransformers may download weights and initialize on first use.

Windows path issues: Quote paths with spaces and prefer PowerShell-compatible quoting.

GitHub Action fork PRs: PR comments can be unavailable for forks with restricted permissions; the report artifact is still written.

No files matched: Pass files or paths, use actions/checkout@v5 with fetch-depth: 0, or check supported extensions. Use exclude_paths for generated files or workflow YAML.

Report too long: GitHub comments are truncated and link to the workflow run where the configured report artifact is uploaded.

Roadmap

  • stronger external benchmark
  • NLI-based deep mode for contradiction/entailment checks
  • VS Code extension
  • web demo
  • docs site
  • more file formats

Author

Built by Veeraj Sai.

Citation

Please cite SemShift using CITATION.cff.

License

MIT. See LICENSE.

Security

Report vulnerabilities through GitHub Security Advisories. SemShift is local-first by default, but optional model downloads and external CI integrations should be reviewed in your environment.

About

Git diff for meaning. Detect semantic shifts, claim changes, tone drift, and risk changes in text, docs, and prompts.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages