Catch risky meaning changes Git diff misses.
SemShift is a local-first review assistant for AI-rewritten and human-edited docs, prompts, policies, resumes, and research drafts. It flags likely semantic drift before you merge, publish, or submit text.
Current release line: v0.2.x alpha. The default backend is lexical + heuristic (tfidf). Optional SentenceTransformers embeddings are local semantic embeddings, not a claim of legal, factual, or scientific authority.
Before:
We do not share personal data with third parties.
After:
We may share personal data with trusted partners.
SemShift:
CRITICAL: privacy commitment weakened.
Risk flag: third-party sharing.
Recommendation: hold approval until a human reviews the change.
pip install semshiftOptional local embedding backend:
pip install "semshift[models]"Development:
pip install -e ".[dev]"semshift compare examples/old_policy.md examples/new_policy.md --mode policy
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --json
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --report semshift-report.mdUse limits for large or generated files:
semshift compare old.md new.md --max-file-size 5242880 --max-chunks 2000name: SemShift Check
on:
pull_request:
paths:
- "**/*.md"
- "**/*.txt"
- "**/*.yml"
permissions:
contents: read
pull-requests: write
jobs:
semshift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
with:
fetch-depth: 0
- uses: VeerajSai/SemShift@v0.2.0
with:
mode: policy
fail_on: high
pr_comment: "true"
paths: "docs/**,prompts/**,**/*.md,**/*.txt"
exclude_paths: ".github/workflows/**"
model: tfidf
report: semshift-report.md
artifact_name: semshift-policy-reportInputs include files, paths, exclude_paths, mode, fail_on, model, report, artifact_name, base_ref, pr_comment, github_token, max_file_size, and max_chunks.
Note:
fail_ondefaults tohigh. Usefail_on: nonefor warn-only mode.
from semshift import compare_files, compare_text
result = compare_text(
old="We do not share personal data.",
new="We may share personal data with partners.",
mode="policy",
)
print(result.drift_label)
print(result.summary)
print(result.risk_flags)
print(result.to_markdown())
file_result = compare_files("old_policy.md", "new_policy.md", mode="policy")
report = file_result.to_markdown()Canonical fields include drift_label, overall_score, drift_score, summary, matched_chunks, chunk_matches, claim_changes, tone_shift, risk_flags, warnings, metadata, to_dict(), to_json(), and to_markdown().
| Mode | Maturity | Best for | Main signals |
|---|---|---|---|
policy |
stable | privacy policies, terms, consent language | sharing, retention, rights, obligations |
prompt |
stable | system prompts and instruction files | safety rules, hidden instructions, scope |
research |
experimental | research drafts and reports | metrics, datasets, baselines, limitations |
resume |
experimental | resumes and bios | titles, metrics, company/project names |
readme |
experimental | README and support docs | install requirements, guarantees, scope |
default |
stable | general text review | drift score, claims, tone, generic risk |
SemShift combines transparent signals:
- Chunk alignment by headings and text structure.
- Lexical TF-IDF similarity by default, or optional local SentenceTransformers embeddings.
- Claim extraction, tone signals, and mode-specific risk rules.
TF-IDF is a lexical backend, not a true semantic model. Optional embedding models may download weights on first use; document text is processed locally unless you explicitly integrate external services.
SemShift includes a starter self-evaluation benchmark for regression tracking. See docs/benchmarks.md.
Do not treat starter benchmark numbers as independent validation. Human-labeled outside evaluation is still needed.
| Tool | What it catches | What it misses |
|---|---|---|
| Git diff | exact text edits | risk, claims, weakened obligations |
| diff-match-patch | text similarity | domain-specific meaning changes |
| LLM judge | broad qualitative review | local determinism, reproducibility, privacy by default |
| Grammar checker | style and grammar | policy, prompt, research, and factual drift |
| SemShift | likely risky semantic drift | subtle context, truth verification, legal authority |
SemShift is:
- not legal advice
- not a fact-checker
- not scientific authority
- not a replacement for human review
- likely to miss subtle context-dependent changes
- likely to false-positive on harmless paraphrases
- lexical + heuristic by default
semshift: command not found: Confirm the active environment is the one where you installed semshift.
Model import error: Install optional dependencies with pip install "semshift[models]", or use --model tfidf.
Slow first model run: SentenceTransformers may download weights and initialize on first use.
Windows path issues: Quote paths with spaces and prefer PowerShell-compatible quoting.
GitHub Action fork PRs: PR comments can be unavailable for forks with restricted permissions; the report artifact is still written.
No files matched: Pass files or paths, use actions/checkout@v5 with fetch-depth: 0, or check supported extensions. Use exclude_paths for generated files or workflow YAML.
Report too long: GitHub comments are truncated and link to the workflow run where the configured report artifact is uploaded.
- stronger external benchmark
- NLI-based deep mode for contradiction/entailment checks
- VS Code extension
- web demo
- docs site
- more file formats
Built by Veeraj Sai.
Please cite SemShift using CITATION.cff.
MIT. See LICENSE.
Report vulnerabilities through GitHub Security Advisories. SemShift is local-first by default, but optional model downloads and external CI integrations should be reviewed in your environment.