SemShift

Catch risky meaning changes Git diff misses.

SemShift is a local-first review assistant for AI-rewritten and human-edited docs, prompts, policies, resumes, and research drafts. It flags likely semantic drift before you merge, publish, or submit text.

Current release line: v0.2.x alpha. The default backend is lexical + heuristic (tfidf). Optional SentenceTransformers embeddings are local semantic embeddings, not a claim of legal, factual, or scientific authority.

5-Second Demo

Before:

We do not share personal data with third parties.

After:

We may share personal data with trusted partners.

SemShift:

CRITICAL: privacy commitment weakened.
Risk flag: third-party sharing.
Recommendation: hold approval until a human reviews the change.

Install

pip install semshift

Optional local embedding backend:

pip install "semshift[models]"

Development:

pip install -e ".[dev]"

Quick Start

semshift compare examples/old_policy.md examples/new_policy.md --mode policy
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --json
semshift compare examples/old_policy.md examples/new_policy.md --mode policy --report semshift-report.md

Use limits for large or generated files:

semshift compare old.md new.md --max-file-size 5242880 --max-chunks 2000

GitHub Action

name: SemShift Check

on:
  pull_request:
    paths:
      - "**/*.md"
      - "**/*.txt"
      - "**/*.yml"

permissions:
  contents: read
  pull-requests: write

jobs:
  semshift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
        with:
          fetch-depth: 0

      - uses: VeerajSai/SemShift@v0.2.0
        with:
          mode: policy
          fail_on: high
          pr_comment: "true"
          paths: "docs/**,prompts/**,**/*.md,**/*.txt"
          exclude_paths: ".github/workflows/**"
          model: tfidf
          report: semshift-report.md
          artifact_name: semshift-policy-report

Inputs include files, paths, exclude_paths, mode, fail_on, model, report, artifact_name, base_ref, pr_comment, github_token, max_file_size, and max_chunks.

Note: fail_on defaults to high. Use fail_on: none for warn-only mode.

Python API

from semshift import compare_files, compare_text

result = compare_text(
    old="We do not share personal data.",
    new="We may share personal data with partners.",
    mode="policy",
)

print(result.drift_label)
print(result.summary)
print(result.risk_flags)
print(result.to_markdown())

file_result = compare_files("old_policy.md", "new_policy.md", mode="policy")
report = file_result.to_markdown()

Canonical fields include drift_label, overall_score, drift_score, summary, matched_chunks, chunk_matches, claim_changes, tone_shift, risk_flags, warnings, metadata, to_dict(), to_json(), and to_markdown().

Modes

Mode	Maturity	Best for	Main signals
`policy`	stable	privacy policies, terms, consent language	sharing, retention, rights, obligations
`prompt`	stable	system prompts and instruction files	safety rules, hidden instructions, scope
`research`	experimental	research drafts and reports	metrics, datasets, baselines, limitations
`resume`	experimental	resumes and bios	titles, metrics, company/project names
`readme`	experimental	README and support docs	install requirements, guarantees, scope
`default`	stable	general text review	drift score, claims, tone, generic risk

How It Works

SemShift combines transparent signals:

Chunk alignment by headings and text structure.
Lexical TF-IDF similarity by default, or optional local SentenceTransformers embeddings.
Claim extraction, tone signals, and mode-specific risk rules.

TF-IDF is a lexical backend, not a true semantic model. Optional embedding models may download weights on first use; document text is processed locally unless you explicitly integrate external services.

Benchmarks

SemShift includes a starter self-evaluation benchmark for regression tracking. See docs/benchmarks.md.

Do not treat starter benchmark numbers as independent validation. Human-labeled outside evaluation is still needed.

Compared To

Tool	What it catches	What it misses
Git diff	exact text edits	risk, claims, weakened obligations
diff-match-patch	text similarity	domain-specific meaning changes
LLM judge	broad qualitative review	local determinism, reproducibility, privacy by default
Grammar checker	style and grammar	policy, prompt, research, and factual drift
SemShift	likely risky semantic drift	subtle context, truth verification, legal authority

Limitations

SemShift is:

not legal advice
not a fact-checker
not scientific authority
not a replacement for human review
likely to miss subtle context-dependent changes
likely to false-positive on harmless paraphrases
lexical + heuristic by default

Troubleshooting

semshift: command not found: Confirm the active environment is the one where you installed semshift.

Model import error: Install optional dependencies with pip install "semshift[models]", or use --model tfidf.

Slow first model run: SentenceTransformers may download weights and initialize on first use.

Windows path issues: Quote paths with spaces and prefer PowerShell-compatible quoting.

GitHub Action fork PRs: PR comments can be unavailable for forks with restricted permissions; the report artifact is still written.

No files matched: Pass files or paths, use actions/checkout@v5 with fetch-depth: 0, or check supported extensions. Use exclude_paths for generated files or workflow YAML.

Report too long: GitHub comments are truncated and link to the workflow run where the configured report artifact is uploaded.

Roadmap

stronger external benchmark
NLI-based deep mode for contradiction/entailment checks
VS Code extension
web demo
docs site
more file formats

Author

Built by Veeraj Sai.

Citation

Please cite SemShift using CITATION.cff.

License

MIT. See LICENSE.

Security

Report vulnerabilities through GitHub Security Advisories. SemShift is local-first by default, but optional model downloads and external CI integrations should be reviewed in your environment.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
assets		assets
benchmarks		benchmarks
docs		docs
examples		examples
scripts		scripts
semshift		semshift
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES_v0.2.0.md		RELEASE_NOTES_v0.2.0.md
RELEASE_REPAIR_PLAN.md		RELEASE_REPAIR_PLAN.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
TODO.md		TODO.md
VERSION_RELEASE_CHECKLIST.md		VERSION_RELEASE_CHECKLIST.md
action.yml		action.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemShift

5-Second Demo

Install

Quick Start

GitHub Action

Python API

Modes

How It Works

Benchmarks

Compared To

Limitations

Troubleshooting

Roadmap

Author

Citation

License

Security

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SemShift

5-Second Demo

Install

Quick Start

GitHub Action

Python API

Modes

How It Works

Benchmarks

Compared To

Limitations

Troubleshooting

Roadmap

Author

Citation

License

Security

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages