A Claude Code / Claude Desktop skill that sets up and operates a local corpus-linguistics toolchain on macOS Apple Silicon — Python (NLTK, spaCy, Stanza), the IMS Open Corpus Workbench (CWB / cwb-encode / cqp), OPUS for parallel corpora, and helpers for building, annotating, and querying corpora compiled from raw text files, PDFs, HTML, or DOCX.
This skill is a first draft generated from research on the state of agent / MCP tooling for linguistic research. It has not been end-to-end tested yet. Specifically:
scripts/setup-python-env.shhas been logically reviewed but not run against a freshuvinstall on a clean machine.scripts/setup-cwb.shassumes a Homebrew formulacwbexists; this needs verification on current Homebrew (a third-party tap may be required).scripts/extract-text.py,scripts/concordance.py, andscripts/annotate.pyparse cleanly and have been argument-tested but have not been run against real corpora end-to-end.scripts/build-cwb-corpus.shfollows the documentedcwb-encode/cwb-makeinvocation but has not been exercised against real.vrtdata.- The parallel-corpus workflow in
references/parallel-corpora.mdis built from the documented OPUS / hunalign /cwb-align-encodepipeline but the integration has not been smoke-tested.
If you want to help test, please:
- Try
scripts/setup-python-env.sh /tmp/corpus-teston your platform and report what fails. - Try
scripts/setup-cwb.shand report whetherbrew install cwbworks directly or needs a tap. - Try the full corpus-build pipeline (
extract-text.py→annotate.py→build-cwb-corpus.sh) on a small directory of test files. - Open issues for anything that surprises you — wrong package names, missing dependencies, broken Homebrew formulas, encoding issues, performance gotchas.
- PRs welcome, especially: Linux support for the bash scripts, clean handling of Apple Silicon MPS in spaCy transformer models, additional language coverage in
setup-python-env.sh.
This is a personal skill repo by @techczech (Dominik Lukeš); no warranty, MIT-licensed.
Once installed in your Claude environment, the skill triggers when you mention setting up NLTK / spaCy / Stanza, installing CWB, building a local corpus from raw files, tokenizing / POS-tagging / lemmatizing, running concordances or keyword statistics over local data, or working with parallel corpora outside Sketch Engine.
It covers two layers:
- Python ecosystem — NLTK, spaCy, Stanza in a
uv-managed virtualenv. Best for: corpus building from messy sources, ad-hoc analysis, smaller corpora. - CWB (IMS Open Corpus Workbench) —
cwb-encode,cwb-make,cqp. Best for: serious concordancing at scale, CQL queries, durable indexed corpora.
It also provides a CLI replacement for ParaConc's parallel-concordancing workflow, since no MCP server for parallel corpora exists at the time of writing.
The intended layout is ~/gitrepos/02_workskills/corpus-tools-skill/ or any folder Claude Code's skill discovery picks up. Cloning into a custom workspace:
git clone https://github.com/techczech/corpus-tools-skill.git ~/gitrepos/02_workskills/corpus-tools-skillThe skill self-registers when Claude Code's skill discovery scans that folder. To package as a .skill archive for distribution:
python3 ~/.claude/skills/skill-creator/scripts/package_skill.py ~/gitrepos/02_workskills/corpus-tools-skillSKILL.md # the skill definition Claude reads
scripts/
setup-python-env.sh # uv venv + NLTK/spaCy/Stanza + standard data
setup-cwb.sh # Homebrew install of IMS CWB
extract-text.py # PDF/HTML/DOCX/TXT → UTF-8 text, recursive
concordance.py # KWIC concordance over a directory of .txt files
annotate.py # spaCy → vertical-text (.vrt) for cwb-encode
build-cwb-corpus.sh # cwb-encode + cwb-make + register in one shot
references/
nltk-recipes.md # tokenize, frequency, concordance, collocations, lemmas
spacy-recipes.md # batch annotation, log-likelihood keywords, dependency matching
cwb-cheatsheet.md # CWB CLI tools and CQL syntax inside cqp
file-formats.md # PDF/HTML/DOCX/EPUB conversion + encoding gotchas
parallel-corpora.md # ParaConc-replacement workflow with OPUS + CWB
Generated from a primary-source research report on the 2025–2026 state of agent / MCP tooling for linguistic research, archived in techczech/_LEARNINGLOG. That report identifies the lack of an MCP wrapper for parallel corpora and for general NLTK / Python NLP work, and recommends the code-generation pattern (Klemen et al. 2025) over a custom NLTK MCP — which is what this skill operationalises.
Sister skill: sketchengine-skill — for cloud-hosted Sketch Engine corpora via the ricCap/sketch-engine-mcp-server MCP.
MIT.