English · Русский · Українська
Keyword clustering and SEO content QA pipeline for Russian and Ukrainian e-commerce pages.
Turns a raw keyword list into clustered SEO groups, LLM-ready briefs, and measurable content checks: coverage, density, spam, wateriness, and nausea.
uv sync --frozen
uv run ruff check src tests
uv run mypy
uv run pytest
uv buildSEO specialists often start with a messy keyword export: duplicates, synonyms, mixed commercial and informational intent, raw search volumes, competitor notes, and a text that may or may not cover the required semantics.
Doing that by hand does not scale. The hard part is not writing one category text. The hard part is turning keyword research into a repeatable workflow:
- group raw keywords into usable clusters;
- separate primary, secondary, supporting, and commercial keywords;
- use TOP-10 SERP overlap and competitor research artifacts when they exist;
- write or generate content from the clustered brief;
- check the finished text against the expected keyword set;
- catch over-optimization before the page is shipped.
This repository is that workflow packaged as a Python project.
The workflow starts from a keyword export, usually CSV or a structured keyword list. That raw input is adapted into category JSON files so the rest of the pipeline can work with one contract instead of spreadsheets.
The project keeps several paths for that handoff:
- raw category JSON, such as
categories/{slug}/data/{slug}.json; - clustered clean JSON, such as
categories/{slug}/data/{slug}_clean.json; - master CSV sync into
_clean.json; - CSV restore/compare utilities for drift, missing keywords, and changed volumes.
The D+E fallback pattern prefers _clean.json first, then raw parsed JSON, then CSV
fallback. In practice, _clean.json is the main working artifact for the next stages.
Keywords are grouped into practical SEO roles:
| Group | Purpose |
|---|---|
primary |
Main target terms for H1, intro, and core page intent. |
secondary |
Related terms that expand the page without changing intent. |
supporting |
Long-tail and contextual phrases for semantic depth. |
commercial |
Buy/price/order modifiers, often used for meta or commercial signals. |
Duplicate and synonym cleanup tools normalize close variants, choose stronger winners by volume and phrase quality, and remove weaker duplicates.
The original workflow also supports a separate SERP TOP-10 file: keywords can be checked by URL intersection in Google results. If two queries share enough TOP-10 URLs, they can belong to the same cluster or synonym group; if their TOP-10 results differ, the keyword is treated as a separate search intent.
In the full workflow, after clustering, the primary keyword, semantic
groups, entities, micro-intents, and product insights were used to generate a
categories/{slug}/research/RESEARCH_PROMPT.md file for an external web-research tool,
for example Perplexity Deep Research or an LLM agent with web search. The research
result was then saved into categories/{slug}/research/RESEARCH_DATA.md.
That research file is not just an attachment. It becomes the category brief for the next stage: product facts, competitor structure, user intent, content gaps, required blocks, FAQ ideas, and risks that should shape the final SEO text.
The public package keeps this stage as prompt/checklist/reference material and as a pipeline artifact checked by task generation. External research runners and generated research outputs are intentionally outside this repository.
The prompts/ directory documents a prepare/produce/deliver flow for LLM-assisted
writing. It reads clustered keywords, primary terms, entity dictionaries, content rules,
and optional research context before drafting or reviewing category content.
The validators check whether the final markdown actually satisfies the brief:
- primary keyword in H1 and intro;
- morphology-aware keyword coverage for Russian and Ukrainian;
- split coverage for core and commercial keywords;
- keyword density and stem-based over-spam detection;
- water percentage, classic nausea, academic nausea, and lemma repetition;
- H1/H2 structure, intro quality, blacklist terms, brand/city mentions, and meta sync.
It converts keyword research into a controlled checklist: what to target, what to keep out of the body, what to include in meta, and what to fix after the text is written.
It treats LLM output as a draft that must pass deterministic checks. The model can write, but the pipeline verifies coverage, structure, and over-optimization.
The code handles Russian and Ukrainian text with language-aware tokenization, stopwords, stemming, lemmatization, and morphology-aware keyword matching.
| Area | What it does |
|---|---|
| Keyword JSON adaptation | Converts CSV/master/raw keyword data into category JSON and _clean.json artifacts used by the pipeline. |
| Keyword clustering | Builds clean keyword groups from raw/CSV/category data and separates intent roles. |
| SERP TOP-10 overlap | Uses search-result URL intersections to decide whether keywords belong in one cluster or represent different intents. |
| Synonym cleanup | Detects near-duplicates and normalizes competing keyword variants. |
| Research prompt workflow | Turns clustered keywords/entities/product insights into a research prompt contract; public docs preserve RESEARCH_PROMPT.md → RESEARCH_DATA.md as the brief workflow. |
| Content validation | Checks H1, intro, headings, keyword coverage, meta sync, and language-specific rules. |
| Density and spam | Finds exact, partial, stem, and substring overuse with warning/spam thresholds. |
| Water and nausea | Calculates Advego-like water, classic nausea, academic nausea, and lemma repetition. |
| Generation | Produces meta artifacts, SQL exports, semantic review files, catalog JSON, and checklists. |
| Repair and sync | Updates volumes, restores clean JSON from CSV, merges master keyword data, and fixes ordering or misplaced terms. |
src/llm_keywords_pipeline/
analyze/ category and metadata analysis for LLM briefs
audit/ water, nausea, coverage, blacklist, H1, and quality audits
core/ keyword matching, morphology, SEO rules, text utilities
generate/ meta, checklist, semantic review, catalog, and export tools
validate/ content, meta, heading, density, language, and data validators
sync/ CSV/master-data synchronization and _clean.json repair
fix/ focused cleanup utilities
compare/ keyword distribution and dataset comparison helpers
tests/ pytest suite and public fixtures
prompts/ prepare/produce/deliver LLM workflow templates
docs/ architecture, testing, and public-version notes
Prerequisites: Python 3.12+ and uv.
git clone https://github.com/yastman/llm-keywords-pipeline.git
cd llm-keywords-pipeline
uv sync --frozen
uv run ruff check src tests
uv run mypy
uv run pytestRun a public fixture through one of the packaged CLI tools:
uv run llm-keywords-audit-brands tests/fixtures/valid_content.md --jsonCoverage:
uv run pytest --cov=src/llm_keywords_pipeline --cov-report=term-missingOptional local database demo:
cp .env.example .env
docker compose upThe Compose stack starts MariaDB and Adminer on localhost. It is a local demo, not a production deployment recipe.
| Doc | What's in it |
|---|---|
| Architecture | Package map, workflow boundaries, and supported public surface. |
| Testing | Test tiers, default commands, CI parity, data-required skips. |
| Research Workflow | How clustered keywords became RESEARCH_PROMPT.md, RESEARCH_DATA.md, and a content brief. |
| Public Version | What is included, what is omitted, and why prompts remain public. |
| Prompt Templates | Prepare/produce/deliver LLM workflow templates. |
This public repository keeps the engineering structure, tests, fixtures, and prompt workflow. Production datasets, generated reports, SERP TOP-10 exports, and external LLM orchestration are intentionally excluded; data-layout dependent modules remain as reference implementations for compatible SEO projects.
MIT. See LICENSE.
Raw keywords are easy to export. This project turns them into clustered SEO work that can be written, checked, and improved.