# SHK · lit-import · v1 pipeline notebook

Use this notebook to run the pipeline step‑by‑step for any corpus (KJV/ASV/WEB, Strong's, KJV+Strong's, Nicene, etc.).

**Tip:** Run each cell in order. If a step fails, read the error message in the cell output.


## 0) Configure paths (edit me once)

- `REPO_ROOT`: absolute path to your local SHK repo
- `LIT_IMPORT`: path to the lit-import tool
- `API_ROOT`: output folder for the v1 API (normally `docs/data/v1`)

In [None]:
from pathlib import Path

# >>> EDIT THIS to your local checkout <<<
REPO_ROOT = Path(r"C:/path/to/SHK").resolve()
LIT_IMPORT = REPO_ROOT / "tools/lit-import"
API_ROOT = REPO_ROOT / "docs/data/v1"

REPO_ROOT, LIT_IMPORT, API_ROOT

## 1) Pick a spec (choose one)
- Bible (plain): `bible_en_kjv_plain.json`, `bible_en_asv_plain.json`, `bible_en_web_plain.json`
- Strong's: `strongs_lexicon.xml.json`
- Bible + Strong's: `bible_en_kjv_plus_strongs.json`
- General text: `exbib_en_nicene.json`

In [None]:
# Choose one of the specs below by uncommenting it
SPEC = LIT_IMPORT / "src/shk_lit_import/specs/bible_en_kjv_plain.json"
# SPEC = LIT_IMPORT / "src/shk_lit_import/specs/bible_en_asv_plain.json"
# SPEC = LIT_IMPORT / "src/shk_lit_import/specs/bible_en_web_plain.json"
# SPEC = LIT_IMPORT / "src/shk_lit_import/specs/strongs_lexicon.xml.json"
# SPEC = LIT_IMPORT / "src/shk_lit_import/specs/bible_en_kjv_plus_strongs.json"
# SPEC = LIT_IMPORT / "src/shk_lit_import/specs/exbib_en_nicene.json"
SPEC

## 2) Helper to run the CLI and show output
This wraps `shk-lit` and captures both stdout and stderr.

In [None]:
import subprocess, sys, textwrap

def run(cmd: list, cwd=None):
    print("$", " ".join(cmd))
    p = subprocess.run(cmd, cwd=cwd, text=True, capture_output=True)
    print(p.stdout)
    if p.returncode != 0:
        print(p.stderr, file=sys.stderr)
        raise RuntimeError(f"Command failed with code {p.returncode}")
    return p

def shk(cmd: str, extra: list = None):
    extra = extra or []
    return run([sys.executable, "-m", "shk_lit_import.cli", "--spec", str(SPEC), cmd] + extra, cwd=LIT_IMPORT)

## 3) Verify environment (editable install, CLI help)

In [None]:
# Ensure we're pointing at the right folders
assert LIT_IMPORT.exists(), f"Missing: {LIT_IMPORT}"
assert (LIT_IMPORT / "pyproject.toml").exists(), "pyproject.toml not found in tools/lit-import"
assert SPEC.exists(), f"Spec not found: {SPEC}"
print("Paths look good. Now showing CLI help…")
shk("-h")

## 4) (Optional) Install the package here
Only needed once per environment. If you already ran `pip install -e .` in your terminal, you can skip this.

In [None]:
run([sys.executable, "-m", "pip", "install", "-e", "."], cwd=LIT_IMPORT)

## 5) Fetch
Downloads source files into `tools/lit-import/data/raw/<corpus_id>/…` and records provenance. 
**Note:** current scaffold fetcher writes a placeholder unless you fill real URLs in the spec.

In [None]:
shk("fetch")

## 6) Normalize
Parses the raw data into normalized JSONL under `tools/lit-import/data/processed/<corpus_id>/…`.

In [None]:
shk("normalize")

## 7) Index (optional for plain Bible)
Builds crosswalks/frequencies if the corpus supports them.

In [None]:
shk("index")

## 8) Export pages (to docs/data/v1)
Writes browser-facing JSON to your `docs/data/v1/**` structure.

In [None]:
shk("export-pages", ["--out", str(API_ROOT)])

## 9) Quick sanity checks (counts, manifest)
These are convenience checks so you can confirm outputs without leaving the notebook.

In [None]:
import json, glob

def find(path_glob):
    return [str(p) for p in glob.glob(str(path_glob), recursive=True)]

print("API root:", API_ROOT)
print("Strong's index:", find(API_ROOT/"lit/strongs/index.json"))
print("KJV manifest:", find(API_ROOT/"lit/bible/en/kjv/manifest.json"))
print("KJV per-book files (first 5):", find(API_ROOT/"lit/bible/en/kjv/*.json")[:5])

## 10) Clean processed data (optional)
Remove `tools/lit-import/data/processed/<corpus_id>` and re-run to verify determinism.

In [None]:
import shutil, json
spec = json.loads(Path(SPEC).read_text(encoding='utf-8'))
corpus = spec.get('corpus_id','corpus').replace(':','_')
proc = LIT_IMPORT / "data/processed" / corpus
print("Removing:", proc)
shutil.rmtree(proc, ignore_errors=True)
proc.exists()

---
### Notes
- Update the **spec** file with real URLs/sha256 to enable real fetching.
- For KJV/ASV/WEB OSIS parsing, place an `.xml` in `tools/lit-import/data/raw/<corpus_id>/` and rerun **Normalize** → **Export**.
- For Strong's, place the XML in `data/raw/lexicon_strongs/` (or fill `source.urls` in the spec) and run all steps.