Skip to content

v0.1.4 — code/dataset/image ingest + Obsidian connector

Choose a tag to compare

@Sardor-M Sardor-M released this 23 Apr 08:16
· 139 commits to main since this release

What's in v0.1.4

  • Code repositories (source_type: code) — lumen add <github-url> shallow-clones into tmpdir and cleans up;
    / .next / target, truncates files over 50 KB, caps at 800 files per repo. Regex signature extraction for
    JS/TS/Python/Go/Rust/Java/C/C++/Ruby/C#. README.md / CONTRIBUTING.md / docs/ ordered before source sections.
  • Datasets (source_type: dataset) — lumen add ./data.csv or lumen add https://huggingface.co/datasets/<id>.
    Native CSV / TSV / JSONL parsing with delimiter auto-detection, quoted-field + doubled-quote escapes; HuggingFace datasets
    fetched via /api/datasets/ plus /raw/main/README.md for the card. Schema table (column, inferred type, null count) +
    20-row preview rendered as markdown. Colocated README.md / dataset-card.md auto-inlined.
  • Images (source_type: image) — lumen add screenshot.png. SHA-256 of bytes into metadata, MIME inferred from
    extension. Optional OCR shells out to a local tesseract binary when on PATH; --no-ocr stores metadata only. Missing
    binary produces a brew install hint instead of failing. Supports .png / .jpg / .webp / .gif / .bmp / .tiff.
  • Obsidian connector (ConnectorType: obsidian) — lumen watch add obsidian ~/vault. Parses Web Clipper YAML
    frontmatter, promotes the source: URL into the ExtractionResult so re-clippings dedup by URL. Skips .obsidian and
    other dotfiles. Flat keys, inline arrays, block-list arrays all supported.
  • Auto-detection in detectSourceType: GitHub repo URL → code, HuggingFace dataset URL → dataset, .csv / .tsv
    / .jsonl / .ndjsondataset, image extensions → image, directory containing .gitcode.
  • New CLI flags: --as-dataset (force dataset handling for ambiguous text), --no-ocr (skip OCR on images).
    lib/add.ts AddInput object form accepts an options field threading IngestOptions through the programmatic path.
  • 77 new tests across tests/code.test.ts (20), tests/dataset.test.ts (20), tests/image.test.ts (12, one
    conditionally skipped when tesseract is absent), tests/obsidian.test.ts (19). Full suite: 536 passing, 2 skipped.

Changed

  • SourceType union widened to 'url' | 'pdf' | 'youtube' | 'arxiv' | 'file' | 'folder' | 'code' | 'dataset' | 'image'.
    No schema migration — existing source_type TEXT + metadata TEXT columns accept the new shapes.
  • ConnectorType union widened with 'obsidian'.
  • ingestInput(input, options?) — optional IngestOptions second argument, backwards-compatible.

Fixes

  • parseJsonl: SCHEMA_SAMPLE_ROWS cap now counts only successfully-parsed object rows (previously incremented on
    malformed lines too).
  • dataset tests: global.fetch stubs switched to vi.stubGlobal + vi.unstubAllGlobals so they don't leak across test
    files.
  • obsidian connector: parseTarget and pull reject subdir paths that escape the vault via .. or absolute paths.
  • add command: --type flag is now actually threaded to the extractor as forcedType.
  • add command: action body wrapped in top-level try/catch for consistent error handling.

Deferred

Tracked in docs/docs-temp/INGEST-EXPANSION-PLAN.md:

  • Tree-sitter-based code parsing — avoids native-build dependency for now.
  • Claude Vision caption pass on compile for images — metadata.caption reserved as placeholder.
  • Native Parquet support — errors with a duckdb conversion hint instead of parsing.
  • In-house browser extension — Obsidian path validates the flow first.

Install

npm install -g lumen-kb

PRs

  • #10 — code/dataset/image ingest formats + Obsidian connector + v0.1.4 CHANGELOG

Full changelog: https://github.com/Sardor-M/Lumen/blob/main/CHANGELOG.md