What's in v0.1.4

Code repositories (source_type: code) — lumen add <github-url> shallow-clones into tmpdir and cleans up;
/ .next / target, truncates files over 50 KB, caps at 800 files per repo. Regex signature extraction for
JS/TS/Python/Go/Rust/Java/C/C++/Ruby/C#. README.md / CONTRIBUTING.md / docs/ ordered before source sections.
Datasets (source_type: dataset) — lumen add ./data.csv or lumen add https://huggingface.co/datasets/<id>.
Native CSV / TSV / JSONL parsing with delimiter auto-detection, quoted-field + doubled-quote escapes; HuggingFace datasets
fetched via /api/datasets/ plus /raw/main/README.md for the card. Schema table (column, inferred type, null count) +
20-row preview rendered as markdown. Colocated README.md / dataset-card.md auto-inlined.
Images (source_type: image) — lumen add screenshot.png. SHA-256 of bytes into metadata, MIME inferred from
extension. Optional OCR shells out to a local tesseract binary when on PATH; --no-ocr stores metadata only. Missing
binary produces a brew install hint instead of failing. Supports .png / .jpg / .webp / .gif / .bmp / .tiff.
Obsidian connector (ConnectorType: obsidian) — lumen watch add obsidian ~/vault. Parses Web Clipper YAML
frontmatter, promotes the source: URL into the ExtractionResult so re-clippings dedup by URL. Skips .obsidian and
other dotfiles. Flat keys, inline arrays, block-list arrays all supported.
Auto-detection in detectSourceType: GitHub repo URL → code, HuggingFace dataset URL → dataset, .csv / .tsv
/ .jsonl / .ndjson → dataset, image extensions → image, directory containing .git → code.
New CLI flags: --as-dataset (force dataset handling for ambiguous text), --no-ocr (skip OCR on images).
lib/add.ts AddInput object form accepts an options field threading IngestOptions through the programmatic path.
77 new tests across tests/code.test.ts (20), tests/dataset.test.ts (20), tests/image.test.ts (12, one
conditionally skipped when tesseract is absent), tests/obsidian.test.ts (19). Full suite: 536 passing, 2 skipped.

Changed

SourceType union widened to 'url' | 'pdf' | 'youtube' | 'arxiv' | 'file' | 'folder' | 'code' | 'dataset' | 'image'.
No schema migration — existing source_type TEXT + metadata TEXT columns accept the new shapes.
ConnectorType union widened with 'obsidian'.
ingestInput(input, options?) — optional IngestOptions second argument, backwards-compatible.

Fixes

parseJsonl: SCHEMA_SAMPLE_ROWS cap now counts only successfully-parsed object rows (previously incremented on
malformed lines too).
dataset tests: global.fetch stubs switched to vi.stubGlobal + vi.unstubAllGlobals so they don't leak across test
files.
obsidian connector: parseTarget and pull reject subdir paths that escape the vault via .. or absolute paths.
add command: --type flag is now actually threaded to the extractor as forcedType.
add command: action body wrapped in top-level try/catch for consistent error handling.

Deferred

Tracked in docs/docs-temp/INGEST-EXPANSION-PLAN.md:

Tree-sitter-based code parsing — avoids native-build dependency for now.
Claude Vision caption pass on compile for images — metadata.caption reserved as placeholder.
Native Parquet support — errors with a duckdb conversion hint instead of parsing.
In-house browser extension — Obsidian path validates the flow first.

Install

npm install -g lumen-kb

PRs

#10 — code/dataset/image ingest formats + Obsidian connector + v0.1.4 CHANGELOG

Full changelog: https://github.com/Sardor-M/Lumen/blob/main/CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.4 — code/dataset/image ingest + Obsidian connector

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's in v0.1.4

Changed

Fixes

Deferred

Install

Uh oh!