v0.1.4 — code/dataset/image ingest + Obsidian connector
What's in v0.1.4
- Code repositories (
source_type: code) —lumen add <github-url>shallow-clones intotmpdirand cleans up;
/.next/target, truncates files over 50 KB, caps at 800 files per repo. Regex signature extraction for
JS/TS/Python/Go/Rust/Java/C/C++/Ruby/C#.README.md/CONTRIBUTING.md/docs/ordered before source sections. - Datasets (
source_type: dataset) —lumen add ./data.csvorlumen add https://huggingface.co/datasets/<id>.
Native CSV / TSV / JSONL parsing with delimiter auto-detection, quoted-field + doubled-quote escapes; HuggingFace datasets
fetched via/api/datasets/plus/raw/main/README.mdfor the card. Schema table (column, inferred type, null count) +
20-row preview rendered as markdown. ColocatedREADME.md/dataset-card.mdauto-inlined. - Images (
source_type: image) —lumen add screenshot.png. SHA-256 of bytes into metadata, MIME inferred from
extension. Optional OCR shells out to a localtesseractbinary when onPATH;--no-ocrstores metadata only. Missing
binary produces abrew installhint instead of failing. Supports.png/.jpg/.webp/.gif/.bmp/.tiff. - Obsidian connector (
ConnectorType: obsidian) —lumen watch add obsidian ~/vault. Parses Web Clipper YAML
frontmatter, promotes thesource:URL into theExtractionResultso re-clippings dedup by URL. Skips.obsidianand
other dotfiles. Flat keys, inline arrays, block-list arrays all supported. - Auto-detection in
detectSourceType: GitHub repo URL →code, HuggingFace dataset URL →dataset,.csv/.tsv
/.jsonl/.ndjson→dataset, image extensions →image, directory containing.git→code. - New CLI flags:
--as-dataset(force dataset handling for ambiguous text),--no-ocr(skip OCR on images).
lib/add.tsAddInputobject form accepts anoptionsfield threadingIngestOptionsthrough the programmatic path. - 77 new tests across
tests/code.test.ts(20),tests/dataset.test.ts(20),tests/image.test.ts(12, one
conditionally skipped whentesseractis absent),tests/obsidian.test.ts(19). Full suite: 536 passing, 2 skipped.
Changed
SourceTypeunion widened to'url' | 'pdf' | 'youtube' | 'arxiv' | 'file' | 'folder' | 'code' | 'dataset' | 'image'.
No schema migration — existingsource_type TEXT+metadata TEXTcolumns accept the new shapes.ConnectorTypeunion widened with'obsidian'.ingestInput(input, options?)— optionalIngestOptionssecond argument, backwards-compatible.
Fixes
parseJsonl:SCHEMA_SAMPLE_ROWScap now counts only successfully-parsed object rows (previously incremented on
malformed lines too).datasettests:global.fetchstubs switched tovi.stubGlobal+vi.unstubAllGlobalsso they don't leak across test
files.obsidianconnector:parseTargetandpullreject subdir paths that escape the vault via..or absolute paths.addcommand:--typeflag is now actually threaded to the extractor asforcedType.addcommand: action body wrapped in top-leveltry/catchfor consistent error handling.
Deferred
Tracked in docs/docs-temp/INGEST-EXPANSION-PLAN.md:
- Tree-sitter-based code parsing — avoids native-build dependency for now.
- Claude Vision caption pass on
compilefor images —metadata.captionreserved as placeholder. - Native Parquet support — errors with a
duckdbconversion hint instead of parsing. - In-house browser extension — Obsidian path validates the flow first.
Install
npm install -g lumen-kbPRs
- #10 — code/dataset/image ingest formats + Obsidian connector + v0.1.4 CHANGELOG
Full changelog: https://github.com/Sardor-M/Lumen/blob/main/CHANGELOG.md