Description
Problem
When ingesting Markdown content into NLWeb (e.g., via RSS feeds or direct file uploads), the absence of a default transformation mechanism leads to:
- Loss of Semantic Meaning: Markdown elements like headings, lists, and code blocks are processed as raw text, missing the chance to map them to Schema.org types (e.g., a setup section as a
HowTo
type). - Reduced Vector Search Accuracy: Without structured metadata, vector search fails to disambiguate entities (e.g., “bug fix” as software vs. insect repellent)
Background
At iunera, we’ve encountered similar challenges in past projects involving structured enterprise data and big data processing. Our solution was to enrich Markdown content by creating a default transformation pipeline to convert it into structured data, which proved highly efficient in internal and client projects (we then could mix it up with structured data from other sources jsonld-schemaorg-javatypes).
While previously unpublished, we believe this approach can significantly boost vector search in NLWeb. iunera is generally willing to contribute this solution to NLWeb to enhance its ingestion capabilities.
Impact
NLWeb currently attracts tech-savvy users who often host content in Git repositories with Markdown READMEs. Supporting Markdown ingestion would unlock immense potential to attract early adopters by enabling them to index content in their most commonly used format, thereby broadening NLWeb’s user base and enhancing its conversational capabilities.
Proposed Solution
Add a default Markdown-to-JSON-LD transformation pipeline to NLWeb’s ingestion process, converting Markdown content into Schema.org JSON-LD for improved semantic indexing. This can leverage existing tools like iunera’s json-ld-markdown (demo) as a foundation.
Questions for Discussion
- Is Markdown ingestion within the scope and interest of the NLWeb project?
- Does the community see value in Markdown ingestion, or is this primarily an enterprise need we’ve identified?
- Which Schema.org types (e.g., for comparison tables, headings, lists, code blocks) would have the greatest impact on vector search accuracy?
- Should NLWeb integrate an existing tool like json-ld-markdown or develop a custom pipeline?
Expected Outcome
- A standardized Markdown ingestion pipeline for NLWeb.
- Enhanced vector search accuracy and conversational responses through semantic context (e.g., mapping a setup guide to a
HowTo
type). - Community-driven mappings, allowing developers to contribute custom transformations in line with NLWeb’s collaborative ethos.