Skip to content

Feature Request: Default Markdown NLWeb Ingestion #177

Open
@tim13337

Description

@tim13337

Problem

When ingesting Markdown content into NLWeb (e.g., via RSS feeds or direct file uploads), the absence of a default transformation mechanism leads to:

  • Loss of Semantic Meaning: Markdown elements like headings, lists, and code blocks are processed as raw text, missing the chance to map them to Schema.org types (e.g., a setup section as a HowTo type).
  • Reduced Vector Search Accuracy: Without structured metadata, vector search fails to disambiguate entities (e.g., “bug fix” as software vs. insect repellent)

Background

At iunera, we’ve encountered similar challenges in past projects involving structured enterprise data and big data processing. Our solution was to enrich Markdown content by creating a default transformation pipeline to convert it into structured data, which proved highly efficient in internal and client projects (we then could mix it up with structured data from other sources jsonld-schemaorg-javatypes).

While previously unpublished, we believe this approach can significantly boost vector search in NLWeb. iunera is generally willing to contribute this solution to NLWeb to enhance its ingestion capabilities.

Impact

NLWeb currently attracts tech-savvy users who often host content in Git repositories with Markdown READMEs. Supporting Markdown ingestion would unlock immense potential to attract early adopters by enabling them to index content in their most commonly used format, thereby broadening NLWeb’s user base and enhancing its conversational capabilities.

Proposed Solution

Add a default Markdown-to-JSON-LD transformation pipeline to NLWeb’s ingestion process, converting Markdown content into Schema.org JSON-LD for improved semantic indexing. This can leverage existing tools like iunera’s json-ld-markdown (demo) as a foundation.

Questions for Discussion

  • Is Markdown ingestion within the scope and interest of the NLWeb project?
  • Does the community see value in Markdown ingestion, or is this primarily an enterprise need we’ve identified?
  • Which Schema.org types (e.g., for comparison tables, headings, lists, code blocks) would have the greatest impact on vector search accuracy?
  • Should NLWeb integrate an existing tool like json-ld-markdown or develop a custom pipeline?

Expected Outcome

  • A standardized Markdown ingestion pipeline for NLWeb.
  • Enhanced vector search accuracy and conversational responses through semantic context (e.g., mapping a setup guide to a HowTo type).
  • Community-driven mappings, allowing developers to contribute custom transformations in line with NLWeb’s collaborative ethos.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions