Feature Request: Default Markdown NLWeb Ingestion

# Problem

When ingesting Markdown content into NLWeb (e.g., via RSS feeds or direct file uploads), the absence of a default transformation mechanism leads to:

- **Loss of Semantic Meaning**: Markdown elements like headings, lists, and code blocks are processed as raw text, missing the chance to map them to Schema.org types (e.g., a setup section as a `HowTo` type).
- **Reduced Vector Search Accuracy**: Without structured metadata, vector search fails to disambiguate entities (e.g., “bug fix” as software vs. insect repellent)

## Background

At [iunera](https://iunera.com/), we’ve encountered similar challenges in past projects involving structured enterprise data and big data processing. Our solution was to enrich Markdown content by creating a default transformation pipeline to convert it into structured data, which proved highly efficient in internal and client projects (we then could mix it up with structured data from other sources  [jsonld-schemaorg-javatypes](https://github.com/iunera/jsonld-schemaorg-javatypes)). 

While previously unpublished, we believe this approach can significantly [boost vector search in NLWeb](https://www.iunera.com/kraken/nlweb/markdown-to-jsonld-boosting-vectorsearch-rags/). iunera is generally willing to contribute this solution to NLWeb to enhance its ingestion capabilities.

## Impact

NLWeb currently attracts tech-savvy users who often host content in Git repositories with Markdown READMEs. Supporting Markdown ingestion would unlock immense potential to attract early adopters by enabling them to index content in their most commonly used format, thereby broadening NLWeb’s user base and enhancing its conversational capabilities.

## Proposed Solution

Add a default Markdown-to-JSON-LD transformation pipeline to NLWeb’s ingestion process, converting Markdown content into Schema.org JSON-LD for improved semantic indexing. This can leverage existing tools like iunera’s [json-ld-markdown](https://github.com/iunera/json-ld-markdown) ([demo](https://markdown-to-jsonld-ai.iunera.com/)) as a foundation.

## Questions for Discussion

- Is Markdown ingestion within the scope and interest of the NLWeb project?
- Does the community see value in Markdown ingestion, or is this primarily an enterprise need we’ve identified?
- Which Schema.org types (e.g., for comparison tables, headings, lists, code blocks) would have the greatest impact on vector search accuracy?
- Should NLWeb integrate an existing tool like [json-ld-markdown](https://github.com/iunera/json-ld-markdown) or develop a custom pipeline?

## Expected Outcome

- A standardized Markdown ingestion pipeline for NLWeb.
- Enhanced vector search accuracy and conversational responses through semantic context (e.g., mapping a setup guide to a `HowTo` type).
- Community-driven mappings, allowing developers to contribute custom transformations in line with NLWeb’s collaborative ethos.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Default Markdown NLWeb Ingestion #177

Problem

Background

Impact

Proposed Solution

Questions for Discussion

Expected Outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Default Markdown NLWeb Ingestion #177

Description

Problem

Background

Impact

Proposed Solution

Questions for Discussion

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions