SparkNLP 1213 Adding Markdown Reader #14618

danilojsl · 2025-07-01T22:33:06Z

Description

This PR introduces a new feature that enables reading and parsing Markdown files into a structured Spark DataFrame. This functionality allows for efficient processing and analysis of Markdown content, supporting use cases such as document indexing, content analysis, and downstream natural language processing (NLP) workflows.

Added
sparkmd.read().md(): This method accepts file paths of Markdown documents and loads their structured content into a DataFrame.

Usage Example:

partitioner = Partition(content_type = "text/markdown"").partition(xml_directory)

Motivation and Context

Structured Data Representation: By converting raw Markdown files into a structured DataFrame, we enable seamless integration with Spark’s powerful analytics and data processing capabilities. This is especially useful for content management systems, documentation pipelines, and knowledge bases.
Scalability: Leveraging Spark’s distributed architecture, this feature allows for efficient processing of large volumes of Markdown documents, enabling scalable workflows for big data and NLP applications.
Simplified Data Manipulation: Representing Markdown documents as DataFrames simplifies common data manipulation tasks such as searching, filtering, aggregation, and transformation, thus reducing complexity and improving productivity.
Enhanced Context for NLP Tasks: Transforming Markdown content into structured formats enables more effective content extraction, context-aware processing, and enhances downstream NLP applications such as information retrieval, summarization, and question answering.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…tition support for md files

Copilot

Pull Request Overview

This PR adds support for reading and parsing Markdown files into Spark DataFrames, exposing both Scala and Python APIs and integrating markdown into the existing Partition utility.

Introduce MarkdownReader with parseMarkdown logic and a new .md() API in SparkNLPReader
Extend Partition to handle text/markdown content type and .md extension
Add Scala and Python tests plus example resources for validating Markdown ingestion

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/main/scala/com/johnsnowlabs/reader/MarkdownReader.scala	Implement Markdown parsing logic
src/main/scala/com/johnsnowlabs/reader/SparkNLPReader.scala	Add `md` & `mdToHTMLElement` methods and docs
src/main/scala/com/johnsnowlabs/partition/Partition.scala	Register markdown support for contentType & ext
src/test/scala/com/johnsnowlabs/reader/MarkdownReaderTest.scala	New unit tests for MarkdownReader
src/test/resources/reader/md/simple.md	Fixture file for MarkdownReader tests
python/sparknlp/reader/sparknlp_reader.py	Add Python binding for `.md()`
python/test/sparknlp_test.py	Add Python test case for Markdown reader

Comments suppressed due to low confidence (1)

python/sparknlp/reader/sparknlp_reader.py:399

The example metadata keys ('elementId', 'tag') don't match the actual keys ('level', 'paragraph') used by parseMarkdown. Update the docstring example for accuracy.

        |[{Title, Sample Markdown Document, {elementId -> ..., tag -> title}}]|

python/sparknlp/reader/sparknlp_reader.py

src/test/resources/reader/md/simple.md

src/main/scala/com/johnsnowlabs/reader/MarkdownReader.scala

examples/python/data-preprocessing/SparkNLP_Partition_Demo.ipynb

danilojsl added 2 commits July 1, 2025 17:05

[SPARKNLP-1213] Introducing MarkdownReader

77bea80

[SPARKNLP-1213] Adding python wrapper for Markdown reader

20dd6c7

danilojsl self-assigned this Jul 1, 2025

danilojsl added the new-feature Introducing a new feature label Jul 1, 2025

danilojsl changed the title ~~Feature/sparknlp 1213 Adding Markdown Reader~~ SparkNLP 1213 Adding Markdown Reader Jul 2, 2025

[SPARKNLP-1213] Adding demo notebook for Markdown reader and adds Par…

cd90530

…tition support for md files

danilojsl marked this pull request as ready for review July 2, 2025 17:16

danilojsl requested review from Copilot and DevinTDHa and removed request for Copilot July 2, 2025 17:17

Copilot AI reviewed Jul 2, 2025

View reviewed changes

danilojsl added 2 commits July 3, 2025 10:18

[SPARKNLP-1213] Addressing copilot suggestions

db00cb7

[SPARKNLP-1213] Corrects typo in partition demo notebook [skip test]

8c6e49a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SparkNLP 1213 Adding Markdown Reader #14618

SparkNLP 1213 Adding Markdown Reader #14618

Uh oh!

danilojsl commented Jul 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkNLP 1213 Adding Markdown Reader #14618

Are you sure you want to change the base?

SparkNLP 1213 Adding Markdown Reader #14618

Uh oh!

Conversation

danilojsl commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danilojsl commented Jul 1, 2025 •

edited

Loading