Skip to content

SparkNLP 1213 Adding Markdown Reader #14618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Jul 1, 2025

Description

This PR introduces a new feature that enables reading and parsing Markdown files into a structured Spark DataFrame. This functionality allows for efficient processing and analysis of Markdown content, supporting use cases such as document indexing, content analysis, and downstream natural language processing (NLP) workflows.

Added
sparkmd.read().md(): This method accepts file paths of Markdown documents and loads their structured content into a DataFrame.

Usage Example:

partitioner = Partition(content_type = "text/markdown"").partition(xml_directory)

Motivation and Context

  • Structured Data Representation: By converting raw Markdown files into a structured DataFrame, we enable seamless integration with Spark’s powerful analytics and data processing capabilities. This is especially useful for content management systems, documentation pipelines, and knowledge bases.

  • Scalability: Leveraging Spark’s distributed architecture, this feature allows for efficient processing of large volumes of Markdown documents, enabling scalable workflows for big data and NLP applications.

  • Simplified Data Manipulation: Representing Markdown documents as DataFrames simplifies common data manipulation tasks such as searching, filtering, aggregation, and transformation, thus reducing complexity and improving productivity.

  • Enhanced Context for NLP Tasks: Transforming Markdown content into structured formats enables more effective content extraction, context-aware processing, and enhances downstream NLP applications such as information retrieval, summarization, and question answering.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Jul 1, 2025
@danilojsl danilojsl added the new-feature Introducing a new feature label Jul 1, 2025
@danilojsl danilojsl changed the title Feature/sparknlp 1213 Adding Markdown Reader SparkNLP 1213 Adding Markdown Reader Jul 2, 2025
@danilojsl danilojsl marked this pull request as ready for review July 2, 2025 17:16
@danilojsl danilojsl requested review from Copilot and DevinTDHa and removed request for Copilot July 2, 2025 17:17
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for reading and parsing Markdown files into Spark DataFrames, exposing both Scala and Python APIs and integrating markdown into the existing Partition utility.

  • Introduce MarkdownReader with parseMarkdown logic and a new .md() API in SparkNLPReader
  • Extend Partition to handle text/markdown content type and .md extension
  • Add Scala and Python tests plus example resources for validating Markdown ingestion

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/main/scala/com/johnsnowlabs/reader/MarkdownReader.scala Implement Markdown parsing logic
src/main/scala/com/johnsnowlabs/reader/SparkNLPReader.scala Add md & mdToHTMLElement methods and docs
src/main/scala/com/johnsnowlabs/partition/Partition.scala Register markdown support for contentType & ext
src/test/scala/com/johnsnowlabs/reader/MarkdownReaderTest.scala New unit tests for MarkdownReader
src/test/resources/reader/md/simple.md Fixture file for MarkdownReader tests
python/sparknlp/reader/sparknlp_reader.py Add Python binding for .md()
python/test/sparknlp_test.py Add Python test case for Markdown reader
Comments suppressed due to low confidence (1)

python/sparknlp/reader/sparknlp_reader.py:399

  • The example metadata keys ('elementId', 'tag') don't match the actual keys ('level', 'paragraph') used by parseMarkdown. Update the docstring example for accuracy.
        |[{Title, Sample Markdown Document, {elementId -> ..., tag -> title}}]|

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature Introducing a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant