-
Notifications
You must be signed in to change notification settings - Fork 726
SparkNLP 1213 Adding Markdown Reader #14618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…tition support for md files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for reading and parsing Markdown files into Spark DataFrames, exposing both Scala and Python APIs and integrating markdown into the existing Partition utility.
- Introduce
MarkdownReader
withparseMarkdown
logic and a new.md()
API inSparkNLPReader
- Extend
Partition
to handletext/markdown
content type and.md
extension - Add Scala and Python tests plus example resources for validating Markdown ingestion
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
src/main/scala/com/johnsnowlabs/reader/MarkdownReader.scala | Implement Markdown parsing logic |
src/main/scala/com/johnsnowlabs/reader/SparkNLPReader.scala | Add md & mdToHTMLElement methods and docs |
src/main/scala/com/johnsnowlabs/partition/Partition.scala | Register markdown support for contentType & ext |
src/test/scala/com/johnsnowlabs/reader/MarkdownReaderTest.scala | New unit tests for MarkdownReader |
src/test/resources/reader/md/simple.md | Fixture file for MarkdownReader tests |
python/sparknlp/reader/sparknlp_reader.py | Add Python binding for .md() |
python/test/sparknlp_test.py | Add Python test case for Markdown reader |
Comments suppressed due to low confidence (1)
python/sparknlp/reader/sparknlp_reader.py:399
- The example metadata keys ('elementId', 'tag') don't match the actual keys ('level', 'paragraph') used by
parseMarkdown
. Update the docstring example for accuracy.
|[{Title, Sample Markdown Document, {elementId -> ..., tag -> title}}]|
Description
This PR introduces a new feature that enables reading and parsing Markdown files into a structured Spark DataFrame. This functionality allows for efficient processing and analysis of Markdown content, supporting use cases such as document indexing, content analysis, and downstream natural language processing (NLP) workflows.
Added
sparkmd.read().md()
: This method accepts file paths of Markdown documents and loads their structured content into a DataFrame.Usage Example:
Motivation and Context
Structured Data Representation: By converting raw Markdown files into a structured DataFrame, we enable seamless integration with Spark’s powerful analytics and data processing capabilities. This is especially useful for content management systems, documentation pipelines, and knowledge bases.
Scalability: Leveraging Spark’s distributed architecture, this feature allows for efficient processing of large volumes of Markdown documents, enabling scalable workflows for big data and NLP applications.
Simplified Data Manipulation: Representing Markdown documents as DataFrames simplifies common data manipulation tasks such as searching, filtering, aggregation, and transformation, thus reducing complexity and improving productivity.
Enhanced Context for NLP Tasks: Transforming Markdown content into structured formats enables more effective content extraction, context-aware processing, and enhances downstream NLP applications such as information retrieval, summarization, and question answering.
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: