This project extracts the functionality of fetching a URL and generating a markdown string from clipper.js and packages it as a library that can be included in to your project using NPM.
Scraper to Markdown is a lightweight JavaScript library that allows you to scrape articles or web pages and convert their content into Markdown format. This is particularly useful for archiving, content generation, or data processing tasks.
- Extracts main content from web pages.
- Converts HTML content into Markdown using Turndown.
- Handles GitHub Flavored Markdown (GFM) for better compatibility.
- Fallback mechanism for handling URLs that return raw Markdown.
- Built-in support for readability parsing via @mozilla/readability.
Install the library using npm:
npm install @mmiscool/scrape_to_markdown -s
import { scrapeToMarkdown } from '@mmiscool/scrape_to_markdown';
(async () => {
const url = 'https://example.com/some-article';
try {
const markdown = await scrapeToMarkdown(url);
console.log(markdown);
} catch (error) {
console.error('Error scraping the URL:', error);
}
})();
The library can handle cases where the URL directly provides Markdown content. It will return the raw Markdown if no HTML is detected.
Scrapes the content from the provided URL and converts it to Markdown.
- Parameters:
url
: The URL of the web page to scrape.
- Returns: A
Promise
resolving to the Markdown content.
Uses JSDOM and @mozilla/readability to extract and convert the primary content from a web page into Markdown.
Converts raw HTML input into Markdown.
Legacy scraper for handling edge cases or simpler scraping needs.
This library relies on the following NPM packages:
- axios for HTTP requests.
- cheerio for parsing HTML content.
- turndown for converting HTML to Markdown.
- turndown-plugin-gfm for GitHub Flavored Markdown support.
- @mozilla/readability for extracting readable content from web pages.
- jsdom for DOM simulation.
import { scrapeToMarkdown } from '@mmiscool/scrape_to_markdown';
(async () => {
const url = 'https://medium.com/some-blog-post';
const markdown = await scrapeToMarkdown(url);
console.log(markdown);
})();
import { extract_from_html } from '@mmiscool/scrape_to_markdown';
const html = `
<article>
<h1>Example Article</h1>
<p>This is an example paragraph.</p>
</article>
`;
(async () => {
const markdown = await extract_from_html(html);
console.log(markdown);
})();
Clipper uses the following open source libraries:
- Mozilla Readability - For parsing article content
- Turndown - For converting HTML to Markdown
- Crawlee - For crawling websites
- Apache 2.0