Skip to content

mmiscool/scrape_to_markdown

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrape to markdown

This project extracts the functionality of fetching a URL and generating a markdown string from clipper.js and packages it as a library that can be included in to your project using NPM.

Scraper to Markdown is a lightweight JavaScript library that allows you to scrape articles or web pages and convert their content into Markdown format. This is particularly useful for archiving, content generation, or data processing tasks.

Features

  • Extracts main content from web pages.
  • Converts HTML content into Markdown using Turndown.
  • Handles GitHub Flavored Markdown (GFM) for better compatibility.
  • Fallback mechanism for handling URLs that return raw Markdown.
  • Built-in support for readability parsing via @mozilla/readability.

Installation

Install the library using npm:

npm install @mmiscool/scrape_to_markdown -s

Usage

Import the Library

import { scrapeToMarkdown } from '@mmiscool/scrape_to_markdown';

Scrape a Web Page to Markdown

(async () => {
    const url = 'https://example.com/some-article';
    try {
        const markdown = await scrapeToMarkdown(url);
        console.log(markdown);
    } catch (error) {
        console.error('Error scraping the URL:', error);
    }
})();

Fallback for Raw Markdown URLs

The library can handle cases where the URL directly provides Markdown content. It will return the raw Markdown if no HTML is detected.

API

scrapeToMarkdown(url: string): Promise<string>

Scrapes the content from the provided URL and converts it to Markdown.

  • Parameters:
    • url: The URL of the web page to scrape.
  • Returns: A Promise resolving to the Markdown content.

extract_from_url(page: string): Promise<string>

Uses JSDOM and @mozilla/readability to extract and convert the primary content from a web page into Markdown.

extract_from_html(html: string): Promise<string>

Converts raw HTML input into Markdown.

oldScrapeToMarkdown(url: string): Promise<string>

Legacy scraper for handling edge cases or simpler scraping needs.

Dependencies

This library relies on the following NPM packages:

Examples

Scraping a Blog Post

import { scrapeToMarkdown } from '@mmiscool/scrape_to_markdown';

(async () => {
    const url = 'https://medium.com/some-blog-post';
    const markdown = await scrapeToMarkdown(url);
    console.log(markdown);
})();

Converting Raw HTML to Markdown

import { extract_from_html } from '@mmiscool/scrape_to_markdown';

const html = `
    <article>
        <h1>Example Article</h1>
        <p>This is an example paragraph.</p>
    </article>
`;

(async () => {
    const markdown = await extract_from_html(html);
    console.log(markdown);
})();

Credits

Clipper uses the following open source libraries:

License

  • Apache 2.0

Packages

No packages published

Languages

  • JavaScript 69.6%
  • Shell 30.4%