scrape2md: Web Content to Markdown Converter

This library is designed to scrape web content and convert it into readable Markdown format, suitable for both large language models and human readers. It simplifies the process of extracting information from various sources, including HTML pages and PDF documents, and presents it in a clean, structured Markdown format.

This is meant to be a fast lib that can be deployed on serverless platforms like Cloudflare without the deployment complexity of heavier-weight scrapers using headless browsers.

Features

Fetch and convert web content to Markdown.
Support for HTML and PDF content types.
Extracts and converts Open Graph data to Markdown.
Compatible with both browser and Node.js environments.
Can scrape following sites:
- Most articles via Mozilla readability
- Twitter via OG parsing + fxtwitter.com
- Reddit via old.reddit.com
- Youtube using generated subtitles
- Comments from sites like hacker news

Installation

pnpm install web-content-to-markdown

Usage

As a library:

import { fetchAndConvertToMarkdown } from 'web-content-to-markdown';

// Example: Convert content from a URL to Markdown
const url = 'https://www.youtube.com/watch?v=U7PUn1Pq0iM';
fetchAndConvertToMarkdown(url, fetch)
  .then(markdown => console.log(markdown))
  .catch(error => console.error(error));

Via cli:

pnpm tsx cli.ts 'https://www.youtube.com/watch?v=U7PUn1Pq0iM'

API

`fetchAndConvertToMarkdown(url: string, fetchFunc: typeof fetch): Promise<string>`

Fetches content from the specified URL and converts it to Markdown. The fetchFunc parameter allows you to provide a custom fetch function, making the library flexible for different environments.

Contributing

Contributions are welcome! Please submit a pull request or open an issue to suggest improvements or add new features.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.ts		cli.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cli.ts

cli.ts

package.json

package.json

pnpm-lock.yaml

pnpm-lock.yaml

tsconfig.json

tsconfig.json

Repository files navigation

scrape2md: Web Content to Markdown Converter

Features

Installation

Usage

API

`fetchAndConvertToMarkdown(url: string, fetchFunc: typeof fetch): Promise<string>`

Contributing

License

About

Releases 3

Packages

Contributors 2

Languages

License

tarasglek/scrape2md

Folders and files

Latest commit

History

Repository files navigation

scrape2md: Web Content to Markdown Converter

Features

Installation

Usage

API

fetchAndConvertToMarkdown(url: string, fetchFunc: typeof fetch): Promise<string>

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Languages

`fetchAndConvertToMarkdown(url: string, fetchFunc: typeof fetch): Promise<string>`