Skip to content

yokingma/url-reader

Repository files navigation

URL READER

This project helps you to read the content of URLs, and return the title, length, html, text, markdown, excerpt.

"node": ">=20.11.0"

Installation

yarn add url-reader
# or npm install url-reader

Usage

import URLReader from 'url-reader';

const reader = new URLReader();
await reader.init();

const results = await reader.read({
  urls: ['https://www.google.com'],
  timeout: 10000, // ms, default: 60000
  enableMarkdown: false, // default: true
  runScripts: 'dangerously', // run the scripts included in the HTML and fetch remote resources, default is closed.
});

Parsed Result:

interface IReaderResult {
  title: string;
  length: number;
  html: string;
  text: string;
  markdown?: string;
  excerpt: string;
}

Server

  • start server
git clone https://github.com/yokingma/url-reader.git
cd url-reader

# default listen on port 3030
yarn install & yarn run start
  • api
GET /reader?url=https://www.google.com

POST /reader
Body:
{
  urls: ['https://www.google.com', 'https://www.bing.com']
}

Docker

docker build -t urlreader . # urlreader is your image's tag name

The service will listen on port 3030.

Tips

  • puppeteer When you install Puppeteer, it will automatically downloads a recent version of Chrome for Testing (~170MB macOS, ~282MB Linux, ~280MB Windows) and a chrome-headless-shell binary.

Troubleshooting

  • install error with puppeteer
Error [ERR_TLS_CERT_ALTNAME_INVALID]: Hostname/IP does not match certificate's altnames...

remove .npmrc file and re-install.

About

Convert URLs to JSON/Markdown/Text. Support docker deployment.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published