web-crawler

History

Name		Name	Last commit message	Last commit date
parent directory ..
src		src
README.md		README.md
README.zh-CN.md		README.zh-CN.md
package.json		package.json
tsconfig.json		tsconfig.json

README.md

@lobechat/web-crawler

LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.

📝 Introduction

@lobechat/web-crawler is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.

🛠️ Core Features

Intelligent Content Extraction: Identifies main content based on Mozilla Readability algorithm
Multi-level Crawling Strategy: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering
Custom URL Rules: Handles specific website crawling logic through a flexible rule system

🤝 Contribution

Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:

How to Contribute URL Rules

Add new rules to the urlRules.ts file
Rule example:

// Example: handling specific websites
const url = [
  // ... other URL matching rules
  {
    // URL matching pattern, supports regex
    urlPattern: 'https://example.com/articles/(.*)',

    // Optional: URL transformation, redirects to an easier-to-crawl version
    urlTransform: 'https://example.com/print/$1',

    // Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless'
    impls: ['naive', 'jina', 'browserless'],

    // Optional: content filtering configuration
    filterOptions: {
      // Whether to enable Readability algorithm for filtering distracting elements
      enableReadability: true,
      // Whether to convert to plain text
      pureText: false,
    },
  },
];

Rule Submission Process

Fork the LobeChat repository
Add or modify URL rules
Submit a Pull Request describing:

Target website characteristics
Problems solved by the rule
Test cases (example URLs)

📌 Note

This is an internal module of LobeHub ("private": true), designed specifically for LobeChat and not published as a standalone package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

web-crawler

web-crawler

README.md

@lobechat/web-crawler

📝 Introduction

🛠️ Core Features

🤝 Contribution

How to Contribute URL Rules

Rule Submission Process

📌 Note

Files

web-crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

web-crawler

Folders and files

parent directory

README.md

@lobechat/web-crawler

📝 Introduction

🛠️ Core Features

🤝 Contribution

How to Contribute URL Rules

Rule Submission Process

📌 Note