Skip to content

tiredpixel/isoxya-plugin-crawler-html

latest
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Isoxya plugin Crawler HTML

Isoxya plugin Crawler HTML provides a core run loop for the crawling engine, parsing each page as static HTML, and extracting request metadata and outbound URLs. It is a plugin for Isoxya web crawler.

https://hub.docker.com/r/tiredpixel/isoxya-plugin-crawler-html
https://github.com/tiredpixel/isoxya-plugin-crawler-html

Features

  • links parsed <a href="http://example.com">link</a>

  • header redirects extracted Location; HTTP Status 301, 302, 303, 307, 308

  • no-follow links respected <a href="http://www.iana.org/domains/example" rel="nofollow">

  • base tags used for relative links <base href="http://www.example.com/">

  • meta robots no-follow tags respected <meta name="robots" content="nofollow">

  • header X-Robots-Tag no-follow respected X-Robots-Tag: nofollow

Installation

Compile and boot locally:

docker compose up

Images are also published using the latest tag (for development), and version-specific tags (for production). Do not use a latest tag in production!

Licence

Copyright © Nic Williams. It is free software, released under the BSD 3-Clause licence, and may be redistributed under the terms specified in LICENSE.