No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config/checkstyle
gradle/wrapper
lib/embulk/filter
src
.gitignore
LICENSE.txt
README.md
build.gradle
gradlew
gradlew.bat

README.md

Crawler filter plugin for Embulk

Write short description here and build.gradle file.

Overview

  • Plugin type: filter

Configuration

  • target_key: base_url column key name (string, require)
  • max_depth_of_crawling: max depth of crawling (integer, default: unlimited)
  • number_of_crawlers: parallelism (integer, default: 1)
  • max_pages_to_fetch: max_pages_to_fetch (integer, default: unlimited)
  • crawl_storage_folder: crawl_storage_folder (string, require)
  • politeness_delay: politeness_delay (integer, default: null)
  • user_agent_string: user_agent_string (string, default: null)
  • output_prefix: output_prefix (string, default: "")
  • connection_timeout: connection timeout millisecond (integer, default: 30000)
  • socket_timeout: socket timeout millisecond (integer, default: 20000)

Example

in:
  type: mysql
  host: dbs04
  user: application
  password: XXXXXXXX
  database: iap
  query: |
    select url from companies limit 100
filters:
  - type: crawler
    target_key: url
    number_of_crawlers: 10
    max_depth_of_crawling: 4
    politeness_delay: 100
    crawl_storage_folder: "/tmp/crawl/%s"
out:
  type: stdout

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously