Skip to content

toyama0919/embulk-filter-crawler

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Crawler filter plugin for Embulk

Write short description here and build.gradle file.

Overview

  • Plugin type: filter

Configuration

  • target_key: base_url column key name (string, require)
  • max_depth_of_crawling: max depth of crawling (integer, default: unlimited)
  • number_of_crawlers: parallelism (integer, default: 1)
  • max_pages_to_fetch: max_pages_to_fetch (integer, default: unlimited)
  • crawl_storage_folder: crawl_storage_folder (string, require)
  • politeness_delay: politeness_delay (integer, default: null)
  • user_agent_string: user_agent_string (string, default: null)
  • output_prefix: output_prefix (string, default: "")
  • connection_timeout: connection timeout millisecond (integer, default: 30000)
  • socket_timeout: socket timeout millisecond (integer, default: 20000)

Example

in:
  type: mysql
  host: dbs04
  user: application
  password: XXXXXXXX
  database: iap
  query: |
    select url from companies limit 100
filters:
  - type: crawler
    target_key: url
    number_of_crawlers: 10
    max_depth_of_crawling: 4
    politeness_delay: 100
    crawl_storage_folder: "/tmp/crawl/%s"
out:
  type: stdout

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published