Skip to content
Easily crawl news portals or blog sites using Storm Crawler.
Java Other
  1. Java 98.2%
  2. Other 1.8%
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github chore: license and logo Sep 4, 2019
administration-ui feat: build projects in Docker May 28, 2019
analysis-ui version bump Apr 13, 2018
bin refactor: default index names May 28, 2019
crawler fix: update jackson-databind (#44) Oct 3, 2019
data-model data classes are serializable Aug 28, 2018
elasticsearch feat: integration test May 29, 2019
page-analyzer
parser
ui-commons version bump Apr 13, 2018
.gitattributes Excluding CSS from analysis Sep 13, 2017
.gitignore es-rest Jan 31, 2018
.gitlab-ci.yml chore: crawler docker image May 29, 2019
Dockerfile.base chore: base docker image with deps May 29, 2019
Dockerfile.crawler refactor: remove unnecessary step from docker build May 29, 2019
Dockerfile.es chore: base elasticsearch docker image May 29, 2019
Dockerfile.ui chore: administration ui docker May 29, 2019
LICENSE chore: license and logo Sep 4, 2019
Makefile chore: docker-compose to run full CF May 29, 2019
README.md
docker-compose.dev.yml chore: docker-compose to run full CF May 29, 2019
docker-compose.run.yml feat: configurable source reload delay from env var May 29, 2019
pom.xml Bump storm-core from 1.1.1 to 1.1.3 May 29, 2019

README.md

Crawling Framework

Maven Central pipeline status

Crawling Framework aims at providing instruments to configure and run your Storm Crawler based crawler. It mainly aims at easing crawling of article content publishing sites like news portals or blog sites. With the help of GUI tool Crawling Framework provides you can:

  1. Specify which sites to crawl.
  2. Configure URL inclusion and exclusion filters, thus controlling which sections of the site will be fetched.
  3. Specify which elements of the page provide information about article publication name, its title and main body.
  4. Define tests which validate that extraction rules are working.

Once configuration is done the Crawling Framework runs Storm Crawler based crawling following the rules specified in the configuration.

Introduction

We have recorded a video on how to setup and use Crawling Framework. Click on the image below to watch in on Youtube.

Crawling Framework Intro

Requirements

Framework writes its configuration and stores crawled data to ElasticSearch. Before starting crawl project install ElasticSearch (Crawling Framework is tested to work with Elastic v5.5.x).

Crawling Framework is a Java lib which will have to be extended to run Storm Crawler topology, thus Java (JDK8, Maven) infrastructure will be needed.

Configuring and Running a crawl

See Crawling Framework Example project's documentation.

License

Copyright © 2017-2019 TokenMill UAB.

Distributed under the The Apache License, Version 2.0.

You can’t perform that action at this time.