Skip to content
This repository has been archived by the owner on Feb 8, 2019. It is now read-only.

Docker Image for Apache Nutch, Elasticsearch and MongoDB

Notifications You must be signed in to change notification settings

smartive/docker-nutch-elasticsearch-mongodb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Nutch, Elasticsearch, MongoDB

This repo contains 1) a Dockerfile build for Apache Nutch and 2) a docker-compose Setup for the usage with Elasticsearch and MongoDB.

Info: Currently MongoDB is not attached and used.

Apache Nutch Docker Build

The Dockerfile provides a Docker Build of Apache Nutch published as smartive/nutch. There are two published builds:

Apache Nutch docker-compose Setup for Elasticsearch 2.3.* and 5.4.* and MongoDB

This repo nutch-elasticsearch-mongodb contains a docker-compose configuration for Apache Nutch with Elasticsearch 2.3.* / 5.4.* and MongoDB.

To get started checkout the Repo and run:

git clone git@github.com:smartive/docker-nutch-elasticsearch-mongodb.git
cd ./docker-nutch-elasticsearch-mongodb && docker-compose up

This will fire up the nutchserver and webapp. Visit http://localhost:8080/.

Manual Run

docker-compose run -p 8080:8080 -p 8081:8081 --name=manual_nutch --rm --entrypoint=bash nutch

Then inside the docker box create the seed file:

echo "https://smartive.ch/" > seed.txt

Then open regex-urlfilter.txt and replace the last line to limit the crawl to the domain smartive.ch:

vi nutch/conf/regex-urlfilter.txt
# Inside regex-urlfilter.txt replace the last line `+.` with:
+^https://smartive\.ch

Then start the crawl

nutch/bin/crawl -i -s seed.txt crawldata 2

ES index only from existing crawl database:

/root/nutch/bin/nutch index crawldata/crawldb -linkdb crawldata/linkdb crawldata/segments/20170706210640

Credits

This Dockerfile and docker-compose Setup is partly based on tpickett/mongo-elasticsearch-nutch.

Apache Nutch is a highly extensible and scalable open source web crawler software project. A well matured, production ready crawler.

About

Docker Image for Apache Nutch, Elasticsearch and MongoDB

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages