Nutch-Patch : Enhanced features for Nutch Crawler

This is a fork of Apache Nutch (branch 2.3), enhanced to fulfil requirements of projects I performed.

Features:

Continous Crawl

One Job to rule them all.

ContinuousCrawlerJob wraps up all Nutch stages (generate, fetch, parse, updateDB and index) into a single job.
Set number of crawl cycles per run (cycles arg)
Set at which crawl stage to start the first cycle (stage arg)
Inject pre-seeded urls (inject arg)
Dynamically seed new urls and auto-inject them (seedUrls arg)
Runs independently or on Nutch Server

REST API (HTTP API for the least ;) )

When ContinousCrawlJob runs on Nutch Server, it exposes common HTTP API:

Create new continous crawl
Stop current crawl
get current crawl status

Crawled-Content History / Versions

Whenever a page's contect is re-fetched, new content is compared to previous content. If it was changed, the old content is saved to a different MongoDB collection, eventually creating a list of versioned content with fetch-dates.

There are ready-made Eclipse launchers, for each crawl stage, as well as for the continous crawl job and the Nutch Server. They can be used for development with Eclipse, but can also be converted to IntelliJ launcher via Eclipser.

Notes

It is pre-configured to work with MongoDB as Nutch's storage and Elasticsearch as the index storage.

Disclaimer:

Nutch-patch is, at most, in beta. I'd be happy if it'll be useful for others, but it's here mainly because I need it.

Name		Name	Last commit message	Last commit date
Latest commit History 1,721 Commits
conf		conf
eclipse_run_configurations		eclipse_run_configurations
ivy		ivy
lib/native		lib/native
src		src
urls		urls
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
KEYS		KEYS
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
README.txt		README.txt
build.xml		build.xml
default.properties		default.properties
eclipse-codeformat.xml		eclipse-codeformat.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nutch-Patch : Enhanced features for Nutch Crawler

Features:

Continous Crawl

REST API (HTTP API for the least ;) )

Crawled-Content History / Versions

Notes

Disclaimer:

About

Releases

Packages

Contributors 9

Languages

License

yaireclipse/nutch-patch

Folders and files

Latest commit

History

Repository files navigation

Nutch-Patch : Enhanced features for Nutch Crawler

Features:

Continous Crawl

REST API (HTTP API for the least ;) )

Crawled-Content History / Versions

Notes

Disclaimer:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages