Skip to content
Goovy/Grails lightweight Apache Nutch alternative
Groovy XSLT Java CSS Other
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
GnutchGrailsPlugin.groovy v0.2.3.2 drafted Mar 21, 2015 grails version chagned to 2.4.4 Feb 26, 2015

Very simple alternative to "Apache Nutch": created in Grails

Crawled data could be stored to files, saved to database or even indexed with Apache Solr. Use "Apache Camel": as integration framework and "Apache ActiveMQ": as source messaging and integration patterns server.


Depends on :routing:1.2.4 or higher

    add these two lines into your BuildConfig.groovy

    plugins {

    // ....

    build ":routing:1.2.4" // you may use higher version
    build ":gnutch:"

    // ....


After plugin installation your application will get grails-app/routes folder created with pre-created camel routes It contains base business logic. You can modify them as you wish. Also grails-app/conf/ehcache.xml will be created with specific cache definition.

file: $GNUTCH_HOME/grails-app/conf/Config.groovy will be appended with that section:

 gnutch {
   // Define local directory which is used for receiving source-definition files.
   inputRoute = 'file:///tmp/gnutch-input'

   crawl {
     // Define fixed number of threads we use for crawling
     threads = 40

   // post processors
   // define custom post processing closure which is called for HTML pages
   // contains org.w3c.dom.Document object represending XHTML page just crawled
   postProcessorHTML = {ex ->}

   // define custom post processing closure which is called for XML document which was received after XSLT
   // contains org.w3c.dom.Document object represending XML result of XSL transformation
   postProcessorXML = {ex ->}

   http {
     // UserAgent string. Better if contain email address of person who is responsible
     // for crawling. That will allow source owners to contact person directly
     userAgent = 'GNutch crawler ( Contact:'

     // Since we use shared threadPool for HTTP connections,
     // we define these two values

     // Maximmum number of connections per host
     defaultMaxConnectionsPerHost = 40
     // Maximmum number of total connections
     maxTotalConnections = 40

   solr {
     // URL to Solr server where we send crawled data for indexing
     serverUrl = 'http://localhost:8983/solr'

   activemq {
     // URL to AMQ message broker.
     // For simple configuration it runs embedded AMQ
     // More on possible connection strings here:
     // brokerURL = 'tcp://'
     brokerURL = 'vm://localhost'

     // For more complex cases it could contain external AMQ configuration
     //conf = 'classpath:activemq.xml'
Something went wrong with that request. Please try again.