This is work in progress, please do not use yet!
SaushEngine is a simple and customizable search engine that allows you to crawl through anything and anywhere for data. You can use it to crawl an intranet or a file server for documents, spreadsheets and slides or even your computer.
How it works
The search engine has two parts:
- Spider - goes out to collect data for the search engine. The Spider includes the message queue and the index database. The Spider crawls through URLs in the message queue, given initial seeds in the queue, processes the documents found and saves data into the index database. The Spider is controlled with a web interface.
- Digger - allows your user to search through the index database. The Digger algorithms are customisable and configurable.
The following are features found in SaushEngine:
- Multiple parallel crawlers for creating the search index
- Crawlers can process multiple document formats including HTML, Microsoft Word, Adobe PDF and many others
- Crawlers can be customized to search specific domains only
- Crawlers can search through NTLM protected sites (Sharepoint etc)
- Web-based crawler management interface
- Search can be specific to mime-types
- Search can be specific to domains or hosts
- Highly customizable and configurable settings for search algorithms
To install SaushEngine, do the following steps in sequence:
- Install JRuby (preferably use rbenv with the ruby-build plugin installed)
- Install Postgres (9.3). Make sure you have the correct permissions set up for your current user.
- Install RabbitMQ. On certain platforms this might already been installed
gem install bundlerto install Bundler, followed by
bundle installto download and install the rest of the gems
- Run the
startscript (Linux variant only) to set up the database and the permissions
You can run SaushEngine in either development mode or production mode (production mode here really only means you're running the servers as daemons with the environment set to production, it doesn't mean SaushEngine is really production capable now).
To run SaushEngine in development mode, just use Foreman:
$ foreman start
This should start your RabbitMQ server as well as the Spider and the Digger.
To run SaushEngine in production mode, the assumption is that RabbitMQ is already running, and we're only running the Spider and the Digger in daemon mode:
To stop SaushEngine in production mode, use the stop script:
After starting up the Spider, you can proceed to configure and deploy your spiders!
- Go to http://localhost:5914 to see the Spider web interface. This interface allows you to control and configure the Spider's settings
- If you're crawling through NTLM authentication protected sites, remember add the necessary credentials in the settings page
- Add the seed URL into the queue
- Start up one or more spiders
- You should see the spiders running now, hard at work in processing and adding pages into the index
With your spiders now hard at work, you can start using the search engine! Go to http://localhost:4199 to start using SaushEngine.
These are the basic components it need:
- JRuby - Ruby implementation on top of the Java Virtual Machine
- RabbitMQ - an easy to use and robust messaging system
- Postgres - A powerful open source relational database
To find a list of Ruby libraries it needs, please view the Gemfile.
The SaushEngine spider crawls through the search space for documents, which it then adds to the search index (database).
To run multiple spiders at the same time, SaushEngine uses Celluloid to run parallel threads. Each thread runs independently and acts as a worker that consumes a queue (using RabbitMQ) of document URLs. As each spider consumes a URL, it will process the document and generates URLs which are published into the same queue.
This is the algorithm used by the Spider.
- Read url from queue (assume - URLs in queue are clean)
- Find page or create a new one, based on the url
- Extract words from the url, put into words array
- Extract keywords from the url, put into the front of the words array
- For every word in the words array,
- Find word or create a new one
- Create a location with a position, which is the index of the array
- Extract links from the url
- For every link in the url,
- If it is a relative url, add the base url
- If authentication is required, add in user name and password
- Add it into the queue if under n messages in the queue
Analysing the documents retrieved
SaushEngine's spider analyses the documents it crawls depending on the type of document:
- HTML - Nokogiri, using customized logic to process a HTML file
- Any other types - Apache Tika which extracts text from many different file formats. Supported file formats can be found here
The Digger allows your users to query the documents you have crawled and processed in your index database.
SaushEngine is a highly customizable and configurable search engine. You can customize the following:
The default built-in search algorithms are:
- Frequency of words
- Location of words
- Distance between one word and another
Each algorithm is assigned an importance percentage, which determines how important the algorithm is in getting the right results. You can tweak this accordingly. More importantly you can add additional algorithms.
The frequency ranking algorithm is quite simple. The page that has more of the search words is assumed to be more relevant.
The location ranking algorithm is also very simple. The assumption made here is that if the search word is near to the top of the document, the page is more relevant.
The distance ranking algorithm inspects the distance of the search words between each other on every page. The closer the words are to each other on a page, the higher that page will be ranked. For example, if I search for 'brown fox' in these 2 documents:
- The quick brown fox jumped over the lazy dog
- The brown dog chased after the fox.
The will both turn up the search results, but document 1 will be more relevant as the distance between 'brown' and 'fox' in document 1 is 0 while in document 2 it is 4.
Document processing algorithms
SaushEngine has built-in processing capabilities to process HTML as well as various types of file formats supported by Apache Tika. You can customize or extend this accordingly.