SaushEngine

This is work in progress, please do not use yet!

SaushEngine is a simple and customizable search engine that allows you to crawl through anything and anywhere for data. You can use it to crawl an intranet or a file server for documents, spreadsheets and slides or even your computer.

How it works

The search engine has two parts:

Spider - goes out to collect data for the search engine. The Spider includes the message queue and the index database. The Spider crawls through URLs in the message queue, given initial seeds in the queue, processes the documents found and saves data into the index database. The Spider is controlled with a web interface.
Digger - allows your user to search through the index database. The Digger algorithms are customisable and configurable.

Features

The following are features found in SaushEngine:

Multiple parallel crawlers for creating the search index
Crawlers can process multiple document formats including HTML, Microsoft Word, Adobe PDF and many others
Crawlers can be customized to search specific domains only
Crawlers can search through NTLM protected sites (Sharepoint etc)
Web-based crawler management interface
Search can be specific to mime-types
Search can be specific to domains or hosts
Highly customizable and configurable settings for search algorithms

Installation

To install SaushEngine, do the following steps in sequence:

Install JRuby (preferably use rbenv with the ruby-build plugin installed)
Install Postgres (9.3). Make sure you have the correct permissions set up for your current user.
Install RabbitMQ. On certain platforms this might already been installed
Run gem install bundler to install Bundler, followed by bundle install to download and install the rest of the gems
Run the start script (Linux variant only) to set up the database and the permissions

Running SaushEngine

You can run SaushEngine in either development mode or production mode (production mode here really only means you're running the servers as daemons with the environment set to production, it doesn't mean SaushEngine is really production capable now).

To run SaushEngine in development mode, just use Foreman:

$ foreman start

This should start your RabbitMQ server as well as the Spider and the Digger.

To run SaushEngine in production mode, the assumption is that RabbitMQ is already running, and we're only running the Spider and the Digger in daemon mode:

$ ./start

To stop SaushEngine in production mode, use the stop script:

$ ./stop

After starting up the Spider, you can proceed to configure and deploy your spiders!

Go to http://localhost:5914 to see the Spider web interface. This interface allows you to control and configure the Spider's settings
If you're crawling through NTLM authentication protected sites, remember add the necessary credentials in the settings page
Add the seed URL into the queue
Start up one or more spiders
You should see the spiders running now, hard at work in processing and adding pages into the index

With your spiders now hard at work, you can start using the search engine! Go to http://localhost:4199 to start using SaushEngine.

Dependencies

These are the basic components it need:

JRuby - Ruby implementation on top of the Java Virtual Machine
RabbitMQ - an easy to use and robust messaging system
Postgres - A powerful open source relational database

To find a list of Ruby libraries it needs, please view the Gemfile.

Spider

The SaushEngine spider crawls through the search space for documents, which it then adds to the search index (database).

Design

To run multiple spiders at the same time, SaushEngine uses Celluloid to run parallel threads. Each thread runs independently and acts as a worker that consumes a queue (using RabbitMQ) of document URLs. As each spider consumes a URL, it will process the document and generates URLs which are published into the same queue.

Spider algorithm

This is the algorithm used by the Spider.

Read url from queue (assume - URLs in queue are clean)
Find page or create a new one, based on the url
Extract words from the url, put into words array
Extract keywords from the url, put into the front of the words array
For every word in the words array,
- Find word or create a new one
- Create a location with a position, which is the index of the array
Extract links from the url
For every link in the url,
- If it is a relative url, add the base url
- If authentication is required, add in user name and password
- Add it into the queue if under n messages in the queue

Analysing the documents retrieved

SaushEngine's spider analyses the documents it crawls depending on the type of document:

HTML - Nokogiri, using customized logic to process a HTML file
Any other types - Apache Tika which extracts text from many different file formats. Supported file formats can be found here

Digger

The Digger allows your users to query the documents you have crawled and processed in your index database.

Customizing SaushEngine

SaushEngine is a highly customizable and configurable search engine. You can customize the following:

Search algorithms

The default built-in search algorithms are:

Frequency of words
Location of words
Distance between one word and another

Each algorithm is assigned an importance percentage, which determines how important the algorithm is in getting the right results. You can tweak this accordingly. More importantly you can add additional algorithms.

Frequency

The frequency ranking algorithm is quite simple. The page that has more of the search words is assumed to be more relevant.

Location

The location ranking algorithm is also very simple. The assumption made here is that if the search word is near to the top of the document, the page is more relevant.

Distance

The distance ranking algorithm inspects the distance of the search words between each other on every page. The closer the words are to each other on a page, the higher that page will be ranked. For example, if I search for 'brown fox' in these 2 documents:

The quick brown fox jumped over the lazy dog
The brown dog chased after the fox.

The will both turn up the search results, but document 1 will be more relevant as the distance between 'brown' and 'fox' in document 1 is 0 while in document 2 it is 4.

Document processing algorithms

SaushEngine has built-in processing capabilities to process HTML as well as various types of file formats supported by Apache Tika. You can customize or extend this accordingly.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
migrations		migrations
public		public
spiders		spiders
tests		tests
views		views
.gitignore		.gitignore
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Procfile		Procfile
README.md		README.md
digger-config.rb		digger-config.rb
digger-dev-config.rb		digger-dev-config.rb
digger-ui.rb		digger-ui.rb
digger.rb		digger.rb
digger.ru		digger.ru
mimetypes.rb		mimetypes.rb
models.rb		models.rb
setup		setup
spider-config.rb		spider-config.rb
spider-dev-config.rb		spider-dev-config.rb
spider-ui.rb		spider-ui.rb
spider.cfg		spider.cfg
spider.rb		spider.rb
spider.ru		spider.ru
start		start
stop		stop
stopwords.rb		stopwords.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SaushEngine

How it works

Features

Installation

Running SaushEngine

Dependencies

Spider

Design

Spider algorithm

Analysing the documents retrieved

Digger

Customizing SaushEngine

Search algorithms

Frequency

Location

Distance

Document processing algorithms

About

Releases

Packages

Contributors 2

Languages

sausheong/saushengine

Folders and files

Latest commit

History

Repository files navigation

SaushEngine

How it works

Features

Installation

Running SaushEngine

Dependencies

Spider

Design

Spider algorithm

Analysing the documents retrieved

Digger

Customizing SaushEngine

Search algorithms

Frequency

Location

Distance

Document processing algorithms

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages