A fast, in-memory indexer of your data
Java Python Shell
Failed to load latest commit information.
build_scripts Various fixes. Hoboserver now appears to work. Dec 16, 2011
example/python Various fixes. Hoboserver now appears to work. Dec 16, 2011
src/org/styloot/hobo Added better docstring to CombinedIterator. Apr 2, 2012
test Since some parameters don't affect small data sets, we need a separat… Mar 30, 2012
.gitignore Added .gitignore Dec 16, 2011
LICENSE Added license Mar 11, 2012
build.xml Added benchmarks folder, and HasFeaturesBenchmark.java Mar 21, 2012
hobo.thrift Now hobo handles price filtering, and has a simplified api. Jan 25, 2012



Hobo serves a simple purpose - it is an in-memory index of a set of documents.

The main use case is as follows - you have a collection of documents which are tagged with some feature set. Hobo allows you to rapidly find the document ID's which contain the tagged features. It also allows search based on color, and based on an arbitrary cost function.


Apache Thrift 0.8

sl4j for logging.

When compiling, make sure you edit the build.xml file to point to the right location/version of sl4j.

See http://www.slf4j.org/manual.html for more details.


The server installation is easy:

$ ant compile jar
$ cp build/hobo.jar SOMEPLACE_IN_CLASSPATH

The python client installation is similarly easy:

$ ant python
$ mv build/python-hobo.tgz /tmp
$ cd /tmp
$ tar -xvzf python-hobo.tgz

Building thrift clients for other languages should be very simple as well.


Hobo has a simple data model. It is based on the concept of an item:

struct Item {
    string id;
    string category;
    list<string> features;
    byte red;
    byte green;
    byte blue;
    int cost;
    int quality;

A category is a string - subcategories are represented by prefixes. E.g., "/products/bags/hobo" is a subcategory of "/products/bags" or even "/prod".

Quality represents how important the item is - higher is better. Higher quality items come earlier in the list of results than lower quality items.

Cost represents how pricey the item is.

The red, green and blue values represent the color of the item. These values are signed bytes, i.e. (127,127,127) is white and (-128,-128,-128) is black.


HoboIndex supports one method: find(1:string category_name, 2:list<string> features, 3:byte red, 4:byte green, 5:byte blue, 6:double colorDist, 7:i32 cost_min, 8:i32 cost_max, 9:i32 page):

find("/products/bags/hobo", ["brass buckle", "leather"], 127, -128, -128, 10, 0, 25, 0)

will return a list of id's for items in /products/bags/hobo (or some subcategory), which also contain the features "brass buckle" and "leather", the color will have a CIE_LAB distance of 10 to rgb(255,0,0), and the cost will be between 0 and 25. If you are unfamiliar with color distance, check out the [wikipedia article](http://en.wikipedia.org/wiki/Color_difference) on the topic.

If no color filter is required, the color distance should be set to a negative value. If no cost filter is required, then cost_min should be zero and cost_max should be set to MAX_INT=pow(2,31)-1 (the maximum possible value). For example, the call:

find("/products/bags/hobo", ["brass buckle", "leather"], 0, 0, 0, -1, 0, MAX_INT, 0)

will return a list of id's for items in /products/bags/hobo (or some subcategory), which also contain the features "brass buckle" and "leather".

The results are paginated - the call find(..., 0) above will return the first $PAGESIZE items, find(..., 1) will return the items from PAGESIZE to 2 x PAGESIZE, etc.


The server is a read-only indexer. It is run as follows:

$ java org.styloot.hobo.thriftserver.Server INPUTFILE PORT PAGESIZE

The inputfile has the following format. Each row contains an item, represented as follows:


The id, category and features must not contain semicolons, commas or newlines.