Explore replacing Solr with Elasticsearch #552

Open
waldoj opened this Issue Oct 8, 2014 · 7 comments

Projects

None yet

3 participants

@waldoj
Member
waldoj commented Oct 8, 2014

Solr is tough to install, and it presents a real obstacle to deploying The State Decoded. In the intervening couple of years since I decided to use Solr, Elasticsearch has improved a great deal, and is now my personal default search software. (In fact, it's become my default data storage mechanism, too.) Elasticsearch is provided as DEB and RPM files, with proper init scripts etc., so installing it on Ubuntu, Debian, Red Hat, Fedora, and CentOS is trivial. We should consider moving away from Solr and to Elasticsearch, post-v1.0.

@waldoj waldoj added the Feature label Oct 8, 2014
@waldoj waldoj added this to the Future milestone Oct 8, 2014
@waldoj
Member
waldoj commented Oct 8, 2014

The catch with Elasticsearch is that only supports JSON—no other formats. However, that's not necessarily a problem. We can iterate through all laws, via the API, to generate JSON for each law and feed those records to Elasticsearch. We could even iterate through all structural units and index those, too, something that we don't currently do (because there's no XML for structural units).

By storing JSON in Elasticsearch, rather than merely indexing it, we could even use Elasticsearch to serve up responses to many API requests. That would make a caching layer (e.g., Varnish, Memcached) unnecessary, and make it trivial for the site to consume its own API.

@waldoj
Member
waldoj commented Oct 8, 2014

I don't think it would be particularly onerous to convert our schema (schema.xml) from Solr's format to Elasticsearch's. I have very limited experience with Solr's format, but I've done a bunch of work with Elasticsearch's, and I've found it to be quite straightforward.

@brianwc
brianwc commented Oct 8, 2014

I'm much more familiar with Solr, but I have the impression that they both have roughly the same features. What matters to me most when selecting a search backend is what query syntax it might force upon the users. Users of legal research tools are the worst. They want it all. They want a natural language search that will "just get" what they were looking for and they also want a super-powerful terms and connectors search capability. (And we're not talking about just boolean AND/NOT/OR; they like things like w/in 30, w/in para, wildcards, fuzzy, ranges, term appears x times, etc.) I know how Solr fares on this front, but am less familiar with what Elasticsearch's query syntax will look like. There's this page: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html and I think I understand it, and if so, it looks like it may use the same syntax Solr does, but it would be nice to have an item-by-item side-by-side comparison.

@waldoj
Member
waldoj commented Oct 8, 2014

They're both just interfaces to Lucene, so they should use the same syntax.

@waldoj
Member
waldoj commented Oct 8, 2014

Oh, hey, this seems handy.

@waldoj
Member
waldoj commented Apr 19, 2015

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment