Improved search feature (elasticsearch based, demo available) #455

Closed
wants to merge 22 commits into
from

Projects

None yet
@karmi

This pull request contains a proposed search feature overhaul for Rubygems.org, implemented with the elasticsearch search engine, via the Tire library.

Objectives

The main objective of the effort is to allow searching in more gem properties then just their names, notably in summaries, descriptions and authors – technically speaking, to increase both precision and recall of search at Rubygems.org.

Using a search engine — as opposed to a LIKE %term% database query — allows not only for better, faster searches, but also for advanced features such as a rich search query language, faceted navigation, and more.

Changelog

All the steps required for implementing the feature are commited on the search-steps branch, with extensive commit messages documenting the process. The important steps are:

  • 86240f2 and 3a17730 implement the most simple search with elasticsearch, adding model integration and using Tire in the controller

  • fa7d1cd adds complex mapping definition for the Rubygem model, allowing to search in gem summaries/descriptions, authors, dependencies, and more.

  • 8c6bee8 adds a sliding panel to the search results page which contains examples of searches with Lucene search query syntax.

Additional Cucumber scenarios were added to document the new search features. Some additional tweaks were required to run the test suite successfuly at Travis CI.

Please review the branch compare page to see the full picture.

Demo Application

A demo application is available at http://rubygems-with-elasticsearch.herokuapp.com.

(UPDATE, **new demo server** here: ****

UPDATE: test servers terminated.

Try out simple searches such as rack or searching in authors: author:john and dependencies: uses:rack. More tips are available as in-page help.

The database contains only a limited subset of gems. The application is running on a free Heroku plan. The elasticsearch service is running on a Amazon EC2 t1.micro instance. Keep in mind, that the application runs in a tweaked development mode (due to issues with assets etc.), so the demo application performance does not reflect the performance in the real production environment.

If a dump of the Rubygems production database would be available, I'd like to import it into the demo application database.

Further Development

If the proposed search implementation is considered desirable, a number of further developments is possible, eg.:

  • more fine-grained score computation based on number of downloads, not straight sorting,
  • allow sorting the results by number of downloads, alphabetically, by created or updated time,
  • highlighting the relevant matched snippets from gem properties,
  • adding faceted search on authors,
  • linking to a specific matched version from search results,
  • displaying aggregated statistics such as authors with most gems, authors with most downloaded gems, etc.,
  • adding Tire's NewRelic instrumentation to track performance.

Installation Instructions

To check out the search feature locally, assuming you had cloned the Rubygems.org repository, set it up according to instructions first:

./script/setup

To import your local gems into the database, run:

bundle exec rake gemcutter:import:process

Then, install elasticsearch using your preffered method. On Mac OS X, the easiest way is to use Homebrew:

brew install elasticsearch

To import the gems from the database into elasticsearch, run:

bundle exec rake environment tire:import CLASS='Rubygem' FORCE=1
@travisbot

This pull request fails (merged 8c6bee8 into 7ac6d16).

@karmi

As noted, @travisbot requires some additional tweaks to run the test suite. (Still, Travis is very unreliable when running the full test suite, see http://travis-ci.org/#!/karmi/rubygems.org/builds)

[Edit] example of a successfull test run: http://travis-ci.org/#!/karmi/rubygems.org/jobs/2286458

@evanphx
RubyGems member

Looks great! I don't have any experience with elasticsearch, does it require running another service? If so, there isn't any details on how to get that service running and we'll need that.

@nz
nz commented Aug 30, 2012

Looks great, @karmi!

@evanphx: I'd consider it a privilege to sponsor the search hosting on http://bonsai.io/

@karmi

@evanphx Thanks! Right now, the elasticsearch service for the demo application runs at EC2 instance, provisioned with Chef. I think there would be no problem getting someone to sponsor the box, possibly including elasticsearch.com company.

As @nz points out, Bonsai is available to host the search service as well -- though issues with the Tire library and Bonsai would have to be sorted out...

@evanphx
RubyGems member

I'm wary of having rubygems.org depend on an external service like bonsai.io. There is A LOT of traffic and I worry the site would depend too heavily on the reliability of something we don't have control over.

@karmi

I understand the concern, Evan.

However, if we want to make the search at Rubygems.org radically better, there's no way around it then depend on some external factor in one way or another — all the major search engines are external processes/services (except TSearch).

It's similar, in fact, to dependency on Redis (for tracking downloads) in the current codebase.

elasticsearch itself is open source and free, based on Lucene, written in Java, and can be run trivially on any real or virtual server. It's particularly suited to run in cloud environments such as Amazon AWS (ie. with little latency to Heroku), but is in no way tied or affiliated with Amazon.

(Regarding the work needed to set up, configure and maintain an elasticsearch server, I can handle such duties just fine.)

If we want to move forward with the proposed search functionality, I think we need to work on these points:

  1. Is the proposed feature something we want to use, eventually, at Rubygems.org? If so, let's discuss what steps are required to merge it into master and roll it into production.

  2. Can a full dump of the Rubygems production database be provided? If so, let me load it up to the demo application so everybody can try various kinds of searches and kick the tires on the feature.

  3. If everybody's happy with the proposed search feature, let's work on polishing it further -- the first thing is using Rubygem#downloads as a score boosting factor, not as a straightforward sorting criterion.

@evanphx
RubyGems member

I'm sorry, I wasn't clear. I don't have an issue running the elasticsearch service on the rubygems.org servers. I am worried about using a hosted elasticsearch service because then usage of it is dependent on a lot more (network conditions, cloud health, etc).

As for the running of it, thats very kind of you to offer to setup, configure, and maintain it but that likely won't work because then you'd need to be effectively on-call all the time. We don't have an issue maintaining it, but I would like some guidance into how it should be configured, how much disk/memory it will use, etc.

@nz
nz commented Aug 31, 2012

@evanphx: ElasticSearch has a pretty good guide for self-hosting on Amazon: http://www.elasticsearch.org/tutorials/2011/08/22/elasticsearch-on-ec2.html

FWIW, we made the same offer to host the search at websolr when Solr was on the table a while back. I did some digging with qrush at one point into the question of traffic volume, and was completely comfortable with the numbers. Besides that, we literally are on call all the time :-)

That said I get the value of self-hosting for you here, and am happy to be available to talk tech when it comes to hosting ES. I'll idle in #gemcutter today (nz) if you want to talk more about capacity planning, which is almost always an experimental process.

@nz
nz commented Aug 31, 2012

Er, make that #rubygems :)

@karmi

Thanks for the clarification, Evan!

I don't have an issue running the elasticsearch service on the rubygems.org servers.
We don't have an issue maintaining it, but I would like some guidance into
how it should be configured, how much disk/memory it will use, etc.

Perfect! elasticsearch is pretty easy to install and operate; in terms of required resources, for the Rubygems.org use case a modest machine will be more then enough.

I can certainly help with the installation and configuration of elasticsearch on your servers — just ask the specifics! The easiest way is to use the Chef cookbook. Please see the tutorial at the elasticsearch.org site.

For the Rubygems.org use case, one elasticsearch node should be enough, though for proper failover and scalability, two nodes would be desirable. In terms of resources needed, elasticsearch needs mostly RAM. Any modest machine comparable to EC2 small to large would be enough, assuming it has couple of gigabytes of memory to spare. (Note, that the demo application uses the micro instance and happily purrs along with just 613MB of RAM.)

Provided the database dump from Rubygems.org is available, I can do some capacity testing with the full set of data against EC2 instances.

@cmeiklejohn

@karmi

Looks like I'm getting some errors on the console when running the test suite, but the tests aren't failing. Is this something to be concerned with?

# Running tests:

..................................................................................................................................................................................................................................................................................................................................................................................................[REQUEST FAILED] curl -X GET "http://localhost:9200/test_rubygems/rubygem/_search?load%5Binclude%5D=versions&page=&per_page=30&size=30&pretty=true" -d '{"query":{"bool":{"should":[{"text":{"name":{"query":"bang!","type":"phrase_prefix","operator":"and","boost":100}}},{"query_string":{"query":"bang!","default_operator":"and"}}]}},"sort":[{"downloads":"desc"},{"name.raw":"asc"}],"filter":{"term":{"indexed":true}},"size":30}'
.[REQUEST FAILED] curl -X GET "http://localhost:9200/test_rubygems/rubygem/_search?load%5Binclude%5D=versions&page=&per_page=30&size=30&pretty=true" -d '{"query":{"bool":{"should":[{"text":{"name":{"query":"bang!","type":"phrase_prefix","operator":"and","boost":100}}},{"query_string":{"query":"bang!","default_operator":"and"}}]}},"sort":[{"downloads":"desc"},{"name.raw":"asc"}],"filter":{"term":{"indexed":true}},"size":30}'
.[REQUEST FAILED] curl -X GET "http://localhost:9200/test_rubygems/rubygem/_search?load%5Binclude%5D=versions&page=&per_page=30&size=30&pretty=true" -d '{"query":{"bool":{"should":[{"text":{"name":{"query":"bang!","type":"phrase_prefix","operator":"and","boost":100}}},{"query_string":{"query":"bang!","default_operator":"and"}}]}},"sort":[{"downloads":"desc"},{"name.raw":"asc"}],"filter":{"term":{"indexed":true}},"size":30}'
.[REQUEST FAILED] curl -X GET "http://localhost:9200/test_rubygems/rubygem/_search?load%5Binclude%5D=versions&page=&per_page=30&size=30&pretty=true" -d '{"query":{"bool":{"should":[{"text":{"name":{"query":"bang!","type":"phrase_prefix","operator":"and","boost":100}}},{"query_string":{"query":"bang!","default_operator":"and"}}]}},"sort":[{"downloads":"desc"},{"name.raw":"asc"}],"filter":{"term":{"indexed":true}},"size":30}'
........................................................................

Finished tests in 115.507678s, 3.9911 tests/s, 6.4411 assertions/s.

461 tests, 744 assertions, 0 failures, 0 errors, 0 skips
@cmeiklejohn cmeiklejohn closed this Sep 9, 2012
@cmeiklejohn

Ugh, whoops for the close. Github UI failure.

@cmeiklejohn cmeiklejohn reopened this Sep 9, 2012
@karmi

@cmeiklejohn Yes, that is intentional -- it's the Tire's STDERR output, coming from tests for "user enters invalid Lucene query", bang! in this case. See https://github.com/karmi/rubygems.org/blob/search-steps/test/functional/searches_controller_test.rb#L57-66 and https://github.com/karmi/rubygems.org/blob/search-steps/features/search.feature#L67-70

@adkron

👍 I went to the test site and loved the functionality that it provides. What can we do to get this brough back up?

@qrush qrush and 1 other commented on an outdated diff Mar 28, 2013
app/models/rubygem.rb
@@ -12,7 +15,48 @@ class Rubygem < ActiveRecord::Base
validate :ensure_name_format
validates :name, :presence => true, :uniqueness => true
- after_create :update_unresolved
+ after_create :update_unresolved, :update_elasticsearch_index
+ after_touch :update_elasticsearch_index
+
+ tire do
@qrush
qrush Mar 28, 2013

All of this is perfect for a Concern module, something like Searchable.

class Rubygem < ActiveRecord::Base
  include Searchable

And that module has all of the necessary includes, methods, etc. Any thoughts about that approach?

@karmi
karmi Mar 28, 2013

Nothing against such approach -- normally, I like to keep mapping/etc definitions inside the model, and since the after_create :update_unresolved hook was already there, I just followed the convention. Do you want to extract everything related to search to a module?

@qrush qrush commented on the diff Mar 28, 2013
features/support/env.rb
@@ -4,6 +4,9 @@
# instead of editing this one. Cucumber will automatically load all features/**/*.rb
# files.
+require 'webmock/cucumber' # Allow connections to elasticsearch
@qrush
qrush Mar 28, 2013

Does this mean the test suite is dependent on an elasticsearch install? How would this work on Travis, etc?

@karmi
karmi Mar 28, 2013

The Cucumber integration test is indeed dependent on Elasticsearch running, since that's the only way how to end-to-end test the feature? Elasticsearch is available on Travis.

@qrush qrush and 1 other commented on an outdated diff Mar 28, 2013
app/controllers/searches_controller.rb
@@ -1,8 +1,31 @@
class SearchesController < ApplicationController
+ # Indicate incorrect query to the user
+ rescue_from Tire::Search::SearchRequestFailed do |error|
@qrush
qrush Mar 28, 2013

Does this cover the case where ES is completely unavailable/disconnected?

@karmi
karmi Mar 28, 2013

No, that would have to be handled by a separate rescue_from clause, displaying an error such as "We're sorry, search is currently not available".

@qrush
RubyGems member

Left a few comments. I think we should get ES setup in rubygems/rubygems-aws soon so we can start playing with it...maybe we can wire up the test heroku app to give it a test first.

Some more feedback:

  • Let's get info for how to get ES setup in CONTRIBUTING.md
  • Does anyone else have experience with maintaining a running cluster? What if there's problems? Who will get alerted, who will debug it, etc? (I'm trying to answer this now instead of when the fire is blazing)
  • What if search goes down, can we fall back to the old search?
@karmi

I think we should get ES setup in rubygems/rubygems-aws soon

Please keep me in the loop, I'm the author of the Chef cookbook.

Let's get info for how to get ES setup in CONTRIBUTING.md

I'll put it in, and force push the commits here.

Does anyone else have experience with maintaining a running cluster?

I'm employed by Elasticsearch.com and have some experience with running Elasticsearch clusters :) Let's talk about the exact process.

What if search goes down, can we fall back to the old search?

I don't think that's a good solution. A running Elasticsearch cluster shouldn't just go down -- we just need to ensure there is proper monitoring on the service itself and EC2 level?

@qrush
RubyGems member

@karmi: This all sounds awesome :) Would you be willing to contribute that into https://github.com/rubygems/rubygems-aws ? I'm very sure our new ops contributors would be more than willing to help get everything set up.

"Shouldn't just go down" is not what I've seen...I'd rather account/test for that now when we're doing the switch and migration instead of when it's on fire.

@karmi

Yes, I'll setup the environment for rubygems-aws and add a pull request for Elasticsearch.

As for going down, all services and servers can go down :) But I think we need to come up with a process for that, instead of falling back on the SQL based search; that just doesn't feel right. There are many aspects here, eg. having nodes properly distributes across AWS zones, having an automated strategy for recovering from backup or reindexing from scratch, etc.

@qrush
RubyGems member

Cool. That would be neat. We don't even have any of that in place for the main app yet (AFAIK)

@karmi karmi added a commit to karmi/rubygems-aws that referenced this pull request May 10, 2013
@karmi karmi [SEARCH] Added configuration for Elasticsearch nodes
This commit adds support for search nodes running Elasticsearch.

* The "elasticsearch" cookbook [https://github.com/elasticsearch/cookbook-elasticsearch/]
  has been added to the Cheffile

* A Vagrant VM named `search` has been added

* A `search` role has been added

* Node configurations (*.json) for Vagrant and EC2 have been added

* The Capistrano tasks have been updated to reflect the changes

To deploy in EC2:

    # Update packages
    #
    RUBYGEMS_EC2_SEARCH=abc-123.compute-1.amazonaws.com \
    DEPLOY_USER=ubuntu \
    DEPLOY_SSH_KEY=~/.ssh/mykey.pem \
      cap rubygems.org invoke COMMAND='sudo apt-get update' SUDO=true

    # Install Chef
    #
    RUBYGEMS_EC2_SEARCH=abc-123.compute-1.amazonaws.com \
    DEPLOY_USER=ubuntu \
    DEPLOY_SSH_KEY=~/.ssh/mykey.pem \
      cap rubygems.org invoke COMMAND='curl -# -L http://www.opscode.com/chef/install.sh | sudo bash -s --' SUDO=true

    # Run Chef
    #
    time \
    RUBYGEMS_EC2_SEARCH=abc-123.compute-1.amazonaws.com \
    DEPLOY_USER=ubuntu \
    DEPLOY_SSH_KEY=~/.ssh/mykey.pem \
      cap rubygems.org chef:search

Related: rubygems/rubygems.org#455
aaa86e5
@karmi karmi added a commit to karmi/rubygems-aws that referenced this pull request May 10, 2013
@karmi karmi [SEARCH] Added a template for Elasticsearch application initializer
The `elasticsearch_url` variable is set in the "secret/rubygems" data bag,
similar to setting PostgreSQL host, etc.

Alternatively, an environment variable `ELASTICSEARCH_URL` could be used.

Related: rubygems/rubygems.org#455
0f9c517
@karmi karmi referenced this pull request in rubygems/rubygems-aws May 10, 2013
Merged

Added Elasticsearch integration #122

@karmi karmi added a commit to karmi/rubygems-aws that referenced this pull request May 10, 2013
@karmi karmi [SEARCH] Added configuration for Elasticsearch nodes
This commit adds support for search nodes running Elasticsearch.

* The "elasticsearch" cookbook [https://github.com/elasticsearch/cookbook-elasticsearch/]
  has been added to the Cheffile

* A Vagrant VM named `search` has been added

* A `search` role has been added

* Node configurations (*.json) for Vagrant and EC2 have been added

* The Capistrano tasks have been updated to reflect the changes

To deploy in EC2:

    # Update packages
    #
    RUBYGEMS_EC2_SEARCH=abc-123.compute-1.amazonaws.com \
    DEPLOY_USER=ubuntu \
    DEPLOY_SSH_KEY=~/.ssh/mykey.pem \
      cap rubygems.org invoke COMMAND='sudo apt-get update' SUDO=true

    # Install Chef
    #
    RUBYGEMS_EC2_SEARCH=abc-123.compute-1.amazonaws.com \
    DEPLOY_USER=ubuntu \
    DEPLOY_SSH_KEY=~/.ssh/mykey.pem \
      cap rubygems.org invoke COMMAND='curl -# -L http://www.opscode.com/chef/install.sh | sudo bash -s --' SUDO=true

    # Run Chef
    #
    time \
    RUBYGEMS_EC2_SEARCH=abc-123.compute-1.amazonaws.com \
    DEPLOY_USER=ubuntu \
    DEPLOY_SSH_KEY=~/.ssh/mykey.pem \
      cap rubygems.org chef:search

Related: rubygems/rubygems.org#455
02d8518
@karmi karmi added a commit to karmi/rubygems-aws that referenced this pull request May 10, 2013
@karmi karmi [SEARCH] Added a template for Elasticsearch application initializer
The `elasticsearch_url` variable is set in the "secret/rubygems" data bag,
similar to setting PostgreSQL host, etc.

Alternatively, an environment variable `ELASTICSEARCH_URL` could be used.

Related: rubygems/rubygems.org#455
8e0e3dd
@karmi

Hi all, rebased the branch against current master and added some commits. There's a new test server available here:

http://54.235.152.92:3000/search?utf8=✓&query=name%3Arack

which has been created as part of the rubygems/rubygems-aws#122 pull request.

@vipulnsward vipulnsward and 1 other commented on an outdated diff May 11, 2013
features/step_definitions/gem_steps.rb
+ table.hashes.each do |row|
+ # p 'GOT TABLE ROW:', row, '-'*80
+ if row['downloads']
+ rubygem = FactoryGirl.create :rubygem_with_downloads, :name => row['name'], :downloads => row['downloads']
+ else
+ rubygem = FactoryGirl.create :rubygem, :name => row['name']
+ end
+
+ FactoryGirl.create(:version, :rubygem => rubygem) do |version|
+ version.number = row['version']
+ version.authors = row['authors'].split(/\s*,\s*/)
+ version.summary = row['summary']
+ version.description = row['description']
+
+ version.save
+ # p "CREATED RUBYGEM:", version.rubygem, version, '-'*80
@vipulnsward
vipulnsward May 11, 2013

this p could be removed now

@vipulnsward vipulnsward commented on an outdated diff May 11, 2013
features/step_definitions/gem_steps.rb
@@ -65,3 +70,24 @@
rubygem.ownerships.create :user => user
end
end
+
+Given /^gems with these properties exist:$/ do |table|
+ table.hashes.each do |row|
+ # p 'GOT TABLE ROW:', row, '-'*80
@karmi

@vipulnsward Commented out debug statements for Cucumber removed in karmi/rubygems.org@dcb887a.

@cicloid

Just saw the pull request, is there something in need of doing or testing, in order to move this forward?

@karmi

karmi opened this pull request a year ago

Guys, we just passed an anniversary with this pull request. What should be done with it? Should I close it?

/cc @qrush @evanphx @skottler

@vipulnsward

😢 I hope not.

@knappe

Can we get some more traction on this? This is a very intriguing feature set.

/cc @qrush @evanphx @skottler

@skottler
RubyGems member

@karmi can you please rebase?

@knappe the best way to help move this forward is to do a thorough code review.

karmi added some commits Aug 24, 2012
@karmi karmi [SEARCH] Added "tire" dependency for searching Rubygems.org with elas…
…ticsearch

elasticsearch is an open source search engine based on Lucene, with a RESTful HTTP interface and advanced distributed features.

Tire is a Ruby API/DSL for elasticsearch, with an out-of-the box ActiveRecord/ActiveModel integration.

See:

* "Tire": https://github.com/karmi/tire
* "elasticsearch": http://elasticsearch.org
e18cfb2
@karmi karmi [SEARCH] Allow connections to elasticsearch [localhost:9200] in tests…
… and Cucumber

NOTE: The `disable_net_connect!` call has to come *before* we load the application,
      because Tire checks for index existence on application boot, and shoots
      the entire test suite down.

      See <karmi/retire#136> for more information.
6632bc2
@karmi karmi [SEARCH] Added elementary Tire integration into the Rubygem model
* Added, that the Version model propagates touches to Rubygem [See: http://stackoverflow.com/a/11711477/95696]

* Added Tire ActiveRecord callbacks [See: https://github.com/karmi/tire#activemodel-integration]

* Added a simple mapping definition for Rubygem

* Added a simple `to_indexed_json` serialization for Tire

* Fixed incorrect test case in WebHookTest ("include an Authorization header"):
  1) use the `build`, not the `create` FactoryGirl strategy (to skip Tire indexing), and,
  2) use the _last_ HTTP request from WebMock registry (to skip Tire checking if the index exists)

* Fixed failing "Web Hooks" feature, using the _last_ HTTP request from WebMock registry (see above)

Import your current database with the following Rake task:

    $ bundle exec rake environment tire:import CLASS='Rubygem' FORCE=1

Check the index in your browser:

    <http://localhost:9200/development_rubygems/_search>
2e56fcb
@karmi karmi [SEARCH] Added the simplest possible search with elasticsearch
* Added simple query string search into SearchesController

* Recreate elasticsearch index in the SearchesController functional test setup and in the Cucumber `Before('@search')` callback

* Trigger index update in the FactoryGirl `after(:create)` callback

* Be more defensive in ApplicationHelper#short_info (in test, some gems don't have versions?)

Note: The "beer laser" => "beer_laser" Cucumber scenario fails,
      due to incorrect analysis of Gem names with underscore.
611774a
@karmi karmi [SEARCH] Added, that factories trigger `touch` callbacks after create
NOTE: We need to be absolutely sure the `Rubygem` instance is touched, because
      we rely on it being indexed in elasticsearch in integration tests.
70f0f1f
@karmi karmi [SEARCH] Added proper analyzer for Rubygem names
With the original (standard) analyzer, a Gem name like "url_mount" would be analyzed as "url_mount",
making searches for "url mount" (without the underscore) fail.

With the new analyzer, *tokens* are split by "special characters" defined by the `Patterns` module (.-_).

Try it out yourself:

  <http://localhost:9200/development_rubygems/_analyze?text=url_mount&field=name>

This change makes the "beer laser" => "beer_laser" Cucumber scenario pass.
2389438
@karmi karmi [SEARCH] Changed the search definition to a DSL-based syntax, added s…
…orting by downloads

* Used the DSL notation for defining the search: using a `match` prefix query on the "name" field,
  basically replicating the simple query string search with wildcards with a more performant version,
  and using a filter on the `indexed` property

* Added sorting of the results by downloads (descending)

* Added a Cucumber scenario for showing the more downloaded gems higher in search results

* Added supporting Cucumber code: a "I have a gem with downloads" and "I see these search results" step definitions
bfd2aa7
@karmi karmi [SEARCH] Changed, that search results are ordered first by downloads,…
… then alphabetically

* Changed the `name` property to multi-field, using the "keyword" analyzer on `name.raw` for searching
* Added the sort block with multiple sort fields
* Added a Cucumber scenario

NOTE: Now we should really stop and think twice about how to make the results more relevant.
      We should be able to get better search _precision_ by using the `Rubygem#downloads`
      counter as a factor affecting score, not just plainly sort on its value.
db6e1d3
@karmi karmi [SEARCH] Added a more complex mapping definition and serialization fo…
…r the Rubygem model

* Previously, only the `name`, `downloads` and `indexed` attributes were indexed,
  replicating the functionality of the current search feature.

* The `to_indexed_json` method was removed, relying on Tire's JSON serialization routines
  based on the model mapping definition.

* The `summary`, `description` and `author` gem properties were added, allowing much better
  search results _recall_, ie. allowing search in these fields as well and widening the search “net”.

* A gem which mentions "sinatra" in it's summary/description will now be matched (with a lower score):
  <http://localhost:3000/search?query=sinatra>.

* A gem written by Florian Hanke will now be matched: http://localhost:3000/search?query=florian+hanke

* The `version` gem property was added, allowing searches based on gem versions, for instance looking
  for Sinatra 1.3.2: <http://localhost:3000/search?query=name:sinatra+version:1.3.2>. For improved
  usability, the link from the result listing _should_ lead to the relevant version page,
  ie. http://localhost:3000/gems/sinatra/versions/1.3.2, not the last version.

* The `depends` and `uses` gem properties were added, which index runtime gem dependencies and all
  gem dependencies, respectively. It allows searches such as <http://localhost:3000/search?query=depends:rack>
  (for gems with depend on rack) or http://localhost:3000/search?query=uses:rack (for gems which use rack in
  one way or other).

* The `created_at` and `updated_at` gem properties were added, which allow to search gems updated in a specific
  period, for instance on August, 26th: <http://localhost:3000/search?query=updated_at:[2012-08-26+TO+2012-08-27]>

* The `author`, `created_at` and `updated_at` also allow for a _faceted navigation_ in the future, ie. searching
  for certain gem while restricting the result to certain author or time.

* These properties also allow for computing statistics on the Rubygem collection, such as displaying authors
  with most gems, or authors of the most downloaded gems, etc.

You have to reindex the elasticsearch index, to pick up the new mapping and index records properly:

    $ bundle exec rake environment tire:import CLASS='Rubygem' FORCE=1

See the following resources for information on previous efforts to implement a better Rubygems.org search:

* https://groups.google.com/forum/#!topic/gemcutter/xIzyTmFdXVo/discussion
* http://florianhanke.com/blog/2011/02/13/a-better-rubygems-search.html
* http://blog.websolr.com/post/3505941785/rubygems-search-upgrade-2
* http://blog.websolr.com/post/3505969969/rubygems-search-upgrade-3
aa63c84
@karmi karmi [SEARCH] Mock HTTP responses to Elasticsearch in unit tests 79b6857
@karmi karmi [SEARCH] Added a more complex search query in the SearchesController#…
…show method

Previously, we have been searching gems based on their names only.

With the new, more complex mapping defined in the preceding commit, we can add a more complex search query as well.

We're using a boolean query, keeping the original match prefix query and adding the `query_string` query,
which uses the Lucene query syntax (field specifation, boolean operators, wildcards,
fuzzy search, range and proximity searches, grouping, etc).

See:

* http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html
* http://lucene.apache.org/core/3_6_1/queryparsersyntax.html
415ba28
@karmi karmi [SEARCH] Added a `rescue_from` failed search requests due to incorrec…
…t query syntax

While we exposed the most powerful way of searching to the user (the Lucene query syntax),
it can quite easily lead to application errors when users enter incorrect queries, such as `bang!` or `foo[]`.

Since this is an error on the user's part, and not the application part, we should display a friendly
error explanation and give the user a chance to correct the query.
f0d20d4
@karmi karmi [SEARCH] Added a "user enters a search query with incorrect syntax" C…
…ucumber scenario

Since the application uses Cucumber scenarios for validating its proper operation,
a scenario with user entering an incorrect search query ("bang!") has been added.
5c13726
@karmi karmi [SEARCH] Added the "Search Advanced" Cucumber feature
With the complex queries now available to users of the application, we should add
acceptance tests for the common scenarios.

We'll start with searching in summaries and descriptions (thanks to the `_all` field
automatically generated by elasticsearch).

Use this command to run all search features:

    $ bundle exec cucumber --tag @search

Use this command to run the "advanced search" feature:

    $ bundle exec cucumber features/search_advanced.feature
ac96085
@karmi karmi [SEARCH] Added a Cucumber scenario for searching in gem authors
    Given we now have a more complex search available
    When we search in the `author` field
    We should get some relevant results

* Added a "Searching in authors" scenario
* Added a step for creating more complex Rubygem records into the `gem_steps.rb` definition file

Use this command to run the scenario:

    $ bundle exec cucumber --name "Searching in authors" features/search_advanced.feature
0346d62
@karmi karmi [SEARCH] Refactored the search steps to a higher-level nested step "I…
… search for ..."

Instead of repeating the low-level steps:

    When I go to the homepage
    And I fill in "query" with "<query>"
    And I press "Search"

over and over in our scenarios, we will abstract these steps to a single step:

    When I search for "<query>"

The obvious benefit is less code duplication and more readable steps.
4eb11a8
@karmi karmi [SEARCH] Added "search tips" sliding panel at the search results page
* Added a second form with `query` input, to duplicate the query for easier correction/change
  at the results page

* Added a HTML partial with concrete, interactive examples of queries possible with Lucene,
  hidden by default

* Added a link and JavaScript code to toggle the sliding panel with search examples

* Added CSS styling for the new elements, added a "help.png" icon from the FamFamFam suite
d389af0
@karmi karmi [SEARCH] Added starting of "elasticsearch" in the Travis CI configura…
…tion
01f479b
@karmi karmi [SEARCH] Prevent indexing errors on Rubygem records without a version cf5beec
@karmi karmi [SEARCH] Added information about installing Elasticsearch into "Contr…
…ibution Guidelines"
382b3a6
@karmi karmi [SEARCH] Handle search engine being not available in user-friendly way c1f99aa
@karmi karmi [SEARCH] Changed, that errors when indexing to Elasticsearch are rescued
Previously, when an error occurred while saving the model into the Elasticsearch index,
the whole operation failed and an Exception has been raised.

This patch adds a `rescue` clause which logs the exception and swallows it.
06f2626
@karmi

@skottler Rebased, fixed problems with Webmock stubbing, force pushed.

@karmi

I have terminated the EC2 instances for the demo application.

@jimmycuadra

For a project I'm working on, I'd like to be able to search gems based on gem specification metadata (the metadata hash attribute available from RubyGems 2.0 and up). I was investigating how gem searching is implemented currently, and after seeing that it was such a simple SQL query, figured someone had to be working on an ES-based search feature, and sure enough, here it is in this pull request.

Long story short, I'm very interested in seeing this rolled out and would like to help, since it's been sitting idle for quite some time. Is a code review still the blocker here?

@jimmycuadra

It also looks like the tire gem has been deprecated in favor of multiple gems hosted at elasticsearch/elasticsearch-ruby. This PR should be updated to use the new goods.

@karmi karmi closed this Dec 12, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment