Getting Started

EgozyN edited this page May 28, 2015 · 3 revisions
Clone this wiki locally

HebMorph is an open-source effort for making Hebrew properly searchable with various available search technologies. The HebMorph project along with an Apache Lucene integration is available at https://github.com/synhershko/HebMorph and on Maven central.

elasticsearch-analysis-hebrew is an Elasticsearch plugin which makes HebMorph usage from Elasticsearch simple.

Both HebMorph and this analysis plugin are provided under the AGPL3 license.

To start working with HebMorph using your local Elasticsearch instance:

  • Download Elasticsearch 1.5.2 if you don't have it yet and install the hebmorph analysis plugin:

https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.5.2.zip

Install the Hebrew analysis plugin using the following command:

~/elasticsearch-1.5.2$ bin/plugin --install analysis-hebrew --url https://bintray.com/artifact/download/synhershko/elasticsearch-analysis-hebrew/elasticsearch-analysis-hebrew-1.7.zip
    hebrew.dict.path: /PATH/TO/HSPELL/FOLDER/
  • Run elasticsearch (make sure to change configurations like cluster.name before if needed):
~/elasticsearch-1.5.2$ bin/elasticsearch

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html#string

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-put-mapping.html

It is often times easier to pre-define index mappings in Elasticsearch via index templates:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html

  • Index your content, make sure to have Hebrew text indexed into fields defined to use the "hebrew" index via the index mappings in the previous step.

  • Queries to retrieve documents can be made using the Match Query family, see more details here:

http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/match-query.html http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

The same analyzer being used for indexing is going to be used for searching as well, so same lemmatization process is going to be in effect.

In addition, "hebrew_query" and "hebrew_query_light" analyzers exist to enable exact matches support. Both will respect "locking" a term while searching to disable lemmatization of it. So if for example a Match query for the term נביעות$ will be sent and an analyzer "hebrew_query" specified for it, documents containing only "נביעה" will not show up in search results. Disabling lemmatization this way also works in phrase match queries as you'd expect. You should notice this will also prevent prefixes to be handled correctly (by design), and as such you should expand your query to contain relevant prefixes.

An "hebrew_exact" analyzer is also available for query_string / match queries to be searched exact without lemma expansion and without locking specific terms with the $ sign.

Because Hebrew uses quote marks to mark acronyms, it is recommended to use the match family queries and not query_string. This is the official recommendation anyway. This plugin does not currently ship with a QueryParser implementation that can be used to power query_string queries.

To see an example of the laser-beam approach search for Hebrew in action see: See http://code972.com/blog/2013/12/673-hebrew-search-done-right . The method shown there uses the "hebrew" analyzer for indexing and "hebrew_query" for searches, with lemmatization-locking powered by a powerful UI.

  • Hebmorph is released open-sourced, alongside with hspell dictionary files. The Commercial option will grant you further support in making Hebrew search even better, and it comes with a proprietary dictionary. For more information, check out http://code972.com/hebmorph.