Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Clone this wiki locally
HebMorph is an open-source effort for making Hebrew properly searchable with various available search technologies. The HebMorph project along with an Apache Lucene integration is available at https://github.com/synhershko/HebMorph and on Maven central.
elasticsearch-analysis-hebrew is an Elasticsearch plugin which makes HebMorph usage from Elasticsearch simple.
Both HebMorph and this analysis plugin are provided under the AGPL3 license.
To start working with HebMorph using your local Elasticsearch instance:
- Download Elasticsearch 1.5.2 if you don't have it yet and install the hebmorph analysis plugin:
Install the Hebrew analysis plugin using the following command:
~/elasticsearch-1.5.2$ bin/plugin --install analysis-hebrew --url https://bintray.com/artifact/download/synhershko/elasticsearch-analysis-hebrew/elasticsearch-analysis-hebrew-1.7.zip
- Download Hebrew dictionary files. Open-sourced hspell files can be downloaded from https://github.com/synhershko/HebMorph/tree/master/hspell-data-files You will need to tell elasticsearch where the dictionary is located. this is done by adding the following line to elasticsearch.yml file:
- Run elasticsearch (make sure to change configurations like cluster.name before if needed):
Elasticsearch is now available to use via REST over HTTP on http://localhost:9200/. Use your preferred REST client, or the Sense UI for easier interaction with Elasticsearch. Read more about it here: http://code972.com/blog/2014/11/76-elasticsearch-one-tip-a-day-the-sense-ui
Use "hebrew" as an analyzer name for fields containing Hebrew text. This will index them in a proper way for Hebrew texts. Consult with Elasticsearch's documentation for more guidance, in particular:
It is often times easier to pre-define index mappings in Elasticsearch via index templates:
Index your content, make sure to have Hebrew text indexed into fields defined to use the "hebrew" index via the index mappings in the previous step.
Queries to retrieve documents can be made using the Match Query family, see more details here:
The same analyzer being used for indexing is going to be used for searching as well, so same lemmatization process is going to be in effect.
In addition, "hebrew_query" and "hebrew_query_light" analyzers exist to enable exact matches support. Both will respect "locking" a term while searching to disable lemmatization of it. So if for example a Match query for the term נביעות$ will be sent and an analyzer "hebrew_query" specified for it, documents containing only "נביעה" will not show up in search results. Disabling lemmatization this way also works in phrase match queries as you'd expect. You should notice this will also prevent prefixes to be handled correctly (by design), and as such you should expand your query to contain relevant prefixes.
An "hebrew_exact" analyzer is also available for query_string / match queries to be searched exact without lemma expansion and without locking specific terms with the $ sign.
Because Hebrew uses quote marks to mark acronyms, it is recommended to use the match family queries and not query_string. This is the official recommendation anyway. This plugin does not currently ship with a QueryParser implementation that can be used to power query_string queries.
To see an example of the laser-beam approach search for Hebrew in action see: See http://code972.com/blog/2013/12/673-hebrew-search-done-right . The method shown there uses the "hebrew" analyzer for indexing and "hebrew_query" for searches, with lemmatization-locking powered by a powerful UI.
- Hebmorph is released open-sourced, alongside with hspell dictionary files. The Commercial option will grant you further support in making Hebrew search even better, and it comes with a proprietary dictionary. For more information, check out http://code972.com/hebmorph.