Skip to content
Elasticsearch lemmatizer for 15 languages
Java
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib Updated for elasticsearch 6.0.0 Feb 20, 2018
src Formatted code using formatter Aug 6, 2019
.gitignore Updated .gitignore Aug 6, 2019
LICENSE Added LICENSE Nov 26, 2013
README.markdown Update documentation May 10, 2019
plugin-descriptor.properties Updated for elasticsearch 7.0.0 Apr 11, 2019
pom.xml Updated for elasticsearch 7.3.1 Aug 23, 2019

README.markdown

LemmaGen Analysis for ElasticSearch

The LemmaGen Analysis plugin provides jLemmaGen lemmatizer as Elasticsearch token filter.

jLemmaGen is Java implementation of LemmaGen project (originally written in C++ and C#).

Instalation

Plugin

Beginning with elasticsearch 5 installation is following:

# specify elasticsearch version
#
export VERSION=6.0.0
./bin/elasticsearch-plugin install https://github.com/vhyza/elasticsearch-analysis-lemmagen/releases/download/v$VERSION/elasticsearch-analysis-lemmagen-$VERSION-plugin.zip

For older elasticsearch version see installation instructions in releases section.

Lexicon

WARNING: Beginning with elasticsearch 6.0 this plugin no longer provides built-in lexicons. There is separate lemmagen-lexicons repository with them.

Copy desired lexicon(s) from lemmagen-lexicons repository into elasticsearch config/lemmagen directory (keep the .lem extension).

For example to install Czech language support do:

cd elasticsearch
mkdir config/lemmagen
cd config/lemmagen
wget https://github.com/vhyza/lemmagen-lexicons/raw/master/free/lexicons/cs.lem

After plugin installation and elasticsearch restart you should see in logs something like:

[2018-02-20T17:46:09,038][INFO ][o.e.p.PluginsService] [1rZCAqs] loaded plugin [elasticsearch-analysis-lemmagen]

Usage

This plugin provides token filter of type lemmagen.

You need to specify lexicon or lexicon_path attribute.

  • lexicon - name of the file located in config/lemmagen (with or without .lem extension)
  • lexicon_path - relative path to the lexicon file from elasticsearch config directory

For example Czech lexicon can be specified with any of the following configuration:

{
    "index": {
        "analysis": {
            "filter": {
                "lemmagen_lexicon" : {
                    "type": "lemmagen",
                    "lexicon": "cs"
                },
                "lemmagen_lexicon_with_ext" : {
                    "type": "lemmagen",
                    "lexicon": "cs.lem"
                },
                "lemmagen_lexicon_path" : {
                    "type": "lemmagen",
                    "lexicon_path": "lemmagen/cs.lem"
                }
            }
        }
    }
}

Example

# Delete test index
#
curl -H "Content-Type: application/json" -X DELETE 'http://localhost:9200/lemmagen-test'

# Create index with lemmagen filter
#
curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/lemmagen-test' -d '{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "lemmagen_filter_en": {
            "type": "lemmagen",
            "lexicon": "en"
          }
        },
        "analyzer": {
          "lemmagen_en": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filter": [
              "lemmagen_filter_en"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "lemmagen_en"
      }
    }
  }
}'

# Try it using _analyze api
#
curl -H "Content-Type: application/json" -X GET 'http://localhost:9200/lemmagen-test/_analyze?pretty' -d '{
  "text": "I am late.",
  "analyzer": "lemmagen_en"
}'

# RESPONSE:
#
# {
#   "tokens" : [
#     {
#       "token" : "I",
#       "start_offset" : 0,
#       "end_offset" : 1,
#       "type" : "<ALPHANUM>",
#       "position" : 0
#     },
#     {
#       "token" : "be",
#       "start_offset" : 2,
#       "end_offset" : 4,
#       "type" : "<ALPHANUM>",
#       "position" : 1
#     },
#     {
#       "token" : "late",
#       "start_offset" : 5,
#       "end_offset" : 9,
#       "type" : "<ALPHANUM>",
#       "position" : 2
#     }
#   ]
# }

# Index document
#
curl -H "Content-Type: application/json" -XPUT 'http://localhost:9200/lemmagen-test/_doc/1?refresh=wait_for' -d '{
    "user"         : "tester",
    "published_at" : "2013-11-15T14:12:12",
    "text"         : "I am late."
}'

# Search
#
curl -H "Content-Type: application/json" -X GET 'http://localhost:9200/lemmagen-test/_search?pretty' -d '{
  "query" : {
    "match" : {
      "text" : "is"
    }
  }
}'

# RESPONSE
#
# {
#   "took" : 2,
#   "timed_out" : false,
#   "_shards" : {
#     "total" : 5,
#     "successful" : 5,
#     "skipped" : 0,
#     "failed" : 0
#   },
#   "hits" : {
#     "total" : 1,
#     "max_score" : 0.2876821,
#     "hits" : [
#       {
#         "_index" : "lemmagen-test",
#         "_type" : "message",
#         "_id" : "1",
#         "_score" : 0.2876821,
#         "_source" : {
#           "user" : "tester",
#           "published_at" : "2013-11-15T14:12:12",
#           "text" : "I am late."
#         }
#       }
#     ]
#   }
# }

NOTE: lemmagen token filter doesn't lowercase. If you want your tokens to be lowercased, add lowercase token filter into your analyzer filters.

# Create index with lemmagen and lowercase filter
#
curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/lemmagen-lowercase-test' -d '{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "lemmagen_filter_en": {
            "type": "lemmagen",
            "lexicon": "en"
          }
        },
        "analyzer": {
          "lemmagen_lowercase_en": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filter": [ "lemmagen_filter_en", "lowercase" ]
          }
        }
      }
    }
  },
  "mappings" : {
    "message" : {
      "properties" : {
        "text" : { "type" : "text", "analyzer" : "lemmagen_lowercase_en" }
      }
    }
  }
}'

Development

To copy dependencies located in lib directory to you local maven repository (~/.m2) run:

mvn initialize

and to create plugin package run following:

mvn package

After that build should be located in ./target/releases.

Credits

LemmaGen team for original C++, C# implementation

Michal Hlaváč for Java implementation of LemmaGen

License

All source codes are licensed under Apache License, Version 2.0.

Copyright 2018 Vojtěch Hýža <http://vhyza.eu>

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
You can’t perform that action at this time.