Skip to content

Files

Latest commit

 

History

History
105 lines (80 loc) · 3.01 KB

analysis-phonetic.asciidoc

File metadata and controls

105 lines (80 loc) · 3.01 KB

Phonetic analysis plugin

The Phonetic Analysis plugin provides token filters which convert tokens to their phonetic representation using Soundex, Metaphone, and a variety of other algorithms.

phonetic token filter

The phonetic token filter takes the following settings:

encoder

Which phonetic encoder to use. Accepts metaphone (default), double_metaphone, soundex, refined_soundex, caverphone1, caverphone2, cologne, nysiis, koelnerphonetik, haasephonetik, beider_morse, daitch_mokotoff.

replace

Whether or not the original token should be replaced by the phonetic token. Accepts true (default) and false. Not supported by beider_morse encoding.

PUT phonetic_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_metaphone"
            ]
          }
        },
        "filter": {
          "my_metaphone": {
            "type": "phonetic",
            "encoder": "metaphone",
            "replace": false
          }
        }
      }
    }
  }
}

GET phonetic_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Joe Bloggs" (1)
}
  1. Returns: J, joe, BLKS, bloggs

It is important to note that "replace": false can lead to unexpected behavior since the original and the phonetically analyzed version are both kept at the same token position. Some queries handle these stacked tokens in special ways. For example, the fuzzy match query does not apply {ref}/common-options.html#fuzziness[fuzziness] to stacked synonym tokens. This can lead to issues that are difficult to diagnose and reason about. For this reason, it is often beneficial to use separate fields for analysis with and without phonetic filtering. That way searches can be run against both fields with differing boosts and trade-offs (e.g. only run a fuzzy match query on the original text field, but not on the phonetic version).

Double metaphone settings

If the double_metaphone encoder is used, then this additional setting is supported:

max_code_len

The maximum length of the emitted metaphone token. Defaults to 4.

Beider Morse settings

If the beider_morse encoder is used, then these additional settings are supported:

rule_type

Whether matching should be exact or approx (default).

name_type

Whether names are ashkenazi, sephardic, or generic (default).

languageset

An array of languages to check. If not specified, then the language will be guessed. Accepts: any, common, cyrillic, english, french, german, hebrew, hungarian, polish, romanian, russian, spanish.