Elasticsearch term position similarity plugin

Elasticsearch custom similarity plugin to calculate score based on term position and payload so that terms closer to the beginning of a field have higher scores.

Build

./gradlew clean assemble

Note, that Elasticsearch 7.x requires Java 12.

Install

Run ./scripts/install-plugin.sh

Re-start elasticsearch

Examples

Run ./examples/position-similarity.sh

Advanced Scoring with Elasticsearch Similarity Plugins

What are Plugins

Plugins are a way to enhance the core Elasticsearch functionality in a custom manner.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/intro.html

What is Similarity

A similarity (scoring/ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.

Configuring a custom similarity is considered an expert feature and the builtin similarities are most likely sufficient.

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html

BM25 Similarity Scoring Formula

BM25 is a default similarity in Elasticsearch 7.x.

score(q,d) =
  ∑ (
      (k1 + 1)
    · idf(t)
    · tf(t in d) / [ tf(t in d) + k1 · (1 - b + b · document_length / avg(document_length)) ]
    ) (t in q)

Let's index some documents, run a match query and look at explanation.

Create Elasticsearch Index

curl --header "Content-Type:application/json" -s -XDELETE "http://localhost:9200/test_index"

curl --header "Content-Type:application/json" -s -XPUT "http://localhost:9200/test_index" -d '
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "similarity": {
        "default": {
          "type": "BM25"
        }
      }
    }
  }
}
'

Create Mapping

curl --header "Content-Type:application/json" -XPUT 'localhost:9200/test_index/_mapping' -d '
{
  "properties": {
    "field1": {
      "type": "text"
    }
  }
}
'

Index Documents

curl --header "Content-Type:application/json" -s -XPUT "localhost:9200/test_index/_doc/1" -d '
{"field1" : "bar foo"}
'

curl --header "Content-Type:application/json" -s -XPUT "localhost:9200/test_index/_doc/2" -d '
{"field1" : "foo bar bar"}
'

curl --header "Content-Type:application/json" -s -XPUT "localhost:9200/test_index/_doc/3" -d '
{"field1" : "bar bar foo foo"}
'

curl --header "Content-Type:application/json" -s -XPOST "http://localhost:9200/test_index/_refresh"

doc id	foo freq	doc length
1	1	2
2	1	3
3	2	4

Match Query

curl --header "Content-Type:application/json" -s "localhost:9200/test_index/_search?pretty=true" -d '
{
  "query": {
    "match": {
      "field1": "foo"
    }
  }
}
'

Match Query Results

{
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.16786805,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.16786805,
        "_source" : {
          "field1" : "bar bar foo foo"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.1546153,
        "_source" : {
          "field1" : "bar foo"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.13353139,
        "_source" : {
          "field1" : "foo bar bar"
        }
      }
    ]
  }
}

Document 3 has the highest score because it has higher foo frequency than Document 1 and Document 2.
Document 1 and 2 have the same foo frequency but Document 1 has less terms.

Match Query Explanation

curl --header "Content-Type:application/json" -s "localhost:9200/test_index/_search?pretty=true" -d '
{
  "explain": true,
  "query": {
    "match": {
      "field1": "foo"
    }
  }
}
'

Note, that explanation is part of Lucene API and doc mentioned in explanation is a Lucene document id and it has nothing to do with Elacticsearch _id field.

{
  "_explanation": {
    "value": 0.16786805,
    "description": "weight(field1:foo in 2) [PerFieldSimilarity], result of:",
    "details": [
      {
        "value": 0.16786805,
        "description": "score(freq=2.0), product of:",
        "details": [
          {
            "value": 2.2,
            "description": "boost",
            "details": []
          },
          {
            "value": 0.13353139,
            "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
            "details": [
              {
                "value": 3,
                "description": "n, number of documents containing term",
                "details": []
              },
              {
                "value": 3,
                "description": "N, total number of documents with field",
                "details": []
              }
            ]
          },
          {
            "value": 0.5714286,
            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
            "details": [
              {
                "value": 2,
                "description": "freq, occurrences of term within document",
                "details": []
              },
              {
                "value": 1.2,
                "description": "k1, term saturation parameter",
                "details": []
              },
              {
                "value": 0.75,
                "description": "b, length normalization parameter",
                "details": []
              },
              {
                "value": 4,
                "description": "dl, length of field",
                "details": []
              },
              {
                "value": 3,
                "description": "avgdl, average length of field",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

We Need a Better Score

The default scoring model works good but the best scoring model will always be application specific. Let's say that we want to score documents based on a position of a matching term. For our example, we want to score Document 2 higher than Document 1 and 3.

Similarity Plugins

Similarity plugins extend Elasticsearch by adding new similarities (scoring/ranking models) to Elasticsearch.

There are several steps necessary to implement a scoring plugin that will use term positions and payloads and ignore term frequency, inverse document frequency and normalization.

TODO: Needs explanation

Build and Install Plugin

git clone -b 7.0.0 https://github.com/sdauletau/elasticsearch-position-similarity.git elasticsearch-position-similarity

cd elasticsearch-position-similarity

./gradlew clean assemble

/usr/local/opt/elasticsearch-7.0.0/bin/elasticsearch-plugin install file:///`pwd`/build/distributions/elasticsearch-position-similarity-7.0.0.zip

IMPORTANT: Restart Elasticsearch.

Create Elasticsearch Index

curl --header "Content-Type:application/json" -s -XDELETE "http://localhost:9200/test_index"

curl --header "Content-Type:application/json" -s -XPUT "http://localhost:9200/test_index" -d '
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "similarity": {
        "default": {
          "type": "BM25"
        }
      }
    },
    "analysis": {
      "analyzer": {
        "positionPayloadAnalyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "positionPayloadFilter"
          ]
        }
      },
      "filter": {
        "positionPayloadFilter": {
          "delimiter": "|",
          "encoding": "int",
          "type": "delimited_payload"
        }
      }
    }
  }
}
'

Create Mapping

curl --header "Content-Type:application/json" -XPUT 'localhost:9200/test_index/_mapping' -d '
{
  "properties": {
    "field1": {
      "type": "text"
    },
    "field2": {
      "type": "text",
      "term_vector": "with_positions_offsets_payloads",
      "analyzer": "positionPayloadAnalyzer"
    }
  }
}
'

Index Documents

curl --header "Content-Type:application/json" -s -XPUT "localhost:9200/test_index/_doc/1" -d '
{"field1" : "bar foo", "field2" : "bar|0 foo|1"}
'

curl --header "Content-Type:application/json" -s -XPUT "localhost:9200/test_index/_doc/2" -d '
{"field1" : "foo bar bar", "field2" : "foo|0 bar|1 bar|3"}
'

curl --header "Content-Type:application/json" -s -XPUT "localhost:9200/test_index/_doc/3" -d '
{"field1" : "bar bar foo foo", "field2" : "bar|0 bar|1 foo|2 foo|3"}
'

curl --header "Content-Type:application/json" -s -XPOST "http://localhost:9200/test_index/_refresh"

doc id	foo freq	doc length	foo position
1	1	2	1
2	1	3	0
3	2	4	2

Match Query

curl --header "Content-Type:application/json" -s "localhost:9200/test_index/_search?pretty=true" -d '
{
  "query": {
    "position_match": {
      "query": {
        "match": {
          "field2": "foo"
        }
      }
    }
  }
}
'

Match Query Results

{
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "field1" : "foo bar bar",
          "field2" : "foo|0 bar|1 bar|3"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.8333333,
        "_source" : {
          "field1" : "bar foo",
          "field2" : "bar|0 foo|1"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.71428573,
        "_source" : {
          "field1" : "bar bar foo foo",
          "field2" : "bar|0 bar|1 foo|2 foo|3"
        }
      }
    ]
  }
}

Document 2 has the highest score because term foo has the lowest position.

Match Query Explanation

curl --header "Content-Type:application/json" -s "localhost:9200/test_index/_search?pretty=true" -d '
{
  "explain": true,
  "query": {
    "position_match": {
      "query": {
        "match": {
          "field2": "foo"
        }
      }
    }
  }
}
'

Note, that explanation is part of Lucene API and doc mentioned in explanation is a Lucene document id and it has nothing to do with Elacticsearch _id field.

{
  "_shard": "[test_index][0]",
  "_node": "Raak6LCoRluN_7MJpzKDJA",
  "_index": "test_index",
  "_type": "_doc",
  "_id": "2",
  "_score": 1,
  "_source": {
    "field1": "foo bar bar",
    "field2": "foo|0 bar|1 bar|3"
  },
  "_explanation": {
    "value": 1,
    "description": "score(doc=1), sum of:",
    "details": [
      {
        "value": 1,
        "description": "score(field=field2, term=foo, pos=0, func=5/(5+0))",
        "details": []
      }
    ]
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
examples		examples
gradle/wrapper		gradle/wrapper
scripts		scripts
src/main/java/org/elasticsearch		src/main/java/org/elasticsearch
.gitignore		.gitignore
Advanced Scoring with Elasticsearch Similarity Plugins.md		Advanced Scoring with Elasticsearch Similarity Plugins.md
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
VERSION.txt		VERSION.txt
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elasticsearch term position similarity plugin

Build

Install

Examples

Advanced Scoring with Elasticsearch Similarity Plugins

What are Plugins

What is Similarity

BM25 Similarity Scoring Formula

Create Elasticsearch Index

Create Mapping

Index Documents

Match Query

Match Query Results

Match Query Explanation

We Need a Better Score

Similarity Plugins

Build and Install Plugin

Create Elasticsearch Index

Create Mapping

Index Documents

Match Query

Match Query Results

Match Query Explanation

About

Releases

Packages

Languages

License

sloth001/elasticsearch-position-similarity

Folders and files

Latest commit

History

Repository files navigation

Elasticsearch term position similarity plugin

Build

Install

Examples

Advanced Scoring with Elasticsearch Similarity Plugins

What are Plugins

What is Similarity

BM25 Similarity Scoring Formula

Create Elasticsearch Index

Create Mapping

Index Documents

Match Query

Match Query Results

Match Query Explanation

We Need a Better Score

Similarity Plugins

Build and Install Plugin

Create Elasticsearch Index

Create Mapping

Index Documents

Match Query

Match Query Results

Match Query Explanation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages