Fast Cosine Similarity ElasticSearch Plugin

Update - Since this plugin was originally written many years ago, ElasticSearch and OpenSearch have deployed native ANN/KNN search capabilities within the product via hnswlib which is an even faster implementation.

Fast Cosine Similarity ElasticSearch Plugin

Extremely fast vector scoring on ElasticSearch 6.4.x+ using vector embeddings.

Fast Elasticsearch Vector Scoring

This Plugin allows you to score Elasticsearch documents based on embedding-vectors, using dot-product or cosine-similarity. Note, this is a linear search approach in its current version. For very large data sets, this is likely not a good choice for realtime search queries.

Depending on cluster size this plugin will likely stop being performant around +1 Million docs.

ElasticSearch is working on some native vector functionality, however, their current solution uses Locality Sensitive Hashing (LSH) approaches. That is to say, their current beta releases all result in approximate results with false-negatives (not all matching records found). If you need exact results, this new feature from the ES team will not be useful to you.

We are working on a solution to close this false-negative gap common with LSH, however we are not communicating a timeline on the delivery of this solution at this time.

Until that time, if you have a large corpus, we would recommend checking out the FAISS (pronounced: Face) project.

General

This plugin was ported from This elasticsearch 5.x vector scoring plugin and this discussion and lior-k's original contribution for ElasticSearch 5.5+ to achieve lightning fast result times when searching across millions of documents.
This port is for ElasticSearch 6.4+ utilizing the ScoreScript class which was officially split from SearchScript and thus incompatible < 6.4.x

Improvements

lior-k's implementation had some confusing variable assignments that did not consistently match with Cosine-Sim's mathematical definition. This has been updated in the code to more accurately reflect the mathematical definition.
Null pointer exceptions are now skipped (e.g. a document doesn't have a vector to compare against) allowing queries to complete successfully even in sparse datasets.
Ported for latest version of ElasticSearch.
Issues and Pull-Requests welcomed!

Elasticsearch version

Currently designed for Elasticsearch 6.4.x+
Plugin is NOT backwards compatible (see note above about ScoreScript class)
Will succesfully build for 6.4.0 and 6.4.1 (latest). Simply modify pom.xml with the correct version then follow maven build steps below.

Maven Build Steps

Clone the project
mvn package to compile the plugin as a zip file
In Elasticsearch run elasticsearch-plugin install file:/PATH_TO_ZIP to install plugin

Why embeddings?

Ultimately, by defining the field mapping as a binary value, by storing an embedded version of the vector you are able to take advantage of Lucene's direct API to achieve direct byte access without transformation.
When creating the document, Lucene encodes the embedding directly to binary, making read access blazing fast on the search side.
Does Lucene do the same with non-embedded vectors? Unsure, but the plugin supports that too if you want to store in [1.2934, -2.0349, ...., .039] format and try!

Usage

Documents

Each document you score should have a field containing the base64 representation of your vector. for example:

   {
   	"_id": 1,
   	....
   	"embeddedVector": "v7l48eAAAAA/s4VHwAAAAD+R7I5AAAAAv8MBMAAAAAA/yEI3AAAAAL/IWkeAAAAAv7s480AAAAC/v6DUgAAAAL+wJi0gAAAAP76VqUAAAAC/sL1ZYAAAAL/dyq/gAAAAP62FVcAAAAC/tQRvYAAAAL+j6ycAAAAAP6v1KcAAAAC/bN5hQAAAAL+u9ItAAAAAP4ckTsAAAAC/pmkjYAAAAD+cYpwAAAAAP5renEAAAAC/qY0HQAAAAD+wyYGgAAAAP5WrCcAAAAA/qzjTQAAAAD++LBzAAAAAP49wNKAAAAC/vu/aIAAAAD+hqXfAAAAAP4FfNCAAAAA/pjC64AAAAL+qwT2gAAAAv6S3OGAAAAC/gfMtgAAAAD/If5ZAAAAAP5mcXOAAAAC/xYAU4AAAAL+2nlfAAAAAP7sCXOAAAAA/petBIAAAAD9soYnAAAAAv5R7X+AAAAC/pgM/IAAAAL+ojI/gAAAAP2gPz2AAAAA/3FonoAAAAL/IHg1AAAAAv6p1SmAAAAA/tvKlQAAAAD/I2OMAAAAAP3FBiCAAAAA/wEd8IAAAAL94wI9AAAAAP2Y1IIAAAAA/rnS4wAAAAL9vriVgAAAAv1QxoCAAAAC/1/qu4AAAAL+inZFAAAAAv7aGA+AAAAA/lqYVYAAAAD+kNP0AAAAAP730BiAAAAA="
   }

Use this field mapping:

      "embeddedVector": {
        "type": "binary",
        "doc_values": true
      }

The vector can be of any dimension

Converting a vector to Base64

to convert an array of doubles to a base64 string we use these example methods:

Java

public static final String convertArrayToBase64(double[] array) {
	final int capacity = 8 * array.length;
	final ByteBuffer bb = ByteBuffer.allocate(capacity);
	for (int i = 0; i < array.length; i++) {
		bb.putDouble(array[i]);
	}
	bb.rewind();
	final ByteBuffer encodedBB = Base64.getEncoder().encode(bb);
	return new String(encodedBB.array());
}

public static double[] convertBase64ToArray(String base64Str) {
	final byte[] decode = Base64.getDecoder().decode(base64Str.getBytes());
	final DoubleBuffer doubleBuffer = ByteBuffer.wrap(decode).asDoubleBuffer();

	final double[] dims = new double[doubleBuffer.capacity()];
	doubleBuffer.get(dims);
	return dims;
}

Python

import base64
import numpy as np

dbig = np.dtype('>f8')

def decode_float_list(base64_string):
    bytes = base64.b64decode(base64_string)
    return np.frombuffer(bytes, dtype=dbig).tolist()

def encode_array(arr):
    base64_str = base64.b64encode(np.array(arr).astype(dbig)).decode("utf-8")
    return base64_str

Go

import(
    "math"
    "encoding/binary"
    "encoding/base64"
)
func convertArrayToBase64(array []float64) string {
	bytes := make([]byte, 0, 8*len(array))
	for _, a := range array {
		bits := math.Float64bits(a)
		b := make([]byte, 8)
		binary.BigEndian.PutUint64(b, bits)
		bytes = append(bytes, b...)
	}
 	encoded := base64.StdEncoding.EncodeToString(bytes)
	return encoded
}
func convertBase64ToArray(base64Str string) ([]float64, error) {
	decoded, err := base64.StdEncoding.DecodeString(base64Str)
	if err != nil {
		return nil, err
	}
 	length := len(decoded)
	array := make([]float64, 0, length/8)
 	for i := 0; i < len(decoded); i += 8 {
		bits := binary.BigEndian.Uint64(decoded[i : i+8])
		f := math.Float64frombits(bits)
		array = append(array, f)
	}
	return array, nil
}

Querying

Querying with encodings

Query for documents based on their cosine similarity:

For ES 6.4.x:

{
  "query": {
    "function_score": {
    "boost_mode" : "replace",
        "functions": [
          {
            "script_score": {
              "script": {
                  "source": "staysense",
                  "lang" : "fast_cosine",
                  "params": {
                      "field": "embeddedVector",
                      "cosine": true,
                      "encoded_vector" : "v+kopYAAAAA/wivkYAAAAD+wfJeAAAAAv8DL4QAAAAA/waYiwAAAAL+zAmvAAAAAv8c+aiAAAAC/07MyQAAAAL+ccr9AAAAAP9feCOAAAAC/y+ivYAAAAL/R34XgAAAAv+G8nuAAAAA/09hlwAAAAL/MkSWAAAAAP9EXn4AAAAC/zBBxYAAAAD/UY+3AAAAAP7zQSkAAAAC/zRijgAAAAA=="
                  }
              }
            }
          }
        ]
    }
  }
}

The example above shows a vector of 64 dimensions
Parameters:
1. field: The document field containing the base64 vector to compare against.
2. cosine: Boolean. if true - use cosine-similarity, else use dot-product.
3. encoded_vector: The encoded vector to compare to.

Querying with vectors

Query for documents based on their cosine similarity:

For ES 6.4.x:

{
  "query": {
    "function_score": {
    "boost_mode" : "replace",
        "functions": [
          {
            "script_score": {
              "script": {
                  "source": "staysense",
                  "lang" : "fast_cosine",
                  "params": {
                      "field": "embeddedVector",
                      "cosine": true,
                      "vector" : [
                      -0.09217305481433868, 0.010635560378432274, -0.02878434956073761, ... , 0.08279753476381302
                      ]
                  }
              }
            }
          }
        ]
    }
  }
}

The example above shows a vector of 64 dimensions
Parameters:
1. field: The document field containing the base64 vector to compare against.
2. cosine: Boolean. if true - use cosine-similarity, else use dot-product.
3. vector: The comma separated non-encoded vector to compare to.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src/main		src/main
target		target
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Cosine Similarity ElasticSearch Plugin

Fast Elasticsearch Vector Scoring

General

Improvements

Elasticsearch version

Maven Build Steps

Why embeddings?

Usage

Documents

Converting a vector to Base64

Querying

Querying with encodings

Querying with vectors

About

Releases

Packages

Languages

wmelton/fast-cosine-similarity

Folders and files

Latest commit

History

Repository files navigation

Fast Cosine Similarity ElasticSearch Plugin

Fast Elasticsearch Vector Scoring

General

Improvements

Elasticsearch version

Maven Build Steps

Why embeddings?

Usage

Documents

Converting a vector to Base64

Querying

Querying with encodings

Querying with vectors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages