Skip to content

Files

Latest commit

 

History

History

hnswlib

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Notes on HNSWLib binding for Elixir

Semantic search example on how to use HNSWLib

The code below can be run in an IEX session or in a Livebook.

elixir hnswlib.exs

We use the small model sentence-transformers/paraphrase-MiniLM-L6-v2 from Hugging Face to compute embeddings from text and run a semantic search.

This model is a vector of dimension 384.

Dependencies

Mix.install([
{:bumblebee, "~> 0.5.0"},
{:exla, "~> 0.7.0"},
{:nx, "~> 0.7.0 "},
{:hnswlib, "~> 0.1.5"},
])

Nx.global_default_backend(EXLA.Backend)

Instantiate the hnswlib index

Metric

You need to endow the vector space with one of the following metrics by setting the space argument from the list:

[:l2, :ip, :cosine]

the first is the standard Euclidean metric, the second the inner product, and the third the pseudo-metric "cosine similarity".

We set the :dimension to 384. We firstly use the :l2 norm to build the hnswlib index:

{:ok, index} =
  HNSWLib.Index.new(
    _space = :l2,
             ^^^
    _dim = 384,
    _max_elements = 200
  )

Nx.Serving

We build the Nx.serving for our model: it downloads the model file from the Hugging Face.

transformer = "sentence-transformers/paraphrase-MiniLM-L6-v2"

{:ok, %{model: _, params: _} = model_info} =
      Bumblebee.load_model({:hf, transformer})

{:ok, tokenizer} =
  Bumblebee.load_tokenizer({:hf, transformer})

serving =
  Bumblebee.Text.TextEmbedding.text_embedding(
    model_info,
    tokenizer,
    defn_options: [compiler: EXLA]
  )

Compute embeddings and add to the index

We check that our index is instantiated and empty:

HNSWLib.Index.get_current_count(index)
#{:ok, 0}

We compute our first embedding for the word "short":

input = "short"
# you compute the embedding
%{embedding: data} =
    Nx.Serving.run(serving, input)

You get:

%{
  embedding: #Nx.Tensor<
    f32[384]
      [-0.013410531915724277, 0.07099384069442749, -0.013070221990346909,...]
}

You then append the embedding to your Index:

:ok = HNSWLib.Index.add_items(index, data)

HNSWLib.Index.save_index(index, "my_index.bin")
#{:ok, 1}

You should see a file "my_index.bin" is your current directory.

When you append an entry one by one, you can get the final indice of the Index with:

HNSWLib.Index.get_current_count(index)

This means you can persist the index to uniquely identify an item.

You can also enter a batch of items. You will only get back the last indice. This means that you may need to persist the embedding if you want to identify the input in this case.

Let's enter another entry:

input = "tall"
# you get an embedding
%{embedding: data} =
    Nx.Serving.run(serving, input)

# you build your Index struct
:ok = HNSWLib.Index.add_items(index, data)

HNSWLib.Index.save_index(index, "my_index.bin")

HNSWLib.Index.get_current_count(index)
#{:ok, 2}

KNN search

You now run a knn_queryfrom a text input - converted into an embedding - to look for the closest element present in the Index.

Let's find the closest item in the Index to the input "small". We expect to get "short", the first item.

input = "small"
# you normalise your query data
%{embedding: query_data} =
  Nx.Serving.run(serving, input)

{:ok, labels, _d} =
    HNSWLib.Index.knn_query(
      index,
      query_data,
      k: 1
    )

You should get:

{:ok,
 #Nx.Tensor<
   u64[1][1]
   EXLA.Backend<host:0, 0.968243412.4269146128.215737>
   [
     [0]
   ]
 >,
 #Nx.Tensor<
   f32[1][1]
   EXLA.Backend<host:0, 0.968243412.4269146128.215739>
   [
     [0.2972676455974579]
   ]
 >}

This means that the nearest neighbour of the given input has the indice "0" in the Index. This corresponds to the first entry "short".

We can recover the embedding to compare:

{:ok, data} =
  HNSWLib.Index.get_items(
    index,
    Nx.to_flat_list(labels[0])
  )

hd(data) |> Nx.from_binary(:f32) |> Nx.stack()

The result is:

##Nx.Tensor<
  f32[1][384]
  EXLA.Backend<host:0, 0.968243412.4269146128.215745>
  [
     [-0.013410531915724277, 0.07099384069442749, -0.013070221990346909,...]
  ]

As expected, we recovered the first embedding.

Change the norm

The model has been trained with the norm :cosine. We will use it.

{:ok, index} =
    HNSWLib.Index.new(
        _space = :cosine,
                 ^^^
        _dim = 384,
        _max_elements = 200
    )

We get the embedding:

#Nx.Tensor<
    f32[384]
        [-0.013410531915724277, 0.07099384069442749, -0.013070221990346909,...]

When we run the knn search, we find again the same "nearest neighbour" with of course a different distance.

#Nx.Tensor<
   u64[1][1]
   EXLA.Backend<host:0, 0.905715445.2882404368.207646>
   [
     [0]
   ]
 >,
 #Nx.Tensor<
   f32[1][1]
   EXLA.Backend<host:0, 0.905715445.2882404368.207647>
   [
     [0.06562089920043945]
   ]
 >}

The recovered embedding is however different:

#Nx.Tensor<
   f32[1][384]
   EXLA.Backend<host:0, 0.905715445.2882404368.207652>
   [
     [-0.008871854282915592, 0.04696659371256828, -0.00864671915769577,...]

The reason is that the "sentence-transformer" model uses differents settings than Bumblebeedefault settings:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

It performs "mean_tokens_pooling" and "normalizes" the vectors see here with class sentence_transformers.models.Pooling and class sentence_transformers.models.Normalize.

This blog post confirms this.

To recover the embedding, we can:

  • normalize the tensors with the transformation:
%{embedding: t_small} =
    Nx.Serving.run(serving, "short")

n_small =
    Nx.divide(t_small, Nx.LinAlg.norm(t_small))

HNSWLib.Index.add_items(index, n_small)
serving =
    Bumblebee.Text.TextEmbedding.text_embedding(
        model_info,
        tokenizer,
        defn_options: [compiler: EXLA],
        embedding_processor: :l2_norm,
        output_pool: :mean_pooling,
        output_attribute: :hidden_state
    )

When we just normalize the vectors, we recover the exact same vector. When we change the Bimblebee settings, the recovered embedding is almost identical:

[-0.03144508972764015, 0.12630629539489746, 0.018703171983361244,...]

Notes on vector spaces

A vector space of embeddings can be equipped with a (Euclidean) inner product. If u = ( u 1 , , u n ) and v = ( v 1 , , v n ) are two embeddings, the (euclidean) inner product is defined as:

< u , v >= u 1 v 1 + + u n v n

This inner product induces an Euclidean norm:

| | u | | = < u , u > = u 1 2 + + u n 2

Let u v be the perpendicular projection of u on v . Then:

< u , v >=< u v , v >= | | u | | | | v | | cos u , v ^

The value below is known as the cosine similarity.

< u | | u | | v | | v | | >= cos u , v ^ .

You will remark that the norm of any embedding 1 | | u | | u is 1. We say that the embedding is L 2 -normalised.

The previous formula shows that the inner product of normalised (aka unit) embeddings is the cosine of the angle between these "normalised" embeddings.

Source: https://en.wikipedia.org/wiki/Cosine_similarity

Note that this is not a distance.

The norm in turn induces a distance: d ( u , v ) = | | u v | |

By definition, | | u v | | 2 =< u v , u v > .

By developing, we obtain:

| | u v | | 2 = | | u | | 2 + | | v | | 2 2 < u , v >

Consider now two normalised vectors. We have: 1 2 | | u v | | 2 = 1 cos u , v ^ = d c ( u , v )

This is commonly known as the cosine distance when the embeddings are normalised. It ranges from 0 to 2. Note that it is not a true distance metric.

Finally, note that since we are dealing with finite dimensional vector spaces, all the norms are equivalent (in some precise mathematical way). This means that the limit points are always the same. However, the values of the distances can be quite different, and a "clusterisation" process can give significantly different results.