summary: A tutorial to understand the process of retrieving documents/items using elastic search and vector indexing methods.
id: how-to-create-large-scale-retrieval-system-using-elasticsearch
categories: Pytorch
tags: codelabs
status: Published 
authors: Sparsh A.
Feedback Link: https://github.com/sparsh-ai/reco-tutorials/issues

# Large-scale Document Retrieval using ElasticSearch

<!-- ------------------------ -->
## Introduction
Duration: 5

### What you'll learn?
- Lorem ipsum

### Why is this important?
- Lorem ipsum

### How it will work?
- Lorem ipsum

### Who is this for?
- Lorem ipsum

### Important resources
- Lorem ipsum

<!-- ------------------------ -->
## Understand the Process
Duration: 5

![elasticsearch_process](img/elasticsearch_process.png)

As shown in the chart above, there are two main steps in the embedding-based retrieval system using Elasticsearch:
1. Indexing: documents are first converted to vectors using deep learning models (aka embedding models). They are then indexed and stored on disk by Elasticsearch.
2. Retrieving: a user query is first converted to its vector representation. Elasticsearch then uses this query vector to evaluate the similarity against indexed documents and returns top-scored ones.

In this tutorial, we will use the Universal Sentence Encoder (USE) model which has been trained to learn the representation of a sentence semantic meaning from large public corpus. Such models usually provide a decent baseline for NLP tasks. In practice, it is necessary to train a deep model to learn the embedding for the target applications for performance boosting.

<!-- ------------------------ -->
## Setting up Elasticsearch
Duration: 5

### Download the latest elasticsearch version

In [None]:
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.11.1-linux-x86_64.tar.gz
!tar -xzvf elasticsearch-7.11.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.11.1

### Prep the elasticsearch server

In [2]:
import os
from subprocess import Popen, PIPE, STDOUT
es_subprocess = Popen(['elasticsearch-7.11.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda : os.setuid(1))

### Create a client connection to the local elasticsearch instance

<aside class="positive">
wait for a few minutes for the local host to start
</aside>

In [6]:
!curl -X GET "localhost:9200/"

{
  "name" : "e28d12ba1977",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "af6YoDvTRvGUSRA_omw61Q",
  "version" : {
    "number" : "7.11.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "ff17057114c2199c9c1bbecc727003a907c0db7a",
    "build_date" : "2021-02-15T13:44:09.394032Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


### Install elasticsearch python api

In [None]:
!pip install -q elasticsearch

### Check if elasticsearch server is properly running in the background

In [8]:
from elasticsearch import Elasticsearch, helpers
es_client = Elasticsearch(['localhost'])
es_client.info()

{'cluster_name': 'elasticsearch',
 'cluster_uuid': 'af6YoDvTRvGUSRA_omw61Q',
 'name': 'e28d12ba1977',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2021-02-15T13:44:09.394032Z',
  'build_flavor': 'default',
  'build_hash': 'ff17057114c2199c9c1bbecc727003a907c0db7a',
  'build_snapshot': False,
  'build_type': 'tar',
  'lucene_version': '8.7.0',
  'minimum_index_compatibility_version': '6.0.0-beta1',
  'minimum_wire_compatibility_version': '6.8.0',
  'number': '7.11.1'}}

<!-- ------------------------ -->
## Embed Movielens Dataset
Duration: 10

### Download MovieLens dataset

In [None]:
!wget https://files.grouplens.org/datasets/movielens/ml-25m.zip --no-check-certificate
!unzip ml-25m.zip

### Read the dataset

In [10]:
import pandas as pd
data = pd.read_csv('ml-25m/movies.csv').drop_duplicates()
data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Download USE - a pre-trained text embedding model

In [11]:
import tensorflow_hub as hub
from timeit import default_timer as timer
import json

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-large/5")

### Define variables

In [None]:
INDEX_NAME = "movie_title"
BATCH_SIZE = 200
SEARCH_SIZE = 10
MAPPINGS = {
    'mappings': {'_source': {'enabled': 'true'},
                 'dynamic': 'true',
                 'properties': {'title_vector':
                                {'dims': 512, 'type': 'dense_vector'},
                                'movie_id': {'type': 'keyword'},
                                'genres': {'type': 'keyword'}
                                }
                 },
            'settings': {'number_of_replicas': 1, 'number_of_shards':2}
}

### Build index with document vectors

In [12]:
def index_movie_lens(df, num_doc=500):
  print('creating the {} index.'.format(INDEX_NAME))
  es_client.indices.delete(index=INDEX_NAME, ignore=[404])
  es_client.indices.create(index=INDEX_NAME, body=json.dumps(MAPPINGS))

  requests = []
  count = 0
  start = timer()

  for row_index, doc in df.iterrows():

    # specify the index size to avoid long waiting time
    if count >= num_doc:
      break
    
    # construct requests
    if len(requests) < BATCH_SIZE:

      title_text = doc.title
      genres_text = doc.genres
      title_vector = embed([title_text]).numpy().tolist()[0]

      request = {
          "op_type": "index",
          "_index": INDEX_NAME,
          "_id": row_index,
          "title": title_text,
          "genres": genres_text,
          "title_vector": title_vector,
          "movie_id": doc.movieId
      }

      requests.append(request)
    
    else:
      helpers.bulk(es_client, requests)
      count += len(requests)
      requests.clear()
      if count % (BATCH_SIZE * 2) == 0:
        print("Indexed {} documents in {:.2f} seconds.".format(count, timer()-start))
    
  # Index the remaining
  helpers.bulk(es_client, requests)
  end = timer()

  print("Done indexing {} documents in {:.2f} seconds".format(count, end-start))

In [13]:
index_movie_lens(data, num_doc=2000)

creating the movie_title index.
Indexed 400 documents in 29.64 seconds.
Indexed 800 documents in 52.69 seconds.
Indexed 1200 documents in 74.50 seconds.
Indexed 1600 documents in 96.52 seconds.
Indexed 2000 documents in 118.40 seconds.
Done indexing 2000 documents in 118.41 seconds


### Search with query vector

In [14]:
def return_top_movies(query):

  embedding_start = timer()
  query_vector = embed([query]).numpy().tolist()[0]
  embedding_time = timer() - embedding_start
  formula = "cosineSimilarity(params.query_vector, 'title_vector') + 1.0"

  script_query = {
      "script_score": {
          "query": {"match_all": {}},
          "script": {
              "source": formula,
              "params": {"query_vector": query_vector}
          }
      }
  }

  search_start = timer()
  response = es_client.search(
      index=INDEX_NAME,
      body={
          "size":SEARCH_SIZE,
          "query": script_query,
          "_source": {"includes": ["title", "genres"]}
      }
  )
  search_time = timer() - search_start

  print()
  print("{} total hits.".format(response["hits"]["total"]["value"]))
  
  for hit in response["hits"]["hits"]:

    print("id: {}, score: {}".format(hit["_id"], hit["_score"] - 1))
    print(hit["_source"])
    print()

In [15]:
return_top_movies("war")


2000 total hits.
id: 335, score: 0.5282537
{'genres': 'Adventure|Drama|War', 'title': 'War, The (1994)'}

id: 712, score: 0.43743240000000005
{'genres': 'Documentary', 'title': 'War Stories (1995)'}

id: 1493, score: 0.3954858000000001
{'genres': 'Drama', 'title': 'War at Home, The (1996)'}

id: 1362, score: 0.32700850000000004
{'genres': 'Romance|War', 'title': 'In Love and War (1996)'}

id: 550, score: 0.3104720999999999
{'genres': 'Documentary', 'title': 'War Room, The (1993)'}

id: 1828, score: 0.3056878999999999
{'genres': 'Action|Romance|Sci-Fi|Thriller', 'title': 'Armageddon (1998)'}

id: 1932, score: 0.3055576
{'genres': 'Adventure|Sci-Fi', 'title': 'Dune (1984)'}

id: 1265, score: 0.2961224
{'genres': 'Drama|War', 'title': 'Killing Fields, The (1984)'}

id: 1063, score: 0.2951368999999999
{'genres': 'Drama|War', 'title': 'Platoon (1986)'}

id: 1676, score: 0.2776046999999999
{'genres': 'Comedy', 'title': 'Senseless (1998)'}



<!-- ------------------------ -->
## Approximate Retrival
Duration: 2

In the last step, we used brute-force method (match the given vector to all other vectors in the database) to find similar movies. This gives us accurate results but it is slow and memory-intensive. This will not work for industrial-scale retrieval demand where we have to retrieve thousands of matching vectors per user, for millions of users in a near-realtime settings. To overcome this challenge, researchers found a technique called **Approximate Nearest Neighbour (ANN)**. In this technique, instead of exhaustively searching the full vector space, we only retrieve top-k nearest neighbour vectors. The accuracy slightly gets reduced but the gain in retrieval speed is worth the tradeoff. Read [this](https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6) article to know more about ANN algorithms.

We'll go through a few common ANN algorithms with open-sourced library nmslib and faiss
• Locality-sensitive hashing
• Product quantization with inverted file
• Hierarchical Navigable Small World Graphs

#### Locality-sensitive hashing (LSH)
LSH is a very classical binary hash. Its core is to create multiple hash functions to map vectors into binary codes. Vectors closely related are
expected to hashed into the same codes.

#### Product quantization with inverted file (IVFPQ)
Product quantization adopts k-means as its core quantizer and drastically increases the number of centroids by dividing each vector into many
subvectors and runs the quantizer on all of these subvectors. The IVFPQ index relies on two levels of quantization.

#### Hierarchical Navigable Small World Graphs (HNSW)
This method relies on exploring the graph based on the closeness relation between a node and its neighbors and neighbors' neighbors and.
HNSW stores the full length vectors and the full graph structure in memory (RAM).

### Install libraries

In [None]:
!pip install faiss
!pip install nmslib
!apt-get install libomp-dev

In [17]:
import faiss
import nmslib

### Embed the documents

In [18]:
documents = data['title'].to_list()[:2000]
# # OOM for large document size
embeddings = embed(documents).numpy()
embeddings.shape

(2000, 512)

<!-- ------------------------ -->
## Compare ANNs
Duration: 2

### Defining base classes

In [19]:
class DemoIndexLSH():
  def __init__(self, dimension, documents, embeddings):
    self.dimension = dimension
    self.documents = documents
    self.embeddings = embeddings

  def build(self, num_bits=8):
    self.index = faiss.IndexLSH(self.dimension, num_bits)
    self.index.add(self.embeddings)

  def query(self, input_embedding, k=5):
    distances, indices = self.index.search(input_embedding, k)

    return [(distance, self.documents[index]) for distance, index in zip(distances[0], indices[0])]

index_lsh = DemoIndexLSH(512, documents, embeddings)
index_lsh.build(num_bits=16)

In [20]:
class DemoIndexIVFPQ():
  def __init__(self, dimension, documents, embeddings):
    self.dimension = dimension
    self.documents = documents
    self.embeddings = embeddings

  def build(self,
            number_of_partition=2,
            number_of_subquantizers=2,
            subvector_bits=4):
    quantizer = faiss.IndexFlatL2(self.dimension)
    self.index = faiss.IndexIVFPQ(quantizer, 
                                  self.dimension,
                                  number_of_partition,
                                  number_of_subquantizers,
                                  subvector_bits)
    self.index.train(self.embeddings)
    self.index.add(self.embeddings)

  def query(self, input_embedding, k=5):
    distances, indices = self.index.search(input_embedding, k)

    return [(distance, self.documents[index]) for distance, index in zip(distances[0], indices[0])]

index_pq = DemoIndexIVFPQ(512, documents, embeddings)
index_pq.build()

In [21]:
class DemoHNSW():
  def __init__(self, dimension, documents, embeddings):
    self.dimension = dimension
    self.documents = documents
    self.embeddings = embeddings

  def build(self, num_bits=8):
    self.index = nmslib.init(method='hnsw', space='cosinesimil')
    self.index.addDataPointBatch(self.embeddings)
    self.index.createIndex({'post': 2}, print_progress=True)

  def query(self, input_embedding, k=5):
    indices, distances = self.index.knnQuery(input_embedding, k)

    return [(distance, self.documents[index]) for distance, index in zip(distances, indices)]

index_hnsw = DemoHNSW(512, documents, embeddings)
index_hnsw.build()

In [22]:
class DemoIndexFlatL2():
  def __init__(self, dimension, documents, embeddings):
    self.dimension = dimension
    self.documents = documents
    self.embeddings = embeddings

  def build(self, num_bits=8):
    self.index = faiss.IndexFlatL2(self.dimension)
    self.index.add(self.embeddings)

  def query(self, input_embedding, k=5):
    distances, indices = self.index.search(input_embedding, k)

    return [(distance, self.documents[index]) for distance, index in zip(distances[0], indices[0])]

index_flat = DemoIndexFlatL2(512, documents, embeddings)
index_flat.build()

### Define retrieval function

In [23]:
def return_ann_top_movies(ann_index, query, k=SEARCH_SIZE):
  query_vector = embed([query]).numpy()
  search_start = timer()
  top_docs = ann_index.query(query_vector, k)
  search_time = timer() - search_start
  print("search time: {:.2f} ms".format(search_time * 1000))
  return top_docs

### Retrieve the documents using different methods

In [24]:
return_ann_top_movies(index_flat, "romance")

search time: 0.83 ms


[(0.9557337, 'True Romance (1993)'),
 (1.2160164, 'Love Serenade (1996)'),
 (1.2626684, 'Love Affair (1994)'),
 (1.3447756, 'Kissed (1996)'),
 (1.3752131, 'In Love and War (1996)'),
 (1.380403, 'Casablanca (1942)'),
 (1.3832322, 'Flirt (1995)'),
 (1.3862598, 'Moonlight and Valentino (1995)'),
 (1.3862815, 'Hotel de Love (1996)'),
 (1.3907105, 'Intimate Relations (1996)')]

In [25]:
return_ann_top_movies(index_lsh, "romance")

search time: 0.26 ms


[(2.0, 'Visitors, The (Visiteurs, Les) (1993)'),
 (2.0, 'City Hall (1996)'),
 (2.0, 'Paradise Road (1997)'),
 (3.0, 'When a Man Loves a Woman (1994)'),
 (3.0, 'Cosi (1996)'),
 (3.0, 'Haunted World of Edward D. Wood Jr., The (1996)'),
 (3.0, 'Eddie (1996)'),
 (3.0, 'Ransom (1996)'),
 (3.0, 'Time to Kill, A (1996)'),
 (3.0, 'Mirage (1995)')]

In [26]:
return_ann_top_movies(index_pq, "romance")

search time: 0.21 ms


[(1.0712402, 'Streetcar Named Desire, A (1951)'),
 (1.0712402, 'Moonlight Murder (1936)'),
 (1.0847106, 'To Kill a Mockingbird (1962)'),
 (1.0847106, 'Meet John Doe (1941)'),
 (1.0867726, 'Moonlight and Valentino (1995)'),
 (1.0901787, 'Laura (1944)'),
 (1.0901787, 'Rebecca (1940)'),
 (1.0901787, 'African Queen, The (1951)'),
 (1.0901787, 'Gigi (1958)'),
 (1.0901787, 'Scarlet Letter, The (1926)')]

In [27]:
return_ann_top_movies(index_hnsw, "romance")

search time: 0.30 ms


[(0.47786677, 'True Romance (1993)'),
 (0.60800815, 'Love Serenade (1996)'),
 (0.6313339, 'Love Affair (1994)'),
 (0.67238766, 'Kissed (1996)'),
 (0.68760645, 'In Love and War (1996)'),
 (0.6916159, 'Flirt (1995)'),
 (0.6931299, 'Moonlight and Valentino (1995)'),
 (0.6931407, 'Hotel de Love (1996)'),
 (0.6953552, 'Intimate Relations (1996)'),
 (0.69853836, 'Love in Bloom (1935)')]

<!-- ------------------------ -->
## Conclusion
Duration: 2

Congratulations!

### What we've covered
- Lorem ipsum

### Next steps
- Lorem ipsum

### Links and References
- Lorem ipsum

### Have a Question?
- [Fill out this form](https://form.jotform.com/211377288388469)
- [Raise issue on Github](https://github.com/sparsh-ai/reco-tutorials/issues)