# Semantic Search With Searchkit and Elasticsearch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/searchkit/searchkit/blob/main/notebooks/semantic-search.ipynb)

This notebook setups a pipeline and a search index to perform semantic search on a dataset of 5000 imdb movies. We will use the sentence-transformers minilm embedding model. 

This is part of a guide on how to use Searchkit to build a semantic search application. You can find the full guide [here](https://searchkit.co/docs/guides/semantic-search/).

## Install packages and import modules
Before you start you need to install all required Python dependencies.

In [None]:
# install packages
!python3 -m pip install -qU sentence-transformers eland elasticsearch transformers

# import modules
import pandas as pd, json
from elasticsearch import Elasticsearch
from getpass import getpass
from urllib.request import urlopen

## Deploy MiniLM-L6-v2 model into Elasticsearch

We are using eland to deploy the model into Elasticsearch. We are connecting to elasticsearch using a cloud id and api-key.


In [3]:
API_KEY = getpass("Elastic deployment API Key")
CLOUD_ID = getpass("Elastic deployment Cloud ID")
!eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --es-api-key $API_KEY --start

## Connect to Elasticsearch

Create a elasticsearch client instance with your deployment `Cloud Id` and `API Key`. In this example, we are using the `API_KEY` and `CLOUD_ID` value from previous step.

In [None]:
es = Elasticsearch(
  cloud_id=CLOUD_ID,
  api_key=API_KEY,
  request_timeout=600
)

# should return cluster info if successfully connected
es.info()

## Create an Ingest pipeline

We need to create a text embedding ingest pipeline to generate vector (text) embeddings for `Plot` field.

The pipeline below is defining a processor for the [inference](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html) to the embedding model.

In [None]:
# ingest pipeline definition
PIPELINE_ID="imdb-movies-pipeline"

es.ingest.put_pipeline(
  id=PIPELINE_ID, 
  processors=[{
    "inference": {
      "model_id": "sentence-transformers__all-minilm-l6-v2",
      "target_field": "plot_embedding",
      "field_map": {
        "Plot": "text_field"
      }
    }
  }]
)

## Create Index with mappings

We will now create an elasticsearch index with correct mapping before we index documents.

In [None]:
INDEX_NAME="imdb-movies-semantic-search"

SHOULD_DELETE_INDEX=True

INDEX_MAPPING = {
    "properties": {
      "plot_embedding": {
        "properties": {
          "predicted_value": {
            "type": "dense_vector",
            "dims": 384,
            "index": True,
            "similarity": "cosine"
          }
        }
      }
    }
  }

INDEX_SETTINGS = {
    "index": {
      "number_of_replicas": "1",
      "number_of_shards": "1",
      "default_pipeline": PIPELINE_ID
    }
}

# check if we want to delete index before creating the index
if(SHOULD_DELETE_INDEX):
  if es.indices.exists(index=INDEX_NAME):
    print("Deleting existing %s" % INDEX_NAME)
    es.indices.delete(index=INDEX_NAME, allow_no_indices=True, ignore_unavailable=True)

print("Creating index %s" % INDEX_NAME)
es.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS)


## Index data to elasticsearch index

Let's index sample blogs data using the ingest pipeline.

Note: Before we begin indexing, ensure you have [started your trained model deployment](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html).

In [None]:
url = "https://raw.githubusercontent.com/searchkit/searchkit/main/sample-data/movies/movies.json"
response = urlopen(url)
movies = json.loads(response.read())

actions = []
for movie in movies:
    actions.append({"index": {"_index": INDEX_NAME}})
    actions.append({
        "title": movie["Title"],
        "Plot": movie["Plot"],
        "Genre": movie["genres"],
    })
es.bulk(index=INDEX_NAME, operations=actions)

## Querying
The next step is to check everything is working. We are going to do a simple search using the model `sentence-transformers__all-minilm-l6-v2`.

In [8]:
query = {
  "field": "plot_embedding.predicted_value",
  "k": 10,
  "num_candidates": 50,
  "query_vector_builder": {
    "text_embedding": {
      "model_id": "sentence-transformers__all-minilm-l6-v2",
      "model_text": "a film about toys coming alive"
    }
  }
}

response = es.search(
    index=INDEX_NAME,
    source=["id", "title", "Plot"],
    knn=query)


results = pd.json_normalize(json.loads(json.dumps(response.body['hits']['hits'])))

# shows the result
results


Unnamed: 0,_index,_id,_score,_source.title,_source.Plot
0,imdb-movies-semantic-search,ERERDooB48GkE85ltEPi,0.729052,Toy Story,A cowboy doll is profoundly threatened and jea...
1,imdb-movies-semantic-search,CxERDooB48GkE85ltEfi,0.729052,Toy Story,A cowboy doll is profoundly threatened and jea...
2,imdb-movies-semantic-search,BBERDooB48GkE85ltEvj,0.723086,How the Grinch Stole Christmas,Big budget remake of the classic cartoon about...
3,imdb-movies-semantic-search,ixERDooB48GkE85ltEvj,0.718766,Small Soldiers,When missile technology is used to enhance toy...
4,imdb-movies-semantic-search,jhERDooB48GkE85ltEPi,0.713659,Toy Story 2,"When Woody is stolen by a toy collector, Buzz ..."
5,imdb-movies-semantic-search,hxERDooB48GkE85ltEfi,0.713659,Toy Story 2,"When Woody is stolen by a toy collector, Buzz ..."
6,imdb-movies-semantic-search,TxERDooB48GkE85ltEbi,0.709669,Ponyo,An animated adventure about a five-year-old bo...
7,imdb-movies-semantic-search,RhERDooB48GkE85ltErj,0.709669,Ponyo,An animated adventure about a five-year-old bo...
8,imdb-movies-semantic-search,ZxERDooB48GkE85ltEbi,0.706249,The Passion of the Christ,A film detailing the final hours and crucifixi...
9,imdb-movies-semantic-search,YRERDooB48GkE85ltErj,0.706249,The Passion of the Christ,A film detailing the final hours and crucifixi...
