**Locally Hosted Semantic Reranker**

# Objectives

In this notebook we will:
- Load a semantic reranker into Elasticsearch with Eland
- Create a reranker inference API
- Modify the query to use the reranker as part of the query to gather contextual documents

# Setup

Here we do the following
- Import the required libraries
- Create an elasticsearch python client connection


These should already be installed in your notebook environment.
You can uncomment and run if needed

In [1]:
%pip install -qU elasticsearch
%pip install -qU eland[pytorch]

You should consider upgrading via the '/Users/rajeshmenon/topramen/onboarding-search-1/.venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
zsh:1: no matches found: eland[pytorch]
Note: you may need to restart the kernel to use updated packages.


Import the required python libraries

In [2]:
import os
from elasticsearch import Elasticsearch, helpers, exceptions
from urllib.request import urlopen
from getpass import getpass
import json
import time

Create an Elasticsearch Python client

In [24]:
elastic_endpoint = os.getenv('ELASTIC_ENDPOINT')
elastic_api_key = os.getenv('ELASTIC_API_KEY')

es = Elasticsearch(
    elastic_endpoint,
    api_key=elastic_api_key,
)

# Upload Hugging Face model with Eland
Here, we will:
- Upload the model from Hugging Face to Elasticsearch
- Use Eland's `eland_import_hub_model` command to upload the model to Elasticsearch.

For this example we've chosen the [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) text similarity model.
<br><br>
**Note**:
While we are importing the model for use as a reranker, Eland and Elasticsearch do not have a dedicated rerank task type, so we still use `text_similarity`

In [25]:
MODEL_ID = "cohere/rerank-english-v2.0"


In [None]:

!eland_import_hub_model \
  --url $elastic_endpoint \
  --es-api-key $elastic_api_key \
  --hub-model-id $MODEL_ID \
  --task-type text_similarity \
  --start \
  --clear-previous
  

# Create Inference Endpoint
Here we will:
- Create an inference Endpoint
- Deploy the reranking model we impoted in the previous section
We need to create an endpoint queries can use for reranking

Key points about the `model_config`
- `service` - in this case `elasticsearch` will tell the inference API to use a locally hosted (in Elasticsearch) model
- `num_allocations` sets the number of allocations to 1
    - Allocations are independent units of work for NLP tasks. Scaling this allows for an increase in concurrent throughput
- `num_threads` - sets the number of threads per allocation to 1
    - Threads per allocation affect the number of threads used by each allocation during inference. Scaling this generally increased the speed of inference requests (to a point).
- `model_id` - This is the id of the model as it is named in Elasticsearch



In [29]:
model_config = {
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": "cross-encoder__ms-marco-minilm-l-6-v2"
  },
      "task_settings": {
        "return_documents": True
    }
}


In [30]:

inference_id = "semantic-reranking"

create_endpoint = es.inference.put(
    inference_id=inference_id,
    task_type="rerank",
    body=model_config
)

create_endpoint.body

{'inference_id': 'semantic-reranking',
 'task_type': 'rerank',
 'service': 'elasticsearch',
 'service_settings': {'num_allocations': 1,
  'num_threads': 1,
  'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},
 'task_settings': {'return_documents': True}}

###Verify it was created

- Run the two cells in this section to verify:
- The Inference Endpoint has been completed
- The model has been deployed

You should see JSON output with information about the semantic endpoint

In [31]:
check_endpoint = es.inference.get(
    inference_id=inference_id,
)

check_endpoint.body

{'endpoints': [{'inference_id': 'semantic-reranking',
   'task_type': 'rerank',
   'service': 'elasticsearch',
   'service_settings': {'num_allocations': 1,
    'num_threads': 1,
    'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},
   'task_settings': {'return_documents': True}}]}

Verify the model was successfully deployed

The cell below should return `started`




In [41]:
ES_MODEL_ID = "cross-encoder__ms-marco-minilm-l-6-v2"

model_info = es.ml.get_trained_models_stats(model_id=ES_MODEL_ID)

%pip install jello
from jello import jprint
jprint(model_info)
# model_info.body['trained_model_stats'][0]['deployment_stats']['nodes'][0]['routing_state']['routing_state']

[31mERROR: Could not find a version that satisfies the requirement pprint==0.7.0 (from versions: none)[0m
[31mERROR: No matching distribution found for pprint==0.7.0[0m
You should consider upgrading via the '/Users/rajeshmenon/topramen/onboarding-search-1/.venv/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.
ObjectApiResponse({'count': 1, 'trained_model_stats': [{'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2', 'model_size_stats': {'model_size_bytes': 90892372, 'required_native_memory_bytes': 1229655976}, 'pipeline_count': 0, 'inference_stats': {'failure_count': 0, 'inference_count': 0, 'cache_miss_count': 0, 'missing_all_fields_count': 0, 'timestamp': 1728062249227}, 'deployment_stats': {'deployment_id': 'semantic-reranking', 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2', 'threads_per_allocation': 1, 'number_of_allocations': 1, 'queue_capacity': 1024, 'state': 'starting', 'reason': 'Could not assign 

# Query with Reranking

This containes a `text_similarity_reranker` retriever which:
1. Uses a Standard Retriever to :
    1. Perform a semantic query against the chunked ELSER embeddings
    2. Return the top 2 inner hit chunks
2. Perform a reranking:
    1. Taks as input the top 50 results from the previous search
      - `"rank_window_size": 50`
    2. Taks as input the uer's question
      - `"inference_text": USER_QUESTION`
    3.  Uses our previously created reranking API and model


In [None]:
USER_QUESTION = "Where can I get good pizza?"

response = es.search(
    index="restaurant_reviews",
    body={
      "retriever": {
        "text_similarity_reranker": {
          "retriever": {
            "standard": {
              "query": {
                    "nested": {
                        "path": "semantic_body.inference.chunks",
                        "query": {
                            "knn": {
                                "field": "semantic_body.inference.chunks.embeddings",
                                "query_vector_builder": {
                                    "text_embedding": {
                                        "model_id": "my-e5-endpoint",
                                        "model_text": USER_QUESTION
                                    }
                                }
                            }
                        },
                        "inner_hits": {
                            "size": 2,
                            "name": "restaurant_reviews.semantic_body",
                            "_source": [
                                "semantic_body.inference.chunks.text"
                            ]
                        }
                    }



              }
            }
          },
          "field": "Review",
          "inference_id": "semantic-reranking",
          "inference_text": USER_QUESTION,
          "rank_window_size": 50
        }
      }
    }
)

response.raw

Print out the formatted response

In [None]:
for review in response.raw['hits']['hits']:
    print(f"Restaurant {review['_source']['Restaurant']} - Rating: {review['_source']['Rating']} - Reviewer: {review['_source']['Reviewer']}")
