## Context

When sending an irrelevant query such as "what is the door about", the search still returns irrelevant chunks from the selected document. The LLM intermittently hallucinate and answer the question based on the irrelevant chunks. This notebook is about how to optimize search but further analysis is required to ensure the LLM does not hallucinate.

In [None]:

from opensearchpy import OpenSearch,  RequestsHttpConnection, client
#"http://admin:admin@opensearch:9200"
client = OpenSearch(
            hosts=[{"host": "localhost", "port": "9200"}],
            http_auth=("admin", "admin"),
            use_ssl=False, #to run locally, changed from True to False
            connection_class=RequestsHttpConnection,
            retry_on_timeout=True
        )

query = {
    "size": 1000,
    "track_total_hits": True,
    "query" : {
        "match_all" : {}
    }
}

#redbox-data-integration-chunk-current

response = client.search(index='redbox-data-integration-chunk-current', body=query)
print(response)

## Define QUERY

In [39]:
from langchain_community.embeddings import BedrockEmbeddings
embedding_model = BedrockEmbeddings(region_name='eu-west-2', model_id="amazon.titan-embed-text-v2:0")
#query = "Data feminism begins by examining how power operates in the world today" #66
#query = "goodbye" #score is 0
#query = "what is this door about" #score is 3.3
query = "I don't know."
query_vector = embedding_model.embed_query(query)

In [None]:
client.indices.get_mapping(index='redbox-data-integration-chunk-current')

In [6]:
from pydantic import BaseModel
class AISettings(BaseModel):
    """Prompts and other AI settings"""

    # LLM settings
    context_window_size: int = 128_000
    llm_max_tokens: int = 1024

    # Prompts and LangGraph settings
    max_document_tokens: int = 1_000_000
    self_route_enabled: bool = False
    map_max_concurrency: int = 128
    stuff_chunk_context_ratio: float = 0.75
    recursion_limit: int = 50





    # Elasticsearch RAG and boost values
    rag_k: int = 30
    rag_num_candidates: int = 10
    rag_gauss_scale_size: int = 3
    rag_gauss_scale_decay: float = 0.5
    rag_gauss_scale_min: float = 1.1
    rag_gauss_scale_max: float = 2.0
    elbow_filter_enabled: bool = False
    match_boost: float = 1.0
    match_name_boost: float = 2.0
    match_description_boost: float = 0.5
    match_keywords_boost: float = 0.5
    knn_boost: float = 2.0
    similarity_threshold: float = 0.7

    # this is also the azure_openai_model
    #chat_backend: ChatLLMBackend = ChatLLMBackend()

    # settings for tool call
    tool_govuk_retrieved_results: int = 100
    tool_govuk_returned_results: int = 5

In [7]:
ai_settings = AISettings()

In [8]:
query_filter = [{
        "bool": {
            "should": [
                {"terms": {"metadata.file_name.keyword": ['natasha.boyse@digital.trade.gov.uk/1_The_power_chapter.pdf']}},
                {"terms": {"metadata.uri.keyword": ['natasha.boyse@digital.trade.gov.uk/1_The_power_chapter.pdf']}}
            ]
        }
    }, {"term": {"metadata.chunk_resolution.keyword": "normal"}}]

## Composite query

In [51]:
final_query = {"size": ai_settings.rag_k,
        "query": {
            "bool": {
                "should": [
                    {
                        "match": {
                            "text": {
                                "query": query,
                                "boost": ai_settings.match_boost,
                            }
                        },
                    },
                    {
                        "match": {
                            "metadata.name": {
                                "query": query,
                                "boost": ai_settings.match_name_boost,
                            }
                        }
                    },
                    {
                        "match": {
                            "metadata.description": {
                                "query": query,
                                "boost": ai_settings.match_description_boost,
                            }
                        }
                    },
                    {
                        "match": {
                            "metadata.keywords": {
                                "query": query,
                                "boost": ai_settings.match_keywords_boost,
                            }
                        }
                    },
                    {
                        "knn": {
                            "vector_field": {
                            "vector": query_vector,
                            "k": ai_settings.rag_num_candidates,
                            "boost": ai_settings.knn_boost}
                        }
                    },
                ],
                "filter": query_filter,
            }
        },
    }


In [None]:

final_response = client.search(index='redbox-data-integration-chunk-current', body=final_query)
final_response

## Keyword query only

BM25 search on document text, title, description and keywords

In [9]:
keyword_final_query = {"size": ai_settings.rag_k,
        "query": {
            "bool": {
                "should": [
                    {
                        "match": {
                            "text": {
                                "query": query,
                                "boost": ai_settings.match_boost,
                            }
                        },
                    },
                    {
                        "match": {
                            "metadata.name": {
                                "query": query,
                                "boost": ai_settings.match_name_boost,
                            }
                        }
                    },
                    {
                        "match": {
                            "metadata.description": {
                                "query": query,
                                "boost": ai_settings.match_description_boost,
                            }
                        }
                    },
                    {
                        "match": {
                            "metadata.keywords": {
                                "query": query,
                                "boost": ai_settings.match_keywords_boost,
                            }
                        }
                    },
                ],
                "filter": query_filter,
            }
        },
    }

In [None]:
response_keyword = client.search(index='redbox-data-integration-chunk-current', body=keyword_final_query)
response_keyword

BM25 query on document text only

In [35]:
text_keyword_final_query = {"size": ai_settings.rag_k,
                            #"min_score":0.01,
        "query": {
            "bool": {
                "should": [
                    {
                        "match": {
                            "text": {
                                "query": query,
                                "boost": ai_settings.match_boost,
                                #"analyzer": "stop",
                                
                            }
                        },
                    }
                ],
                "filter": query_filter,
            }
        },
    }

In [None]:
response_text_keyword = client.search(index='redbox-data-integration-chunk-current', body=text_keyword_final_query)
response_text_keyword

## Semantic query

In [52]:
knn_final_query = {"size": ai_settings.rag_k,
                   #"min_score": 1.9,
        "query": {
            "bool": {
                "must": [
                    {
                        "knn": {
                            "vector_field": {
                            "vector": query_vector,
                            "k": ai_settings.rag_num_candidates,
                            "boost": ai_settings.knn_boost,
                            
                            }
                        }
                    },
                ],
                "filter": query_filter,
            }
        },
    }

In [None]:
response_knn = client.search(index='redbox-data-integration-chunk-current', body=knn_final_query)
response_knn

In [None]:
response_knn["hits"]["hits"]

## Recommendations

- Keyword search based on BM25 does not remove stop words. This lead to inflated scores returning irrelevant results.  Analyzer function in Opensearch should be used to remove stopwords. However, when applying analyzer only on query, it still returns irrelevant chunks. This could be due to the fact that we should also remove stop words from indexed documents. However, removing stop words would impact semantic search. Therefore, adding a new field for 'Text' attribute is required for keyword search where STOP analyzer is performed

- Even when score is 0, Keyword search returns the chunks. We should set min_score for keyword search to a low value to filter out irrelevant chunks. When the query is irrelevant, semantic search does not return any chunks. Perhaps, there is in-built cutoff threeshold for the relevance score in Opensearch KNN but not in BM25. This need to be verified.

- Relevance scores from BM25 are added to relevance scores from Semantic seach (cosine similarity). Scores from BM25 can be as high as 66 while score from Opensearch scaled cosine similarity is between 0 and 2.
Thereore, the impact of keyword is greater than semantic search. It doesn'\t make sense to add both scores. In addition, there are scaling factors (Boost parameters) used as a multiplier to each score, to add more weight to to Semantic search and keyword search for the titles of the documents. The impact of such boosting scores need to be investigated. Further research on the best approach to implement hybrid search is required.

- In the short term, we can remove keyword search and keep semantic search. This would address the issues with irrelevant queries, increasing the recall and therefore accuracy of the search. However, semantic search does not handle well acronyms. A long-term solution using hybrid search is required.

- When implementing the short term solution, integration testing should be done to verify that the RAG and gadget/agent works when no chunks are returned:
  1) Ensuring code handles the edge case where no chunks returned.
  2) Ensuring LLM do not hallucinate if there are no chunks returned
  2) Ensuring agent/gadget select other tools (for example gov.uk) to attempt answering the question

## References

"The bool query takes a more-matches-is-better approach, so the score from each matching must or should clause will be added together to provide the final _score for each document."
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html#:~:text=The%20bool%20query%20takes%20a,final%20_score%20for%20each%20document.



Stop analyzers
https://opensearch.org/docs/2.0/opensearch/query-dsl/text-analyzers/

"The search term is analyzed by the same analyzer that was used for the specific document field at the time it was indexed. This means that your search term goes through the same analysis process as the document’s field."
https://opensearch.org/docs/latest/query-dsl/term-vs-full-text/
