# Hybrid Search with LangChain

This notebook demonstrates how to build a hybrid search pipeline using LangChain, Elasticsearch, and SentenceTransformers embeddings. You will learn how to index FAQ documents, generate semantic embeddings, and perform advanced retrieval that combines both vector (semantic) and keyword (lexical) search.

**Notebook Overview:**
- Install and import all required libraries and packages.
- Load and embed FAQ documents using a transformer model.
- Set up an Elasticsearch index to store both text and vector representations.
- Index documents and perform hybrid search queries that leverage both semantic and keyword matching.
- Use LangChain's abstractions to simplify retrieval and evaluation workflows.
- Evaluate retrieval performance using standard information retrieval metrics (Hit Rate, MRR).

**Why use LangChain?**
- **Unified Abstractions:** LangChain provides high-level interfaces for connecting to vector stores, retrievers, and embedding models, reducing boilerplate code.
- **Hybrid Search Support:** Easily combine vector and keyword search strategies for more robust and accurate retrieval.
- **Extensibility:** LangChain integrates with a wide range of backends (Elasticsearch, Pinecone, FAISS, etc.) and supports custom pipelines.
- **Productivity:** Simplifies complex retrieval workflows, making it easier to experiment, prototype, and scale advanced RAG (Retrieval-Augmented Generation) systems.

By the end of this notebook, you will understand how to leverage LangChain to build powerful hybrid search solutions that combine the strengths of semantic and lexical retrieval. Start by installing the necessary LangChain and related packages for hybrid search with Elasticsearch and HuggingFace embeddings

In [None]:
%%pip install -qU langchain langchain-elasticsearch langchain-huggingface

In [None]:
import json
import pandas as pd
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

## 1. Indexing stage

### Load FAQ Documents and Embedding Model
Load the FAQ documents from a JSON file into a Python list and the SentenceTransformer model to generate vector embeddings for the FAQ documents and user queries.

In [None]:
with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

In [None]:
model_name = 'multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(model_name) 



### Generate Embeddings for Each Document
This cell generates vector embeddings for the question, answer text, and their concatenation for each FAQ document. These embeddings are used for semantic search.

In [None]:
for doc in tqdm(documents):
    question = doc['question']
    text = doc['text']
    qt = question + ' ' + text

    doc['question_vector'] = model.encode(question)
    doc['text_vector'] = model.encode(text)
    doc['question_text_vector'] = model.encode(qt)

  0%|          | 0/948 [00:00<?, ?it/s]

### Set Up Elasticsearch Index
This cell connects to a local Elasticsearch instance and creates an index with the appropriate settings and mappings for storing the embedded FAQ data.

In [None]:
es_client = Elasticsearch('http://localhost:9200')

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
            "question_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "text_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "question_text_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [None]:
# Index FAQ documents, along with metadata and generated embeddings, into the Elasticsearch index
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

## 2. Retrieval stage

In [None]:
# Import necessary LangChain modules for embedding and Elasticsearch retrieval, enabling hybrid search functionality
from langchain.embeddings import SentenceTransformerEmbeddings
from typing import Dict
from langchain_elasticsearch import ElasticsearchRetriever

In [None]:
es_url = 'http://localhost:9200'

In [None]:
query = 'I just discovered the course. Can I still join it?'
course = "data-engineering-zoomcamp"

### Load Embedding Model
This cell loads the SentenceTransformerEmbeddings model, which will be used to generate vector embeddings for the FAQ documents and user queries.

In [None]:
embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/multi-qa-MiniLM-L6-cos-v1")

### Construct Hybrid Search Query and Retriever
This cell defines a function to construct a hybrid search query combining vector and keyword search, and sets up the LangChain ElasticsearchRetriever for hybrid retrieval.

In [None]:
def hybrid_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": search_query,
                        "fields": ["question", "text", "section"],
                        "type": "best_fields",
                        "boost": 0.5,
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        },
        "knn": {
            "field": "question_text_vector",
            "query_vector": vector,
            "k": 5,
            "num_candidates": 10000,
            "boost": 0.5,
            "filter": {
                "term": {
                    "course": course
                }
            }
        },
        "size": 5,
        # "rank": {"rrf": {}},
    }


hybrid_retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=hybrid_query,
    content_field='text',
    url=es_url,
)

## Display Hybrid Search Results
Call and print the retrieved questions, course names, and scores from the hybrid search results.

In [None]:
hybrid_results = hybrid_retriever.invoke(query)

In [None]:
for result in hybrid_results:
    print(result.metadata['_source']['question'], result.metadata['_source']['course'], result.metadata['_score'])

Course - Can I still join the course after the start date? data-engineering-zoomcamp 12.559245
Course - Can I follow the course after it finishes? data-engineering-zoomcamp 9.39959
Course - What can I do before the course starts? data-engineering-zoomcamp 7.306914
Course - Can I get support if I take the course in the self-paced mode? data-engineering-zoomcamp 7.1085525
Course - When will the course start? data-engineering-zoomcamp 6.7513986


## 3. Hybrid search

### Load Ground Truth Data for Evaluation
Load and convert the ground truth data from a CSV file, which will be used to evaluate the retrieval performance.

In [None]:
df_ground_truth = pd.read_csv('ground-truth-data.csv')

In [None]:
# Convert the ground truth DataFrame into a list of dictionaries for easier iteration during evaluation
ground_truth = df_ground_truth.to_dict(orient='records')

### Define Hit Rate Metric
This cell defines the hit rate metric, which measures the fraction of queries with at least one relevant result in the top-k retrieved documents.

In [None]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

### Define Mean Reciprocal Rank (MRR) Metric
This cell defines the Mean Reciprocal Rank (MRR) metric, which measures the average reciprocal rank of the first relevant result for each query.

In [None]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

### Define Hybrid Retrieval Function
This cell defines a function to perform hybrid search in Elasticsearch by combining both dense vector (semantic) and keyword (lexical) search using LangChain.

In [None]:
def elastic_search_hybrid(field, query, course):
    def hybrid_query(search_query: str) -> Dict:
        vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
        return {
            "query": {
                "bool": {
                    "must": {
                        "multi_match": {
                            "query": search_query,
                            "fields": ["question", "text", "section"],
                            "type": "best_fields",
                            "boost": 0.5,
                        }
                    },
                    "filter": {
                        "term": {
                            "course": course
                        }
                    }
                }
            },
            "knn": {
                "field": field,
                "query_vector": vector,
                "k": 5,
                "num_candidates": 10000,
                "boost": 0.5,
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            },
            "size": 5,
            "_source": ["text", "section", "question", "course", "id"],
            # "rank": {"rrf": {}},
        }


    hybrid_retriever = ElasticsearchRetriever.from_es_params(
        index_name=index_name,
        body_func=hybrid_query,
        content_field='text',
        url=es_url,
    )

    hybrid_results = hybrid_retriever.invoke(query)

    result_docs = []

    for hit in hybrid_results:
        result_docs.append(hit.metadata['_source'])

    return result_docs

In [None]:
ground_truth[0]

{'question': 'When does the course begin?',
 'course': 'data-engineering-zoomcamp',
 'document': 'c02e79ef'}

In [None]:
question = ground_truth[0]['question']
course = ground_truth[0]['course']
elastic_search_hybrid('question_text_vector', question, course)

[{'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp',
  'id': 'c02e79ef'},
 {'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp',
  'id': 'a482086d'},
 {'section': 'Module 1: Docker and Terraform',
  'question': 'PGCLI - error column c.relhasoids does not exist',
  'course': 'data-engineering-zoomcamp',
  'id': 'c91ad8f2'},
 {'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'course': 'data-engineering-zoomcamp',
  'id': '1f6520ca'}]

### Helper Function for Hybrid Search
This cell defines a helper function to run hybrid search using the concatenated question and answer embedding for each query.

In [None]:
def question_text_hybrid(q):
    question = q['question']
    course = q['course']

    return elastic_search_hybrid('question_text_vector', question, course)

### Define Evaluation Function
This cell defines a function to evaluate the retrieval performance of a search function using hit rate and MRR metrics against the ground truth data.

In [None]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

In [None]:
# Evaluate the hybrid search function using the ground truth data and print the hit rate and MRR metrics
evaluate(ground_truth, question_text_hybrid)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.9250054030689432, 'mrr': 0.8506231539514445}

Hybrid search with ES: `{'hit_rate': 0.9250054030689432, 'mrr': 0.8506231539514445}`