# Improve Semantic Similarity with Reverse HYDE

It is common that the documents that we want to retrieve are longer than the users' queries and have different formats. To increase the accuracy of the **r*etrieval of the documents based on the users' queries, we will generate hypothetical potential queries from each document and use them as vector embeddings to the documents - AKA Reverse Hyde.

Please note that the original [Hyde technique](https://arxiv.org/abs/2212.10496) processed the incoming queries of the users, and generated the hypothetical documents from these queries, and then used these hypothetical documents to retrive the real documents. In the reverse HYDE, the processing is done when indexing the documents and not in retrival time. Therefore, the latency of the query is not affectd.

### Visual improvements

We will use [rich library](https://github.com/Textualize/rich) to make the output more readable, and supress warning messages.

In [1]:
from rich.console import Console
from rich.panel import Panel
from rich.table import Table

# Apply the theme and print the object with rich formatting
console = Console()

In [2]:
import openai
from typing import List, Dict
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

## Reverse HYDE Implementation

We will create a class that will generate and hypotherical questions and also retrieve the document by calculating the semantic similarity matching. In a real application, we can use a vector database for the embedding vector storage, indexing and retrieval. 

In [3]:
class ReverseHyde:
    def __init__(self, api_key: str):
        openai.api_key = api_key
        self.model = "text-embedding-ada-002"

    def get_embedding(self, text: str) -> List[float]:
        client = openai.OpenAI()
        response = client.embeddings.create(input=text, model=self.model)
        return response.data[0].embedding

    def generate_reverse_hyde(self, chunk: str, n: int = 3) -> List[str]:
        prompt = f"""
        
Given the following text chunk, generate {n} different questions that this chunk would be a good answer to:

Chunk: {chunk}

Questions:
1."""

        client = openai.OpenAI()
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=100,
            n=1,
            stop=None,
            temperature=0.7,
        )

        
        questions = response.choices[0].message.content.strip().split('\n')
        console.print("Prompt:", prompt, style="bold white")
        console.print("Generated Questions:", questions, style="bold green")
        return [q.split('. ', 1)[1] for q in questions if '. ' in q]

    def process_chunks(self, chunks: List[str]) -> Dict[str, List[str]]:
        processed_chunks = {}
        for chunk in chunks:
            processed_chunks[chunk] = self.generate_reverse_hyde(chunk)
        return processed_chunks

    def find_best_chunk(self, query: str, processed_chunks: Dict[str, List[str]]) -> str:
        query_embedding = self.get_embedding(query)
        
        best_similarity = -1
        best_chunk = None

        for chunk, questions in processed_chunks.items():
            chunk_embedding = self.get_embedding(chunk)
            question_embeddings = [self.get_embedding(q) for q in questions]
            
            similarities = cosine_similarity(
                [query_embedding], 
                [chunk_embedding] + question_embeddings
            )[0]
            
            max_similarity = np.max(similarities)
            
            if max_similarity > best_similarity:
                best_similarity = max_similarity
                best_chunk = chunk

        return best_chunk

Loading API keys from environment variable

In [4]:
from dotenv import load_dotenv

load_dotenv()

True

## Enriching the document index with LLM generated Hypothetical questions

In [5]:
import os
# Usage example
api_key = os.getenv("OPENAI_API_KEY")
reverse_hyde = ReverseHyde(api_key)

chunks = [
    "The mitochondria is the powerhouse of the cell.",
    "Python is a high-level, interpreted programming language.",
    "The American Civil War lasted from 1861 to 1865."
]

processed_chunks = reverse_hyde.process_chunks(chunks)

## Query the enriched index

Once we have an index with multiple hypothetical questions to the documents, we can use it to retrive the document based on a real user's query.

In [8]:
query = "What generates energy in a cell?"
best_chunk = reverse_hyde.find_best_chunk(query, processed_chunks)

# Create a table for both query and best match
table = Table(show_header=True, header_style="bold yellow")
table.add_column("Query", style="bright_cyan", width=30)
table.add_column("Best Matching Chunk", style="bright_yellow", width=50)
table.add_row(query, best_chunk)

# Create a panel for the table
panel = Panel(
    table,
    title="[bold]Query and Best Match",
    border_style="white",
    expand=False
)

# Print the panel
console.print(panel)