# RAG

Retrieval-Augmented Generation (RAG) is a technique that enhances a Large Language Model's (LLM) ability to answer questions by providing it with relevant, up-to-date information from an external knowledge source. Instead of relying solely on the model's pre-trained knowledge, RAG first retrieves relevant documents and then uses that information to augment the prompt, allowing the LLM to generate a more accurate and contextually grounded response.

In this notebook, we are performing the fundamental Retrieval step of RAG. We convert the user's query and a collection of documents into numerical representations called embeddings. By calculating the cosine similarity between the query's embedding and each document's embedding, we can mathematically identify and retrieve the document that is most semantically similar to the user's question. This retrieved text then serves as the specific context for the LLM to generate its final answer.

In [None]:
from pathlib import Path

import numpy as np
from llama_cpp import Llama

In [None]:
MODEL_ROOT = Path("../llama-cpp-python/models")
assert MODEL_ROOT.exists()

In [None]:
model_path = MODEL_ROOT / "text_gen/llama/llama-2-7b.Q4_0.gguf"
assert model_path.exists()

In [None]:
llm = Llama(
    model_path=str(model_path),
    embedding=True,  # Enable embedding generation
    n_ctx=2048,  # Set context size
    verbose=True,
    n_gpu_layers=-1,
)

In [None]:
documents = [
    "The new Orion spacecraft is designed for deep-space missions to the Moon and Mars.",
    "The James Webb Space Telescope allows us to see the first galaxies ever formed.",
    "Big Ben is the nickname for the Great Bell of the striking clock at the north end of the Palace of Westminster.",
    "Hitchin is a market town in the North Hertfordshire district in Hertfordshire, England.",
]

# Create embeddings for each document
doc_embeddings = [llm.embed(doc) for doc in documents]

In [None]:
user_query = "Where is Hitchin?"
query_embedding = llm.embed(user_query)

# Simple cosine similarity search
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [cosine_similarity(np.array(query_embedding).mean(0), np.array(doc_emb).mean(0)) for doc_emb in doc_embeddings]
most_relevant_doc_index = np.argmax(similarities)
retrieved_context = documents[most_relevant_doc_index]

print(f"🔍 Most relevant document found: '{retrieved_context}'")