#CRAG (Compositional Retrieval-Augmented Generation) in RAG

CRAG (Compositional Retrieval-Augmented Generation) is an advanced variant of RAG (Retrieval-Augmented Generation) that enhances retrieval and response quality by combining multiple information sources. It is particularly useful for complex queries that require multiple reasoning steps.

🔹 How CRAG Works

✅ Step 1: Query Decomposition
Breaks down a complex query into simpler sub-queries.
Uses LLMs or dependency parsers to split queries.

✅ Step 2: Multi-Hop Retrieval
Retrieves relevant documents for each sub-query.
Uses FAISS, BM25, or Dense Retrieval models for search.

✅ Step 3: Context Fusion
Merges retrieved documents intelligently.
Uses rerankers (like ColBERT) to select the best results.

✅ Step 4: Final Generation
Passes the fused context to a generative model (e.g., GPT, LLaMA) for coherent response synthesis.


🔹 Example: CRAG in Action
Query:
"How does quantum computing impact cryptography?"

CRAG Breakdown:

1️⃣ Sub-query 1: "What is quantum computing?"

2️⃣ Sub-query 2: "How does quantum computing affect encryption?"

3️⃣ Retrieve documents, merge the information, and generate the final response.



In [None]:
!pip install sentence-transformers faiss-cpu transformers


Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_6

In [None]:
#load retrieval and generation model
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

retriever = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
generator = pipeline("text2text-generation", model="google/flan-t5-base")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu


In [None]:
#create qa simple knowledge base
documents = [
    "Quantum computing uses quantum bits (qubits) to perform computations.",
    "Classical cryptography relies on mathematical hardness assumptions.",
    "Quantum computers can factorize large numbers exponentially faster.",
    "Shor’s Algorithm enables quantum computers to break RSA encryption.",
    "Post-quantum cryptography focuses on quantum-resistant encryption methods.",
    "Lattice-based cryptography is one approach to post-quantum security."
]

#encode documents into vectors
doc_embeddings = retriever.encode(documents, convert_to_numpy = True)

#create faiss index
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(doc_embeddings)

In [None]:
#define CRAG retrieval function
def retrieve_docs(query, top_k=4):
  query_embedding = retriever.encode([query], convert_to_numpy = True)
  distances, indices = index.search(query_embedding, top_k)
  return [documents[i] for i in indices[0]]

#query
query = "How does quantum computing impact cryptography?"
retrieved_docs = retrieve_docs(query)
print(retrieved_docs)

['Post-quantum cryptography focuses on quantum-resistant encryption methods.', 'Quantum computing uses quantum bits (qubits) to perform computations.', 'Shor’s Algorithm enables quantum computers to break RSA encryption.', 'Lattice-based cryptography is one approach to post-quantum security.']


In [None]:
#generate answers using retrieved context
def generate_response(query,retrieve_docs):

    # Combine retrieved documents into context
    context = "\n".join(retrieved_docs)

    # Construct a more detailed prompt
    prompt = (
        f"You are an AI assistant. Use the following retrieved documents to answer the question.\n\n"
        f"Retrieved Documents:\n{context}\n\n"
        f"Question: {query}\n"
        f"Answer:"
    )
    response = generator(prompt, max_length=100)[0]['generated_text']
    return response

response = generate_response(query,retrieved_docs)
print(response)

uses quantum bits (qubits) to perform computations


#Adaptive RAG

Adaptive RAG is an advanced form of RAG that dynamically adjusts retrieval and generation strategies based on the complexity of a query, user context, or available knowledge. Unlike static RAG models that retrieve a fixed number of documents and generate answers, Adaptive RAG optimizes both retrieval and response generation for better efficiency and accuracy.

Key Concepts of Adaptive RAG

Dynamic Retrieval Selection:

Instead of fetching a fixed number of documents, Adaptive RAG retrieves more relevant documents based on query complexity.
It can switch between different retrieval strategies (BM25, Dense Retrieval, Hybrid).

Adaptive Query Reformulation:

The model may rewrite the query dynamically to improve retrieval.
It can break down complex queries into sub-questions.

Context-Aware Generation:

The system adapts generation based on retrieved content, user history, or prior responses.
Can adjust response length and style depending on the use case.

Feedback Integration:

The model learns from user interactions to improve retrieval and generation over time.
Can adjust based on explicit (user feedback) or implicit (engagement metrics) signals.

How Adaptive RAG Works

Query Understanding:
The system classifies the query (simple, complex, multi-turn, vague).

Dynamic Retrieval Adjustment:
If the query is complex, it retrieves more documents or uses a hybrid retrieval method.

Context Incorporation:
Previous conversation history is considered.

Adaptive Response Generation:
The system tailors the response based on confidence, knowledge coverage, and user needs.

Feedback Loop:
Adjusts retrieval and generation based on past interactions.


In [None]:
!pip install -q langchain langchain-community transformers faiss-cpu


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.5 MB[0m [31m22.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m42.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.0/413.0 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
#set up vector index FAISS
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from transformers import AutoTokenizer
# Load sample knowledge
documents = ["Quantum computing uses qubits.",
             "Post-quantum cryptography is resistant to quantum attacks.",
             "Transformers are powerful deep learning models."]

#convert into langchain document format
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
docs = text_splitter.create_documents(documents)

#use hugging face embedings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")


#create faiss vector store
vector_store = FAISS.from_documents(docs, embedding_model)

In [None]:
#define adaptive retrieval function
def adaptive_retrieve(query, complexity="auto"):
    """Dynamically adjust retrieval based on query complexity."""
    if complexity == "auto":
        complexity = "high" if len(query.split()) > 7 else "low"

    k = 5 if complexity == "high" else 2  # Retrieve more for complex queries
    docs = vector_store.similarity_search(query, k=k)
    return [doc.page_content for doc in docs]


In [None]:
#define adaptive response generation

# Load the Hugging Face QA model
qa_model = pipeline("question-answering", model="deepset/roberta-base-squad2")

def adaptive_generate(query, retrieved_docs):
    if isinstance(retrieved_docs, list):
        retrieved_docs = " ".join(retrieved_docs)  # Convert list to a single string

    prompt = {
        "question": query,
        "context": retrieved_docs
    }

    output = qa_model(prompt)  # No need for max_length or eos_token_id
    print("Model Output:", output)  # Debugging step

    # Extract the answer correctly
    response = output.get('answer', 'No response generated')  # Use .get() to avoid KeyError

    return response


Device set to use cpu


In [None]:
#test adaptive rag system
query = "How does quantum computing work?"
retrieved_docs = adaptive_retrieve(query)
response = adaptive_generate(query, retrieved_docs)

print("Query:", query)
print("Retrieved Docs:", retrieved_docs)
print("Generated Response:", response)

Model Output: {'score': 0.5914841294288635, 'start': 23, 'end': 29, 'answer': 'qubits'}
Query: How does quantum computing work?
Retrieved Docs: ['Quantum computing uses qubits.', 'Post-quantum cryptography is resistant to quantum attacks.']
Generated Response: qubits
