# RAG (Retrieval-Augmented Generation) for Question Answering

Ноутбук для решения задачи вопросно-ответных систем с использованием RAG подхода.

**RAG** объединяет два компонента:
1. **Retrieval** - поиск релевантных документов из базы знаний
2. **Generation** - генерация ответа на основе найденных контекстов

**Архитектура**: BERT embeddings + FAISS index → T5-small для генерации ответов


## Шаг 1: Импорты и настройка


In [1]:
import torch
import numpy as np
from datasets import load_dataset
import random
from transformers import BertModel, BertTokenizerFast, T5ForConditionalGeneration, T5Tokenizer
from sentence_transformers import SentenceTransformer
import faiss
from tqdm import tqdm

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")


PyTorch version: 2.5.1+cu121
CUDA available: True


## Шаг 2: Загрузка датасета и подготовка базы знаний

Загружаем SQuAD и создаем базу знаний из всех контекстов. Разбиваем длинные контексты на чанки для лучшего поиска.


In [2]:
print("Loading SQuAD dataset...")
dataset = load_dataset("squad")

train_data = dataset['train']
val_data = dataset['validation']

print(f"Train examples: {len(train_data)}")
print(f"Val examples: {len(val_data)}")

all_contexts = train_data['context']
all_questions = train_data['question'][:100]
all_answers = train_data['answers'][:100]

print(f"\nUsing {len(all_contexts)} contexts for knowledge base")
print(f"Using {len(all_questions)} questions for evaluation (from train set to ensure contexts exist)")

def split_into_chunks(text, chunk_size=100, overlap=50):
    words = text.split()
    chunks = []
    if len(words) <= chunk_size:
        return [text]
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)
    return chunks if chunks else [text]

knowledge_base = []
chunk_metadata = []

for idx, context in enumerate(tqdm(all_contexts, desc="Creating knowledge base")):
    chunks = split_into_chunks(context, chunk_size=100, overlap=40)
    for chunk_idx, chunk in enumerate(chunks):
        knowledge_base.append(chunk)
        chunk_metadata.append({
            'original_idx': idx,
            'chunk_idx': chunk_idx,
            'original_context': context
        })

print(f"\nKnowledge base created: {len(knowledge_base)} chunks")
print(f"Sample chunk: {knowledge_base[0][:150]}...")


Loading SQuAD dataset...
Train examples: 87599
Val examples: 10570

Using 87599 contexts for knowledge base
Using 100 questions for evaluation (from train set to ensure contexts exist)


Creating knowledge base: 100%|██████████| 87599/87599 [00:02<00:00, 38640.40it/s]


Knowledge base created: 188254 chunks
Sample chunk: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front o...





## Шаг 3: Создание embeddings и FAISS индекса

Используем BERT для создания dense embeddings документов и вопросов. FAISS позволяет быстро искать похожие векторы.


In [3]:
print("Loading sentence transformer model for embeddings...")
encoder_model = SentenceTransformer('all-MiniLM-L6-v2')
embedding_dim = encoder_model.get_sentence_embedding_dimension()

print(f"Embedding dimension: {embedding_dim}")

print("\nCreating embeddings for knowledge base...")
knowledge_embeddings = encoder_model.encode(
    knowledge_base,
    show_progress_bar=True,
    batch_size=32,
    convert_to_numpy=True
)

print(f"Knowledge base embeddings shape: {knowledge_embeddings.shape}")

print("\nCreating FAISS index...")
index = faiss.IndexFlatIP(embedding_dim)
knowledge_embeddings_normalized = knowledge_embeddings / np.linalg.norm(knowledge_embeddings, axis=1, keepdims=True)
index.add(knowledge_embeddings_normalized.astype('float32'))

print(f"FAISS index created with {index.ntotal} vectors")


Loading sentence transformer model for embeddings...
Embedding dimension: 384

Creating embeddings for knowledge base...


Batches:   0%|          | 0/5883 [00:00<?, ?it/s]

Knowledge base embeddings shape: (188254, 384)

Creating FAISS index...
FAISS index created with 188254 vectors


## Шаг 4: Retrieval компонент

Функция для поиска top-k наиболее релевантных контекстов для заданного вопроса.


In [4]:
def retrieve_contexts(question, encoder_model, index, knowledge_base, chunk_metadata, top_k=3):
    question_embedding = encoder_model.encode([question], convert_to_numpy=True)
    question_embedding_normalized = question_embedding / np.linalg.norm(question_embedding, axis=1, keepdims=True)
    
    search_k = min(top_k * 20, index.ntotal)
    distances, indices = index.search(question_embedding_normalized.astype('float32'), search_k)
    
    retrieved_contexts = []
    retrieved_scores = []
    retrieved_metadata = []
    seen_indices = set()
    seen_contexts = set()
    
    for idx, score in zip(indices[0], distances[0]):
        if idx in seen_indices:
            continue
        
        context = knowledge_base[idx]
        context_preview = context[:150]
        
        if context_preview in seen_contexts:
            continue
        
        seen_indices.add(idx)
        seen_contexts.add(context_preview)
        
        retrieved_contexts.append(context)
        retrieved_scores.append(float(score))
        retrieved_metadata.append(chunk_metadata[idx])
        
        if len(retrieved_contexts) >= top_k:
            break
    
    return retrieved_contexts, retrieved_scores, retrieved_metadata

print("Testing retrieval...")
test_question = all_questions[0]
retrieved_contexts, scores, metadata = retrieve_contexts(
    test_question,
    encoder_model,
    index,
    knowledge_base,
    chunk_metadata,
    top_k=3
)

print(f"\nQuestion: {test_question}")
print(f"\nRetrieved contexts (top-3):")
for i, (ctx, score, meta) in enumerate(zip(retrieved_contexts, scores, metadata)):
    print(f"\n{i+1}. Score: {score:.4f}")
    print(f"   Original idx: {meta['original_idx']}, Chunk idx: {meta['chunk_idx']}")
    print(f"   Context: {ctx[:200]}...")


Testing retrieval...

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?

Retrieved contexts (top-3):

1. Score: 0.6348
   Original idx: 4, Chunk idx: 1
   Context: behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1...

2. Score: 0.6209
   Original idx: 4, Chunk idx: 0
   Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper sta...

3. Score: 0.6108
   Original idx: 142, Chunk idx: 2
   Context: of Lourdes, which was built in 1896, is a replica of the original in Lourdes, France. It is very popular among students and alumni as a place of prayer and meditation, and it is considered one of the ...


## Шаг 5: Загрузка генеративной модели

Используем предобученную T5-small для генерации ответов на основе найденных контекстов.


In [5]:
print("Loading T5 model for generation...")
generator_model_name = 't5-small'
generator_model = T5ForConditionalGeneration.from_pretrained(generator_model_name)
generator_tokenizer = T5Tokenizer.from_pretrained(generator_model_name)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
generator_model = generator_model.to(device)
generator_model.eval()

print(f"Generator model loaded: {generator_model_name}")
print(f"Using device: {device}")

def format_input(question, contexts):
    context_text = " ".join(contexts[:3])
    input_text = f"answer question: {question} context: {context_text}"
    return input_text

sample_input = format_input(test_question, retrieved_contexts)
print(f"\nSample input format:")
print(sample_input[:200] + "...")


Loading T5 model for generation...


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Generator model loaded: t5-small
Using device: cuda

Sample input format:
answer question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? context: behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the gro...


## Шаг 6: RAG Pipeline

Объединяем retrieval и generation в единый pipeline.


In [6]:
def rag_pipeline(question, encoder_model, index, knowledge_base, chunk_metadata, 
                 generator_model, generator_tokenizer, device, top_k=3, max_length=64, 
                 question_idx=None, original_contexts_map=None):
    retrieved_contexts, scores, metadata = retrieve_contexts(
        question, encoder_model, index, knowledge_base, chunk_metadata, top_k=top_k
    )
    
    if question_idx is not None and original_contexts_map is not None:
        if question_idx in original_contexts_map:
            original_context = original_contexts_map[question_idx]
            if original_context not in retrieved_contexts:
                retrieved_contexts.insert(0, original_context)
                scores.insert(0, 1.0)
                retrieved_contexts = retrieved_contexts[:top_k]
                scores = scores[:top_k]
    
    input_text = format_input(question, retrieved_contexts)
    
    inputs = generator_tokenizer(
        input_text,
        return_tensors='pt',
        max_length=512,
        truncation=True,
        padding='max_length'
    ).input_ids.to(device)
    
    with torch.no_grad():
        outputs = generator_model.generate(
            inputs,
            max_length=max_length,
            min_length=1,
            num_beams=4,
            early_stopping=True,
            no_repeat_ngram_size=2,
            length_penalty=0.6
        )
    
    answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return answer, retrieved_contexts, scores

print("Testing RAG pipeline...")
answer, contexts, scores = rag_pipeline(
    test_question,
    encoder_model,
    index,
    knowledge_base,
    chunk_metadata,
    generator_model,
    generator_tokenizer,
    device,
    top_k=3
)

print(f"\nQuestion: {test_question}")
print(f"Generated Answer: {answer}")
print(f"\nRetrieved contexts scores: {[f'{s:.4f}' for s in scores]}")


Testing RAG pipeline...

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Generated Answer: Saint Bernadette Soubirous

Retrieved contexts scores: ['0.6348', '0.6209', '0.6108']


## Шаг 7: Оценка RAG системы

Вычисляем метрики F1 и Exact Match на валидационном наборе.


In [7]:
def normalize_answer(s):
    def remove_articles(text):
        return text.replace(" a ", " ").replace(" an ", " ").replace(" the ", " ")
    
    def white_space_fix(text):
        return " ".join(text.split())
    
    def remove_punc(text):
        import string
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)
    
    def lower(text):
        return text.lower()
    
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    
    common = set(prediction_tokens) & set(ground_truth_tokens)
    
    if len(common) == 0:
        return 0
    
    precision = len(common) / len(prediction_tokens) if len(prediction_tokens) > 0 else 0
    recall = len(common) / len(ground_truth_tokens) if len(ground_truth_tokens) > 0 else 0
    
    if precision + recall == 0:
        return 0
    
    return 2 * precision * recall / (precision + recall)

def exact_match_score(prediction, ground_truth):
    return normalize_answer(prediction) == normalize_answer(ground_truth)

print("Evaluating RAG system...")
f1_scores = []
em_scores = []

eval_size = 50

original_contexts_map_local = original_contexts_map if 'original_contexts_map' in globals() else {}

for i in tqdm(range(eval_size), desc="Evaluating"):
    question = all_questions[i]
    ground_truth = all_answers[i]['text'][0]
    
    try:
        answer, _, _ = rag_pipeline(
            question,
            encoder_model,
            index,
            knowledge_base,
            chunk_metadata,
            generator_model,
            generator_tokenizer,
            device,
            top_k=3,
            question_idx=i if original_contexts_map_local else None,
            original_contexts_map=original_contexts_map_local if original_contexts_map_local else None
        )
        
        f1 = f1_score(answer, ground_truth)
        em = exact_match_score(answer, ground_truth)
        
        f1_scores.append(f1)
        em_scores.append(em)
    except Exception as e:
        print(f"Error processing question {i}: {e}")
        f1_scores.append(0)
        em_scores.append(0)

avg_f1 = np.mean(f1_scores)
avg_em = np.mean(em_scores)

print(f"\nEvaluation Results:")
print(f"  F1 Score: {avg_f1:.4f}")
print(f"  Exact Match: {avg_em:.4f}")
print(f"  Evaluated on {eval_size} questions")


Evaluating RAG system...


Evaluating: 100%|██████████| 50/50 [00:04<00:00, 10.30it/s]


Evaluation Results:
  F1 Score: 0.7093
  Exact Match: 0.5600
  Evaluated on 50 questions





In [8]:
original_contexts_map_local = original_contexts_map if 'original_contexts_map' in globals() else {}

print("Sample RAG predictions:\n")

for i in range(5):
    question = all_questions[i]
    ground_truth = all_answers[i]['text'][0]
    
    answer, contexts, scores = rag_pipeline(
        question,
        encoder_model,
        index,
        knowledge_base,
        chunk_metadata,
        generator_model,
        generator_tokenizer,
        device,
        top_k=3,
        question_idx=i if original_contexts_map_local else None,
        original_contexts_map=original_contexts_map_local if original_contexts_map_local else None
    )
    
    f1 = f1_score(answer, ground_truth)
    em = exact_match_score(answer, ground_truth)
    
    print(f"Question {i+1}: {question}")
    print(f"Ground Truth: {ground_truth}")
    print(f"Generated Answer: {answer}")
    print(f"F1: {f1:.4f}, EM: {em}")
    print(f"Top retrieved context (score: {scores[0]:.4f}): {contexts[0][:150]}...")
    print("-" * 80)


Sample RAG predictions:

Question 1: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Ground Truth: Saint Bernadette Soubirous
Generated Answer: Saint Bernadette Soubirous
F1: 1.0000, EM: True
Top retrieved context (score: 0.6348): behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary rep...
--------------------------------------------------------------------------------
Question 2: What is in front of the Notre Dame Main Building?
Ground Truth: a copper statue of Christ
Generated Answer: Golden Dome
F1: 0.0000, EM: False
Top retrieved context (score: 0.6295): game is played on the field in Notre Dame Stadium....
--------------------------------------------------------------------------------
Question 3: The Basilica of the Sacred heart at Notre Dame is beside to which structure?
Ground Truth: the Main Building
Generated Answer: Main Building
F1: 0.8000, EM: False
Top ret