# Wikipedia RAG System Setup
## Data Sourcing and Preparation for Retrieval-Augmented Generation

This notebook covers:
1. Data Sourcing - Downloading Wikipedia and evaluation benchmarks
2. Data Preparation - Parsing, cleaning, chunking, and indexing
3. Building FAISS index for efficient retrieval

## 1. Installation and Setup

In [1]:
# Install required packages
!pip install -q wikipedia-api
!pip install -q datasets
!pip install -q sentence-transformers
!pip install -q faiss-cpu
!pip install -q mwparserfromhell
!pip install -q tqdm
!pip install -q requests
!pip install -q beautifulsoup4
!pip install -q lxml

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m68.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.5/256.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import json
import pickle
import re
import requests
from pathlib import Path
from typing import List, Dict, Tuple
import xml.etree.ElementTree as ET
from tqdm.auto import tqdm
import numpy as np
import pandas as pd

# Sentence transformers and FAISS
from sentence_transformers import SentenceTransformer
import faiss

# For Wikipedia parsing
import mwparserfromhell
from datasets import load_dataset

print("✓ All imports successful")

✓ All imports successful


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 2. Project Directory Structure Setup

In [3]:
# Create structured project directory
PROJECT_ROOT = Path("./rag_wikipedia_project")
DATA_DIR = PROJECT_ROOT / "data"
RAW_DATA_DIR = DATA_DIR / "raw"
PROCESSED_DATA_DIR = DATA_DIR / "processed"
BENCHMARKS_DIR = DATA_DIR / "benchmarks"
EMBEDDINGS_DIR = DATA_DIR / "embeddings"
INDICES_DIR = DATA_DIR / "indices"

# Create all directories
for directory in [RAW_DATA_DIR, PROCESSED_DATA_DIR, BENCHMARKS_DIR, EMBEDDINGS_DIR, INDICES_DIR]:
    directory.mkdir(parents=True, exist_ok=True)
    print(f"✓ Created: {directory}")

print("\n✓ Project structure created successfully")

✓ Created: rag_wikipedia_project/data/raw
✓ Created: rag_wikipedia_project/data/processed
✓ Created: rag_wikipedia_project/data/benchmarks
✓ Created: rag_wikipedia_project/data/embeddings
✓ Created: rag_wikipedia_project/data/indices

✓ Project structure created successfully


## 3. Data Sourcing
### 3.1 Download Wikipedia Corpus

In [4]:
# Using HuggingFace's wikipedia dataset (cleaner than raw dumps)
# For production, you might want to use the full Wikipedia dump from dumps.wikimedia.org

print("Downloading Wikipedia dataset...")
print("Note: This will download a sample. For full Wikipedia, use dumps.wikimedia.org")

# Download Wikipedia dataset (20220301 version, English)
# Using a smaller subset for demonstration - adjust as needed
wikipedia_dataset = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train[:10000]",  # Remove [:10000] for full dataset
    cache_dir=str(RAW_DATA_DIR)
)

print(f"✓ Downloaded {len(wikipedia_dataset)} Wikipedia articles")
print(f"Sample article title: {wikipedia_dataset[0]['title']}")

Downloading Wikipedia dataset...
Note: This will download a sample. For full Wikipedia, use dumps.wikimedia.org


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/41 [00:00<?, ?files/s]

20231101.en/train-00000-of-00041.parquet:   0%|          | 0.00/420M [00:00<?, ?B/s]

20231101.en/train-00001-of-00041.parquet:   0%|          | 0.00/351M [00:00<?, ?B/s]

20231101.en/train-00002-of-00041.parquet:   0%|          | 0.00/329M [00:00<?, ?B/s]

20231101.en/train-00003-of-00041.parquet:   0%|          | 0.00/331M [00:00<?, ?B/s]

20231101.en/train-00004-of-00041.parquet:   0%|          | 0.00/307M [00:00<?, ?B/s]

20231101.en/train-00005-of-00041.parquet:   0%|          | 0.00/244M [00:00<?, ?B/s]

20231101.en/train-00006-of-00041.parquet:   0%|          | 0.00/266M [00:00<?, ?B/s]

20231101.en/train-00007-of-00041.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

20231101.en/train-00008-of-00041.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

20231101.en/train-00009-of-00041.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

20231101.en/train-00010-of-00041.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

20231101.en/train-00011-of-00041.parquet:   0%|          | 0.00/232M [00:00<?, ?B/s]

20231101.en/train-00012-of-00041.parquet:   0%|          | 0.00/239M [00:00<?, ?B/s]

20231101.en/train-00013-of-00041.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

20231101.en/train-00014-of-00041.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

20231101.en/train-00015-of-00041.parquet:   0%|          | 0.00/235M [00:00<?, ?B/s]

20231101.en/train-00016-of-00041.parquet:   0%|          | 0.00/503M [00:00<?, ?B/s]

20231101.en/train-00017-of-00041.parquet:   0%|          | 0.00/231M [00:00<?, ?B/s]

20231101.en/train-00018-of-00041.parquet:   0%|          | 0.00/231M [00:00<?, ?B/s]

20231101.en/train-00019-of-00041.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

20231101.en/train-00020-of-00041.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

20231101.en/train-00021-of-00041.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

20231101.en/train-00022-of-00041.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

20231101.en/train-00023-of-00041.parquet:   0%|          | 0.00/213M [00:00<?, ?B/s]

20231101.en/train-00024-of-00041.parquet:   0%|          | 0.00/221M [00:00<?, ?B/s]

20231101.en/train-00025-of-00041.parquet:   0%|          | 0.00/221M [00:00<?, ?B/s]

20231101.en/train-00026-of-00041.parquet:   0%|          | 0.00/208M [00:00<?, ?B/s]

20231101.en/train-00027-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

20231101.en/train-00028-of-00041.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

20231101.en/train-00029-of-00041.parquet:   0%|          | 0.00/218M [00:00<?, ?B/s]

20231101.en/train-00030-of-00041.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

20231101.en/train-00031-of-00041.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

20231101.en/train-00032-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

20231101.en/train-00033-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

20231101.en/train-00034-of-00041.parquet:   0%|          | 0.00/219M [00:00<?, ?B/s]

20231101.en/train-00035-of-00041.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

20231101.en/train-00036-of-00041.parquet:   0%|          | 0.00/610M [00:00<?, ?B/s]

20231101.en/train-00037-of-00041.parquet:   0%|          | 0.00/674M [00:00<?, ?B/s]

20231101.en/train-00038-of-00041.parquet:   0%|          | 0.00/538M [00:00<?, ?B/s]

20231101.en/train-00039-of-00041.parquet:   0%|          | 0.00/465M [00:00<?, ?B/s]

20231101.en/train-00040-of-00041.parquet:   0%|          | 0.00/422M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6407814 [00:00<?, ? examples/s]

✓ Downloaded 10000 Wikipedia articles
Sample article title: Anarchism


In [None]:
# Alternative: Download from Wikipedia XML dumps (for full corpus)
# Uncomment below to download the actual Wikipedia dump

def download_wikipedia_dump(dump_url: str, output_path: Path):
    """
    Download Wikipedia XML dump.
    Example URL: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
    """
    if output_path.exists():
        print(f"✓ Wikipedia dump already exists at {output_path}")
        return

    print(f"Downloading Wikipedia dump from {dump_url}...")
    print("WARNING: This is a very large file (>20GB compressed)")

    response = requests.get(dump_url, stream=True)
    total_size = int(response.headers.get('content-length', 0))

    with open(output_path, 'wb') as f, tqdm(
        total=total_size,
        unit='B',
        unit_scale=True,
        desc='Downloading'
    ) as pbar:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
            pbar.update(len(chunk))

    print(f"✓ Download complete: {output_path}")

# Uncomment to download full Wikipedia dump
# WIKI_DUMP_URL = "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
# download_wikipedia_dump(WIKI_DUMP_URL, RAW_DATA_DIR / "wikipedia_dump.xml.bz2")

print("✓ Wikipedia sourcing complete (using HuggingFace dataset)")

### 3.2 Download Evaluation Benchmarks

In [7]:
# Download TruthfulQA
print("Downloading TruthfulQA benchmark...")
truthfulqa = load_dataset("truthfulqa/truthful_qa", "generation", cache_dir=str(BENCHMARKS_DIR))
truthfulqa.save_to_disk(str(BENCHMARKS_DIR / "truthfulqa"))
print(f"✓ TruthfulQA downloaded: {len(truthfulqa['validation'])} questions")

'''
# Download FEVER
print("\nDownloading FEVER benchmark...")
fever = load_dataset("fever/fever", cache_dir=str(BENCHMARKS_DIR))
fever.save_to_disk(str(BENCHMARKS_DIR / "fever"))
print(f"✓ FEVER downloaded: {len(fever['train'])} train samples")
'''
'''
# Download RAGTruth (if available) or create placeholder
print("\nNote: RAGTruth benchmark setup...")
# RAGTruth might need to be downloaded from the official source
# For now, we'll note where it should be placed
ragtruth_dir = BENCHMARKS_DIR / "ragtruth"
ragtruth_dir.mkdir(exist_ok=True)
print(f"✓ RAGTruth directory created at: {ragtruth_dir}")
print("  (Place RAGTruth data files here if available)")
'''

print("\n✓ All benchmarks downloaded successfully")

Downloading TruthfulQA benchmark...


Saving the dataset (0/1 shards):   0%|          | 0/817 [00:00<?, ? examples/s]

✓ TruthfulQA downloaded: 817 questions

Downloading FEVER benchmark...


RuntimeError: Dataset scripts are no longer supported, but found fever.py

## 4. Data Preparation
### 4.1 Parse and Clean Wikipedia Data

In [8]:
def clean_wikipedia_text(text: str) -> str:
    """
    Clean Wikipedia text by removing markup, special characters, etc.
    """
    # Parse wiki markup
    wikicode = mwparserfromhell.parse(text)

    # Remove templates, references, etc.
    for template in wikicode.filter_templates():
        try:
            wikicode.remove(template)
        except:
            pass

    # Get plain text
    plain_text = wikicode.strip_code()

    # Remove extra whitespace
    plain_text = re.sub(r'\s+', ' ', plain_text)

    # Remove special characters but keep punctuation
    plain_text = re.sub(r'[^\w\s.,!?;:\-\'"()]', '', plain_text)

    return plain_text.strip()

def parse_wikipedia_articles(dataset) -> List[Dict[str, str]]:
    """
    Parse and clean Wikipedia articles from the dataset.
    """
    cleaned_articles = []

    for article in tqdm(dataset, desc="Parsing Wikipedia articles"):
        try:
            cleaned_text = clean_wikipedia_text(article['text'])

            if len(cleaned_text) > 100:  # Filter out very short articles
                cleaned_articles.append({
                    'id': article['id'],
                    'title': article['title'],
                    'text': cleaned_text,
                    'url': article.get('url', '')
                })
        except Exception as e:
            print(f"Error processing article {article.get('title', 'Unknown')}: {e}")
            continue

    return cleaned_articles

# Parse and clean the Wikipedia dataset
print("Parsing and cleaning Wikipedia articles...")
cleaned_articles = parse_wikipedia_articles(wikipedia_dataset)

# Save cleaned articles
with open(PROCESSED_DATA_DIR / "cleaned_articles.json", 'w', encoding='utf-8') as f:
    json.dump(cleaned_articles, f, ensure_ascii=False, indent=2)

print(f"\n✓ Cleaned {len(cleaned_articles)} articles")
print(f"✓ Saved to: {PROCESSED_DATA_DIR / 'cleaned_articles.json'}")

# Display sample
print("\nSample cleaned article:")
print(f"Title: {cleaned_articles[0]['title']}")
print(f"Text preview: {cleaned_articles[0]['text'][:300]}...")

Parsing and cleaning Wikipedia articles...


Parsing Wikipedia articles:   0%|          | 0/10000 [00:00<?, ?it/s]


✓ Cleaned 9986 articles
✓ Saved to: rag_wikipedia_project/data/processed/cleaned_articles.json

Sample cleaned article:
Title: Anarchism
Text preview: Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state ...


### 4.2 Chunk Text into Sentence-Level Fragments

In [10]:
import nltk
# Download both punkt and punkt_tab to ensure compatibility
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)  # Add this line
from nltk.tokenize import sent_tokenize

def chunk_text_into_sentences(text: str, max_chunk_size: int = 3) -> List[str]:
    """
    Chunk text into sentence-level fragments.
    Combines sentences into chunks of max_chunk_size sentences.
    """
    sentences = sent_tokenize(text)
    chunks = []

    for i in range(0, len(sentences), max_chunk_size):
        chunk = ' '.join(sentences[i:i + max_chunk_size])
        if len(chunk.strip()) > 20:  # Filter very short chunks
            chunks.append(chunk.strip())

    return chunks

def create_document_chunks(articles: List[Dict[str, str]]) -> List[Dict[str, any]]:
    """
    Create chunks from all articles with metadata.
    """
    all_chunks = []
    chunk_id = 0

    for article in tqdm(articles, desc="Chunking articles"):
        chunks = chunk_text_into_sentences(article['text'])

        for chunk_idx, chunk_text in enumerate(chunks):
            all_chunks.append({
                'chunk_id': chunk_id,
                'article_id': article['id'],
                'article_title': article['title'],
                'chunk_index': chunk_idx,
                'text': chunk_text,
                'url': article.get('url', '')
            })
            chunk_id += 1

    return all_chunks

# Create chunks
print("Creating sentence-level chunks...")
document_chunks = create_document_chunks(cleaned_articles)

# Save chunks
with open(PROCESSED_DATA_DIR / "document_chunks.json", 'w', encoding='utf-8') as f:
    json.dump(document_chunks, f, ensure_ascii=False, indent=2)

print(f"\n✓ Created {len(document_chunks)} chunks from {len(cleaned_articles)} articles")
print(f"✓ Saved to: {PROCESSED_DATA_DIR / 'document_chunks.json'}")

# Statistics
chunk_lengths = [len(chunk['text']) for chunk in document_chunks]
print(f"\nChunk statistics:")
print(f"  Average chunk length: {np.mean(chunk_lengths):.0f} characters")
print(f"  Min chunk length: {np.min(chunk_lengths)} characters")
print(f"  Max chunk length: {np.max(chunk_lengths)} characters")

# Display sample chunk
print("\nSample chunk:")
print(f"Article: {document_chunks[10]['article_title']}")
print(f"Text: {document_chunks[10]['text']}")

Creating sentence-level chunks...


Chunking articles:   0%|          | 0/9986 [00:00<?, ?it/s]


✓ Created 212957 chunks from 9986 articles
✓ Saved to: rag_wikipedia_project/data/processed/document_chunks.json

Chunk statistics:
  Average chunk length: 427 characters
  Min chunk length: 21 characters
  Max chunk length: 34161 characters

Sample chunk:
Article: Anarchism
Text: History Pre-modern era Before the creation of towns and cities, established authority did not exist. It was after the institution of authority that anarchistic ideas were espoused as a reaction. The most notable precursors to anarchism in the ancient world were in China and Greece.


### 4.3 Generate Embeddings using Sentence Transformer

In [11]:
# Initialize sentence transformer model
print("Loading Sentence Transformer model...")
model_name = "sentence-transformers/all-MiniLM-L6-v2"  # Fast and efficient
# Alternative: "sentence-transformers/all-mpnet-base-v2" for better quality

embedding_model = SentenceTransformer(model_name)
print(f"✓ Loaded model: {model_name}")
print(f"  Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

Loading Sentence Transformer model...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ Loaded model: sentence-transformers/all-MiniLM-L6-v2
  Embedding dimension: 384


In [12]:
def generate_embeddings(chunks: List[Dict], model: SentenceTransformer, batch_size: int = 32) -> np.ndarray:
    """
    Generate embeddings for all chunks.
    """
    texts = [chunk['text'] for chunk in chunks]

    print(f"Generating embeddings for {len(texts)} chunks...")
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True
    )

    return embeddings

# Generate embeddings
embeddings = generate_embeddings(document_chunks, embedding_model)

print(f"\n✓ Generated embeddings shape: {embeddings.shape}")
print(f"  Number of chunks: {embeddings.shape[0]}")
print(f"  Embedding dimension: {embeddings.shape[1]}")

Generating embeddings for 212957 chunks...


Batches:   0%|          | 0/6655 [00:00<?, ?it/s]


✓ Generated embeddings shape: (212957, 384)
  Number of chunks: 212957
  Embedding dimension: 384


In [13]:
# Save embeddings
embeddings_file = EMBEDDINGS_DIR / "chunk_embeddings.npy"
np.save(embeddings_file, embeddings)
print(f"✓ Saved embeddings to: {embeddings_file}")

# Save metadata separately for easy access
metadata = [
    {
        'chunk_id': chunk['chunk_id'],
        'article_title': chunk['article_title'],
        'text': chunk['text']
    }
    for chunk in document_chunks
]

with open(EMBEDDINGS_DIR / "chunk_metadata.json", 'w', encoding='utf-8') as f:
    json.dump(metadata, f, ensure_ascii=False, indent=2)

print(f"✓ Saved metadata to: {EMBEDDINGS_DIR / 'chunk_metadata.json'}")

✓ Saved embeddings to: rag_wikipedia_project/data/embeddings/chunk_embeddings.npy
✓ Saved metadata to: rag_wikipedia_project/data/embeddings/chunk_metadata.json


### 4.4 Build FAISS Index for Efficient Similarity Search

In [14]:
def build_faiss_index(embeddings: np.ndarray, index_type: str = "IVF") -> faiss.Index:
    """
    Build FAISS index for efficient similarity search.

    Args:
        embeddings: Numpy array of embeddings
        index_type: Type of index - "Flat" for exact search, "IVF" for approximate
    """
    dimension = embeddings.shape[1]

    if index_type == "Flat":
        # Exact search (slower but accurate)
        index = faiss.IndexFlatL2(dimension)
        index.add(embeddings)
        print(f"✓ Built Flat index (exact search)")

    elif index_type == "IVF":
        # Approximate search (faster for large datasets)
        nlist = min(100, embeddings.shape[0] // 10)  # Number of clusters
        quantizer = faiss.IndexFlatL2(dimension)
        index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

        # Train the index
        print("Training IVF index...")
        index.train(embeddings)
        index.add(embeddings)
        index.nprobe = 10  # Number of clusters to search
        print(f"✓ Built IVF index (approximate search) with {nlist} clusters")

    else:
        raise ValueError(f"Unknown index type: {index_type}")

    return index

# Build FAISS index
print("Building FAISS index...")
faiss_index = build_faiss_index(embeddings, index_type="IVF")

print(f"\n✓ FAISS index built successfully")
print(f"  Total vectors: {faiss_index.ntotal}")
print(f"  Index type: {type(faiss_index).__name__}")

Building FAISS index...
Training IVF index...
✓ Built IVF index (approximate search) with 100 clusters

✓ FAISS index built successfully
  Total vectors: 212957
  Index type: IndexIVFFlat


In [15]:
# Save FAISS index
index_file = INDICES_DIR / "faiss_index.bin"
faiss.write_index(faiss_index, str(index_file))
print(f"✓ Saved FAISS index to: {index_file}")

# Save index configuration
config = {
    'model_name': model_name,
    'embedding_dimension': embeddings.shape[1],
    'num_chunks': embeddings.shape[0],
    'index_type': type(faiss_index).__name__,
    'created_at': pd.Timestamp.now().isoformat()
}

with open(INDICES_DIR / "index_config.json", 'w') as f:
    json.dump(config, f, indent=2)

print(f"✓ Saved index configuration to: {INDICES_DIR / 'index_config.json'}")

✓ Saved FAISS index to: rag_wikipedia_project/data/indices/faiss_index.bin
✓ Saved index configuration to: rag_wikipedia_project/data/indices/index_config.json


## 5. Test the Retrieval System

In [16]:
def search_similar_chunks(query: str, model: SentenceTransformer, index: faiss.Index,
                         chunks: List[Dict], k: int = 5) -> List[Dict]:
    """
    Search for similar chunks given a query.
    """
    # Generate query embedding
    query_embedding = model.encode([query], convert_to_numpy=True)

    # Search in FAISS index
    distances, indices = index.search(query_embedding, k)

    # Retrieve chunks
    results = []
    for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
        if idx < len(chunks):  # Valid index
            results.append({
                'rank': i + 1,
                'distance': float(dist),
                'article_title': chunks[idx]['article_title'],
                'text': chunks[idx]['text'],
                'chunk_id': chunks[idx]['chunk_id']
            })

    return results

# Test queries
test_queries = [
    "What is machine learning?",
    "History of artificial intelligence",
    "How does photosynthesis work?"
]

print("Testing retrieval system...\n")
for query in test_queries:
    print(f"Query: {query}")
    print("=" * 80)

    results = search_similar_chunks(query, embedding_model, faiss_index, document_chunks, k=3)

    for result in results:
        print(f"\n[Rank {result['rank']}] Article: {result['article_title']}")
        print(f"Distance: {result['distance']:.4f}")
        print(f"Text: {result['text'][:200]}...")

    print("\n" + "=" * 80 + "\n")

Testing retrieval system...

Query: What is machine learning?

[Rank 1] Article: Artificial intelligence
Distance: 0.6172
Text: There are several kinds of machine learning. Unsupervised learning analyzes a stream of data and finds patterns and makes predictions without any other guidance. Supervised learning requires a human t...

[Rank 2] Article: Artificial intelligence
Distance: 0.6943
Text: This definition stipulates the ability of systems to synthesize information as the manifestation of intelligence, similar to the way it is defined in biological intelligence. Evaluating approaches to ...

[Rank 3] Article: Artificial intelligence
Distance: 0.8587
Text: When a new observation is received, that observation is classified based on previous experience. There are many kinds of classifiers in use. The decision tree is the simplest and most widely used symb...


Query: History of artificial intelligence

[Rank 1] Article: Artificial intelligence
Distance: 0.6667
Text: This, along with c

## 6. Summary and Next Steps

In [17]:
# Generate summary report
summary = {
    "Project Directory": str(PROJECT_ROOT),
    "Wikipedia Articles Downloaded": len(cleaned_articles),
    "Total Chunks Created": len(document_chunks),
    "Embedding Dimension": embeddings.shape[1],
    "FAISS Index Type": type(faiss_index).__name__,
    "Model Used": model_name,
    "Benchmarks Downloaded": ["TruthfulQA", "FEVER", "RAGTruth (placeholder)"],
    "Files Created": {
        "Cleaned Articles": str(PROCESSED_DATA_DIR / "cleaned_articles.json"),
        "Document Chunks": str(PROCESSED_DATA_DIR / "document_chunks.json"),
        "Embeddings": str(EMBEDDINGS_DIR / "chunk_embeddings.npy"),
        "Metadata": str(EMBEDDINGS_DIR / "chunk_metadata.json"),
        "FAISS Index": str(INDICES_DIR / "faiss_index.bin"),
        "Index Config": str(INDICES_DIR / "index_config.json")
    }
}

print("="*80)
print("SETUP COMPLETE - SUMMARY")
print("="*80)
for key, value in summary.items():
    if isinstance(value, dict):
        print(f"\n{key}:")
        for k, v in value.items():
            print(f"  - {k}: {v}")
    elif isinstance(value, list):
        print(f"\n{key}:")
        for item in value:
            print(f"  - {item}")
    else:
        print(f"\n{key}: {value}")

print("\n" + "="*80)
print("NEXT STEPS:")
print("="*80)
print("""
1. Integrate with a language model (e.g., GPT, Llama) for RAG
2. Evaluate using the downloaded benchmarks (TruthfulQA, FEVER)
3. Fine-tune retrieval parameters (chunk size, k value, etc.)
4. Implement re-ranking for better results
5. Add caching for frequently asked queries
6. Scale up to full Wikipedia dataset if needed
7. Implement evaluation metrics (precision, recall, F1)
""")

# Save summary
with open(PROJECT_ROOT / "setup_summary.json", 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\n✓ Summary saved to: {PROJECT_ROOT / 'setup_summary.json'}")

SETUP COMPLETE - SUMMARY

Project Directory: rag_wikipedia_project

Wikipedia Articles Downloaded: 9986

Total Chunks Created: 212957

Embedding Dimension: 384

FAISS Index Type: IndexIVFFlat

Model Used: sentence-transformers/all-MiniLM-L6-v2

Benchmarks Downloaded:
  - TruthfulQA
  - FEVER
  - RAGTruth (placeholder)

Files Created:
  - Cleaned Articles: rag_wikipedia_project/data/processed/cleaned_articles.json
  - Document Chunks: rag_wikipedia_project/data/processed/document_chunks.json
  - Embeddings: rag_wikipedia_project/data/embeddings/chunk_embeddings.npy
  - Metadata: rag_wikipedia_project/data/embeddings/chunk_metadata.json
  - FAISS Index: rag_wikipedia_project/data/indices/faiss_index.bin
  - Index Config: rag_wikipedia_project/data/indices/index_config.json

NEXT STEPS:

1. Integrate with a language model (e.g., GPT, Llama) for RAG
2. Evaluate using the downloaded benchmarks (TruthfulQA, FEVER)
3. Fine-tune retrieval parameters (chunk size, k value, etc.)
4. Implement re-

In [19]:
from google.colab import drive
drive.mount('/content/drive')

# Copy a file to your Drive
# Copy the whole folder to your Drive (MyDrive is the default root in Drive)
!cp -r /content/rag_wikipedia_project /content/drive/MyDrive/

Mounted at /content/drive


## 7. Utility Functions for Loading the System

In [18]:
def load_rag_system(project_root: str = "./rag_wikipedia_project"):
    """
    Load the complete RAG system for use in other notebooks/scripts.

    Returns:
        dict with model, index, chunks, and search function
    """
    project_root = Path(project_root)

    # Load configuration
    with open(project_root / "data" / "indices" / "index_config.json", 'r') as f:
        config = json.load(f)

    # Load model
    model = SentenceTransformer(config['model_name'])

    # Load FAISS index
    index = faiss.read_index(str(project_root / "data" / "indices" / "faiss_index.bin"))

    # Load chunks
    with open(project_root / "data" / "processed" / "document_chunks.json", 'r') as f:
        chunks = json.load(f)

    print(f"✓ RAG system loaded successfully")
    print(f"  Model: {config['model_name']}")
    print(f"  Chunks: {len(chunks)}")
    print(f"  Index vectors: {index.ntotal}")

    return {
        'model': model,
        'index': index,
        'chunks': chunks,
        'config': config,
        'search': lambda query, k=5: search_similar_chunks(query, model, index, chunks, k)
    }

# Example usage:
# rag_system = load_rag_system()
# results = rag_system['search']("What is quantum computing?", k=5)