In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Building a Knowledge Base and Implementing Hybrid RAG System with **Web Search Integration**

1. What is Retrieval-Augmented Generation (RAG)
2. How Does RAG Work?
3. The advantage of using RAG
4. Understanding Knowledge Base
5. Hybrid RAG System with Web Search Integration

  - Leverage multiple knowledge sources - Combining private documents with web knowledge
  - Ensure consistent ranking - Using embedding similarity as a unified scoring method
  - Handle the "freshness vs. depth" tradeoff - Local documents provide depth on specific topics while web results provide up-to-date information


<!-- - Demo: querying the LLM with RAG
    - Find all the studies on topic X that discuss method Y

- What is a knowledge base?
    - Storing extracted information from research papers

- Creating a searchable vector database

- Implementing RAG
    - Connecting the LLM to the knowledge base
    - Using embedding for document retrieval -->



## 1. What is Retrieval-Augmented Generation(RAG)

Retrieval‑Augmented Generation (RAG) is a two‑step architecture that **first retrieves** relevant passages from an external knowledge store (vector database, search API, etc.) and then **conditions the LLM** on those passages to generate the answer.  
Because the response is grounded in real documents, RAG:

* mitigates hallucinations and improves factual accuracy;  
* keeps outputs up‑to‑date without re‑training the model;  
* sidesteps context‑length limits by feeding the LLM only the top‑k most relevant chunks.


reference: https://cloud.google.com/use-cases/retrieval-augmented-generation

<!-- ![RAG](https://arxiv.org/html/2405.06211v2/x1.png)credit:A Survey on RAG Meeting LLMs -->

<img src="https://arxiv.org/html/2405.06211v2/x1.png" alt="RAG" width="800"/>

# 🧠 Discussion Question:
In what scenarios would RAG be particularly valuable compared to using a standard LLM? Think about domains with rapidly changing information or specialized knowledge.


<details>
<summary>Click here for answers</summary>

| When to Use RAG                 | Why It Wins                            |
|--------------------------------|----------------------------------------|
| Fast-changing domains          | Keeps answers fresh                    |
| Deep niche or internal data    | Taps into proprietary sources          |
| Large documents                | Retrieves only what’s needed           |
| Traceable answers              | Provides citations/sources             |


</details>

## 2. How Does RAG Work?

  ### Retrieval and Pre-processing:
  1. RAG first converts the user’s query into a vector embedding and searches an external vector database (e.g., FAISS, Pinecone) for semantically similar documents.
  2. The most relevant documents are retrieved and optionally undergo light preprocessing, such as truncation or reformatting for compatibility with the language model.
  3. Unlike traditional text search, RAG primarily relies on semantic similarity search rather than techniques like stemming or stop-word removal.


  ### Grounded Generation:
  1. The retrieved context is appended to the user’s original query before being fed into the pre-trained LLM.
  2. This augmented prompt allows the LLM to generate more precise, factually grounded, and contextually relevant responses based on the retrieved information.
  3. By dynamically fetching external knowledge, RAG enhances response accuracy without requiring fine-tuning of the LLM itself.

reference: https://cloud.google.com/use-cases/retrieval-augmented-generation



##3 · Why Use RAG Instead of a Plain Prompt?

### From Plain Prompting ➜ RAG‑Enhanced Prompting

| Stage | Without RAG | With RAG |
|-------|-------------|----------|
| **Persona** | Set the model’s role (e.g., “expert reviewer”). | Same |
| **Query** | User’s question. | Converted to an **embedding** and used to search the vector DB. |
| **Context** | – | **Top‑k retrieved chunks** (most semantically similar passages). |
| **Final prompt** | **Persona + Query** | **Persona + Query + Retrieved Context** |

---

### Key Benefits for Prompt Engineering

1. **Automatic context injection**  
   No more copy‑pasting background text—the retriever supplies it on the fly.

2. **Escapes context‑window limits**  
   You can index gigabytes of PDFs offline and only send a few hundred tokens to the LLM.

3. **Higher factual accuracy**  
   The model is *grounded* in real passages → fewer hallucinations.

4. **Traceable answers**  
   Retrieved snippets can be surfaced as inline citations, making peer review easier.

5. **Lower cost than “just shove everything in”**  
   Shorter prompts = fewer tokens passed to the LLM, especially critical for large models.

---

### Why Not Simply Paste All Documents into the Prompt?

* **Hard limit:** Most LLMs cap out at 4k–32k tokens. Many literature corpora dwarf that.  
* **No built‑in search:** The LLM treats the entire prompt uniformly; it doesn’t “index” and rank passages internally.  
* **Wasted tokens:** Irrelevant sections still consume context and cost money.  

> **RAG solves this** by performing *external semantic search*, injecting only the most relevant evidence, and keeping the model focused and inexpensive.



##4 · Understanding a Knowledge Base

A **knowledge base** is a central repository that stores information—usually as document chunks plus rich metadata—in a format that is **easy to embed, search, and update**.

In the context of scientific literature, the KB:

* **Organises** extracted elements of each paper (title, abstract, methods, results, figures, DOI).  
* **Indexes** those chunks as embeddings, enabling **semantic search** rather than plain keywords.  
* **Feeds** retrieved passages to an LLM so answers are grounded in verifiable sources.  
* **Evolves**—new PDFs or web articles can be ingested at any time with no model fine‑tuning.

### Key Benefits

| Benefit | Why it matters to researchers |
|---------|--------------------------------|
| **Comprehensive access** | Query thousands of papers in seconds. |
| **Reduced hallucination** | LLM responses cite real passages, not guesses. |
| **Domain depth** | You control the corpus—e.g., oncology only, or climate datasets only. |
| **Transparency & traceability** | Citations (DOI, PubMed ID, URL) travel with each answer. |
| **Versioning** *(optional)* | Snapshots let you reproduce a review at a specific date. |


# 🧪 Interactive Knowledge Base Demo

In [2]:
# Simulated knowledge base search interface
import ipywidgets as widgets
from IPython.display import display, HTML

# Simulated knowledge base with scientific papers
kb = {
    "climate": [
        {"title": "Rising Sea Levels and Urban Infrastructure", "author": "Zhang et al.", "year": 2023,
         "abstract": "This study examines the impact of rising sea levels on urban infrastructure in 15 major coastal cities."},
        {"title": "Climate Adaptation Strategies for Coastal Communities", "author": "Johnson et al.", "year": 2022,
         "abstract": "A comprehensive review of adaptation strategies implemented in vulnerable coastal regions worldwide."},
        {"title": "Economic Analysis of Sea Level Rise", "author": "Patel et al.", "year": 2023,
         "abstract": "This paper projects the economic costs associated with various sea level rise scenarios through 2100."}
    ],
    "ai": [
        {"title": "Advancements in Retrieval-Augmented Generation", "author": "Sharma et al.", "year": 2023,
         "abstract": "This paper explores recent improvements in RAG systems for specialized domain applications."},
        {"title": "LLMs in Scientific Literature Analysis", "author": "Lee et al.", "year": 2022,
         "abstract": "Analysis of how large language models can accelerate systematic literature reviews."},
        {"title": "Vector Databases for Scientific Knowledge", "author": "Garcia et al.", "year": 2023,
         "abstract": "Comparison of vector database technologies for scientific paper retrieval and analysis."}
    ]
}

# Create search interface
search_input = widgets.Text(
    value='',
    placeholder='Search (try "climate" or "ai")',
    description='Search:',
    disabled=False
)

search_output = widgets.Output()

def search_kb(query):
    if query.lower() in kb:
        results = kb[query.lower()]
        html = "<h3>Search Results:</h3><ul>"
        for paper in results:
            html += f"<li><b>{paper['title']}</b> ({paper['year']}) by {paper['author']}<br>{paper['abstract']}</li>"
        html += "</ul>"
        return HTML(html)
    else:
        return HTML("<p>No results found. Try 'climate' or 'ai'.</p>")

def on_search_change(change):
    with search_output:
        search_output.clear_output()
        if change['type'] == 'change' and change['name'] == 'value':
            display(search_kb(change['new']))

search_input.observe(on_search_change, names='value')

# Display the search interface
display(search_input)
display(search_output)

Text(value='', description='Search:', placeholder='Search (try "climate" or "ai")')

Output()

## 5. Hybrid RAG System with Web Search Integration

A hybrid RAG system combines local document knowledge with fresh web results to provide comprehensive, up-to-date answers.


### Environment Setup

Serper is a Google Search API.

- In this workshop, we provide a Serper_API_KEY in the api_key.txt file.

- If you want to run with your own Serper_API_KEY, please go to https://serper.dev/ to get the SERPER_API_KEY. The free plan is 100 searches / month.  

- Place PDFs in /content/drive/MyDrive/Colab_Notebooks/AI/arxiv_pdfs/ directory
- Install all dependencies:


In [None]:
import os

api_keys_path = '/content/drive/MyDrive/Colab_Notebooks/AI/api_keys.txt'

with open(api_keys_path) as f:
    for line in f:
        key, value = line.strip().split('=')
        os.environ[key] = value

In [None]:
import openai

openai_api_key = os.environ['OPENAI_API_KEY']

In [None]:
!pip install -q PyPDF2 faiss-cpu sentence-transformers transformers python-dotenv requests torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m71.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m101.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Import Libraries
<!--
- PyPDF2: For reading and parsing PDF documents
- faiss-cpu: Facebook AI's efficient similarity search library, crucial for vector indexing
- sentence-transformers: For creating high-quality document and query embeddings
- transformers: For accessing pre-trained language models
- python-dotenv: For securely managing environment variables and API keys
- requests: For making HTTP requests to web search APIs
- torch: The deep learning framework that powers our models -->

In [None]:
import os
import PyPDF2
import numpy as np
import requests
import torch
from google.colab import userdata
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer
from transformers import pipeline, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Dict, Tuple, Optional
import time
import pickle
import json
import getpass
import re
import shutil
import glob

# Import FAISS correctly
# Faiss (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other.
# It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions.
try:
    import faiss
except ImportError:
    import faiss.contrib.torch_utils


### Preprocess
When the title of the paper is 'Not Known', we exclude such papers

In [None]:
#PAPERS_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/arxiv_markdowns2'
#OUTPUT_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files'

PAPERS_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/arxiv_json'
OUTPUT_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files2'

if os.path.exists(OUTPUT_DIR):
    for file in glob.glob(os.path.join(OUTPUT_DIR, "*")):
        os.remove(file)
else:
    os.makedirs(OUTPUT_DIR)

json_files = glob.glob(os.path.join(PAPERS_DIR, "*.json"))
print(f"Found {len(json_files)} JSON files in {PAPERS_DIR}")

valid_files_count = 0
invalid_files_count = 0

# Process each JSON file
for file_path in json_files:
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        if 'title' in data and data['title'] != "Not Found":
            dest_path = os.path.join(OUTPUT_DIR, os.path.basename(file_path))
            shutil.copy2(file_path, dest_path)
            valid_files_count += 1
        else:
            invalid_files_count += 1

    except json.JSONDecodeError:
        print(f"Error: Could not decode JSON in file {file_path}")
    except Exception as e:
        print(f"Error processing file {file_path}: {str(e)}")

print(f"Valid files copied: {valid_files_count}")
print(f"Invalid files excluded: {invalid_files_count}")

# Check what's in the output directory
paper_files = glob.glob(os.path.join(OUTPUT_DIR, "*.json"))
print(f"Found {len(paper_files)} JSON files in {OUTPUT_DIR}")

for file in paper_files[:5]:
    print(f" - {os.path.basename(file)}")

Found 197 JSON files in /content/drive/MyDrive/Colab_Notebooks/AI/arxiv_json
Valid files copied: 197
Invalid files excluded: 0
Found 197 JSON files in /content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files2
 - 2210.11630v1.json
 - 2210.10723v2.json
 - 2303.00077v1.json
 - 2302.13681v2.json
 - 2112.02969v1.json


In [4]:
load_dotenv()

class Config:
    #JSON_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files'
    JSON_DIR = '/content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files2'
    DRIVE_BASE_DIR = "/content/drive/MyDrive/Colab_Notebooks/AI/section6/data"
    #VECTOR_DB_DIR = os.path.join(DRIVE_BASE_DIR, "vector_db")
    VECTOR_DB_DIR = os.path.join(DRIVE_BASE_DIR, "vector_db2")
    LOCAL_CHUNK_SIZE = 512  # Characters per document chunk. Controls how large each document chunk should be (512 characters)
    OVERLAP_SIZE = int(LOCAL_CHUNK_SIZE * 0.1)  # 10% overlap
    EMBEDDING_MODEL = "all-mpnet-base-v2" # This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
    RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2" # Model used for re-ranking search results
    # LLM_MODEL = "google/flan-t5-base"
    USE_OPENAI = True
    OPENAI_MODEL = "gpt-4"
    MAX_CONTEXT_LENGTH = 1000 # Maximum context length to provide to the LLM
    ANSWER_MIN_LENGTH = 100 # Controls the length of generated answers
    ANSWER_MAX_LENGTH = 1000
    MIN_RESPONSE_WORDS = 25 # Minimum words required for a valid response

config = Config()

os.makedirs(config.VECTOR_DB_DIR, exist_ok=True)

OVERLAP_SIZE = int(config.LOCAL_CHUNK_SIZE * 0.1)  # 10% overlap

NameError: name 'load_dotenv' is not defined

### Document Processing

<!-- This section handles loading, parsing, and chunking of local PDF documents.


This section handles loading, parsing, and chunking of local PDF documents:

The load_and_chunk_pdfs() function identifies all PDF files in the specified directory
For each PDF, it:

- Opens and reads the file using PyPDF2
- Extracts text from all pages
- Divides the text into smaller chunks based on LOCAL_CHUNK_SIZE
- Creates document objects with text, source information, and type
- Handles errors gracefully if a PDF can't be processed -->



Chunking is a crucial step in RAG systems, as it:

- Makes retrieval more precise by allowing the system to retrieve only the most relevant chunks
- Keeps context sizes manageable for both embedding generation and LLM input
- Improves search efficiency by creating more specific document fragments

In [None]:
def load_and_chunk_json_files(json_dir: str) -> List[Dict]:
    """Load and chunk JSON documents that contain extracted research paper information"""
    documents = []
    os.makedirs(json_dir, exist_ok=True)

    for filename in os.listdir(json_dir):
        if filename.endswith(".json"):
            path = os.path.join(json_dir, filename)
            try:
                with open(path, 'r', encoding='utf-8') as file:
                    # Load JSON content
                    paper_data = json.load(file)

                    # Process title and abstract
                    if 'title' in paper_data and paper_data['title']:
                        title = paper_data['title']
                        # Abstract is usually shorter so combine with title
                        abstract = paper_data.get('abstract', '')
                        title_abstract = f"TITLE: {title}\n\nABSTRACT: {abstract}"

                        documents.append({
                            'text': title_abstract,
                            'source': f"{filename} [Title & Abstract]",
                            'type': 'local'
                        })

                    # Process each section in the paper
                    for section_name, section_content in paper_data.items():
                        # Skip non-text fields or already processed fields
                        if section_name in ['title', 'abstract', 'filename', 'authors', 'keywords', 'references'] or not section_content:
                            continue

                        # Handle sections that could be text or dictionary
                        if isinstance(section_content, dict):
                            # Some JSON formats might have sections as nested dictionaries
                            section_text = section_content.get('text', '')
                        else:
                            section_text = section_content

                        if not section_text or not isinstance(section_text, str):
                            continue

                        # For longer sections, divide into chunks
                        if len(section_text) > config.LOCAL_CHUNK_SIZE:
                            chunks = []
                            for i in range(0, len(section_text), config.LOCAL_CHUNK_SIZE - config.OVERLAP_SIZE):
                                # Make sure we don't go beyond the text length
                                end_idx = min(i + config.LOCAL_CHUNK_SIZE, len(section_text))
                                chunks.append(section_text[i:end_idx])

                                # Break if this was the last chunk
                                if end_idx == len(section_text):
                                    break

                            for i, chunk in enumerate(chunks):
                                documents.append({
                                    'text': chunk,
                                    'source': f"{filename} [{section_name.capitalize()} - Chunk {i+1}/{len(chunks)}]",
                                    'type': 'local'
                                })
                        else:
                            documents.append({
                                'text': section_text,
                                'source': f"{filename} [{section_name.capitalize()}]",
                                'type': 'local'
                            })

                    # Process acknowledgements separately if present
                    if 'acknowledgements' in paper_data and paper_data['acknowledgements']:
                        documents.append({
                            'text': paper_data['acknowledgements'],
                            'source': f"{filename} [Acknowledgements]",
                            'type': 'local'
                        })

                    # Process references as a separate document
                    if 'references' in paper_data and paper_data['references']:
                        # References could be a list of objects or a string
                        if isinstance(paper_data['references'], list):
                            # Format reference entries
                            ref_text = ""
                            for i, ref in enumerate(paper_data['references']):
                                if isinstance(ref, dict):
                                    # Format structured reference
                                    authors = ref.get('authors', '')
                                    title = ref.get('title', '')
                                    year = ref.get('year', '')

                                    if isinstance(authors, list):
                                        authors = ', '.join(authors)

                                    ref_entry = f"[{i+1}] {authors} ({year}). {title}"
                                    ref_text += ref_entry + "\n"
                                else:
                                    # Handle case where reference is a string
                                    ref_text += f"[{i+1}] {ref}\n"
                        else:
                            # References is a string
                            ref_text = paper_data['references']

                        documents.append({
                            'text': ref_text,
                            'source': f"{filename} [References]",
                            'type': 'local'
                        })

                    # Add authors and metadata as a separate document for better author-based retrieval
                    metadata = ""
                    if 'authors' in paper_data and paper_data['authors']:
                        if isinstance(paper_data['authors'], list):
                            authors = ', '.join(paper_data['authors'])
                        else:
                            authors = paper_data['authors']
                        metadata += f"AUTHORS: {authors}\n"

                    if 'keywords' in paper_data and paper_data['keywords']:
                        if isinstance(paper_data['keywords'], list):
                            keywords = ', '.join(paper_data['keywords'])
                        else:
                            keywords = paper_data['keywords']
                        metadata += f"KEYWORDS: {keywords}\n"

                    if metadata:
                        documents.append({
                            'text': metadata,
                            'source': f"{filename} [Metadata]",
                            'type': 'local'
                        })

            except Exception as e:
                print(f"Error processing {filename}: {e}")
                import traceback
                traceback.print_exc()  # Print the full traceback for debugging

    print(f"Loaded {len(documents)} document chunks from {json_dir}")
    return documents

### Vector Database Setup

This section creates our vector database for efficient similarity search:

After setup, we load documents from the PDF directory and create our vector database. FAISS (Facebook AI Similarity Search) is particularly important here because it:

- Provides extremely fast similarity search, even with millions of vectors
- Optimizes memory usage through efficient storage methods
- Supports GPU acceleration when available

We use a flag **reuse_existing** : For the first run, please set reuse_existing as False. Then, set this flag to True after first successful run to reuse the vector database.


In [None]:
def save_vector_db(embedding_model, index, documents, save_dir: str = config.VECTOR_DB_DIR):
    """Save the vector database and related components to disk"""
    os.makedirs(save_dir, exist_ok=True)

    # Save documents
    with open(os.path.join(save_dir, "documents.pkl"), "wb") as f:
        pickle.dump(documents, f)

    faiss.write_index(index, os.path.join(save_dir, "faiss_index.bin"))

    # Save embedding model path
    with open(os.path.join(save_dir, "model_name.txt"), "w") as f:
        f.write(config.EMBEDDING_MODEL)

    print(f"Vector database saved to {save_dir}")

In [None]:
def load_vector_db(save_dir: str = config.VECTOR_DB_DIR) -> Tuple[Optional[SentenceTransformer], Optional[faiss.Index], List[Dict]]:
    """Load the vector database and related components from disk"""
    try:
        # Check if all required files exist
        if not all(os.path.exists(os.path.join(save_dir, f)) for f in
                   ["documents.pkl", "faiss_index.bin", "model_name.txt"]):
            print(f"Missing vector database files in {save_dir}")
            return None, None, []

        with open(os.path.join(save_dir, "documents.pkl"), "rb") as f:
            documents = pickle.load(f)

        index = faiss.read_index(os.path.join(save_dir, "faiss_index.bin"))

        # Load embedding model
        with open(os.path.join(save_dir, "model_name.txt"), "r") as f:
            model_name = f.read().strip()

        # Check if model name matches config
        if model_name != config.EMBEDDING_MODEL:
            print(f"Warning: Saved model ({model_name}) differs from configured model ({config.EMBEDDING_MODEL})")
            print("Using the configured model, but consider rebuilding the index for consistency")

        embedding_model = SentenceTransformer(config.EMBEDDING_MODEL)

        print(f"Vector database loaded from {save_dir}")
        print(f"Loaded {len(documents)} document chunks and FAISS index with {index.ntotal} vectors")

        return embedding_model, index, documents

    except Exception as e:
        print(f"Error loading vector database: {e}")
        import traceback
        traceback.print_exc()
        return None, None, []

In [None]:
# Handles the case of empty document collections
# Initializes the SentenceTransformer model specified in our config
# Encodes all document chunks into dense vector embeddings
# Creates a FAISS index optimized for L2 distance calculations
# Adds the document embeddings to the index
def setup_vector_db(documents, reuse_existing: bool = False, save_dir: str = config.VECTOR_DB_DIR):
    """Set up the vector database with document embeddings, with option to reuse existing DB"""
    # Try to load existing vector DB if requested
    if reuse_existing:
        embedding_model, index, loaded_documents = load_vector_db(save_dir)
        if embedding_model is not None and index is not None and loaded_documents:
            return embedding_model, index, loaded_documents
        print("Could not load existing vector DB, creating new one...")

    # Create new vector DB
    if not documents:
        print("Warning: No documents found. Vector DB will be empty.")
        dimension = 768  # Default for most sentence transformers
        empty_index = faiss.IndexFlatL2(dimension)
        return None, empty_index, []

    embedding_model = SentenceTransformer(config.EMBEDDING_MODEL)
    print("Encoding documents...")
    embeddings = embedding_model.encode(
        [doc['text'] for doc in documents],
        show_progress_bar=True,
        batch_size=32
    )

    dimension = embeddings.shape[1]
    print('Embedding dimension is', dimension)
    index = faiss.IndexFlatL2(dimension)
    index.add(np.array(embeddings).astype('float32'))

    # Save the newly created vector DB
    save_vector_db(embedding_model, index, documents, save_dir)

    return embedding_model, index, documents

# 🧠 Knowledge Check:

What's the difference between L2 distance and cosine similarity for vector search?

<details>
<summary>Click here for answers</summary>

L2 Distance (Euclidean Distance):

* Measures the straight-line distance between two points in vector space
* Sensitive to both the direction and magnitude (length) of vectors.
* If two documents have similar content but different lengths, L2 distance may consider them dissimilar

Cosine Similarity:

* Measures the cosine of the angle between two vectors
* Only considers the direction of vectors, not their magnitude
* Two documents with identical content but different lengths would have high cosine similarity
* Range is [-1, 1], with 1 meaning perfectly similar, 0 meaning orthogonal, and -1 meaning opposite





</details>

**reuse_existing** flag

In [None]:
# Set this flag to True after first successful run to reuse the vector database
reuse_existing = True # Change to True after first successful run

In [None]:
start_time = time.time()

if reuse_existing:
    print("Trying to reuse existing vector database...")
    embedding_model, index, documents = setup_vector_db([], reuse_existing=True)

    if embedding_model is None or len(documents) == 0:
        print("No existing vector database found or it was empty. Creating new one...")
        # documents = load_and_chunk_pdfs(config.PDF_DIR)
        documents = load_and_chunk_json_files(config.JSON_DIR)
        print(f"Loaded {len(documents)} document chunks")
        embedding_model, index, documents = setup_vector_db(documents)
else:
    # Load documents and create new vector DB
    print("Loading documents...")
    documents = load_and_chunk_json_files(config.JSON_DIR)
    # documents = load_and_chunk_pdfs(config.PDF_DIR)
    print(f"Loaded {len(documents)} document chunks")

    print("Setting up vector database...")
    embedding_model, index, documents = setup_vector_db(documents)

end_time = time.time()
elapsed_time = end_time - start_time
minutes, seconds = divmod(elapsed_time, 60)
print(f"Process completed in {int(minutes)} min {seconds:.2f} sec.")

Loading documents...
Loaded 585 document chunks from /content/drive/MyDrive/Colab_Notebooks/AI/filtered_json_files2
Loaded 585 document chunks
Setting up vector database...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding documents...


Batches:   0%|          | 0/19 [00:00<?, ?it/s]

Embedding dimension is 768
Vector database saved to /content/drive/MyDrive/Colab_Notebooks/AI/section6/data/vector_db2
Process completed in 0 min 22.22 sec.


### Web Search Retriever
Our WebSearchRetriever class integrates external web search capabilities:

Initializes with a Google Serper API key.

The search() method:

- Creates a formatted query to the search API
- Handles errors and timeouts gracefully
- Processes and normalizes the search results
- Returns a standardized format compatible with our local results



This integration allows the system to:

- Access up-to-date information beyond what's in our local documents
- Provide answers to questions not covered by local knowledge
- Verify or complement information found in local documents

The error handling ensures the system continues functioning even if the web search fails.

In [None]:
class WebSearchRetriever:
    """Handles web search integration"""

    def __init__(self, api_key: str):
        self.api_key = os.environ['SERPER_API_KEY']
        self.search_url = "https://google.serper.dev/search"
        self.headers = {'X-API-KEY': self.api_key}

    def search(self, query: str, num_results: int = 5) -> List[Dict]:
        """Perform web search"""
        if not self.api_key:
            print("Warning: No API key provided for web search")
            return []

        payload = {'q': query, 'num': num_results, 'hl': 'en', 'gl': 'us'} # hl: language, gl: country

        try:
            response = requests.post(
                self.search_url,
                headers=self.headers,
                json=payload, # Send as JSON
                timeout=10
            )
            response.raise_for_status()

            results = []
            # In search engine terminology, "organic" search results are the natural listings that appear based on relevance to the search query, as opposed to paid or sponsored results. These are the standard search results that earn their placement through the search engine's algorithm rather than through advertising payments.
            for result in response.json().get('organic', []):
                results.append({
                    'content': result.get('snippet', ''),
                    'title': result.get('title', 'No title'),
                    'url': result.get('link'),
                    'source': f"Web: {result.get('title', 'No title')}",
                    'type': 'web'
                })
            return results

        except Exception as e:
            print(f"Web search failed: {e}")
            return []

web_retriever = WebSearchRetriever(os.environ['SERPER_API_KEY'])

### Hybrid RAG System

The HybridRAGSystem class is the heart of our implementation, combining:

Local document retrieval
Web search results
Answer generation

Key components include:

- Initialization: Sets up the embedding model, FAISS index, and LLM pipeline
- hybrid_retrieve():

- Retrieves results from both local and web sources
    - Implements deduplication to avoid redundant information
    - Ranks results based on relevance to the query using cosine similarity
    - Returns the most relevant top_k results


- Generate_answer():

    - Creates a structured prompt with retrieved context
    - Formats sources for proper citation
    - Generates a coherent answer using the LLM
    - Includes error handling for various failure modes


- _create_fallback_answer():

    - Provides a graceful fallback when LLM generation fails
    - Formats raw sources in a readable way



The deduplication strategy is particularly important as it:

- Prevents redundant information from overwhelming the context
- Improves context quality by increasing information diversity
- Optimizes token usage when sending context to the LLM

In [None]:
class HybridRAGSystem:
    """Combines local and web retrieval with generation"""

    def __init__(self, documents, local_index, embedding_model, web_retriever):
        self.local_index = local_index
        self.embedding_model = embedding_model
        self.documents = documents
        self.web_retriever = web_retriever

        if config.USE_OPENAI:
            try:
                self.openai_client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
                print("OpenAI client initialized")
            except Exception as e:
                print(f"Error initializing OpenAI client: {e}")
                self.openai_client = None
                config.USE_OPENAI = False

    def hybrid_retrieve(self, query: str, top_k: int = 10) -> List[Dict]:
        """Retrieve from both local and web sources with deduplication"""
        combined_results = []

        # Create query embedding
        query_embedding = self.embedding_model.encode([query])
        query_embedding_array = np.array(query_embedding).astype('float32')

        if self.documents and self.local_index.ntotal > 0: # self.local_index.ntotal refers to the total number of vectors that have been added to the FAISS index.
            # local_scores: The similarity scores (distances) for each match
            # local_indices: The indices of the matched vectors in your index
            local_scores, local_indices = self.local_index.search(
                query_embedding_array,
                min(top_k, self.local_index.ntotal))
            combined_results.extend([self.documents[i] for i in local_indices[0]])

        combined_results.extend(self.web_retriever.search(query, num_results=top_k))

        # Deduplication
        unique_results = []
        seen_urls = set()
        seen_texts = set()

        for res in combined_results:
            # For web results
            if res['type'] == 'web':
                url = res.get('url', '')
                if url and url in seen_urls:
                    continue
                seen_urls.add(url)
            # For local results
            else:
                text_hash = hash(res['text'][:512])  # First 512 characters as signature
                if text_hash in seen_texts:
                    continue
                seen_texts.add(text_hash)
            unique_results.append(res)

        if not unique_results:
            return []

        texts = [f"{res.get('title', '')} {res.get('text', res.get('content', ''))}"
                for res in unique_results]

        # After getting results from both sources with different retrieval algorithms, we need a consistent way to rank them against each other
        # By re-embedding all results and comparing with cosine similarity, we create a single, unified ranking metric'''
        result_embeddings = self.embedding_model.encode(texts)
        similarities = cosine_similarity(
            query_embedding,
            result_embeddings
        )[0]

        results_with_scores = list(zip(unique_results, similarities))
        results_with_scores.sort(key=lambda x: x[1], reverse=True) # sort the similarity score in descending order

        return [r for r, _ in results_with_scores[:top_k]]

    def generate_answer(self, query: str, context: List[Dict]) -> str:
        """Generate answer with source citations using GPT-4 or fallback to T5"""
        if not context:
            return "I couldn't find any relevant information to answer your question."

        # Try using OpenAI's GPT-4 if available
        if config.USE_OPENAI and self.openai_client:
            try:
                return self._generate_gpt4_answer(query, context)
            except Exception as e:
                print(f"Error using GPT-4 for answer generation: {e}")
                print("Falling back to T5...")

        # Fallback to T5
        if self.llm is None:
            return "Neither GPT-4 nor fallback LLM model are available. Please check your configuration."

        return self._generate_t5_answer(query, context)


    def _generate_gpt4_answer(self, query: str, context: List[Dict]) -> str:
        """Generate literature review using GPT-4"""
        try:
            # Format sources for GPT-4
            sources_text = ""
            for i, doc in enumerate(context):
                source_type = "Local" if doc.get('type') == 'local' else "Web"
                source_id = f"[{i+1}]"
                source_title = doc.get('source', doc.get('title', 'Unknown'))
                source_content = doc.get('text', doc.get('content', ''))[:500]  # Limit content length

                sources_text += f"{source_id} {source_type}: {source_title}\n{source_content}\n\n"

            # Create system message with instructions
            system_message = """You are an academic research assistant creating a comprehensive literature review.
            Follow these requirements strictly:
            1. Structure your response as a formal academic literature review.
            2. MUST cite ALL sources provided using their numeric identifiers [1], [2], etc.
            3. Every paragraph should include at least one citation.
            4. ALL provided sources must be cited at least once.
            5. Do not invent or assume information not present in the sources.
            6. Use a scholarly tone and academic style throughout.
            7. Synthesize information across sources rather than summarizing each source separately.
            8. Include sections: Introduction, Current Research Trends, Methodological Approaches, Research Gaps, Future Directions, and Conclusion.
            9. End with a numbered reference list of all sources.
            """

            # Create user message with query and sources
            user_message = f"""Question: {query}

            Please create a literature review using ALL of the following sources:

            {sources_text}

            Remember to cite EVERY source at least once using its numeric identifier [X].
            """

            # Call GPT-4 API. We can adjust the temperature to make the answer more creative (by increasing it) or more deterministic (by lowering it).
            response = self.openai_client.chat.completions.create(
                model=config.OPENAI_MODEL,
                messages=[
                    {"role": "system", "content": system_message},
                    {"role": "user", "content": user_message}
                ],
                temperature=0.3,
                max_tokens=2000
            )

            # Extract and return the answer
            if response.choices and len(response.choices) > 0:
                answer = response.choices[0].message.content.strip()

                # Check if all sources are cited
                missing_citations = []
                for i in range(1, len(context) + 1):
                    if f"[{i}]" not in answer:
                        missing_citations.append(str(i))

                # Add note about missing citations if any
                if missing_citations:
                    note = f"\n\nNote: This literature review should have cited sources {', '.join(missing_citations)} but failed to do so. Consider these sources for a more comprehensive understanding."
                    answer += note

                return answer
            else:
                return self._create_fallback_answer(context, query)

        except Exception as e:
            print(f"Error in GPT-4 answer generation: {e}")
            # Return the error if in debug mode, otherwise use fallback
            if os.environ.get("DEBUG") == "1":
                return f"GPT-4 Error: {str(e)}\n\n{self._create_fallback_answer(context, query)}"
            return self._create_fallback_answer(context, query)


    def _create_fallback_answer(self, context: List[Dict], query: str) -> str:
        """Improved fallback answer formatting"""
        answer = [f"Here are relevant sources about {query}:"]
        for i, doc in enumerate(context, 1):
            content = doc.get('text', doc.get('content', ''))[:250]
            source = doc.get('source', doc.get('title', f"Source {i}"))
            answer.append(f"[{i}] {source}: {content}")
        return "\n".join(answer) + "\n\n[System: Answer generation failed - showing raw sources]"


print("Initializing RAG system...")
rag_system = HybridRAGSystem(documents, index, embedding_model, web_retriever)
print("RAG system initialized")

Initializing RAG system...
OpenAI client initialized
RAG system initialized


### Demo Execution

In [None]:
def run_demo(query: str, top_k: int = 6):
    """Complete RAG demo execution"""
    print(f"\n{'='*50}\nQuery: {query}\n{'='*50}")

    print("Retrieving results...")
    results = rag_system.hybrid_retrieve(query, top_k=top_k)

    print(f"\nRetrieved {len(results)} Documents:")
    for i, res in enumerate(results):
        source = f"[Source {i+1}] {res.get('source', res.get('title', 'Unknown'))}"
        content = res.get('text', res.get('content', ''))
        print(f"{source}\n{content[:200]}...\n")

    if not results:
        print("No results found.")
        return "No results found to answer the query."

    print("Generating answer...")
    answer = rag_system.generate_answer(query, results)
    # print(f"\nGenerated Answer:\n{answer}\n")
    return answer

In [None]:
run_demo("What are the research trends about Large Language Model")


Query: What are the research trends about Large Language Model
Retrieving results...

Retrieved 6 Documents:
[Source 1] Web: Five Emerging Trends in Large Language Models - Aragon Research
Large language models (LLMs) are a critical element of the emerging generative AI technology stack, powering a variety of applications that range from chat to computer vision....

[Source 2] 2102.02503v1.json [Title & Abstract]
TITLE: Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models

ABSTRACT: ...

[Source 3] Web: Future of Large Language Models (LLMs) | AnnotationBox
Three vital changes that researchers are focusing on include enhancing model efficiency, reducing biases, and improving factual accuracy in ......

[Source 4] Web: Large Language Models 2024 Year in Review and 2025 Trends
Expect more research studies evaluating the capabilities of large language models versus human experts, as well as the increased use of LLMs ......

[Source 5] Web: The Future 

'Introduction\n\nLarge Language Models (LLMs) are becoming increasingly important in the field of generative AI technology, with applications ranging from chat to computer vision [1]. The current research trends, capabilities, limitations, and societal impacts of these models are the focus of several studies [2][4]. This literature review aims to provide an overview of the current state of research on LLMs, the methodological approaches employed, the gaps in the research, and the potential future directions.\n\nCurrent Research Trends\n\nThe current research trends in LLMs are centered around enhancing model efficiency, reducing biases, and improving factual accuracy [3]. There is also an increasing focus on evaluating the capabilities of LLMs versus human experts [4]. Future trends include fact-checking with real-time data integration, synthetic training data, and sparse expertise [5]. These trends reflect the growing importance of LLMs in various applications and the need to improve 

In [None]:
run_demo("Please give me the latest research on natural language processing")


Query: Please give me the latest research on natural language processing
Retrieving results...

Retrieved 6 Documents:
[Source 1] Web: (PDF) Natural language processing: state of the art, current trends ...
PDF | Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally....

[Source 2] Web: Natural Language Processing - Google Research
Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains....

[Source 3] Web: Advancements in natural language processing: Implications ...
This research delves into the latest advancements in Natural Language Processing (NLP) and their broader implications, challenges, and future directions....

[Source 4] Web: Natural language processing | Massachusetts Institute of Technology
MIT researchers make language models scalable self-learners. The scientists used a natural language-based logical inference datase

"Introduction\n\nNatural Language Processing (NLP) is a rapidly evolving field that has garnered significant attention in recent years for its potential to computationally represent and analyze human language [1]. This literature review aims to provide an overview of the current research trends, methodological approaches, research gaps, and future directions in the field of NLP.\n\nCurrent Research Trends\n\nThe current trends in NLP research are largely driven by advancements in machine learning and deep learning [5]. These advancements have facilitated the development of sophisticated language models that are capable of tasks such as text generation and summarization [6]. Furthermore, research at institutions like Google and the Massachusetts Institute of Technology (MIT) has focused on creating algorithms that can be applied at scale, across languages, and across domains [2,4]. \n\nMethodological Approaches\n\nThe methodological approaches in NLP research are diverse and innovative.

In [None]:
run_demo("What are the papers about Large Language Model")


Query: What are the papers about Large Language Model
Retrieving results...

Retrieved 6 Documents:
[Source 1] Web: 30 Important Research Papers to Understand Large Language Models
This article presents a curated list of 30 important research papers that provide deep insights into the development and functioning of large language models....

[Source 2] 2406.13138v2.json [Metadata]
AUTHORS: Because They Are Large Language Models, Philip Resnik
...

[Source 3] 2305.12152v2.json [Metadata]
AUTHORS: with Large Language Models, Dominik Stammbach\({}^{\mathbf{\varepsilon}}\) &Vilem Zouhar\({}^{\mathbf{\varepsilon}}\) &Alexander Hoyle\({}^{\mathbf{\text{M}}}\), Mrinmaya Sachan\({}^{\mathbf{...

[Source 4] Web: (PDF) Large Language Models: A Comprehensive Survey of its ...
This survey paper provides a comprehensive overview of LLMs, including their history, architecture, training methods, applications, and challenges....

[Source 5] 2308.00683v1.json [Metadata]
AUTHORS: Options for Large Lang

"Introduction\n\nLarge Language Models (LLMs) have been a significant area of research in the field of Natural Language Processing (NLP). These models have been instrumental in understanding and generating human-like text, thereby revolutionizing the way we interact with machines [4]. This literature review aims to provide an overview of the current research trends, methodological approaches, research gaps, and future directions in the field of LLMs.\n\nCurrent Research Trends\n\nThe current research trends in LLMs are largely focused on their development, functioning, and applications. The curated list of 30 important research papers provides deep insights into these areas [1]. Some of the most prominent LLMs include GPT, LLaMA, and PaLM, each having unique characteristics and applications [6]. LLMs have also been used in source code pretraining, demonstrating their versatility [5].\n\nMethodological Approaches\n\nThe methodological approaches to LLMs involve understanding their archi

# 💡 Ask you own question!

Your turn! Try your own question


In [None]:
your_question = "What are the latest developments in quantum computing?"  # Replace with your own question
your_answer = run_demo(your_question)
print(f"\nGenerated Answer:\n{your_answer}\n")


Query: What are the latest developments in quantum computing?
Retrieving results...

Retrieved 6 Documents:
[Source 1] Web: The latest developments in quantum science and technology ...
Many more advancements in quantum technology are yet to come. Secure communication through metropolitan-scale entangled quantum networks, quantum machine clusters for high-end computation, and quantum...

[Source 2] Web: Quantum Computing: Breakthroughs, Challenges & What's Ahead
Breakthroughs in Quantum Computing in 2024 · 1. Increased Qubit Stability and Error Correction · 2. Quantum Supremacy Milestones · 3. Advancements in Quantum Algorithms · 4. Commercial Quantum Cloud S...

[Source 3] Web: The Quantum Insider: Quantum Computing News & Top Stories
Find the latest Quantum Computing news, data, market research, and insights. To stay up to date with the quantum market click here!...

[Source 4] Web: Quantum computing | Massachusetts Institute of Technology
Quantum computing · MIT engineers advance t

## Resources

- LangChain Documentation: https://python.langchain.com/docs/get_started/introduction
- FAISS Documentation: https://github.com/facebookresearch/faiss
- Sentence Transformers: https://www.sbert.net/
- Rag from scratch: https://github.com/langchain-ai/rag-from-scratch/tree/main