<a href="https://colab.research.google.com/github/snehal-sd/RAG/blob/main/Tensorflow_RAG_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to NLP for RAG

Thanks for coming! I've tried to include as many resources as possible, but the biggest part of this notebook is of course: the code!
We'll run the main reccomended flow together, then build a RAG chatbot on a different set of data.

# Generating and Using High-Quality Embeddings in RAG

- Generate embeddings and use them to retrieve documents.
- Generate text embeddings using Universal Sentence Encoder.
- Implement FAISS for fast nearest-neighbor search.
- Re-rank to improve retrieval quality in RAG.

---

- **Embeddings** convert text into **high-dimensional numerical vectors** that capture meaning.
- They allow machines to **compare text meaning mathematically** instead of relying on exact words.
- Used in **search engines, chatbots, and RAG systems**.

📌 **Alternatives:**
- [`intfloat/e5-large-v2`](https://huggingface.co/intfloat/e5-large-v2) – Another top-tier embedding model.
- [`sentence-transformers/all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) – More lightweight option.

🧘☸️ **All This Will Change**
- The only thing constant in AI is change- be [ready to re-run evaluations](https://github.com/RUC-NLPIR/FlashRAG)
- [Code Embeddings Eval](https://github.com/CoIR-team/coir)


# Step 1: Install Dependencies

In [4]:
!pip install faiss-cpu
!pip install wikipedia-api



## Step 1.5: Fetch some documents from Wikipedia

In [5]:
import wikipediaapi
import os
import random

# Set up Wikipedia API with a custom User-Agent
wiki_wiki = wikipediaapi.Wikipedia(user_agent='MyResearchBot/1.0 (https://themultiverse.school/; liz@themultiverse.school)', language='en')

# Define categories and sample articles from different domains
categories = {
  "Computers": {
      "Computer science": ["Abstract Data Structure", "Algorithm", "Object Oriented Programming", "Scripting", "Python (programming language)"],
      "CyberSecurity": ["TCP IP", "Internet Protocol", "Computer Network", "TCP Packet", "LoRA", "OSI Model", "Layer 1", "Layer 2", "Layer 3", "Layer 4", "Layer 5", "Layer 6", "Layer 7", "Layer 8", "International Organization for Standardization", "OSHA"],
      "Machine Learning": ["Artificial intelligence", "Machine learning", "Deep learning", "Computer vision", "Natural language processing", "FAISS", "Embedding"]
  },
  "Biology": {
    "Cell biology": ["Cytoskeleton", "Cell membrane", "Endoplasmic reticulum", "Golgi apparatus", "Apoptosis"],
    "Genetics": ["Epigenetics", "Gene expression", "CRISPR", "DNA replication", "Genetic drift"],
    "Food Web": ["Trophic cascade", "Keystone species", "Ecological pyramid", "Energy flow (ecology)", "Biogeochemical cycle"],
    "Microbiology": ["Bacteriophage", "Gram-positive bacteria", "Archaea", "Biofilm", "Extremophiles"],
    "Human anatomy": ["Circulatory system", "Endocrine system", "Nervous system", "Musculoskeletal system", "Digestive system"],
    "Mitochondria": ["ATP synthase", "Mitochondrial DNA", "Oxidative phosphorylation", "Mitochondrial diseases", "Endosymbiotic theory"],
    "Phylogenetics": ["Cladistics", "Evolutionary tree", "Molecular phylogenetics", "Common descent", "Homology (biology)"]
  },
  "Chemistry": {
    "Organic chemistry": ["Functional groups", "Alkene", "Aromaticity", "Polymerization", "Carbohydrates"],
    "Inorganic chemistry": ["Coordination complex", "Transition metal", "Crystallography", "Lanthanides", "Actinides"],
    "Analytical chemistry": ["Chromatography", "Spectroscopy", "Mass spectrometry", "Electrochemical analysis", "Nuclear magnetic resonance"],
    "Physical chemistry": ["Quantum chemistry", "Thermodynamics", "Statistical mechanics", "Chemical kinetics", "Molecular dynamics"],
    "Biochemistry": ["Enzyme kinetics", "Protein folding", "Lipid metabolism", "Glycolysis", "Signal transduction"]
  },
  "Geology": {
    "Plate tectonics": ["Subduction zone", "Mid-ocean ridge", "Continental drift", "Transform fault", "Rift valley"],
    "Mineralogy": ["Silicate minerals", "Feldspar", "Quartz", "Mohs scale of mineral hardness", "Crystal habit"],
    "Volcano": ["Stratovolcano", "Shield volcano", "Pyroclastic flow", "Volcanic explosivity index", "Supervolcano"],
    "Earthquake": ["Seismic wave", "Richter scale", "Fault mechanics", "Liquefaction", "Tsunami"],
    "Geological history of Earth": ["Hadean eon", "Cambrian explosion", "Snowball Earth", "K-Pg extinction event", "Great Oxygenation Event"],
    "Igneous Rock": ["Basalt", "Granite", "Magma differentiation", "Intrusive rock", "Plutonic rock"]
  },
  "History": {
    "World War II": ["Battle of Stalingrad", "Manhattan Project", "D-Day", "Holocaust", "Blitzkrieg"],
    "Ancient Egypt": ["Pharaoh", "Hieroglyphics", "Valley of the Kings", "Mummification", "Great Pyramid of Giza"],
    "Renaissance": ["Humanism (Renaissance)", "Leonardo da Vinci", "Medici family", "Florence during the Renaissance", "Printing press"],
    "Industrial Revolution": ["Steam engine", "Factory system", "Textile industry", "Urbanization", "Luddites"],
    "Cold War": ["Cuban Missile Crisis", "Space Race", "Berlin Wall", "McCarthyism", "NATO"]
  },
  "Art": {
    "Impressionism": ["Claude Monet", "Edgar Degas", "Pierre-Auguste Renoir", "Plein air painting", "Color theory"],
    "Cubism": ["Pablo Picasso", "Georges Braque", "Analytic Cubism", "Synthetic Cubism", "Still Life with Chair Caning"],
    "Renaissance art": ["Michelangelo", "Sistine Chapel ceiling", "Raphael", "Leonardo da Vinci’s notebooks", "Linear perspective"],
    "Sculpture": ["Rodin", "Bronze casting", "Marble sculpture", "Gothic sculpture", "Greek classical sculpture"],
    "Abstract art": ["Wassily Kandinsky", "Color field painting", "Abstract expressionism", "Suprematism", "De Stijl"],
    "Dadaism": ["Marcel Duchamp", "Readymades", "Cabaret Voltaire", "Tristan Tzara", "Anti-art movement"],
    "Absurdism": ["Albert Camus", "The Myth of Sisyphus", "Theatre of the Absurd", "Samuel Beckett", "Waiting for Godot"]
  }
}

# Create the cache directory
cache_dir = "data/wikipedia_cache"
os.makedirs(cache_dir, exist_ok=True)

# Function to save Wikipedia text to cache
def save_to_cache(title, text):
    filename = os.path.join(cache_dir, f"{title.replace(' ', '_')}.txt")
    with open(filename, "w", encoding="utf-8") as file:
        file.write(text)

# Function to load Wikipedia text from cache
def load_from_cache(title):
    filename = os.path.join(cache_dir, f"{title.replace(' ', '_')}.txt")
    if os.path.exists(filename):
        with open(filename, "r", encoding="utf-8") as file:
            return file.read()
    return None

# Function to fetch Wikipedia page content with caching
def get_wikipedia_text(title):
    # Check if the page is already cached
    cached_text = load_from_cache(title)
    if cached_text:
        return cached_text

    page = wiki_wiki.page(title)

    # If it's a disambiguation page, follow the first few linked pages
    if page.exists():
        if "may refer to:" in page.text[:200]:  # Check if it's a disambiguation page
            linked_pages = list(page.links.keys())[:5]  # Grab first few related links
            for linked_title in linked_pages:
                sub_page = wiki_wiki.page(linked_title)
                if sub_page.exists() and len(sub_page.text) > 500:  # Ensure meaningful content
                    save_to_cache(linked_title, sub_page.text[:2000])  # Save to cache
                    return sub_page.text[:2000]

        # Save the fetched page to cache and return it
        save_to_cache(title, page.text[:2000])
        return page.text[:2000]

    return None

# Function to get additional Wikipedia pages from related categories
def get_category_pages(category_name, max_pages=5):
    category_page = wiki_wiki.page(f"Category:{category_name}")
    pages = []

    if category_page.exists():
        for title, page in category_page.categorymembers.items():
            if page.ns == 0:  # Only fetch articles (not subcategories)
                text = get_wikipedia_text(title)
                if text:
                    pages.append(text)
                if len(pages) >= max_pages:
                    break
    return pages

# Fetch and store documents
documents = []

for category, topics in categories.items():
    for topic in topics:
        print("Getting ", topic)
        for page in topics[topic]:
            text = get_wikipedia_text(page)
            if text:
                documents.append(text)

    # Also pull a few pages from the Wikipedia category
    documents.extend(get_category_pages(category, max_pages=5))


print(f"Collected {len(documents)} Wikipedia documents. Cached files saved in: {cache_dir}")





Getting  Computer science
Getting  CyberSecurity
Getting  Machine Learning
Getting  Cell biology
Getting  Genetics
Getting  Food Web
Getting  Microbiology
Getting  Human anatomy
Getting  Mitochondria
Getting  Phylogenetics
Getting  Organic chemistry
Getting  Inorganic chemistry
Getting  Analytical chemistry
Getting  Physical chemistry
Getting  Biochemistry
Getting  Plate tectonics
Getting  Mineralogy
Getting  Volcano
Getting  Earthquake
Getting  Geological history of Earth
Getting  Igneous Rock
Getting  World War II
Getting  Ancient Egypt
Getting  Renaissance
Getting  Industrial Revolution
Getting  Cold War
Getting  Impressionism
Getting  Cubism
Getting  Renaissance art
Getting  Sculpture
Getting  Abstract art
Getting  Dadaism
Getting  Absurdism
Collected 200 Wikipedia documents. Cached files saved in: data/wikipedia_cache


#### Step 2: Encode Embeddings with TensorFlow Universal Sentence Encoder
If you prefer a TensorFlow-based approach, you can use Universal Sentence Encoder.


In [6]:
import tensorflow_hub as hub

# Load USE model
use_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Generate embeddings
document_embeddings = use_model(documents)

In [7]:
query = input("Enter a question: ")

Enter a question: What are the contributions of Claude Monet to the world?


## Calculate Similarity (several methods)

#### TensorFlow Approximate Nearest Neighbors (ANN)

| Feature            | **FAISS** (Facebook AI Similarity Search) | **General ANN Methods** (e.g., HNSW, BallTree, KDTree) |
|-------------------|----------------------------------|--------------------------------|
| **Speed**        | Highly optimized for fast search, supports GPU acceleration | Varies by method; HNSW is fast, but KDTree struggles in high dimensions |
| **Scalability**  | Handles millions to billions of vectors efficiently | Some ANN methods scale well (HNSW), others (KDTree, BallTree) do not |
| **Accuracy**     | Configurable for exact or approximate search | Varies; HNSW offers high accuracy, others may sacrifice accuracy for speed |
| **Memory Usage** | Optimized via quantization and compression | Varies; HNSW uses more memory, others may be more efficient |
| **Implementation Complexity** | Easy-to-use library with multiple indexing strategies | Some methods require deeper parameter tuning and domain expertise |
| **Parallelization** | Supports multi-threading and GPU acceleration | Many ANN methods lack GPU support |
| **Use Case Fit** | Best for large-scale, high-dimensional similarity search | Some ANN methods (KDTree, BallTree) perform better in low dimensions |


In [8]:
import numpy as np
import faiss

# Create a FAISS index
dimension = document_embeddings.shape[1]  # Get dimension from your embeddings
index = faiss.IndexFlatL2(dimension)  # L2 distance index

# Convert embeddings to numpy float32 array if not already
embeddings_np = np.array(document_embeddings).astype(np.float32)

# Add vectors to the index
index.add(embeddings_np)

## 3. Using Embeddings for Retrieval

Now, let’s query the knowledge base using FAISS and find the most relevant documents

In [9]:
# Function to retrieve top-k documents
def retrieve_top_k_documents(query, top_k=3):
    query_embedding = use_model([query])
    ## convert to numpy array
    query_embedding = np.array(query_embedding).astype(np.float32)
    _, indices = index.search(query_embedding, top_k)
    return [documents[i] for i in indices[0]]

retrieved_docs = retrieve_top_k_documents(query, 5)

# Display results
print("\nTop Retrieved Documents:")
for idx, doc in enumerate(retrieved_docs, 1):
    print(f"{idx}. {doc}")


Top Retrieved Documents:
1. Cubism is an early-20th-century avant-garde art movement begun in Paris that revolutionized painting and the visual arts, and influenced artistic innovations in music, ballet, literature, and architecture. Cubist subjects are analyzed, broken up, and reassembled in an abstract form—instead of depicting objects from a single perspective, the artist depicts the subject from multiple perspectives to represent the subject in a greater context. Cubism has been considered the most influential art movement of the 20th century. The term cubism is broadly associated with a variety of artworks produced in Paris (Montmartre and Montparnasse) or near Paris (Puteaux) during the 1910s and throughout the 1920s.
The movement was pioneered in partnership by Pablo Picasso and Georges Braque, and joined by Jean Metzinger, Albert Gleizes, Robert Delaunay, Henri Le Fauconnier, Juan Gris, and Fernand Léger. One primary influence that led to Cubism was the representation of three

## 5. Re-Ranking: Use a Neural Re-Ranker to Improve Retrieval Precision

Problem: The retrieved documents may not always be perfectly ranked.
Solution: Re-rank the results!

In [None]:
import tensorflow as tf
import numpy as np
from transformers import TFBertModel, BertTokenizer

# Load Pretrained Transformer Model (BERT for ranking)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

class RankingModel(tf.keras.Model):
    def __init__(self):
        super(RankingModel, self).__init__()
        self.bert = TFBertModel.from_pretrained("bert-base-uncased")
        self.score_layer = tf.keras.layers.Dense(1, activation="linear")

    def call(self, inputs, training=False):
        input_ids, attention_mask = inputs
        bert_outputs = self.bert(input_ids, attention_mask=attention_mask, training=training)
        pooled_output = bert_outputs.pooler_output
        scores = self.score_layer(pooled_output)
        return scores

def tokenize_texts(texts, max_length=128):
    """Tokenizes and converts texts into BERT input format."""
    encodings = tokenizer(
        texts,
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="tf"
    )
    return encodings["input_ids"], encodings["attention_mask"]

def rerank_documents(query, retrieved_docs):
    """Re-ranks documents based on query relevance using BERT."""
    # Create pairs of query and each document
    pairs = [f"{query} [SEP] {doc}" for doc in retrieved_docs]

    # Tokenize all pairs
    input_ids, attention_mask = tokenize_texts(pairs)

    # Build and use the ranking model
    ranking_model = RankingModel()

    # Get scores by passing through the model
    scores = ranking_model([input_ids, attention_mask], training=False).numpy().flatten()

    # Return re-ranked documents
    ranked_indices = np.argsort(-scores)  # Sort in descending order
    reranked_docs = [retrieved_docs[i] for i in ranked_indices]

    return reranked_docs



# Apply Re-Ranking
reranked_docs = rerank_documents(query, retrieved_docs)

# Display Results
print("\nTop Re-Ranked Documents:")
for idx, doc in enumerate(reranked_docs, 1):
    print(f"{idx}. {doc}")

# Exercise: Implement a Simple RAG-Based Q&A System, then compare models

Objective

Students will:
1.	Use wikipedia articles as the basis for your RAG system.
2.	Use BERT embeddings to create a FAISS index of the documents. Compare HNSW and non-HNSW versions.
3.	Accept queries and retrieve the best documents to answer the question.
4.	Use a Neural Reranker to re-rank results retrieved.
5.  Use [Deepseek's R1-Distillation of Qwen](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) to interpret the retrieved documents into an answer to your question (or an internally-hosted LLM of your choice)

In [None]:
## Using groq - generate and use groq key
!pip install groq

In [None]:
## here's where your code goes
## given the context and the user query
## generate a response using a language model
from google.colab import userdata
groq_key = userdata.get('GROQ_API_KEY')

from groq import Groq

client = Groq(api_key=groq_key)

prompt = "Using the context above, answer the user's query:"
context = "\n".join([doc for docs in reranked_docs])

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": f"{context}  {prompt}  {query}",
        }
    ],
    model="deepseek-r1-distill-qwen-32b",
)

print(chat_completion.choices[0].message.content)

In [1]:
# Load model directly from HuggingFace
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the tokenizer and model
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [1]:
def generate_answer(question, documents, max_length=512):
    # Combine question and documents into a single prompt
    prompt = f"Question: {question}\nDocuments:\n" + "\n".join(documents) + "\nAnswer:"

    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_length)

    # Move to the correct device
    input_ids = inputs["input_ids"].to(model.device)

    # Generate response
    with torch.no_grad():
        output = model.generate(input_ids, max_length=max_length, temperature=0.7, top_p=0.9)

    # Decode response
    answer = tokenizer.decode(output[0], skip_special_tokens=True)

    return answer




In [2]:
question = "What is the contribution of Claude Monet to the world?"
documents = context = "\n".join([doc for docs in reranked_docs])

answer = generate_answer(question, documents)
print("Generated Answer:\n", answer)

NameError: name 'tokenizer' is not defined

### Hierarchical Navigable Small Worlds (HNSW) Indexing
Hierarchical Navigable Small Worlds (HNSW) Indexing

What is HNSW?

HNSW (Hierarchical Navigable Small Worlds) is an advanced approximate nearest neighbor (ANN) algorithm that speeds up retrieval while maintaining high accuracy. Unlike FAISS IndexFlatL2, it builds a multi-layered graph for efficient navigation.

How It Works
- Hierarchical Structure: Vectors are arranged in layers; the top has fewer nodes, lower layers have more.
- Graph-Based Navigation: Each point connects to neighbors, forming a graph. Searches:
- Start at a random top-layer entry
- Navigate toward the query
- Descend layers for precision
- Small World Properties: Most nodes are reachable in a few steps.

Advantages Over Flat Indexing
- **Scalability**: Handles millions of vectors with sub-linear search time; flat indices are linear.
- **Speed-Accuracy Trade-off**: Tunable via `M` and `ef_construction`.
- **Memory Efficiency**: More efficient than a full linear scan despite higher memory use.
- **Real-World Performance**: Outperforms many ANN algorithms when properly tuned.

When to Use HNSW
- Dataset >10,000 documents
- Query speed is critical
- High recall accuracy (>95%) needed without a full scan
- Sufficient memory available for graph storage

In [None]:
import faiss

# Create HNSW index
embedding_size = document_embeddings.shape[1]  # 1024 for BGE-large-en
M = 16  # Number of connections per layer (higher = more accurate but slower to build)
ef_construction = 200  # Controls index quality (higher = better recall but slower to build)

# Create the index
index = faiss.IndexHNSWFlat(embedding_size, M)
index.hnsw.efConstruction = ef_construction
index.hnsw.efSearch = 128  # Controls search accuracy/speed trade-off

# Add vectors to the index
index.add(document_embeddings)