# Steps in Retrieval-Augmented Generation (RAG)

This notebook demonstrates the **indexing phase of a Retrieval-Augmented
Generation (RAG) pipeline**.

The focus is on the core preprocessing steps required before retrieval
and generation can occur.


## RAG Indexing Pipeline Overview

The indexing stage of a RAG system typically consists of the following steps:

1. **Document Loading**  
   Load raw documents from different sources (PDF, DOCX, Markdown, etc.).

2. **Document Splitting**  
   Split documents into smaller, semantically meaningful chunks
   suitable for embedding and retrieval.

3. **Document Embedding and Storage**  
   Convert text chunks into vector embeddings that can later be stored
   in a vector database for similarity search.

This notebook focuses on **loading, splitting, and embedding**.


In [None]:
import getpass
import os
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader
from langchain_text_splitters import CharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain_openai.embeddings import OpenAIEmbeddings
import numpy as np
import copy

In [None]:
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

## Step 1: Document Loading

Document loading is the first step in the RAG indexing pipeline.

LangChain provides different loaders for handling various document formats,
such as PDF and DOCX, while preserving metadata and page structure.


### Loading Documents with `PyPDFLoader`

The PDF loader reads a document page by page and converts each page
into a `Document` object.


In [None]:
loader_pdf = PyPDFLoader("../../data/docs/Introduction_to_Data_and_Data_Science.pdf")

In [None]:
pages_pdf = loader_pdf.load()

In [None]:
pages_pdf

PDF text often contains excessive whitespace and line breaks.
To improve downstream processing, the page content is normalized
by collapsing extra spaces.


In [None]:
pages_pdf_cut = copy.deepcopy(pages_pdf)  #to avoid modifying the original file

In [None]:
' '.join(pages_pdf_cut[0].page_content.split())

In [None]:
for i in pages_pdf_cut:
    i.page_content = ' '.join(i.page_content.split())

In [None]:
pages_pdf_cut

In [None]:
pages_pdf[0].page_content, pages_pdf_cut[0].page_content

### Loading Documents with `Docx2txtLoader`

DOCX files can be loaded using `Docx2txtLoader`,
which extracts raw text from Word documents.


In [None]:
loader_docx = Docx2txtLoader("../../data/docs/Introduction_to_Data_and_Data_Science.docx")

In [None]:
pages = loader_docx.load()
for i in range(len(pages)):
    pages[i].page_content = ' '.join(pages[i].page_content.split())

In [None]:
pages[0].page_content

## Step 2: Document Splitting

Large documents must be split into smaller chunks before embedding.

Smaller chunks:
- Improve retrieval accuracy
- Fit within model context limits
- Preserve semantic coherence


### Character-Based Text Splitting

Character-based splitting divides text using a fixed chunk size
and optional overlap to preserve context between chunks.


In [None]:
len(pages[0].page_content)

In [None]:
char_splitter = CharacterTextSplitter(separator=".", chunk_size=500, chunk_overlap=50)

In [None]:
pages_chat_split = char_splitter.split_documents(pages)

In [None]:
pages_chat_split

In [None]:
len(pages_chat_split)

In [None]:
len(pages_chat_split[16].page_content)

### Markdown Header-Based Text Splitting

When documents contain structured headers, a markdown-aware splitter
can be used to preserve logical sections such as titles and headings.


In [None]:
loader2_docx = Docx2txtLoader("../../data/docs/Introduction_to_Data_and_Data_Science_2.docx")


In [None]:
pages2 = loader2_docx.load()

In [None]:
pages2

In [None]:
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on = [("#", "Course Title"), ("##", "Lecture Title")] )

In [None]:
pages_md_split = md_splitter.split_text(pages2[0].page_content)

In [None]:
for i in range(len(pages_md_split)):
    pages_md_split[i].page_content = ' '.join(pages_md_split[i].page_content.split())

In [None]:
pages_chat_split2 = char_splitter.split_documents(pages_md_split)

In [None]:
pages_chat_split2 

## Step 3: Text Embedding

After splitting, text chunks are converted into dense vector embeddings.

These embeddings capture semantic meaning and enable
similarity search during retrieval.


### Generating Embeddings with OpenAI

Each text chunk is mapped to a high-dimensional vector representation
using an embedding model.


In [None]:
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [None]:
vector1 = embeddings.embed_query(pages_chat_split2[3].page_content)
vector2 = embeddings.embed_query(pages_chat_split2[5].page_content)
vector3 = embeddings.embed_query(pages_chat_split2[18].page_content)

In [None]:
print(len(vector1), len(vector2), len(vector3))

### Measuring Similarity Between Embeddings

Cosine similarity (via dot product and vector norms)
is commonly used to measure semantic similarity between text chunks.


In [None]:
np.dot(vector1, vector2), np.dot(vector1, vector3), np.dot(vector2, vector3)

In [None]:
np.linalg.norm(vector1), np.linalg.norm(vector2), np.linalg.norm(vector3)

## Summary

This notebook demonstrated the indexing steps of a RAG pipeline:

- Loading documents from PDF and DOCX sources  
- Cleaning and normalizing raw text  
- Splitting documents into manageable chunks  
- Generating vector embeddings for semantic retrieval  
- Comparing embeddings using similarity metrics  

These steps form the foundation for building
retrieval-augmented generation systems.
