# Steps in Retrieval-Augmented Generation (RAG)

This notebook demonstrates the **indexing phase** of a
Retrieval-Augmented Generation (RAG) pipeline.

The indexing phase prepares external knowledge so it can later
be retrieved and injected into a language model during generation.


## RAG Indexing Pipeline Overview

The indexing stage of a RAG system consists of three core steps:

1. **Document Loading**  
   Load raw documents from different sources (PDF, DOCX, Markdown, etc.).

2. **Document Splitting**  
   Split documents into smaller, semantically meaningful chunks
   suitable for embedding and retrieval.

3. **Document Embedding and Storage**  
   Convert text chunks into vector embeddings and store them
   in a vector database for similarity search.

This notebook focuses exclusively on these indexing steps.


In [62]:
import getpass
import os
import copy
import numpy as np

from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader
from langchain_text_splitters import CharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_chroma import Chroma


In [63]:
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

## Step 1: Document Loading

Document loading is the first step in the RAG indexing pipeline.

LangChain provides specialized loaders for handling different
file formats while preserving metadata and document structure.


## Step 1: Document Loading

Document loading is the first step in the RAG indexing pipeline.

LangChain provides specialized loaders for handling different
file formats while preserving metadata and document structure.


In [64]:
loader_pdf = PyPDFLoader("../../data/docs/Introduction_to_Data_and_Data_Science.pdf")

In [65]:
pages_pdf = loader_pdf.load()
pages_pdf

[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2023-11-09T10:16:34+02:00', 'author': 'Hristina  Hristova', 'moddate': '2023-11-09T10:16:34+02:00', 'source': '../../data/docs/Introduction_to_Data_and_Data_Science.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}, page_content='Analysis vs Analytics \nAlright! So… \nLet’s discuss the not-so-obvious differences \nbetween the terms analysis and analytics. \nDue to the similarity of the words, some people \nbelieve they share the same meaning, and thus \nuse them interchangeably. Technically, this \nisn’t correct. There is, in fact, a distinct \ndifference between the two. And the reason \nfor one often being used instead of the other \nis the lack of a transparent understanding \nof both. \nSo, let’s clear this up, shall we? \nFirst, we will start with analysis. \nConsider the following… \nYou have a huge dataset containing data of \nvarious types. I

PDF text often contains excessive whitespace and line breaks.
To improve downstream processing, the page content is normalized
by collapsing extra spaces.


In [66]:
pages_pdf_cut = copy.deepcopy(pages_pdf)  #to avoid modifying the original file

In [67]:
for i in pages_pdf_cut:
    i.page_content = ' '.join(i.page_content.split())

pages_pdf_cut

[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2023-11-09T10:16:34+02:00', 'author': 'Hristina  Hristova', 'moddate': '2023-11-09T10:16:34+02:00', 'source': '../../data/docs/Introduction_to_Data_and_Data_Science.pdf', 'total_pages': 6, 'page': 0, 'page_label': '1'}, page_content='Analysis vs Analytics Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis. Consider the following… You have a huge dataset containing data of various types. Instead of tackling the entire da

In [68]:
pages_pdf[0].page_content, pages_pdf_cut[0].page_content

('Analysis vs Analytics \nAlright! So… \nLet’s discuss the not-so-obvious differences \nbetween the terms analysis and analytics. \nDue to the similarity of the words, some people \nbelieve they share the same meaning, and thus \nuse them interchangeably. Technically, this \nisn’t correct. There is, in fact, a distinct \ndifference between the two. And the reason \nfor one often being used instead of the other \nis the lack of a transparent understanding \nof both. \nSo, let’s clear this up, shall we? \nFirst, we will start with analysis. \nConsider the following… \nYou have a huge dataset containing data of \nvarious types. Instead of tackling the entire \ndataset and running the risk of becoming overwhelmed, \nyou separate it into easier to digest chunks \nand study them individually and examine how \nthey relate to other parts. And that’s analysis \nin a nutshell. \nOne important thing to remember, however, \nis that you perform analyses on things that \nhave already happened in the

### Loading Documents with `Docx2txtLoader`

DOCX files are loaded using `Docx2txtLoader`,
which extracts raw text from Word documents.


In [69]:
loader_docx = Docx2txtLoader("../../data/docs/Introduction_to_Data_and_Data_Science.docx")

In [70]:
pages = loader_docx.load()
for i in range(len(pages)):
    pages[i].page_content = ' '.join(pages[i].page_content.split())

pages[0].page_content

"Analysis vs Analytics Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis. Consider the following… You have a huge dataset containing data of various types. Instead of tackling the entire dataset and running the risk of becoming overwhelmed, you separate it into easier to digest chunks and study them individually and examine how they relate to other parts. And that’s analysis in a nutshell. One important thing to remember, however, is that you perform analyses on things that have already happened in the past. Such as using an analysis to explain how a

## Step 2: Document Splitting

Large documents must be split into smaller chunks before embedding.

Smaller chunks:
- Improve retrieval accuracy
- Fit within model context limits
- Preserve semantic coherence


### Character-Based Text Splitting

Character-based splitting divides text using a fixed chunk size
and optional overlap to preserve context between chunks.


In [71]:
len(pages[0].page_content)

8259

In [72]:
char_splitter = CharacterTextSplitter(separator=".", chunk_size=500, chunk_overlap=50)

In [73]:
pages_chat_split = char_splitter.split_documents(pages)

In [74]:
pages_chat_split

[Document(metadata={'source': '../../data/docs/Introduction_to_Data_and_Data_Science.docx'}, page_content='Analysis vs Analytics Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both'),
 Document(metadata={'source': '../../data/docs/Introduction_to_Data_and_Data_Science.docx'}, page_content='So, let’s clear this up, shall we? First, we will start with analysis. Consider the following… You have a huge dataset containing data of various types. Instead of tackling the entire dataset and running the risk of becoming overwhelmed, you separate it into easier to digest chunks and study them individually and examine how they relate

In [75]:
len(pages_chat_split)

21

In [76]:
len(pages_chat_split[16].page_content)

382

### Markdown Header-Based Splitting

When documents contain structured headers,
a markdown-aware splitter preserves logical sections
such as course titles and lecture headings.


In [77]:
loader2_docx = Docx2txtLoader("../../data/docs/Introduction_to_Data_and_Data_Science_2.docx")


In [78]:
pages2 = loader2_docx.load()

In [79]:
pages2

[Document(metadata={'source': '../../data/docs/Introduction_to_Data_and_Data_Science_2.docx'}, page_content="# Introduction to Data and Data Science\n\n## Analysis vs Analytics\n\nAlright! So…\nLet’s discuss the not-so-obvious differences\nbetween the terms analysis and analytics.\nDue to the similarity of the words, some people\nbelieve they share the same meaning, and thus\nuse them interchangeably. Technically, this\nisn’t correct. There is, in fact, a distinct\ndifference between the two. And the reason\nfor one often being used instead of the other\nis the lack of a transparent understanding\nof both.\nSo, let’s clear this up, shall we?\nFirst, we will start with analysis.\nConsider the following…\nYou have a huge dataset containing data of\nvarious types. Instead of tackling the entire\ndataset and running the risk of becoming overwhelmed,\nyou separate it into easier to digest chunks\nand study them individually and examine how\nthey relate to other parts. And that’s analysis\ni

In [80]:
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on = [("#", "Course Title"), ("##", "Lecture Title")] )

In [81]:
pages_md_split = md_splitter.split_text(pages2[0].page_content)

In [82]:
for i in range(len(pages_md_split)):
    pages_md_split[i].page_content = ' '.join(pages_md_split[i].page_content.split())

In [83]:
pages_char_split2 = char_splitter.split_documents(pages_md_split)

In [84]:
pages_char_split2 

[Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Analysis vs Analytics'}, page_content='Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis'),
 Document(metadata={'Course Title': 'Introduction to Data and Data Science', 'Lecture Title': 'Analysis vs Analytics'}, page_content='Consider the following… You have a huge dataset containing data of various types. Instead of tackling the entire dataset and running the risk of becoming overwhelmed, you separate it into easier to digest chunks and study them individu

## Step 3: Document Embedding and Storage

After splitting, text chunks are converted into dense vector embeddings.
These embeddings are stored in a vector database to enable
efficient similarity search during retrieval.


### Generating Embeddings with OpenAI

Each text chunk is mapped to a high-dimensional vector
that captures semantic meaning.


In [85]:
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

In [86]:
vector1 = embeddings.embed_query(pages_chat_split2[3].page_content)
vector2 = embeddings.embed_query(pages_chat_split2[5].page_content)
vector3 = embeddings.embed_query(pages_chat_split2[18].page_content)

In [87]:
print(len(vector1), len(vector2), len(vector3))

1536 1536 1536


### Measuring Similarity Between Embeddings

Cosine similarity (dot product + vector norms)
is commonly used to measure semantic similarity.


In [88]:
np.dot(vector1, vector2), np.dot(vector1, vector3), np.dot(vector2, vector3)

(np.float64(0.8791284497943926),
 np.float64(0.8000235828747099),
 np.float64(0.7934993700101873))

In [89]:
np.linalg.norm(vector1), np.linalg.norm(vector2), np.linalg.norm(vector3)

(np.float64(0.9999999518969219),
 np.float64(0.9999999432048748),
 np.float64(0.9999999688261215))

### Creating a Chroma Vector Store

A vector store persists embeddings and enables
efficient similarity-based retrieval.


In [90]:
vectorstore = Chroma.from_documents(documents= pages_char_split2, embedding=embeddings, persist_directory = "./vectorstore/rag-practice" )

In [91]:
vectorstore_from_directory = Chroma(persist_directory="./vectorstore/rag-practice", embedding_function=embeddings)

### Inspecting, Adding, Updating, and Deleting Documents

Vector stores support full lifecycle management of documents,
including insertion, updates, and deletion.


In [92]:
vectorstore_from_directory.get(ids ="123ef422-8ec3-4cd7-89ad-e095e77998fd" , include=["embeddings"])

{'ids': ['123ef422-8ec3-4cd7-89ad-e095e77998fd'],
 'embeddings': array([[ 0.00478017, -0.01535145,  0.02508651, ...,  0.02121745,
         -0.01364157, -0.00687695]], shape=(1, 1536)),
 'documents': None,
 'uris': None,
 'included': ['embeddings'],
 'data': None,
 'metadatas': None}

In [93]:
added_document = Document(page_content='Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis', 
                          metadata={'Course Title': 'Introduction to Data and Data Science', 
                                    'Lecture Title': 'Analysis vs Analytics'})

In [94]:
vectorstore_from_directory.add_documents([added_document])

['6fa33e86-e9c7-4781-ae2f-4b951aeccbd2']

In [95]:
vectorstore_from_directory.get("9ef73267-b650-4011-82ec-025a21e1095d")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

In [96]:
updated_document = Document(page_content='Great! We hope we gave you a good idea about the level of applicability of the most frequently used programming and software tools in the field of data science. Thank you for watching!', 
                            metadata={'Course Title': 'Introduction to Data and Data Science', 
                                     'Lecture Title': 'Programming Languages & Software Employed in Data Science - All the Tools You Need'})

In [97]:
vectorstore_from_directory.update_document(document_id="9ef73267-b650-4011-82ec-025a21e1095d", document = updated_document)

In [98]:
vectorstore_from_directory.get("9ef73267-b650-4011-82ec-025a21e1095d")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

In [99]:
vectorstore_from_directory.delete("9ef73267-b650-4011-82ec-025a21e1095d")

In [100]:
vectorstore_from_directory.get("9ef73267-b650-4011-82ec-025a21e1095d")

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

## Summary

This notebook demonstrated the **indexing phase of a RAG pipeline**:

- Loading documents from PDF and DOCX sources  
- Cleaning and normalizing raw text  
- Splitting documents using structural and character-based strategies  
- Generating semantic embeddings  
- Storing and managing embeddings in a Chroma vector store  

These steps form the foundation for retrieval-augmented
generation systems.
