# Retrieval-Augmented Generation (RAG Pipeline)

![rag flow image](flow.png "RAG FLOW")

# Data Ingestion Types
## Step 1: Loading Data

In [3]:
# Load the environment variables

import os
from dotenv import load_dotenv

load_dotenv()

# For LangSmith - Observability tracing
# Setting up the environment variables
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY")

### Reading from Text file

In [11]:
from langchain_community.document_loaders import TextLoader

# Creating a loader to read the speech from a text file
loader = TextLoader("data/speech.txt")

# converting the enitre loader into text document
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'data/speech.txt'}, page_content="Friends, family, and honored guests,\n\nToday, as we gather here together, I am filled with profound gratitude and joy. Life's most precious moments are those we share with the people we cherish most, and this occasion is no exception.\n\nWe live in a world of endless possibilities, where each day brings new opportunities to make a difference, to show kindness, and to lift each other up. As I look around this room, I see faces illuminated with hope, hearts full of dreams, and spirits united in purpose.\n\nLet us remember that our greatest strength lies not in our individual achievements, but in our collective ability to support, inspire, and empower one another. Every smile we share, every hand we extend in friendship, and every word of encouragement we offer creates ripples of positive change that extend far beyond this moment.\n\nIn times of challenge, let us be reminded of our resilience. In moments of triumph, let us c

### Reading from web using bs4 (web scraper)

In [12]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

# Creates a loader that fetches content from web URLs
# bs_kwargs=dict(...) - Creates a dictionary of keyword arguments to pass to BeautifulSoup
# parse_only=bs4.SoupStrainer(...) - SoupStrainer is a BeautifulSoup filter that tells the parser to only process specific HTML elements
loader = WebBaseLoader(web_path=("https://python.langchain.com/docs/tutorials/rag/",),
                        # keyword arguments
                       bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("theme-doc-markdown markdown", "main-wrapper mainWrapper_z2l0")))
                       )

# converting the enitre loader into text document
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'https://python.langchain.com/docs/tutorials/rag/'}, page_content='IntroductionTutorialsBuild a Question Answering application over a Graph DatabaseTutorialsBuild a simple LLM application with chat models and prompt templatesBuild a ChatbotBuild a Retrieval Augmented Generation (RAG) App: Part 2Build an Extraction ChainBuild an AgentTaggingBuild a Retrieval Augmented Generation (RAG) App: Part 1Build a semantic search engineBuild a Question/Answering system over SQL dataSummarize TextHow-to guidesHow-to guidesHow to use tools in a chainHow to use a vectorstore as a retrieverHow to add memory to chatbotsHow to use example selectorsHow to add a semantic layer over graph databaseHow to invoke runnables in parallelHow to stream chat model responsesHow to add default invocation args to a RunnableHow to add retrieval to chatbotsHow to use few shot examples in chat modelsHow to do tool/function callingHow to install LangChain packagesHow to add examples to the pr

### Reading from PDF

In [16]:
from langchain_community.document_loaders import PyPDFLoader

# Creates a loader that fetches content from PDF files
loader = PyPDFLoader("data/somatosensory.pdf")

docs = loader.load()
docs



[Document(metadata={'producer': 'Prince 20150210 (www.princexml.com)', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Anatomy of the Somatosensory System', 'source': 'data/somatosensory.pdf', 'total_pages': 4, 'page': 0, 'page_label': '1'}, page_content='This is a sample document to\nshowcase page-based formatting. It\ncontains a chapter from a Wikibook\ncalled Sensory Systems. None of the\ncontent has been changed in this\narticle, but some content has been\nremoved.\nAnatomy of the Somatosensory System\nFROM WIKIBOOKS1\nOur somatosensory system consists of sensors in the skin\nand sensors in our muscles, tendons, and joints. The re-\nceptors in the skin, the so called cutaneous receptors, tell\nus about temperature (thermoreceptors), pressure and sur-\nface texture (mechano receptors), and pain (nociceptors).\nThe receptors in muscles and joints provide information\nabout muscle length, muscle tension, and joint angles.\nCutaneous receptors\nSensory information from Meissner corpu

## Step 2. Transform Data

- Creates a RecursiveCharacterTextSplitter that:
  
    - Breaks text into chunks of approximately 1000 characters each
  
    - Makes consecutive chunks overlap by 200 characters to maintain context

```
Chunk 1: [........1000 characters........]
                                  [--200--]
Chunk 2:                          [--200--][....800 new characters....]
                                                                 [--200--]
Chunk 3:                                                         [--200--][....800 new characters....]
```

In [17]:
# Spliting pdf documents into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Splitting the documents into chunks
chunks = text_splitter.split_documents(docs)
chunks

[Document(metadata={'producer': 'Prince 20150210 (www.princexml.com)', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Anatomy of the Somatosensory System', 'source': 'data/somatosensory.pdf', 'total_pages': 4, 'page': 0, 'page_label': '1'}, page_content='This is a sample document to\nshowcase page-based formatting. It\ncontains a chapter from a Wikibook\ncalled Sensory Systems. None of the\ncontent has been changed in this\narticle, but some content has been\nremoved.\nAnatomy of the Somatosensory System\nFROM WIKIBOOKS1\nOur somatosensory system consists of sensors in the skin\nand sensors in our muscles, tendons, and joints. The re-\nceptors in the skin, the so called cutaneous receptors, tell\nus about temperature (thermoreceptors), pressure and sur-\nface texture (mechano receptors), and pain (nociceptors).\nThe receptors in muscles and joints provide information\nabout muscle length, muscle tension, and joint angles.\nCutaneous receptors\nSensory information from Meissner corpu

## Step 2. Embed the Data (converting chunks to vectors)

In [22]:
# Vector Embedding - convert text into vectors
from langchain_community.embeddings import OllamaEmbeddings

# OllamaEmbeddings -> convert text into vectors
# Creates an OllamaEmbeddings object that uses the "nomic-embed-text" model
embeddings = OllamaEmbeddings(model="nomic-embed-text")

[[1.1335922479629517,
  1.297784447669983,
  -3.1430249214172363,
  -1.2948276996612549,
  0.3822382986545563,
  -0.5976055860519409,
  0.5071829557418823,
  0.5967442989349365,
  -0.3790108859539032,
  -0.7892819046974182,
  0.3583195209503174,
  0.2771032750606537,
  1.38645601272583,
  0.3838447630405426,
  0.9232196807861328,
  -0.0772995799779892,
  0.11699500679969788,
  -0.3965011239051819,
  0.5749255418777466,
  0.32725629210472107,
  -1.084980845451355,
  0.2913624942302704,
  -0.9616391062736511,
  0.05742543563246727,
  0.9762870073318481,
  1.1455177068710327,
  -0.21889927983283997,
  -0.42080625891685486,
  -1.2221890687942505,
  -0.6180261969566345,
  0.6379814147949219,
  -0.5506751537322998,
  0.15731088817119598,
  -0.13174808025360107,
  -1.0492315292358398,
  -1.003218412399292,
  1.4923182725906372,
  -0.13562387228012085,
  -1.3526687622070312,
  1.2927227020263672,
  1.036705732345581,
  -0.9377143383026123,
  -0.009990320540964603,
  -1.3813856840133667,
  0.01

## Step 3. Storing the vectors to Chroma DB

In [None]:
# To store the vectors
from langchain_community.vectorstores import Chroma

# ChromaDB -> store the vectors into database
db = Chroma.from_documents(chunks[:10], embeddings)

# Embeds the chunks into vectors
embeddings.embed_documents(chunks)

In [35]:
## quering on Vector Database NOT Prompting
query = "How many pages are there in the document?"

# Chroma calculates how similar this query vector is to all document chunk vectors
# It returns a list of the most semantically similar document chunks, ranked by relevance
results = db.similarity_search(query) # for input query text, returns the most similar chunks

results[0].page_content

'This is a sample document to\nshowcase page-based formatting. It\ncontains a chapter from a Wikibook\ncalled Sensory Systems. None of the\ncontent has been changed in this\narticle, but some content has been\nremoved.\nAnatomy of the Somatosensory System\nFROM WIKIBOOKS1\nOur somatosensory system consists of sensors in the skin\nand sensors in our muscles, tendons, and joints. The re-\nceptors in the skin, the so called cutaneous receptors, tell\nus about temperature (thermoreceptors), pressure and sur-\nface texture (mechano receptors), and pain (nociceptors).\nThe receptors in muscles and joints provide information\nabout muscle length, muscle tension, and joint angles.\nCutaneous receptors\nSensory information from Meissner corpuscles and rapidly\nadapting afferents leads to adjustment of grip force when\nobjects are lifted. These afferents respond with a brief\nburst of action potentials when objects move a small dis-\ntance during the early stages of lifting. In response to'

1) Query Embedding:

- Takes your text query "How many pages are there in the document?"
- Passes it through the same embedding model used for your documents
- Converts it into a high-dimensional vector (typically 384-1536 dimensions)


2) Vector Comparison:

- Calculates similarity scores between your query vector and all document vectors
- Typically uses cosine similarity: cos(θ) = (A·B)/(||A||·||B||)
- Higher cosine values (closer to 1) indicate greater similarity


3) Ranking & Retrieval:

- Sorts all documents by their similarity scores
- By default, returns the top 4 most similar documents
- Each returned document maintains its metadata and content

In [34]:
# FAISS Database
from langchain_community.vectorstores import FAISS

# To store the vectors
db_faiss = FAISS.from_documents(chunks[:10], embeddings)

results = db_faiss.similarity_search(query)
results[0].page_content

'This is a sample document to\nshowcase page-based formatting. It\ncontains a chapter from a Wikibook\ncalled Sensory Systems. None of the\ncontent has been changed in this\narticle, but some content has been\nremoved.\nAnatomy of the Somatosensory System\nFROM WIKIBOOKS1\nOur somatosensory system consists of sensors in the skin\nand sensors in our muscles, tendons, and joints. The re-\nceptors in the skin, the so called cutaneous receptors, tell\nus about temperature (thermoreceptors), pressure and sur-\nface texture (mechano receptors), and pain (nociceptors).\nThe receptors in muscles and joints provide information\nabout muscle length, muscle tension, and joint angles.\nCutaneous receptors\nSensory information from Meissner corpuscles and rapidly\nadapting afferents leads to adjustment of grip force when\nobjects are lifted. These afferents respond with a brief\nburst of action potentials when objects move a small dis-\ntance during the early stages of lifting. In response to'