# RAG Components

![RAG](https://miro.medium.com/v2/0*AJ4H1E-SdY4oXwxC)

chunk_size → maximum characters in each chunk.

chunk_overlap → number of characters to “carry over” from the previous chunk. (Prevents context loss at boundaries.)

separators → list of characters/strings to try splitting on (priority order). Default: ["\n\n", "\n", " ", ""].

length_function → function to measure length (default is len(text), i.e., character count; you can swap for tiktoken token counter).

keep_separator → whether to keep the separator string in the split chunks.

![Chunk](https://miro.medium.com/1*2G5plo83o4l9PcriBG4ncA.png)


In [1]:
import os
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env
_ = load_dotenv(find_dotenv())

# Access Groq API key
groq_api_key = os.environ["GROQ_API_KEY"]

In [2]:
from langchain_groq import ChatGroq

llamaChatModel = ChatGroq(
    model="llama3-70b-8192",   # you can also try "llama3-8b-8192" for cheaper runs
    temperature=0.2
)

# 5. Call the model
response = llamaChatModel.invoke("Tell me a fun fact about space.")
print(response.content)

Here's one:

**There is a giant storm on Jupiter that has been raging for at least 187 years!**

The Great Red Spot, as it's called, is a persistent anticyclonic storm on Jupiter, which means that it's a high-pressure region with clockwise rotation. It's so large that three Earths could fit inside it.

The storm was first observed in 1831, and it's been continuously monitored since then. Despite its incredible longevity, the Great Red Spot is actually shrinking, and its color has changed from a deep red to more of a pale pink over the years.

Isn't that just mind-blowing?


In [None]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("ms_dhoni.txt")


loaded_data = loader.load()

In [4]:
loaded_data

[Document(metadata={'source': 'ms_dhoni.txt'}, page_content="Mahendra Singh Dhoni, often known simply as MS Dhoni, is one of the greatest cricketers in the history of the game. \nHe was born on 7th July 1981 in Ranchi, Jharkhand, India. Dhoni is celebrated not only for his exceptional cricketing skills \nbut also for his calm demeanor, sharp decision-making abilities, and inspirational leadership on and off the field.\n\nDhoni made his debut for the Indian cricket team in December 2004 against Bangladesh. He initially caught attention with his \naggressive batting style, unorthodox wicket-keeping, and ability to finish matches under pressure. Over the years, he transformed \ninto a complete cricketer and one of the finest captains cricket has ever seen.\n\nKnown as 'Captain Cool,' Dhoni led the Indian team to several historic victories. Under his leadership, India won the inaugural \nT20 World Cup in 2007, the 2011 ICC Cricket World Cup, and the 2013 ICC Champions Trophy, making him th

In [5]:
loaded_data[0].page_content


"Mahendra Singh Dhoni, often known simply as MS Dhoni, is one of the greatest cricketers in the history of the game. \nHe was born on 7th July 1981 in Ranchi, Jharkhand, India. Dhoni is celebrated not only for his exceptional cricketing skills \nbut also for his calm demeanor, sharp decision-making abilities, and inspirational leadership on and off the field.\n\nDhoni made his debut for the Indian cricket team in December 2004 against Bangladesh. He initially caught attention with his \naggressive batting style, unorthodox wicket-keeping, and ability to finish matches under pressure. Over the years, he transformed \ninto a complete cricketer and one of the finest captains cricket has ever seen.\n\nKnown as 'Captain Cool,' Dhoni led the Indian team to several historic victories. Under his leadership, India won the inaugural \nT20 World Cup in 2007, the 2011 ICC Cricket World Cup, and the 2013 ICC Champions Trophy, making him the only captain to have \nwon all three major ICC trophies. H

In [6]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

In [8]:
type(text_splitter)

langchain_text_splitters.character.CharacterTextSplitter

In [9]:
texts = text_splitter.create_documents([loaded_data[0].page_content])


In [10]:
texts

[Document(page_content='Mahendra Singh Dhoni, often known simply as MS Dhoni, is one of the greatest cricketers in the history of the game. \nHe was born on 7th July 1981 in Ranchi, Jharkhand, India. Dhoni is celebrated not only for his exceptional cricketing skills \nbut also for his calm demeanor, sharp decision-making abilities, and inspirational leadership on and off the field.\n\nDhoni made his debut for the Indian cricket team in December 2004 against Bangladesh. He initially caught attention with his \naggressive batting style, unorthodox wicket-keeping, and ability to finish matches under pressure. Over the years, he transformed \ninto a complete cricketer and one of the finest captains cricket has ever seen.'),
 Document(page_content="Known as 'Captain Cool,' Dhoni led the Indian team to several historic victories. Under his leadership, India won the inaugural \nT20 World Cup in 2007, the 2011 ICC Cricket World Cup, and the 2013 ICC Champions Trophy, making him the only captai

# RecursiveCharacterTextSplitter

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=26,
    chunk_overlap=4
)

text = recursive_splitter.split_text(loaded_data[0].page_content)

In [12]:
text

['Mahendra Singh Dhoni,',
 'often known simply as MS',
 'MS Dhoni, is one of the',
 'the greatest cricketers',
 'in the history of the',
 'the game.',
 'He was born on 7th July',
 '1981 in Ranchi,',
 'Jharkhand, India. Dhoni',
 'is celebrated not only',
 'for his exceptional',
 'cricketing skills',
 'but also for his calm',
 'demeanor, sharp',
 'decision-making',
 'abilities, and',
 'and inspirational',
 'leadership on and off the',
 'the field.',
 'Dhoni made his debut for',
 'for the Indian cricket',
 'team in December 2004',
 'against Bangladesh. He',
 'He initially caught',
 'attention with his',
 'aggressive batting style,',
 'unorthodox',
 'wicket-keeping, and',
 'and ability to finish',
 'matches under pressure.',
 'Over the years, he',
 'he transformed',
 'into a complete cricketer',
 'and one of the finest',
 'captains cricket has ever',
 'seen.',
 "Known as 'Captain Cool,'",
 'Dhoni led the Indian team',
 'to several historic',
 'victories. Under his',
 'his leadership, India

# Embeddings
An embedding is a numerical fingerprint that captures the meaning of text (or other data) so machines can compare and search semantically.

In [40]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [41]:
chunks_of_text =     [
        "Hi there!",
        "Hello!",
        "What's your name?",
        "Bond, James Bond",
        "Hello Bond!"
    ]

In [42]:
embeddings = embeddings_model.embed_documents(chunks_of_text)


In [44]:
len(embeddings[0])


1536

In [45]:
print(embeddings[0][:5])


[-0.020325319841504097, -0.007096723187714815, -0.022839006036520004, -0.026279456913471222, -0.037527572363615036]


# Vector Stores (aka. Vector Databases)
Store embeddings in a very fast searchable database.

In [49]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
loaded_document = TextLoader('ms_dhoni.txt').load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

In [50]:
len(chunks_of_text)

11

In [51]:
chunks_of_text

[Document(metadata={'source': 'ms_dhoni.txt'}, page_content='Mahendra Singh Dhoni, often known simply as MS Dhoni, is one of the greatest cricketers in the history of the game. \nHe was born on 7th July 1981 in Ranchi, Jharkhand, India. Dhoni is celebrated not only for his exceptional cricketing skills \nbut also for his calm demeanor, sharp decision-making abilities, and inspirational leadership on and off the field.\n\nDhoni made his debut for the Indian cricket team in December 2004 against Bangladesh. He initially caught attention with his \naggressive batting style, unorthodox wicket-keeping, and ability to finish matches under pressure. Over the years, he transformed \ninto a complete cricketer and one of the finest captains cricket has ever seen.'),
 Document(metadata={'source': 'ms_dhoni.txt'}, page_content="Known as 'Captain Cool,' Dhoni led the Indian team to several historic victories. Under his leadership, India won the inaugural \nT20 World Cup in 2007, the 2011 ICC Cricke

# ChromaDB
ChromaDB is an open-source vector database designed for AI applications. It stores and indexes embeddings (high-dimensional vectors) to enable efficient similarity search and retrieval. Commonly used in RAG (Retrieval-Augmented Generation) pipelines, it provides fast nearest-neighbor queries, persistence, and APIs for integrating with LLM frameworks like LangChain and LlamaIndex.

In [52]:
vector_db = Chroma.from_documents(chunks_of_text, OpenAIEmbeddings())


In [53]:
vector_db

<langchain_chroma.vectorstores.Chroma at 0x17825ba50>

In [57]:
question = "csk means?"

response = vector_db.similarity_search(question)

print(response[0].page_content)

Beyond his international career, Dhoni made a massive impact in the Indian Premier League (IPL). As the captain of the 
Chennai Super Kings (CSK), he led the team to multiple IPL titles and playoff appearances. His connection with CSK fans runs 
deep, and he is adored as 'Thala' in Chennai, meaning 'leader' in Tamil. His calmness, humility, and loyalty towards the 
franchise made him an IPL legend.


# Vector Store as Retriever
Find the embedding that best answers your question.

In [58]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("asmtech.txt")

In [61]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loaded_document = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

embeddings = OpenAIEmbeddings()

vector_db = FAISS.from_documents(chunks_of_text, embeddings)

In [62]:
vector_db

<langchain_community.vectorstores.faiss.FAISS at 0x179937c90>

In [68]:
retriever = vector_db.as_retriever(search_kwargs={"k": 4})


In [69]:
response = retriever.invoke("bse code?")
response

[Document(metadata={'source': 'asmtech.txt'}, page_content='8. Summary Table\nFeature\tDetails\nFounded / HQ\t1992, Bengaluru, India\nPublic Listing\tBSE (ASMTEC), since 1994\nGlobal Presence\tOffices & delivery centers in USA, UK, Singapore, Canada, Mexico, Japan\nCore Services\tEngineering, Product R&D, IoT, Digital Engineering, Infrastructure\nKey Verticals\tAutomotive, Avionics, High-Tech, Medical, Semiconductor\nKey Technologies\tEV, ADAS, Robotics, VR/AR, Wafer Packaging, Dx\nSemiconductor Focus\tJV with HHV; design-led equipment manufacturing for semicon & solar\nRecent Investments\t₹510 cr MoU, ₹1,701M equity raise, Semcon UK acquisition\nFinancial Highlights\tStrong revenue growth, rising profitability, global listing presence\nLeadership\tExperienced board with engineering and business leadership pedigree\nIndustry Positioning\tSEMI member, recognized for precision engineering and innovation'),
 Document(metadata={'source': 'asmtech.txt'}, page_content='Acquiring 10 acres of 

In [70]:
len(response)


4