# RAG

### Document Loading

#### Text

In [1]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./RAGFiles/LangchainRetrieval.txt")
loader.load()

[Document(metadata={'source': './RAGFiles/LangchainRetrieval.txt'}, page_content="Retrieval\nMany LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.\n\nLangChain provides all the building blocks for RAG applications - from simple to complex. This section of the documentation covers everything related to the retrieval step - e.g. the fetching of the data. Although this sounds simple, it can be subtly complex. This encompasses several key modules.\n\nIllustrative diagram showing the data connection process with steps: Source, Load, Transform, Embed, Store, and Retrieve.\n\nDocument loaders\nDocument loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the s

#### PDF Loading
For loading PDF you will need pypdf2

In [None]:
!uv pip install pypdf

In [10]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./RAGFiles/Excel_Course_Document.pdf")
pages = loader.load_and_split()

In [11]:
pages[1]

Document(metadata={'producer': 'Skia/PDF m119 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Excel Course Document', 'source': './RAGFiles/Excel_Course_Document.pdf', 'total_pages': 8, 'page': 1, 'page_label': '2'}, page_content="Whatyou'll learn\n● ABeginner'sGuidetoMicrosoftExcel-MicrosoftExcel,LearnExcel,Spreadsheets,Formulas,Shortcuts,Macros● KnowledgeofalltheessentialExcelformulas● BecomeproﬁcientinExceldatatoolslikeSorting,Filtering,DatavalidationsandDataimporting● MasterExcel'smostpopularlookupfunctionssuchasVlookup,Hlookup,IndexandMatch● HarnessfullpotentialofExcelbycreatingPivottableswithslicers● MakegreatpresentationsusingtheConditionalandTableformattingoptions● VisuallyenchantviewersusingBarcharts,ScatterPlots,Histogramsetc.● IncreaseyourefﬁciencybylearninghowtocreateanduseimportantExcelshortcuts● ExplorefunandexcitingusecasesofExcel\nRequirements\nYouwillneedaPCwithanyversionofExcelinstalledinit\nWhothiscourseisfor\nAnyonecurioustomasterExcelfrombe

#### Loading Directory of Files
`unstructured` package is needed for loaidng from directory

In [None]:
!uv pip install unstructured

In [15]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('./RAGFiles/', glob="**/*.txt")

In [16]:
docs = loader.load()

In [17]:
len(docs)

4

In [18]:
docs[1]

Document(metadata={'source': 'RAGFiles/LangchainRetrieval.txt'}, page_content="Retrieval Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.\n\nLangChain provides all the building blocks for RAG applications - from simple to complex. This section of the documentation covers everything related to the retrieval step - e.g. the fetching of the data. Although this sounds simple, it can be subtly complex. This encompasses several key modules.\n\nIllustrative diagram showing the data connection process with steps: Source, Load, Transform, Embed, Store, and Retrieve.\n\nDocument loaders Document loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the space,

In [20]:
loader = DirectoryLoader('./RAGFiles/', glob="**/*.txt", show_progress=True)
docs = loader.load()

100%|██████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 76.79it/s]


#### Loaidng CSV File

In [23]:
from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='./RAGFiles/Movie_collection_dataset.csv')
data = loader.load()

In [24]:
print(data)

[Document(metadata={'source': './RAGFiles/Movie_collection_dataset.csv', 'row': 0}, page_content='Collection: 48000\nMarketin_expense: 20.1264\nBudget: 36524.125\nLead_ Actor_Rating: 7.825\nLead_Actress_rating: 8.095\nTrailer_views: 527367\nGenre: Thriller\nNum_multiplex: 494\n3D_available: YES'), Document(metadata={'source': './RAGFiles/Movie_collection_dataset.csv', 'row': 1}, page_content='Collection: 43200\nMarketin_expense: 20.5462\nBudget: 35668.655\nLead_ Actor_Rating: 7.505\nLead_Actress_rating: 7.65\nTrailer_views: 494055\nGenre: Drama\nNum_multiplex: 462\n3D_available: NO'), Document(metadata={'source': './RAGFiles/Movie_collection_dataset.csv', 'row': 2}, page_content='Collection: 69400\nMarketin_expense: 20.5458\nBudget: 39912.675\nLead_ Actor_Rating: 7.485\nLead_Actress_rating: 7.57\nTrailer_views: 547051\nGenre: Comedy\nNum_multiplex: 458\n3D_available: NO'), Document(metadata={'source': './RAGFiles/Movie_collection_dataset.csv', 'row': 3}, page_content='Collection: 66800

In [26]:
loader = CSVLoader(file_path='./RAGFiles/Movie_collection_dataset.csv', csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['Genre', 'Budget', 'Actor_rating']
})

data = loader.load()

### Splitting the document - Chunking

#### Recursively split by character

In [None]:
!uv pip install langchain-text-splitters

In [30]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./RAGFiles/LangchainRetrieval.txt")
text = loader.load()

In [31]:
text

[Document(metadata={'source': './RAGFiles/LangchainRetrieval.txt'}, page_content="Retrieval\nMany LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process, external data is retrieved and then passed to the LLM when doing the generation step.\n\nLangChain provides all the building blocks for RAG applications - from simple to complex. This section of the documentation covers everything related to the retrieval step - e.g. the fetching of the data. Although this sounds simple, it can be subtly complex. This encompasses several key modules.\n\nIllustrative diagram showing the data connection process with steps: Source, Load, Transform, Embed, Store, and Retrieve.\n\nDocument loaders\nDocument loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the s

**Now lets try to splot this document into parts for easy management**

In [32]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)

In [33]:
texts = text_splitter.split_documents(text)
print(texts[0])
print(texts[1])
print(texts[2])

page_content='Retrieval' metadata={'source': './RAGFiles/LangchainRetrieval.txt'}
page_content='Many LLM applications require user-specific data that is not part of the model's training set. The primary way of accomplishing this is through Retrieval Augmented Generation (RAG). In this process,' metadata={'source': './RAGFiles/LangchainRetrieval.txt'}
page_content='In this process, external data is retrieved and then passed to the LLM when doing the generation step.' metadata={'source': './RAGFiles/LangchainRetrieval.txt'}


### Embedding

#### Embedding
We are going to use Ollama `all-minilm` embedding model for embedding the documents

In [None]:
!uv pip install -U langchain-ollama

In [2]:
from langchain_ollama.embeddings import OllamaEmbeddings
embeddings_model = OllamaEmbeddings(model="all-minilm", base_url = 'http://localhost:11434',)

In [6]:
# Lets try to generate embeddings for some sample texts
embeddings = embeddings_model.embed_documents(
    [
        "Hi",
        "What's up!",
        "Learning LangChain",
        "You should learn it from Start-Tech Academy"
    ]
)
len(embeddings), len(embeddings[0])

(4, 384)

In [7]:
# ANd this is how an embedidng vector looks
embeddings[0]

[-0.09052501,
 0.04040882,
 0.023812361,
 0.05894962,
 -0.022979809,
 -0.047248665,
 0.045013472,
 0.015817596,
 -0.048258033,
 -0.037713077,
 -0.019081255,
 0.021408616,
 -0.0047364486,
 -0.043372754,
 0.06001875,
 0.0591188,
 -0.027947573,
 -0.05916007,
 -0.12443087,
 -0.035721906,
 -0.006241241,
 0.032463346,
 -0.037928227,
 0.024765002,
 -0.04265622,
 -0.04248118,
 0.04591171,
 0.098640464,
 -0.050003096,
 -0.035199627,
 0.07089405,
 0.03303076,
 0.02658049,
 0.0002342745,
 0.0037552768,
 0.030471213,
 -0.078262694,
 -0.12030923,
 0.018163208,
 0.02267945,
 -0.0017624528,
 -0.023479862,
 0.003023576,
 0.024255471,
 0.044092495,
 -0.039990067,
 0.020245856,
 0.01095916,
 0.028747993,
 0.012303512,
 -0.0913725,
 -0.06810243,
 0.006091563,
 -0.012554665,
 0.092805825,
 0.027888702,
 -0.031197315,
 -0.025163416,
 0.07829983,
 -0.073466815,
 -0.06698586,
 0.01381016,
 -0.14288008,
 0.008689508,
 0.020703785,
 0.0002464315,
 -0.059241712,
 -0.06532808,
 -0.03800207,
 -0.061940506,
 -0.00

In [9]:
# More smaple
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]

[-0.08780007, 0.123271614, -0.01809491, 0.07764353, -0.0016213083]

### Vector Storage

#### Chroma
We will use chroma DB to store the embeddings

In [None]:
!uv pip install langchain-chroma

In [14]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader("./RAGFiles/LangchainRetrieval.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=20)
documents = text_splitter.split_documents(raw_documents)
db = Chroma.from_documents(documents, embeddings_model)

Created a chunk of size 760, which is longer than the specified 500


**Let us see if documents search is working in vector DB**

In [16]:
query = "What is text embedding and how does langchain help in doing it"
docs = db.similarity_search(query)
print(docs[1].page_content)

Document loaders
Document loaders load documents from many different sources. LangChain provides over 100 different document loaders as well as integrations with other major providers in the space, like AirByte and Unstructured. LangChain provides integrations to load all types of documents (HTML, PDF, code) from all types of locations (private S3 buckets, public websites).


**If the query is already embedded, we can simply get vector search**

In [20]:
embedding_vector = embeddings_model.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)

Text embedding models
Another key part of retrieval is creating embeddings for documents. Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of a text that are similar. LangChain provides integrations with over 25 different embedding providers and methods, from open-source to proprietary API, allowing you to choose the one best suited for your needs. LangChain provides a standard interface, allowing you to easily swap between models.


## Retrievers
Retrievers are combination of Embedding and Vector DB to facilitate document retriever

In [23]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader("./RAGFiles/LangchainRetrieval.txt").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)
documents = text_splitter.split_documents(raw_documents)
db = Chroma.from_documents(documents, embeddings_model)

In [24]:
retriever = db.as_retriever()

In [25]:
# And like langchain runnables, we can invoke Retriever
docs = retriever.invoke("What is text embedding and how does langchain help in doing it")

In [26]:
len(docs)

4

### Final RAG

In [29]:
# This is final RAG which combines all elements together in a chain
from langchain_ollama.chat_models import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOllama(
    base_url = 'http://localhost:11434',
    model = 'qwen2.5:0.5b'
)


def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

chain.invoke("What is text embedding and how does langchain help in doing it")


"Text embedding refers to capturing the semantic meaning of a piece of text using an algorithm or model that identifies and quantifies the relationships between words. It allows for the quick identification and retrieval of similar texts, making it useful for summarization, document classification, and other natural language processing tasks.\n\nLangChain helps in doing this by providing integrations with various embedding providers, such as OpenAI's Autoencoders, Google Cloud's AutoML, and many others from Amazon Web Services (AWS) and Microsoft Azure. These integrations allow users to load and apply the models to their data, enabling them to quickly find similar documents."