## RAG Pipeline with VectorDB 

In [4]:
## Data Ingestion
from langchain_community.document_loaders import TextLoader
loader = TextLoader('speech.txt')

In [5]:
docs = loader.load()
docs

[Document(metadata={'source': 'speech.txt'}, page_content='Today, I stand before you to speak about a concept that is both powerful and empowering â€” Democracy.\n\nDemocracy is not just a form of government; it is the voice of the people. It is the system where every individual, regardless of wealth, class, or background, has an equal say in how they are governed. It is the belief that power belongs to the people â€” and that governments should exist to serve, not to rule.\n\nIn a true democracy, decisions are not made behind closed doors, but out in the open. Rights are not granted by rulers, but guaranteed by law. It gives us the freedom to speak, to vote, to protest, and most importantly â€” to dream.\n\nBut democracy is not a one-time event; it is a daily responsibility. It is built by our participation, protected by our awareness, and strengthened by our unity. It demands that we stay informed, ask questions, and hold our leaders accountable.\n\nAs citizens, we must remember: sil

In [6]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["GROQ_API_KEY"]=os.getenv("GROQ_API_KEY")

## Web based Loader

In [8]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

## load , chunk and index the content of html 

loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",)
                       , bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                           class_ = ("post-title", "post-content","post-header")
                       )))
docs = loader.load()
docs

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

In [9]:
## PDF Loader

from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('Automating_records.pdf')
docs = loader.load()
docs

Ignoring wrong pointing object 58 0 (offset 0)
Ignoring wrong pointing object 62 0 (offset 0)


[Document(metadata={'producer': 'Mac OS X 10.11.6 Quartz PDFContext', 'creator': 'pdftk 1.41 - www.pdftk.com', 'creationdate': "D:20180729005033Z00'00'", 'moddate': "D:20180729005033Z00'00'", 'source': 'Automating_records.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Automating Records \nManagement\nChristine Shervington\nIntroduction\nThis paper will give an outline of an automated current records \nmanagement programme in an Australian institution, the University of \nWestern A ustralia. The University A rchivist is involved in the organisation \nand care of both the archival records, and the current records management \nprogramme and office, a position which provides an overview of the day to \nday administrative needs placed on the records as well as the historical \nexpectations held by administrators and researchers.\nRecent literature1 on automation of current and non-current records \nmanagement has made much of the fact that traditionally, archival find

In [10]:
## Split 
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunk_docs = text_splitter.split_documents(docs)
chunk_docs

[Document(metadata={'producer': 'Mac OS X 10.11.6 Quartz PDFContext', 'creator': 'pdftk 1.41 - www.pdftk.com', 'creationdate': "D:20180729005033Z00'00'", 'moddate': "D:20180729005033Z00'00'", 'source': 'Automating_records.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Automating Records \nManagement\nChristine Shervington\nIntroduction\nThis paper will give an outline of an automated current records \nmanagement programme in an Australian institution, the University of \nWestern A ustralia. The University A rchivist is involved in the organisation \nand care of both the archival records, and the current records management \nprogramme and office, a position which provides an overview of the day to \nday administrative needs placed on the records as well as the historical \nexpectations held by administrators and researchers.\nRecent literature1 on automation of current and non-current records \nmanagement has made much of the fact that traditionally, archival find

## VECTOR EMBEDDING & VECTOR STORING

In [21]:

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS

# Specify the model to use with Ollama (must be downloaded in Ollama)
embeddings = OllamaEmbeddings(model="gemma:2b")  # or another valid model like "mistral"

# Now create FAISS vector store
db = FAISS.from_documents(chunk_docs, embeddings)
db


<langchain_community.vectorstores.faiss.FAISS at 0x2c90364ea50>

In [23]:
query = " Another refinement which is currently under consideration is the institution of a bar-code checking system."
result = db.similarity_search(query)
print(result[1].page_content)


138 AUTOMATING RECORDS MANAGEMENT
These systems may be more powerful in that they access material at 
document level but the running costs of such operations are prohibitive in 
an education institution facing financial cut-backs annually.
There are however, several enhancements or variations which could have 
been made if a package other than DATATR1EVE had been used. For 
instance, although it was originally planned to include a procedure to 
maintain statistical details such as title of officer and date of recent use of 
each file, this proved too expensive in computer time because of the need 
for the programme to interact with the master file.
Consequently, although this feature is recognised by experts in the field 
as one of the great benefits of automation8 it has been abandoned in this 
instance. However, it is still possible to identify recent users, as officers’ 
titles are listed on the face sheet as a directive when the file is checked out.
