GETTING STARTED WITH RAG APPLICATIONS
1. LOAD THE SOURCE FILE -> PDFs, Text files
2. TRANSFORMATION OF THE LOADED DATA
- BREAKING INTO CHUNKS
3. CREATE EMBEDDINGS FOR THE CHUNKS OF DOCUMENTS -> STORE THEM INTO VECTOR DB -> QUERY USING SIMILARITY SEARCH FROM CHROMADB

In [None]:
# loading text files
from langchain_community.document_loaders import TextLoader
loader = TextLoader('test.txt')
result = loader.load()
print(result[0].page_content)

In [None]:
# loading pdf files
# install pypdf
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('abstract.pdf')
result = loader.load()
print(result)

In [None]:
# data transformation
# breaking the loaded data into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=100
)
splitted_docs = text_splitter.split_documents(result) # used for splitting the document into small chunks of document
# print(splitted_docs[:5])
for item in splitted_docs[:5]:
    print(item.page_content)
    print("_")

In [None]:
# creating embeddings for the document chunks
from langchain_ollama import OllamaEmbeddings # using ollama embeddings
embeddings_object = OllamaEmbeddings(model="phi")

VECTOR DB

Vector databases are specialized tools that help store and quickly search through data represented as numerical codes, known as vectors. These vectors capture important characteristics of data like text, images, or sounds. The main purpose of vector databases is to find similar items efficiently. For example, if you want to find images that look like a given picture or texts that have a similar meaning, vector databases make this process fast and effective, even when dealing with a large amount of data.

In [None]:
# storing embeddings into vector db
# install chromadb
from langchain_community.vectorstores import Chroma # using chroma vector db inbuilt in langchainn
db = Chroma.from_documents(splitted_docs[:5], embeddings_object)

In [None]:
# query the vector db for similar embeddings
query = "who are the team members"
res = db.similarity_search(query) # note this does not use llm
for item in res:
    print(item.page_content)
    print("_")

What happens behind the scenes is that
when a query is sent to the llm to query the vector db, the query is first transformed into embeddings using the same technique/algo used to transform the document, after which the most similar embeddings are returned

NOW LET US USE LLMS TO QUERY

In [None]:
# loading open source llm
from langchain_community.llms import Ollama
llm = Ollama(model="phi")
print(llm)

In [None]:
# design the chat prompt template
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("""
Answer the question only based on the provided context only
<context>{context}</context>
Question: {input}""")

In [None]:
# Introducing chains
# create stuff document chain
from langchain.chains.combine_documents import create_stuff_documents_chain
document_chain = create_stuff_documents_chain(llm, prompt)

In [None]:
# creating retrievers (interface) that is connected to vector store
retriever = db.as_retriever()

In [None]:
# creating retriever chain
from langchain.chains import create_retrieval_chain
retriever_chain = create_retrieval_chain(retriever, document_chain)

In [None]:
response = retriever_chain.invoke({"input": "what is the content of the pdf"})
print(response)

In [None]:
print(response["answer"])