<a href="https://colab.research.google.com/github/udituen/AgriQuery/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Learning Goals
Convert documents or texts into embeddings

Store and search those embeddings by relevance

📌 Suggested Steps
Intro to embeddings:

OpenAI embeddings doc

Try generating embeddings with text-embedding-3-small

Learn FAISS (or Chroma):

Store vectors

Perform a similarity search

Get top-k relevant chunks

Practice Exercise
Take a set of articles (e.g., Wikipedia summaries), chunk them, embed them, store in FAISS, and build a simple script to return top 3 relevant passages for a user query.


User Query
   ⬇️
Retriever (FAISS + Embeddings) → Top K relevant chunks
   ⬇️
LLM (GPT-3.5/4) → Generates answer using those chunks as context

Lets use HuggingFace Sentence Transformer!!!

In [None]:
!pip install -U sentence-transformers


1. upload documents to drive
2. access documents from drive
3. read a document
4. save content in a variable
5. clean document - regex other new methods
6. chunk document - nltk/spacy/langchain
7. embed document - sentenceTransform method

In [None]:
!pip install langchain-community


In [2]:
##
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
FILE_PATH = '/content/drive/MyDrive/Career/AgriQuery/documents/vasular_plant.pdf'

Using PyMuPDF for document extraction

REF: https://pymupdf.readthedocs.io/en/latest/the-basics.html\





Method / Attribute

Description

Document.page_count

the number of pages (int)

Document.metadata

the metadata (dict)

Document.get_toc()

get the table of contents (list)

Document.load_page()

read a Page

In [5]:
!pip install --upgrade pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.3


In [10]:
from langchain_community.document_loaders import PyMuPDFLoader

In [11]:
loader = PyMuPDFLoader(FILE_PATH)
# data = loader.load()

###  PyPDFLoader with langchain_community

In [12]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyMuPDFLoader(FILE_PATH)

pages = []

# consume what I need, when I need it.
async for page in loader.alazy_load():
  pages.append(page)

In [None]:
print(pages[1].page_content)

234	
Journal of the Indian Institute of Science  VOL 91:3  July–Sept. 2011  journal.library.iisc.ernet.in
Sanjay P. Sane and Amit K. Singh
Steudle, 2001; Tyree and Zimmermann, 1983). 
This debate has continued at many levels - from 
the basic physical properties of water and nature 
of its motion in capillaries, to which experimental 
techniques are most appropriate for measuring 
internal pressures in the vascular structure of 
plant xylem, and what is the biological response 
of plants to stresses due to water shortage or excess 
salinity. In this article, we review the history and 
recent literature on water transport in plants with 
a focus on the tools and techniques and major 
experimental challenges in the field.
1.1.  The Physical Properties of Water
The ubiquity of water often makes us lose sight of 
the fact that the physical properties of water are 
rather anomalous as compared to other liquids of its 
kind. Water itself is odd because it is liquid at room 
temperature, in s

Let's Chunk!!


Chuncking breaks down or splits documents into managable sizes called chunks for further embedding task. this is needed to avoid exceeding token length when using paid services for embedding

we'll use RecursiveCharacterTextSplitter from langchain_community.textsplitter

- We have an array of length 10 pages
each list contains a page
- we want to use the chunking method to break them down into semantically coherent sentences
- then we use sentence embedding by hugging face to embed them an save in a vector DB!

In [None]:
len(pages)

10

In [13]:
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import itertools

chunk_size: The maximum size of a chunk, where size is determined by the length_function.
chunk_overlap: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.
length_function: Function determining the chunk size.
is_separator_regex: Whether the separator list (defaulting to ["\n\n", "\n", " ", ""]) should be interpreted as regex.


REF: https://python.langchain.com/docs/how_to/recursive_text_splitter/

In [15]:
# initialise the chunk method
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    # length_function=len
    )

In [16]:
chunked_docs = []
for page in pages:
  docs = [Document(page_content=f'{page}')] #convert to langchain document
  texts = splitter.split_documents(docs) #apply textsplitter
  chunked_docs.append(texts)

In [None]:
print(len(chunked_docs))

10


In [17]:
chunked_docs = list(itertools.chain(*chunked_docs))  # flatten the list of pages

In [18]:
# lets see how the split looks like
texts = []
for i, chunks in enumerate(chunked_docs):
  text = chunks.page_content
  texts.append(text)

In [19]:
## for embeddings

from sentence_transformers import SentenceTransformer

In [20]:
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [21]:
embeddings = model.encode(texts, show_progress_bar=True)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
len(embeddings[0])

384

Store emeddings in store (with and without langchain framework)

Let's use FAISS - Facebook AI similarity search

In [23]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.0 kB)
Downloading faiss_cpu-1.11.0.post1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0.post1


In [25]:
import faiss
import numpy as np

In [26]:
embeddings.shape

(214, 384)

In [27]:
# create an embedding with dimension of the columns

embedding_dim = embeddings.shape[1]

In [28]:
# create an index for each column

index = faiss.IndexFlatL2(embedding_dim)

In [29]:
# map the index to the embeddings i.e add vectors to store

index.add(np.array(embeddings))

In [30]:
%pip install -qU langchain-huggingface

%pip install -qU langchain_community faiss-cpu

Create embeddings with sentence transformers and store in FAISS via langchain

https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers/

In [31]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

In [34]:
#huggingface sentence transformers embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# store text as well as the embddings in faiss vector store using langchain wrappers

vectorstore = FAISS.from_documents(documents=chunked_docs, embedding=embedding_model)

vectorstore.save_local("./vector_db")

In [None]:

%pip install langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.3.28-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.28-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.28


SetUp RAG Pipeline (Retriever + LLM)

In [42]:
# from langchain_openai import ChatOpenAI  # using openai requires a obtaining a paid api
from langchain.chains import RetrievalQA

In [None]:
# initialise OpenAI LLM
# llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

In [35]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.llms import HuggingFacePipeline
import torch

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

In [36]:
from huggingface_hub import login
login(new_session=False)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [37]:
# define the model
model_name = "meta-llama/Llama-2-7b-chat-hf"

# load the tokenizer and model on cpu/gpu

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



In [38]:
hf_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # max_new_tokens=512,
    # temperature=0.7,
    )

Device set to use cpu


In [40]:
# Create a retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":3}
    )


In [43]:
# setup RAG Chain (retrieval + generator)
from langchain_community.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=hf_pipeline)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff"  # simplest RAG Pattern - inject docs as context
)


In [None]:
# Query RAG pipeline

query = "What is this document about?"
response = qa_chain.invoke(query)

print("\nRAG LLM Response")
print(response)

Save or load FAISSfor later use

In [None]:
# save
# vectorstore.save_local("/content/drive/MyDrive/Career/AgriQuery/")

# load
# loaded_vectorstore = FAISS.load_local("vectorstore/", embeddings=embedding_model)

NameError: name 'embedding_model' is not defined

In [None]:
%pip install transformers langchain langchain_community sentence-transformers faiss-cpu




In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
# from langchain_openai import ChatOpenAI


# retriever = ...  # Your retriever
llm = llm

system_prompt = (
    "Use the given context to answer the question. "
    "If you don't know the answer, say you don't know. "
    "Use three sentence maximum and keep the answer concise. "
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

chain.invoke({"input": query})

{'input': 'What is this document about?',
 'context': [Document(id='e6bf066f-f999-4dd0-aede-f9c560e2fba0', metadata={}, page_content="basis? Trends in Plant Science 4, 372–375.' metadata={'producer': 'Adobe PDF Library 8.0', 'creator': 'Adobe InDesign CS3 (5.0)', 'creationdate': '2011-10-19T11:35:38+05:30', 'source': '/content/drive/MyDrive/Career/AgriQuery/documents/vasular_plant.pdf', 'file_path':"),
  Document(id='d1e773b8-d63b-448f-8d82-994965fdb322', metadata={}, page_content='\'total_pages\': 10, \'format\': \'PDF 1.5\', \'title\': \'\', \'author\': \'\', \'subject\': \'\', \'keywords\': \'\', \'moddate\': \'2011-10-19T11:35:51+05:30\', \'trapped\': \'\', \'modDate\': "D:20111019113551+05\'30\'", \'creationDate\': "D:20111019113538+05\'30\'", \'page\': 3}'),
  Document(id='bfebcfae-ee85-4ad2-ba80-0283a35702fa', metadata={}, page_content="biomechanics, eco-physiology and evolutionary biology.' metadata={'producer': 'Adobe PDF Library 8.0', 'creator': 'Adobe InDesign CS3 (5.0)', 'c