Learning Goals
Convert documents or texts into embeddings

Store and search those embeddings by relevance

📌 Suggested Steps
Intro to embeddings:

OpenAI embeddings doc

Try generating embeddings with text-embedding-3-small

Learn FAISS (or Chroma):

Store vectors

Perform a similarity search

Get top-k relevant chunks

Practice Exercise
Take a set of articles (e.g., Wikipedia summaries), chunk them, embed them, store in FAISS, and build a simple script to return top 3 relevant passages for a user query.


User Query
   ⬇️
Retriever (FAISS + Embeddings) → Top K relevant chunks
   ⬇️
LLM (GPT-3.5/4) → Generates answer using those chunks as context

Lets use HuggingFace Sentence Transformer!!!

In [None]:
!pip install -U sentence-transformers




1. upload documents to drive
2. access documents from drive
3. read a document
4. save content in a variable
5. clean document - regex other new methods
6. chunk document - nltk/spacy/langchain
7. embed document - sentenceTransform method

In [None]:
!pip install langchain-community




In [None]:
##
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
FILE_PATH = '/content/drive/MyDrive/Career/AgriQuery/documents/vasular_plant.pdf'

Using PyMuPDF for document extraction

REF: https://pymupdf.readthedocs.io/en/latest/the-basics.html\





Method / Attribute

Description

Document.page_count

the number of pages (int)

Document.metadata

the metadata (dict)

Document.get_toc()

get the table of contents (list)

Document.load_page()

read a Page

In [None]:
!pip install --upgrade pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.3


In [None]:
import pymupdf

print(pymupdf.__doc__)

PyMuPDF 1.26.3: Python bindings for the MuPDF 1.26.3 library (rebased implementation).
Python 3.11 running on linux (64-bit).



In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

In [None]:
loader = PyMuPDFLoader(FILE_PATH)
# data = loader.load()

###  PyPDFLoader with langchain_community

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyMuPDFLoader(FILE_PATH)

pages = []

async for page in loader.alazy_load():
  pages.append(page)

In [None]:
print(pages[1].page_content)

234	
Journal of the Indian Institute of Science  VOL 91:3  July–Sept. 2011  journal.library.iisc.ernet.in
Sanjay P. Sane and Amit K. Singh
Steudle, 2001; Tyree and Zimmermann, 1983). 
This debate has continued at many levels - from 
the basic physical properties of water and nature 
of its motion in capillaries, to which experimental 
techniques are most appropriate for measuring 
internal pressures in the vascular structure of 
plant xylem, and what is the biological response 
of plants to stresses due to water shortage or excess 
salinity. In this article, we review the history and 
recent literature on water transport in plants with 
a focus on the tools and techniques and major 
experimental challenges in the field.
1.1.  The Physical Properties of Water
The ubiquity of water often makes us lose sight of 
the fact that the physical properties of water are 
rather anomalous as compared to other liquids of its 
kind. Water itself is odd because it is liquid at room 
temperature, in s

Let's Chunk!!


Chuncking breaks down or splits documents into managable sizes called chunks for further embedding task. this is needed to avoid exceeding token length when using paid services for embedding

we'll use RecursiveCharacterTextSplitter from langchain_community.textsplitter

- We have an array of length 10 pages
each list contains a page
- we want to use the chunking method to break them down into semantically coherent sentences
- then we use sentence embedding by hugging face to embed them an save in a vector DB!

In [None]:

len(pages)

10

In [None]:
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

chunk_size: The maximum size of a chunk, where size is determined by the length_function.
chunk_overlap: Target overlap between chunks. Overlapping chunks helps to mitigate loss of information when context is divided between chunks.
length_function: Function determining the chunk size.
is_separator_regex: Whether the separator list (defaulting to ["\n\n", "\n", " ", ""]) should be interpreted as regex.


REF: https://python.langchain.com/docs/how_to/recursive_text_splitter/

In [None]:
# initialise the chunk method

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50, length_function=len)

In [None]:
chunked_docs = []
for page in pages:
  docs = [Document(page_content=f'{page}')] #convert to langchain document
  texts = splitter.split_documents(docs) #apply textsplitter
  chunked_docs.append(texts)

In [None]:
print(len(chunked_docs))

10


In [None]:
chunked_docs = list(itertools.chain(*chunked_docs))  # flatten the list of pages

In [None]:
# lets see how the split looks like
texts = []
for i, chunks in enumerate(chunked_docs):
  text = chunks.page_content
  texts.append(text)

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
embeddings = model.encode(texts, show_progress_bar=True)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

In [None]:
len(embeddings[0])

384

Store emeddings in store (with and without langchain framework)

Let's use FAISS - Facebook AI similarity search

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [None]:
import faiss
import numpy as np

In [None]:
embeddings.shape

(214, 384)

In [None]:
# create an embedding with dimension of the columns

embedding_dim = embeddings.shape[1]

In [None]:
# create an index for each column

index = faiss.IndexFlatL2(embedding_dim)

In [None]:
# map the index to the embeddings i.e add vectors to store

index.add(np.array(embeddings))

In [None]:
%pip install -qU langchain-huggingface

%pip install -qU langchain_community faiss-cpu

Create embeddings with sentence transformers and store in FAISS via langchain

https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers/

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

In [None]:
#huggingface sentence transformers embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# store text as well as the embddings in faiss vector store using langchain wrappers

vectorstore = FAISS.from_documents(documents=chunked_docs, embedding=embedding_model)


In [None]:
%pip install langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.3.28-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.28-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.6/70.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.28


SetUp RAG Pipeline (Retriever + LLM)

In [None]:
from langchain_openai import ChatOpenAI  # using openai requires a obtaining a paid api
from langchain.chains import RetrievalQA

In [None]:
# initialise OpenAI LLM
# llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.llms import HuggingFacePipeline
import torch

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

In [None]:
# define the model
model_name = "meta-llama/Llama-2-7b-chat-hf"

# load the tokenizer and model on cpu/gpu

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.
401 Client Error. (Request ID: Root=1-68758476-4bd4b6ed5cae221814329731;aa95cdf0-eacd-4ef7-931b-1640de9131d6)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b-chat-hf is restricted. You must have access to it and be authenticated to access it. Please log in.

In [None]:
# Create a retriever
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":3})

In [None]:
# setup RAG Chain (retrieval + generator)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff"  # simplest RAG Pattern - inject docs as context
)


NameError: name 'llm' is not defined

In [None]:
# Query RAG pipeline

query = "What is this document about?"
response = qa_chain.run(query)

print("\nRAG LLM Response")
print(response)

NameError: name 'qa_chain' is not defined

Save or load FAISSfor later use

In [None]:
# save
vectorstore.save_local("/content/drive/MyDrive/Career/AgriQuery/")

# load
# loaded_vectorstore = FAISS.load_local("vectorstore/", embeddings=embedding_model)

In [None]:
%pip install transformers langchain langchain_community sentence-transformers faiss-cpu


