# **HuggingFace and LangChain RAG to query your own PDF Files**

Sample code to POC using HuggingFace and LangChain to read your own PDF file and ask questions.

**1. Install necessary libraries**

These libraries are essential for:

**transformers and sentence-transformers:** To work with Hugging Face models for text embedding.

**faiss-gpu:** For efficient similarity search within the document embeddings. If running this locally on your laptop change to faiss-cpu

**langchain:** For building the RAG pipeline.
pypdf: For loading and processing the PDF document.


In [1]:
!pip install transformers sentence-transformers faiss-gpu langchain pypdf langchain_community


Collecting sentence-transformers
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Collecting langchain
  Downloading langchain-0.3.3-py3-none-any.whl.metadata (7.1 kB)
Collecting pypdf
  Downloading pypdf-5.0.1-py3-none-any.whl.metadata (7.4 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.2-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain-core<0.4.0,>=0.3.10 (from langchain)
  Downloading langchain_core-0.3.10-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.134-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-a

**2. Load the PDF document**

This code utilizes PyPDFLoader from LangChain to load the content of your PDF file into documents.

In [11]:
from langchain.document_loaders import PyPDFLoader

# Following lines of code to read the PDF from your Google drive account. Comment if reading from elsewhere.
from google.colab import drive
drive.mount('/content/drive')

loader = PyPDFLoader("/content/drive/My Drive/your_pdf_file.pdf")  # Replace with your PDF file path
documents = loader.load()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**3. Split the document into chunks**

This step breaks down the PDF content into smaller chunks using RecursiveCharacterTextSplitter to ensure manageable processing. chunk_size and chunk_overlap can be adjusted.

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

**4. Create embeddings for the chunks**

This code initializes a Hugging Face embedding model using HuggingFaceEmbeddings and specifies the model to use (all-mpnet-base-v2 in this case) for generating text embeddings.

In [13]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")



**5. Store embeddings in a vector store**

This uses FAISS to create a vector store, enabling efficient storage and retrieval of document chunks based on their embeddings

In [14]:
from langchain.vectorstores import FAISS

db = FAISS.from_documents(texts, embeddings)

**6. Create a retrieval QA chain**

This sets up the RetrievalQA chain, utilizing a specified language model (google/flan-t5-xl here) for answering questions and the db vector store for retrieving relevant information. Make sure to set your Hugging Face API token.

In [15]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFaceHub

from google.colab import userdata

sec_key = userdata.get('HUGGINGFACEHUB_API_TOKEN')

import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = sec_key

llm = HuggingFaceHub(repo_id="google/flan-t5-large", model_kwargs={"temperature":0.5, "max_length":512})
# Remember to set your Hugging Face API token as an environment variable named HUGGINGFACEHUB_API_TOKEN
# or pass it as token argument.
# flan-t5-large is a decent model to experiment with. Runs fine locally on CPU as well.
# xl and xxl versions of the model take a lot of resources and will consume your daily free limit on Colab.

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=db.as_retriever())

**7. Query the RAG system**

This code demonstrates how to query the RAG system with a question. Replace "What is the content of the document about?" with your desired query to get relevant answers based on the PDF content.

This comprehensive approach enables you to build a robust RAG application for querying PDF documents. Remember to replace placeholders like "your_pdf_file.pdf" with your actual file path and customize parameters as needed.

In [16]:
query = "what is the content of the document about?"
result = qa.run(query)
print(result)

(ii)
