# **Demo: MultiPDF QA Retriever with FAISS and LangChain**

In this demo, you will learn how to use LangChain to create a MultiPDF retriever with FAISS. This demo is performed on new generative AI research paper PDFs. You will understand how to load and process documents, create a database, make a retriever, create a chain, and use the retriever to ask questions and get answers.

## **Steps to Perform:**

*   Step 1: Importing the Necessary Libraries
*   Step 2: Loading and Splitting
*   Step 3: Loading the OpenAI Embeddings
*   Step 4: Creating and Loading the Database
*   Step 5: Creating and Using the Retriever
*   Step 6: Passing the Query



### **Step 1: Importing the Necessary Libraries**

In [1]:
import os
os.makedirs("Gen_AI_Papers", exist_ok=True)


In [2]:
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader, PyPDFLoader, DirectoryLoader
from langchain.embeddings import OpenAIEmbeddings
import os
import openai


In [None]:

os.chdir( - define yorur working directory - )
os.getcwd()


'D:\\Desktop\\AMJ Group\\Teaching\\Class Materials\\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\\Demo\\L7_Benchmark_and_Evaluation_of_LLM_Capabilities_Part_1'

In [4]:
from pathlib import Path
sorted([p.name for p in Path("Gen_AI_Papers").glob("*.pdf")])


['Chan et al. - 2023 - Harms from Increasingly Agentic Algorithmic Systems.pdf',
 'LeCun - A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27.pdf',
 'Wei et al. - 2023 - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.pdf']

### **Step 2: Loading and Splitting**


*   Create a directory named `GenAI_Papers`.
*   Load the PDF documents in the directory.
*   Split the documents into smaller chunks using the **RecursiveCharacterTextSplitter**.

In [5]:
# RESET
documents, texts = [], []

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

def long(p): return "\\\\?\\" + str(Path(p).resolve())

for p in Path("Gen_AI_Papers").glob("*.pdf"):
    documents.extend(PyPDFLoader(long(p)).load())

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# confirm only the 3 sources are present
sorted({d.metadata.get("source") for d in texts})[:10], len(documents), len(texts)





(['\\\\?\\D:\\Desktop\\AMJ Group\\Teaching\\Class Materials\\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\\Demo\\L7_Benchmark_and_Evaluation_of_LLM_Capabilities_Part_1\\Gen_AI_Papers\\Chan et al. - 2023 - Harms from Increasingly Agentic Algorithmic Systems.pdf',
  '\\\\?\\D:\\Desktop\\AMJ Group\\Teaching\\Class Materials\\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\\Demo\\L7_Benchmark_and_Evaluation_of_LLM_Capabilities_Part_1\\Gen_AI_Papers\\LeCun - A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27.pdf',
  '\\\\?\\D:\\Desktop\\AMJ Group\\Teaching\\Class Materials\\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\\Demo\\L7_Benchmark_and_Evaluation_of_LLM_Capabilities_Part_1\\Gen_AI_Papers\\Wei et al. - 2023 - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.pdf'],
 130,
 565)

In [6]:
texts

[Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-01-12T01:06:30+00:00', 'author': '', 'keywords': '', 'moddate': '2023-01-12T01:06:30+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '\\\\?\\D:\\Desktop\\AMJ Group\\Teaching\\Class Materials\\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\\Demo\\L7_Benchmark_and_Evaluation_of_LLM_Capabilities_Part_1\\Gen_AI_Papers\\Wei et al. - 2023 - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.pdf', 'total_pages': 43, 'page': 0, 'page_label': '1'}, page_content='Chain-of-Thought Prompting Elicits Reasoning\nin Large Language Models\nJason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma\nBrian Ichter Fei Xia Ed H. Chi Quoc V . Le Denny Zhou\nGoogle Research, Brain Team\n{jasonwei,dennyzhou}@google.com\nAbstract\nWe explore how gene

In [7]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
split_texts = doc_splitter.split_documents(texts)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


565


In [8]:
split_texts

[Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-01-12T01:06:30+00:00', 'author': '', 'keywords': '', 'moddate': '2023-01-12T01:06:30+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '\\\\?\\D:\\Desktop\\AMJ Group\\Teaching\\Class Materials\\AGS_Advanced_Generative_AI_Building_LLM_Applications_ILT_Material\\Demo\\L7_Benchmark_and_Evaluation_of_LLM_Capabilities_Part_1\\Gen_AI_Papers\\Wei et al. - 2023 - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.pdf', 'total_pages': 43, 'page': 0, 'page_label': '1'}, page_content='Chain-of-Thought Prompting Elicits Reasoning\nin Large Language Models\nJason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma\nBrian Ichter Fei Xia Ed H. Chi Quoc V . Le Denny Zhou\nGoogle Research, Brain Team\n{jasonwei,dennyzhou}@google.com\nAbstract\nWe explore how gene

### **Step 3: Loading the OpenAI Embeddings**

In [None]:
# --- Embeddings + FAISS index ---
import os
try:
    from langchain_openai import OpenAIEmbeddings
except ImportError:
    from langchain.embeddings import OpenAIEmbeddings  # fallback for older LC

# If needed for this session only:
os.environ["OPENAI_API_KEY"] = 


# Use default OpenAI embeddings (or specify model="text-embedding-3-small")
embeddings = OpenAIEmbeddings()  # model="text-embedding-3-small"

from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(texts, embeddings)

# Confirm index size
db.index.ntotal


565

### **Step 4: Creating and Loading the Database**

*   Create a database to store the embedded text.
*   Load the database to bring it back into memory from the disk.



In [10]:
# --- Step 4: Creating and Loading the Database ---

# Save the FAISS index
db.save_local("faiss_index")

# Load the FAISS index back from disk
db_reloaded = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

# Confirm it reloaded properly
db_reloaded.index.ntotal



565

### **Step 5: Creating and Using the Retriever**

*   Create a retriever using the vector database.
*   Use the retriever to get relevant documents for a specific query.



In [11]:
# Creating retriever
# --- Step 5: Creating and Using the Retriever ---

# Create retriever from the reloaded FAISS index
retriever = db_reloaded.as_retriever()

# Example query
docs = retriever.get_relevant_documents("What is Toolformer?")

# Peek at first result
docs[0].page_content[:500]



  docs = retriever.get_relevant_documents("What is Toolformer?")


'Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as soft reasoners over\nlanguage. IJCAI.\nKarl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher\nHesse, and John Schulman. 2021. Training veriﬁers to solve math word problems. arXiv preprint\narXiv:2110.14168.\nJacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of\ndeep bidirectional transformers for language understanding. NAACL.\nHonghua Dong, Jiayuan Mao,'

### **Step 6: Passing the Query**

*   Pass the query to the vector database.
*   Print the content of the most relevant document.



In [12]:
# --- Step 6: Passing the Query ---

query = "A fundamental limitation of HMMs"
docs = db_reloaded.similarity_search(query, k=1)

print(docs[0].page_content)



senting uncertainty about a word being predicted comes down to producing a vector whose
components are scores or probabilities for each word (or discrete token) in the dictionary.
But this approach doesn’t work for high-dimensional continuous modalities, such as video.
To represent such data, it is necessary to eliminate irrelevant information about the variable
to be modeled through an encoder, as in the JEPA. Furthermore, the high-dimensionality
of the signal precludes the representation of uncertainty through a normalized distribution.
Second, current models are only capable of very limited forms of reasoning. The absence
of abstract latent variables in these models precludes the exploration of multiple interpre-
tations of a percept and the search for optimal courses of action to achieve a goal. In fact,
dynamically specifying a goal in such models is essentially impossible.
8.3.2 Reward is not enough


In [13]:
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI

qa = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=retriever
)

answer = qa.run("Summarize the key idea of Chain-of-Thought prompting")
print(answer)


  answer = qa.run("Summarize the key idea of Chain-of-Thought prompting")


 Chain-of-Thought prompting is a method that improves the ability of large language models to perform complex reasoning by providing a series of intermediate reasoning steps, called a "chain of thought," as exemplars in prompting. This approach has been shown to outperform standard prompting on various reasoning tasks, and it allows language models to decompose multi-step problems into smaller, more manageable steps.


### **Conclusion**

By the end of this demo, you have a clear understanding of how to use LangChain’s MultiPDF retriever with FAISS. You’ve learned how to load and process documents, create a database, make a retriever, and use the retriever to ask questions. This knowledge will help you effectively utilize LangChain’s capabilities in your projects.