### Objective

Set up a clean, reproducible development environment for building a LangChain-based QA app that can:

- Read PDF files

- Break them into chunks

- Create embeddings

- Retrieve relevant context using FAISS

- Use OpenAI's GPT-4.1  ` to generate answers


## Step 1: Project Set Up

### 1.1 Create Your Project Folder and Open It in VS Code
Why? Keeping everything in one folder ensures modularity, version control, and easier sharing.

- Open your terminal or file explorer

- Create the folder - Either create folder using interface/file explorer or programatically as below

``mkdir rag_bot``

``cd rag_bot``

- Open this folder in VS Code (# If you have VSCode CLI setup:)

``.code``

- or simply open the vscode, and from there open the rag_bot folder you created earlier

### 1.2: Create a Virtual Python Environment
Why? Virtual environments isolate your project’s dependencies so they don’t interfere with other Python projects on your machine.

Windows: 

``python -m venv rag_bot`

``rag_bot_env\Scripts\activate``

Mac: 

``python3 -m venv rag_bot``

``source rag_bot_env/bin/activate``

Once activated, your terminal should show (rag_bot) before the prompt.

### 1.3: Install Required Libraries

Install the exact LangChain modules (v0.3+) along with related tools.

``pip install python-dotenv langchain-community langchain-openai pypdf faiss-cpu streamlit``

and install other required documents as needed

### 1.4: Generate requirements.txt

Why? Captures your current environment so anyone else can recreate it exactly.

``pip freeze > requirements.txt``

This command will create new requirements.txt file with the above installed libraries in your current project directory.

The alternative way of creating requirements.txt is to - either copy whole requirements.txt from my folder or ,first create this requirements.txt file inside your current project directory and copy and paste above packages and run

``pip install -r requirements.txt``


### 1.5: OPENAI_API_KEY

Create a .env file in your project directory with your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key_here

### Step 2: Import Required Libraries

Explanation:
These imports bring in necessary modules for loading PDFs, splitting text, creating embeddings, interacting with OpenAI's GPT-4, and constructing the LCEL pipeline.

In [1]:
import os
from dotenv import load_dotenv

# LangChain components
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS

from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableMap, RunnablePassthrough


### Step 3: Load OpenAI API Key
Explanation:
This step securely loads your OpenAI API key from the .env file, avoiding hardcoding sensitive information.

In [2]:
# Load environment variables from .env file
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

### Step 4: Load and Preview PDF Document

Explanation:
This step uses PyPDFLoader to read the PDF and loads each page as a separate document, allowing for easier processing.

In [4]:
# Specify the path to your PDF file
data = r"C:\Users\u116503.GLOBAL\OneDrive - Bio-Rad Laboratories Inc\Documents\PycharmProjects\streamlit-pdf-qa-main\data" #Replace with your actual data path
pdf_path = data+"/"+"IntroToUSEconomyHousingMarket.pdf"  # Replace with your actual PDF file

# Load the PDF
loader = PyPDFLoader(pdf_path)
documents = loader.load()

# Preview the number of pages and content of the first page
print(f"✅ Loaded {len(documents)} pages from the PDF.")
print(f"\n🔹 Sample Page Content:\n{documents[0].page_content[:500]}...")


✅ Loaded 3 pages from the PDF.

🔹 Sample Page Content:
https://crsreports.congress.gov 
 
Updated January 3, 2023
Introduction to U.S. Economy: Housing Market
The Housing Market  
Real estate and the housing market play an important role in 
the U.S. economy. At the individual level, roughly 65% of 
occupied housing units are owner occupied, homes are 
often a substantial source of household wealth in the United 
States, and housing construction provides widespread 
employment. At the aggregate level, housing accounts for a 
significant portion of a...


###  Step 5: Split Document into Chunks

Explanation:
Splitting the document into chunks ensures that each piece of text is within the token limit of the language model and maintains context through overlapping.

In [5]:
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,     # Maximum number of characters per chunk
    chunk_overlap=200    # Overlap between chunks to maintain context
)

# Split the documents into chunks
splits = text_splitter.split_documents(documents)

# Preview the number of chunks and content of the first chunk
print(f"\n✅ Created {len(splits)} text chunks.")
print(f"\n🔹 Sample Chunk:\n{splits[0].page_content[:500]}...")


✅ Created 13 text chunks.

🔹 Sample Chunk:
https://crsreports.congress.gov 
 
Updated January 3, 2023
Introduction to U.S. Economy: Housing Market
The Housing Market  
Real estate and the housing market play an important role in 
the U.S. economy. At the individual level, roughly 65% of 
occupied housing units are owner occupied, homes are 
often a substantial source of household wealth in the United 
States, and housing construction provides widespread 
employment. At the aggregate level, housing accounts for a 
significant portion of a...


### Step 6: Create Vector Store with OpenAI Embeddings
Explanation:
This step converts text chunks into vector embeddings using OpenAI's model and stores them in a FAISS vector store for efficient similarity search.


In [6]:
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Create a FAISS vector store from the document chunks
vectorstore = FAISS.from_documents(splits, embeddings)

# Create a retriever to fetch relevant documents
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})  # Retrieves top 3 relevant chunks


### Step 7: Define Prompt Template
Explanation:
The prompt guides GPT-4 to use the provided context to answer the question and to acknowledge when the answer isn't present in the context.

In [7]:
# Define the prompt template
template = """
You are a helpful assistant. Use the following context to answer the user's question.
If the answer is not in the context, say "I don't know".

Context:
{context}

Question:
{question}

Answer:
"""
prompt = PromptTemplate.from_template(template)


### Step 8: Create the LCEL Chain
Explanation:
This pipeline first retrieves relevant context, then formats it with the question using the prompt, and finally generates an answer using GPT-4.1.

In [8]:
# Initialize the ChatOpenAI model
llm = ChatOpenAI(model="gpt-4.1")

# Construct the LCEL pipeline
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)


### Step 9: Ask a Question and Get an Answer
Explanation:
This step sends the user's question through the pipeline and prints out the model's response.

In [9]:
# Define your question
question = "What is the main topic of this document?"

# Invoke the pipeline with the question
response = rag_chain.invoke(question)

# Display the answer
print("\n💬 GPT-4's Answer:")
print(response.content)



💬 GPT-4's Answer:
The main topic of this document is an introduction to the U.S. economy with a focus on the housing market, including its role in individual wealth, employment, and its broader effects on economic activity.


### Step 10: Display Retrieved Source Chunks

Explanation:
This step shows the specific chunks of the document that were retrieved to answer the question, including page numbers and text snippets, providing transparency and traceability.

In [10]:
# Retrieve the source documents used to answer the question
retrieved_docs = retriever.invoke(question)

# Display the sources
print("\n📚 Sources Used:")
for i, doc in enumerate(retrieved_docs, 1):
    page = doc.metadata.get("page", "N/A")
    snippet = doc.page_content[:300].replace("\n", " ") + "..."
    print(f"\n Source #{i}")
    print(f" Page: {page}")
    print(f" Text Snippet:\n{snippet}")



📚 Sources Used:

 Source #1
 Page: 2
 Text Snippet:
Introduction to U.S. Economy: Housing Market  https://crsreports.congress.gov | IF11327 · VERSION 10 · UPDATED    Lida R. Weinstock, Analyst Macroeconomic Policy    IF11327     Disclaimer  This document was prepared by the Congressional Research Service (CRS). CRS serves as nonpartisan shared staff ...

 Source #2
 Page: 2
 Text Snippet:
reproduced and distributed in its entirety without permission from CRS. However, as a CRS Report may include  copyrighted images or material from a third party, you may need to obtain the permissio n of the copyright holder if you  wish to copy or otherwise use copyrighted material....

 Source #3
 Page: 0
 Text Snippet:
https://crsreports.congress.gov    Updated January 3, 2023 Introduction to U.S. Economy: Housing Market The Housing Market   Real estate and the housing market play an important role in  the U.S. economy. At the individual level, roughly 65% of  occupied housing units are owner occup

In [11]:
# Define your question
question = "Where is the university of Georgia?"

# Invoke the pipeline with the question
response = rag_chain.invoke(question)

# Display the answer
print("\n💬 GPT-4's Answer:")
print(response.content)



💬 GPT-4's Answer:
I don't know.


### Note:
GPT-4 uses all 3 retrieved sources as context — not just the first one.
Here's what happens:

- Retriever Stage:

``retriever=vectorstore.as_retriever(search_kwargs={"k": 3})``

This retrieves the top 3 most relevant chunks (based on vector similarity to the question).

- Prompt Stage:

The text of all 3 chunks is concatenated into a single string under the {context} variable in the prompt template:

- LLM Stage:

GPT-4 receives that full prompt and is free to use any or all of those 3 chunks to generate the final answer.

It may summarize, rephrase, or even synthesize across the chunks depending on the content and quality of the input.

### Thank You - Next Step
- Convert this to deploybale modular code
- create streamlit application
- deploy it