# **Progressive LM - PDF to Web Search RAG System**

This notebook implements a progressive language model system that:
1. Searches a PDF document first for answers
2. Falls back to web search if information is not found in the PDF
3. Uses free-tier APIs (Google Gemini, Groq, Tavily)

**Cost**: $0 - Perfect for students!

---

## **Setup: Install Dependencies**

In [None]:
!pip install langchain langchain-community langchain-google-genai langchain-groq langchain-core langchain-tavily faiss-cpu pymupdf -q

---

## **Configure API Keys**

In [None]:
import os
from getpass import getpass

# Free tier API keys - no cost for students
os.environ["GOOGLE_API_KEY"] = getpass("Enter GOOGLE GEMINI API KEY: ")
os.environ["GROQ_API_KEY"] = getpass("Enter GROQ API KEY: ")
os.environ["TAVILY_API_KEY"] = getpass("Enter TAVILY API KEY: ")

---

## **Upload PDF File**

In [None]:
from google.colab import files
import tempfile
import os

# upload the pdf file
uploaded = files.upload()
file_name = list(uploaded.keys())[0]

# get the file path and store in temporary location
file_path = os.path.join(tempfile.gettempdir(), file_name)
with open(file_path, "wb") as f:
        f.write(uploaded[file_name])

---

## **Initialize Models and Tools**

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_groq import ChatGroq
from langchain_tavily import TavilySearch
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Free tier models - Gemini for embeddings, Groq for LLM
embedding_model = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)
tool = TavilySearch(max_results=3, topic="general")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

---

## **Define Helper Functions**

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

---

## **Create Prompts**

In [None]:
from langchain_core.prompts import ChatPromptTemplate

answer_determination_prompt = ChatPromptTemplate.from_template("""
You are an AI assistant tasked with determining if the provided context from a PDF contains sufficient information to answer a user's question.

Context from PDF: {context}

User Question: {question}

First, carefully analyze if the context provides adequate information to answer the question.

If the context contains sufficient information to answer the question, respond with a complete and accurate answer based ONLY on the provided context.

If the context does NOT contain sufficient information to fully answer the question, respond with exactly: "[NEED_WEB_SEARCH]"

Your response:
""")

web_search_prompt = ChatPromptTemplate.from_template("""
You are an AI assistant helping a user with their question.

User Question: {question}

Web Search Results: {web_results}

Using the web search results, provide a comprehensive and accurate answer to the user's question.
Make sure to cite sources from the search results where appropriate.
""")

---

## **Process PDF and Create Vector Store**

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS

# --- PDF Processing Function ---
def process_pdf(file_path):
    loader = PyMuPDFLoader(file_path)
    docs = loader.load()
    chunks = text_splitter.split_documents(docs)
    vector_store = FAISS.from_documents(chunks, embedding_model)
    return vector_store

In [None]:
vector_store = process_pdf(file_path)

---

## **Create Retriever**

In [None]:
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

---

## **Build Processing Chains**

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

determination_chain = (
  {
    "context": retriever | format_docs,
    "question": RunnablePassthrough(),
  }
  | answer_determination_prompt
  | llm
  | StrOutputParser()
)

web_search_chain = (
  {
    "question": RunnablePassthrough(),
    "web_results": lambda x: tool.invoke({"query": x})
  }
  | web_search_prompt
  | llm
  | StrOutputParser()
)

---

## **Define Agent Function**

In [None]:
def agent(question):
  pdf_response = determination_chain.invoke(question)
  if "[NEED_WEB_SEARCH]" in pdf_response:
    print("\nℹ️ Info not found in PDF. Searching the web...")
    return web_search_chain.invoke(question)
  else:
    return pdf_response

---

## **Interactive Chat Loop**

Ask questions about your PDF! The system will:
1. First search the PDF for answers
2. If not found, search the web

Type 'exit' to quit.

In [None]:
# Ask questions interactively
while True:
  query = input("\nAsk a question about your PDF (or type 'exit'): ")
  if query.lower() == 'exit':
    break
  answer = agent(query)
  print("\n✉️ Answer:", answer)