In [14]:
!pip install langchain faiss-cpu sentence-transformers pymupdf langchain-google-genai langchain-community

Collecting pymupdf
  Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.1


In [23]:
from google.colab import files

uploaded = files.upload()


Saving Paper 1.pdf to Paper 1.pdf
Saving Paper 2.pdf to Paper 2.pdf
Saving Paper 3.pdf to Paper 3.pdf


In [27]:
import fitz  # PyMuPDF

def extract_text_from_uploaded_pdfs(uploaded_files):
    all_texts = []
    for filename in uploaded_files.keys():
        with fitz.open(filename) as pdf:
            text = ""
            for page in pdf:
                text += page.get_text() or ""
            all_texts.append((text, filename))  # Store (text, filename)
    return all_texts

documents = extract_text_from_uploaded_pdfs(uploaded)


In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = []

for text, filename in documents:
    split_docs = text_splitter.create_documents([text], metadatas=[{"source": filename}])
    chunks.extend(split_docs)

print(f"✅ Total Chunks Created: {len(chunks)}")
# 👀 Print all chunks with their source filenames

print("🔍 Preview of Chunks and Their Sources:\n")
for i, doc in enumerate(chunks[:5]):  # Show only first 10 to avoid overload
    print(f"Chunk {i+1} from {doc.metadata['source']}:\n{doc.page_content}...\n{'-'*80}")


✅ Total Chunks Created: 430
🔍 Preview of Chunks and Their Sources:

Chunk 1 from Paper 1.pdf:
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
bas

In [29]:
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


In [30]:
from langchain.vectorstores import FAISS

# Build the vector DB from chunks
vectorstore = FAISS.from_documents(chunks, embedding_model)

# Save the vectorstore if you want (optional)
# vectorstore.save_local("my_vectorstore")


In [31]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA

# Set your Gemini API Key
os.environ["GOOGLE_API_KEY"] = "AIzaSyDUYj-YJiNHe_e8XO2v4o_TDccL1ik2dUA"

# Initialize Gemini 2.0 Flash model with longer, paragraph output
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.7,
    convert_system_message_to_human=True,
    max_output_tokens=1024  # or higher if needed
)

# Create RAG QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Ask your question
question = "What are RAGs and what are the components of RAG?"
result = qa_chain({"query": question})

# Show the paragraph-style answer
print("✅ Answer:\n", result["result"])

# Show only the PDF source names
print("\n📄 Sources used:")
source_names = {doc.metadata["source"] for doc in result["source_documents"]}
for source in source_names:
    print("-", source)




✅ Answer:
 RAGs (Retrieval-Augmented Generation models) are a type of model that combines a retrieval component with a generator.

The components of RAG are:

1.  **Retriever:** This component (pη(z|x)) is based on DPR (Dense Passage Retrieval). It retrieves the top K documents relevant to the input query. DPR uses a bi-encoder architecture with BERTBASE to create dense representations of documents and queries.
2.  **Generator:** The generator produces the final output based on the retrieved documents. The paper mentions using BART as the generator.

There are two variants of RAG discussed: RAG-Sequence and RAG-Token, which differ in how they marginalize over the retrieved documents during generation. RAG-Token may perform best because it can generate responses that combine content from several documents.

📄 Sources used:
- Paper 2.pdf
