<a href="https://colab.research.google.com/github/ruchira559/chat-with-pdf/blob/main/research_and_prototyping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# 1. Core Framework & 2026 Migration Support
!pip install -q -U langchain langchain-community langchain-groq langchain-huggingface langchain-classic

# 2. Specialized Utilities
!pip install -q -U langchain-text-splitters chromadb pypdf sentence-transformers

print("All libraries installed. PLEASE RESTART RUNTIME (Runtime > Restart session).")

✅ Block 1: All libraries installed. PLEASE RESTART RUNTIME (Runtime > Restart session).


In [1]:
import os
from google.colab import userdata

# Updated 2026 Imports
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Legacy Support Import for RetrievalQA
from langchain_classic.chains import RetrievalQA  # FIXED for v1.0

# API Keys
groq_api_key = userdata.get('GROQ_API_KEY')

print("Modules imported using 2026-standard paths.")



✅ Block 2: Modules imported using 2026-standard paths.


In [2]:
# 1. Load and Chunk the PDF
loader = PyPDFLoader("data.pdf")
pages = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = text_splitter.split_documents(pages)

In [7]:
# Verification Print
print(f"Document loaded: {len(pages)} pages found.")
print(f"Document split into {len(chunks)} smaller chunks.")
print("-" * 30)
print(f"Sample from Chunk #1:\n{chunks[0].page_content[:200]}...")

Document loaded: 12 pages found.
Document split into 57 smaller chunks.
------------------------------
Sample from Chunk #1:
Vol.:(0123456789)
The International Journal of Life Cycle Assessment 
https://doi.org/10.1007/s11367-024-02405-8
DATA AVAILABILITY , DATA QUALITY
Testing the use of a large language model (LLM) for pe...


In [3]:
# 2. Convert text to Vectors
embed_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=embed_model,
    collection_name="pdf_knowledge_base"
)
print("Vector Database created successfully!")

In [4]:
# 3. Setup the Brain (Groq Llama 3.1)
llm = ChatGroq(
    groq_api_key=groq_api_key,
    model_name="llama-3.1-8b-instant",
    temperature=0
)

In [5]:
# 4. Final Chain Assembly
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_db.as_retriever(search_kwargs={"k": 7}),
    return_source_documents=True
)

print(f"Vector Store ready with {len(chunks)} chunks.")

✅ Block 3: Logic Complete. Vector Store ready with 57 chunks.


In [9]:
def check_fact(query):
    print(f"\n Querying PDF: {query}")
    response = qa_chain.invoke({"query": query})
    print(f" Answer: {response['result']}")

# Test cases based on the MacMaster & Sinistore (2024) paper
verification_tests = [
    "What was the success rate for technology coverage according to the results?",
    "Does the study conclude that LLMs can reduce practitioner bias?",
    "What were the success rates for temporal and geographic coverage?"
]

for test in verification_tests:
    check_fact(test)


 Querying PDF: What was the success rate for technology coverage according to the results?
 Answer: According to the results, the initial technology coverage test had a success rate of 73%. However, when the test was repeated for two scenarios where contextual clues in the prompt were obscured, the model had 100% success in reasoning and scoring.

 Querying PDF: Does the study conclude that LLMs can reduce practitioner bias?
 Answer: Yes, the study concludes that LLMs can reduce practitioner bias. According to the text, outsourcing DQA to artificial intelligence (A.I.) can "eliminate practitioner's biases" and reduce liability for practitioners.

 Querying PDF: What were the success rates for temporal and geographic coverage?
 Answer: According to the text, the success rates for temporal and geographic coverage are as follows:

For temporal coverage, the LLM was successful on the initial attempt, as seen in Table 5. This suggests that the temporal coverage test had a 100% success rate