# 📄 Session 3: Document Loaders in LangChain (Groq + v3)

**Objective:**  
Learn how to load and inspect documents (PDFs) in LangChain using `PyPDFLoader`.  

**Why This Matters:**  
- First step in RAG: converting unstructured files (PDFs) into structured text.  
- Metadata helps in filtering and contextual retrieval later.  


## ✅ Step 1: Install Required Libraries
We’ll use:
- **LangChain v3**  
- **LangChain Groq** (for future steps)  
- **pypdf** (PDF parsing)  


In [1]:
!pip install -q langchain==0.3.27 langchain-groq==0.3.8 pypdf==6.1.2 langchain_community==0.3.31


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/323.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m317.4/323.6 kB[0m [31m13.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m323.6/323.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m52.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following 

In [2]:
!pip show langchain langchain-groq pypdf langchain_community

Name: langchain
Version: 0.3.27
Summary: Building applications with LLMs through composability
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: langchain-core, langchain-text-splitters, langsmith, pydantic, PyYAML, requests, SQLAlchemy
Required-by: langchain-community
---
Name: langchain-groq
Version: 0.3.8
Summary: An integration package connecting Groq and LangChain
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: groq, langchain-core
Required-by: 
---
Name: pypdf
Version: 6.1.2
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page: 
Author: 
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: 
Required-by: 
---
Name: langchain-community
Version: 0.3.31
Summary: Community contributed LangChain integrations.
Home-page: 
Author: 
Auth

## ✅ Step 2: Setup Groq API Key
We’ll again use Colab **Secrets Manager**.


In [3]:
from google.colab import userdata

# Load Groq API key
GROQ_API_KEY = userdata.get('GROQ_API_KEY')

if GROQ_API_KEY:
    print("✅ Groq API key retrieved!")
else:
    print("❌ Please add GROQ_API_KEY in Colab Secrets.")


✅ Groq API key retrieved!


## ✅ Step 3: Load a PDF with PyPDFLoader
We’ll use `PyPDFLoader` from LangChain community loaders.  
Make sure to upload a sample PDF (like `scholarship_info.pdf`) to Colab first.  


In [5]:
from langchain_community.document_loaders import PyPDFLoader

# Replace with your uploaded file path in Colab
pdf_path = "/content/scholarship_info.pdf"

loader = PyPDFLoader(pdf_path)
docs = loader.load()

print(f"✅ Loaded {len(docs)} documents (pages).")


✅ Loaded 1 documents (pages).


## ✅ Step 4: Inspect Loaded Documents
Let’s look at the **first page** text and metadata.


In [6]:
print("Sample Document Object:\n")
print(docs[0])

print("\nMetadata of first page:\n", docs[0].metadata)
print("\nText Content of first page:\n", docs[0].page_content[:500])


Sample Document Object:

page_content='Title: Scholarship Information 2025 
 
1. Eligibility: 
- Open to students in India pursuing undergraduate degrees. 
- Annual family income must be below ₹6,00,000. 
- Minimum 60% marks in the last qualifying exam. 
 
2. Documents Required: 
- Income certificate 
- Aadhaar card 
- Bank passbook 
- Marksheet 
 
3. Deadline: October 15, 2025 
 
4. Benefits: 
- ₹10,000 per semester for tuition 
- Book allowance of ₹3,000 per year 
 
5. How to Apply: 
Visit https://scholarships.gov.in and register under the NSP portal.' metadata={'producer': 'Microsoft® Word 2019', 'creator': 'Microsoft® Word 2019', 'creationdate': '2025-08-07T19:16:29+05:30', 'author': 'Preethesh Poojary', 'moddate': '2025-08-07T19:16:29+05:30', 'source': '/content/scholarship_info.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}

Metadata of first page:
 {'producer': 'Microsoft® Word 2019', 'creator': 'Microsoft® Word 2019', 'creationdate': '2025-08-07T19:16:29+05:30', 'author'

## ✅ Step 5: Use Groq LLM to Summarize Document
Now let’s ask Groq’s **LLaMA-3** model to summarize the first page.  


In [7]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="openai/gpt-oss-20b",
    api_key=GROQ_API_KEY,
    temperature=0.3,
    max_tokens=200
)

sample_text = docs[0].page_content

summary = llm.invoke(f"Summarize this text in 3 bullet points:\n\n{sample_text}")
print("Groq LLM Summary:\n", summary.content)


Groq LLM Summary:
 - **Eligibility & Requirements**: Open to Indian undergraduates with family income < ₹6,00,000 and ≥ 60 % marks; must submit income certificate, Aadhaar, bank passbook, and marksheet.  
- **Benefits**: ₹10,000 per semester toward tuition plus a ₹3,000 annual book allowance.  
- **Application**: Deadline 15 Oct 2025; apply via https://scholarships.gov.in by registering on the NSP portal.


## 📝 Exercise
1. Load a different PDF (e.g., your course notes) and extract metadata.  
2. Modify the summarization prompt to generate:
   - A **title** for the document.  
   - A **FAQ-style summary**.  
3. Compare results using models:
   - `llama3-8b-8192`  
   - `mixtral-8x7b-32768`  


## 🎯 Summary
- Learned how to load PDFs using `PyPDFLoader`.  
- Explored document objects (content + metadata).  
- Used Groq LLM to summarize loaded text.  

**Next Notebook → Text Splitting & Preprocessing**  
