# ✂️ Session 4: Text Splitting & Preprocessing (Groq + v3)

**Objective:**  
Learn how to split large documents into smaller chunks using `RecursiveCharacterTextSplitter`.  

**Why This Matters:**  
- LLMs have context length limits (e.g., 8k or 32k tokens).  
- Splitting ensures **efficient retrieval** while preserving context.  
- Overlap between chunks helps avoid cutting important sentences mid-way.  


## ✅ Step 1: Install Required Libraries
We already have LangChain v3 + Groq, but we’ll ensure all dependencies are in place.


In [1]:
!pip install -q langchain==0.3.27 langchain-groq==0.3.8 pypdf==6.1.2 langchain_community==0.3.31


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m323.6/323.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m864.0 kB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0m

In [2]:
!pip show langchain langchain-groq pypdf langchain_community

Name: langchain
Version: 0.3.27
Summary: Building applications with LLMs through composability
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: langchain-core, langchain-text-splitters, langsmith, pydantic, PyYAML, requests, SQLAlchemy
Required-by: langchain-community
---
Name: langchain-groq
Version: 0.3.8
Summary: An integration package connecting Groq and LangChain
Home-page: 
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.12/dist-packages
Requires: groq, langchain-core
Required-by: 
---
Name: pypdf
Version: 6.1.2
Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
Home-page: 
Author: 
Author-email: Mathieu Fenniak <biziqe@mathieu.fenniak.net>
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: 
Required-by: 
---
Name: langchain-community
Version: 0.3.31
Summary: Community contributed LangChain integrations.
Home-page: 
Author: 
Auth

## ✅ Step 2: Load PDF (Re-use from Previous Session)
Upload your `scholarship_info.pdf` to Colab if not already present.


In [4]:
from langchain_community.document_loaders import PyPDFLoader

pdf_path = "/content/scholarship_info.pdf"
loader = PyPDFLoader(pdf_path)
docs = loader.load()

print(f"✅ Loaded {len(docs)} pages.")


✅ Loaded 1 pages.


## ✅ Step 3: Split Text into Chunks
We’ll use `RecursiveCharacterTextSplitter`.  
Parameters:  
- `chunk_size=1000` → Max characters per chunk  
- `chunk_overlap=200` → Overlap between chunks (to preserve context)


In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200
)

splits = splitter.split_documents(docs)
print(f"✅ Split into {len(splits)} chunks.")

# Preview first chunk
print("\n--- First Chunk ---\n")
print(splits[0].page_content[:500])


✅ Split into 2 chunks.

--- First Chunk ---

Title: Scholarship Information 2025 
 
1. Eligibility: 
- Open to students in India pursuing undergraduate degrees. 
- Annual family income must be below ₹6,00,000. 
- Minimum 60% marks in the last qualifying exam. 
 
2. Documents Required: 
- Income certificate 
- Aadhaar card 
- Bank passbook 
- Marksheet 
 
3. Deadline: October 15, 2025 
 
4. Benefits: 
- ₹10,000 per semester for tuition 
- Book allowance of ₹3,000 per year 
 
5. How to Apply:


## ✅ Step 4: Use Groq LLM to Summarize a Chunk
Let’s take one chunk and summarize with LLaMA-3.


In [6]:
from google.colab import userdata
from langchain_groq import ChatGroq

# Load API key
GROQ_API_KEY = userdata.get('GROQ_API_KEY')

llm = ChatGroq(
    model="openai/gpt-oss-20b",
    api_key=GROQ_API_KEY,
    temperature=0.3,
    max_tokens=200
)

sample_chunk = splits[0].page_content
summary = llm.invoke(f"Summarize this chunk in 3 bullet points:\n\n{sample_chunk}")

print("Groq LLM Summary:\n", summary.content)


Groq LLM Summary:
 - **Eligibility & Application**: Open to Indian undergraduates with family income < ₹6,00,000 and ≥ 60 % marks; must submit income certificate, Aadhaar, bank passbook, and marksheet by the October 15, 2025 deadline.  
- **Benefits**: Receives ₹10,000 per semester toward tuition and a ₹3,000 annual book allowance.  
- **Process**: Apply online (details not provided) using the required documents and meeting the stated criteria.


## ✅ Step 5: Why Chunking Matters
Try asking the LLM about **scholarship deadlines** using:  
1. The whole document (may fail).  
2. A chunked + RAG pipeline (works better, next session).  


## 📝 Exercise
1. Change `chunk_size` to 500 and compare number of chunks.  
2. Change `chunk_overlap` to 0, 100, and 300 → observe differences.  
3. Write a prompt to extract **key dates** from the first 2 chunks.  


## 🎯 Summary
- Split PDF into manageable chunks.  
- Learned why overlap preserves context.  
- Summarized a chunk with Groq LLM.  

**Next Notebook → Vector Stores in LangChain (Embeddings + FAISS)**  
