## 👉 Working with your own data (via Data Loaders)
## 👉 Retrieval-Augmented Generation (RAG)

# Agenda:
1. Data loader - how to work with custom data
2. Introduction to RAG - Retrival Augmented Generation
3. Spiltters
4. Embeddings
5. Vector Store
6. Retriver
7. Top K

Next: RAG with LCEL

### 🔹 1. Working with Custom Data (Data Loaders)

LLMs like GPT are powerful, but they don’t know your private PDFs, docs, or DBs.
So LangChain gives us Document Loaders to bring your own data inside.

✨ Steps:

Load data (PDF, TXT, CSV, URL, DB, JSOn etc.)

Split into chunks (LLMs can’t handle huge docs at once).

Store in a Vector Database (for similarity search).

Use RAG pipeline → ask questions over your data.

#### Example: Loading a PDF

In [1]:
from langchain_community.document_loaders import PyPDFLoader

# 1. Load your PDF
loader = PyPDFLoader("sample.pdf")
docs = loader.load()

print(docs[0].page_content[:500])  # See first 500 chars


ModuleNotFoundError: No module named 'langchain_community'

#### Example: Splitting Documents

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

print(len(chunks))
print(chunks[0].page_content)

### 🔹 2. RAG (Retrieval-Augmented Generation) — Basic Concept

Problem: LLMs often "hallucinate" (make up answers).
Solution: Instead of relying only on the model’s memory → give it retrieved data from your documents.

✨ How RAG Works

User Question → "What is space weather?"

Retriever (Vector DB) finds relevant chunks from your docs.

LLM combines both (prompt + retrieved data) to give grounded answers.

#### 🔹 Simple RAG Example with FAISS

In [None]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# 1. Embeddings (turn text into vectors)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 2. Store in FAISS (local vector DB)
vectorstore = FAISS.from_documents(chunks, embeddings)

# 3. Retriever
retriever = vectorstore.as_retriever()

# 4. RAG Chain
llm = ChatOpenAI(model="gpt-3.5-turbo")
qa = RetrievalQA.from_chain_type(llm, retriever=retriever)

# 5. Ask Question
query = "Summarize this document in 3 bullet points."
result = qa.run(query)

print(result)


🔹 Key Idea:

Without RAG → Model answers from its own knowledge (may be outdated/wrong).

With RAG → Model answers from your documents (reliable, context-aware).

✅ You can use loaders for:

PDFs → PyPDFLoader

Word Docs → Docx2txtLoader

Websites → WebBaseLoader

Notion, Slack, DBs → Special loaders

# MCP Server (letter we will leran)


MCP server নিয়ে একটা **সহজ ছোট নোটস** বানাই—যাতে দ্রুত পড়লে পরিষ্কার আইডিয়া পাওয়া যায়।

---

# 📒 Easy Notes on MCP Server

### 🔹 What is MCP Server?

👉 **MCP (Model Context Protocol) server = দরকারি জায়গা থেকে দরকারি তথ্য এনে LLM কে খাওয়ানো।**
এটা basically এক ধরনের **middle layer**, যেখানে LLM সরাসরি সব data বা API access না করে MCP server এর মাধ্যমে data নেয়।

---

### 🔹 Why MCP Server?

* LLM সবকিছু জানে না → নিজের knowledge এ সীমাবদ্ধ।
* Custom data / company data দরকার হলে MCP দিয়ে দেওয়া হয়।
* নিরাপদ, কারণ LLM কে direct DB/API access দিতে হয় না।

---

### 🔹 How it Works (Flow)

1. **User প্রশ্ন করে** → e.g., “Remote work policy কী?”
2. **LLM Agent** → MCP server কে call করে।
3. **MCP Server** → relevant system/doc (PDF, DB, API) থেকে info আনে।
4. **LLM** → সেই data দিয়ে answer generate করে।

---

### 🔹 Example Use Cases

* **HR Policies** → Employee handbook থেকে policy আনা।
* **Customer Support** → CRM থেকে customer history আনা।
* **Finance** → Payroll বা invoice DB থেকে তথ্য আনা।
* **DevOps** → Server health check বা logs আনা।

---

### 🔹 Benefits

✅ **Grounded Answer** → hallucination কমে।
✅ **Standardized Access** → সব data source এ এক format এ access।
✅ **Security** → সীমিত access control।
✅ **Reusable** → একবার MCP বানালে agent সহজে use করতে পারে।

---

### 🔹 Relation with RAG

* **RAG** = নিজের data (PDF, Docs, DB) থেকে answer বানানো।
* **MCP** = সেই retriever / tool কে একটা **standard server** হিসেবে expose করা।

👉 বলা যায়:
**RAG = concept** (retrieve + generate)
**MCP = technology** (standard way to serve data/tools to LLM)

---

# 📝 Quick Summary

* **MCP server = middleman** → LLM ↔ Data/API.
* **Job** → শুধু দরকারি তথ্য এনে দেওয়া।
* **Use Cases** → HR, Finance, Support, DevOps, Knowledge base.
* **Benefit** → safe, reusable, no hallucination, standardized.

---