## üëâ Working with your own data (via Data Loaders)
## üëâ Retrieval-Augmented Generation (RAG)

# Agenda:
1. Data loader - how to work with custom data
2. Introduction to RAG - Retrival Augmented Generation
3. Spiltters
4. Embeddings
5. Vector Store
6. Retriver
7. Top K

Next: RAG with LCEL

### üîπ 1. Working with Custom Data (Data Loaders)

LLMs like GPT are powerful, but they don‚Äôt know your private PDFs, docs, or DBs.
So LangChain gives us Document Loaders to bring your own data inside.

‚ú® Steps:

Load data (PDF, TXT, CSV, URL, DB, JSOn etc.)

Split into chunks (LLMs can‚Äôt handle huge docs at once).

Store in a Vector Database (for similarity search).

Use RAG pipeline ‚Üí ask questions over your data.

#### Example: Loading a PDF

In [1]:
from langchain_community.document_loaders import PyPDFLoader

# 1. Load your PDF
loader = PyPDFLoader("sample.pdf")
docs = loader.load()

print(docs[0].page_content[:500])  # See first 500 chars


ModuleNotFoundError: No module named 'langchain_community'

#### Example: Splitting Documents

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

print(len(chunks))
print(chunks[0].page_content)

### üîπ 2. RAG (Retrieval-Augmented Generation) ‚Äî Basic Concept

Problem: LLMs often "hallucinate" (make up answers).
Solution: Instead of relying only on the model‚Äôs memory ‚Üí give it retrieved data from your documents.

‚ú® How RAG Works

User Question ‚Üí "What is space weather?"

Retriever (Vector DB) finds relevant chunks from your docs.

LLM combines both (prompt + retrieved data) to give grounded answers.

#### üîπ Simple RAG Example with FAISS

In [None]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

# 1. Embeddings (turn text into vectors)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 2. Store in FAISS (local vector DB)
vectorstore = FAISS.from_documents(chunks, embeddings)

# 3. Retriever
retriever = vectorstore.as_retriever()

# 4. RAG Chain
llm = ChatOpenAI(model="gpt-3.5-turbo")
qa = RetrievalQA.from_chain_type(llm, retriever=retriever)

# 5. Ask Question
query = "Summarize this document in 3 bullet points."
result = qa.run(query)

print(result)


üîπ Key Idea:

Without RAG ‚Üí Model answers from its own knowledge (may be outdated/wrong).

With RAG ‚Üí Model answers from your documents (reliable, context-aware).

‚úÖ You can use loaders for:

PDFs ‚Üí PyPDFLoader

Word Docs ‚Üí Docx2txtLoader

Websites ‚Üí WebBaseLoader

Notion, Slack, DBs ‚Üí Special loaders

# MCP Server (letter we will leran)


MCP server ‡¶®‡¶ø‡ßü‡ßá ‡¶è‡¶ï‡¶ü‡¶æ **‡¶∏‡¶π‡¶ú ‡¶õ‡ßã‡¶ü ‡¶®‡ßã‡¶ü‡¶∏** ‡¶¨‡¶æ‡¶®‡¶æ‡¶á‚Äî‡¶Ø‡¶æ‡¶§‡ßá ‡¶¶‡ßç‡¶∞‡ßÅ‡¶§ ‡¶™‡ßú‡¶≤‡ßá ‡¶™‡¶∞‡¶ø‡¶∑‡ßç‡¶ï‡¶æ‡¶∞ ‡¶Ü‡¶á‡¶°‡¶ø‡ßü‡¶æ ‡¶™‡¶æ‡¶ì‡ßü‡¶æ ‡¶Ø‡¶æ‡ßü‡•§

---

# üìí Easy Notes on MCP Server

### üîπ What is MCP Server?

üëâ **MCP (Model Context Protocol) server = ‡¶¶‡¶∞‡¶ï‡¶æ‡¶∞‡¶ø ‡¶ú‡¶æ‡ßü‡¶ó‡¶æ ‡¶•‡ßá‡¶ï‡ßá ‡¶¶‡¶∞‡¶ï‡¶æ‡¶∞‡¶ø ‡¶§‡¶•‡ßç‡¶Ø ‡¶è‡¶®‡ßá LLM ‡¶ï‡ßá ‡¶ñ‡¶æ‡¶ì‡ßü‡¶æ‡¶®‡ßã‡•§**
‡¶è‡¶ü‡¶æ basically ‡¶è‡¶ï ‡¶ß‡¶∞‡¶®‡ßá‡¶∞ **middle layer**, ‡¶Ø‡ßá‡¶ñ‡¶æ‡¶®‡ßá LLM ‡¶∏‡¶∞‡¶æ‡¶∏‡¶∞‡¶ø ‡¶∏‡¶¨ data ‡¶¨‡¶æ API access ‡¶®‡¶æ ‡¶ï‡¶∞‡ßá MCP server ‡¶è‡¶∞ ‡¶Æ‡¶æ‡¶ß‡ßç‡¶Ø‡¶Æ‡ßá data ‡¶®‡ßá‡ßü‡•§

---

### üîπ Why MCP Server?

* LLM ‡¶∏‡¶¨‡¶ï‡¶ø‡¶õ‡ßÅ ‡¶ú‡¶æ‡¶®‡ßá ‡¶®‡¶æ ‚Üí ‡¶®‡¶ø‡¶ú‡ßá‡¶∞ knowledge ‡¶è ‡¶∏‡ßÄ‡¶Æ‡¶æ‡¶¨‡¶¶‡ßç‡¶ß‡•§
* Custom data / company data ‡¶¶‡¶∞‡¶ï‡¶æ‡¶∞ ‡¶π‡¶≤‡ßá MCP ‡¶¶‡¶ø‡ßü‡ßá ‡¶¶‡ßá‡¶ì‡ßü‡¶æ ‡¶π‡ßü‡•§
* ‡¶®‡¶ø‡¶∞‡¶æ‡¶™‡¶¶, ‡¶ï‡¶æ‡¶∞‡¶£ LLM ‡¶ï‡ßá direct DB/API access ‡¶¶‡¶ø‡¶§‡ßá ‡¶π‡ßü ‡¶®‡¶æ‡•§

---

### üîπ How it Works (Flow)

1. **User ‡¶™‡ßç‡¶∞‡¶∂‡ßç‡¶® ‡¶ï‡¶∞‡ßá** ‚Üí e.g., ‚ÄúRemote work policy ‡¶ï‡ßÄ?‚Äù
2. **LLM Agent** ‚Üí MCP server ‡¶ï‡ßá call ‡¶ï‡¶∞‡ßá‡•§
3. **MCP Server** ‚Üí relevant system/doc (PDF, DB, API) ‡¶•‡ßá‡¶ï‡ßá info ‡¶Ü‡¶®‡ßá‡•§
4. **LLM** ‚Üí ‡¶∏‡ßá‡¶á data ‡¶¶‡¶ø‡ßü‡ßá answer generate ‡¶ï‡¶∞‡ßá‡•§

---

### üîπ Example Use Cases

* **HR Policies** ‚Üí Employee handbook ‡¶•‡ßá‡¶ï‡ßá policy ‡¶Ü‡¶®‡¶æ‡•§
* **Customer Support** ‚Üí CRM ‡¶•‡ßá‡¶ï‡ßá customer history ‡¶Ü‡¶®‡¶æ‡•§
* **Finance** ‚Üí Payroll ‡¶¨‡¶æ invoice DB ‡¶•‡ßá‡¶ï‡ßá ‡¶§‡¶•‡ßç‡¶Ø ‡¶Ü‡¶®‡¶æ‡•§
* **DevOps** ‚Üí Server health check ‡¶¨‡¶æ logs ‡¶Ü‡¶®‡¶æ‡•§

---

### üîπ Benefits

‚úÖ **Grounded Answer** ‚Üí hallucination ‡¶ï‡¶Æ‡ßá‡•§
‚úÖ **Standardized Access** ‚Üí ‡¶∏‡¶¨ data source ‡¶è ‡¶è‡¶ï format ‡¶è access‡•§
‚úÖ **Security** ‚Üí ‡¶∏‡ßÄ‡¶Æ‡¶ø‡¶§ access control‡•§
‚úÖ **Reusable** ‚Üí ‡¶è‡¶ï‡¶¨‡¶æ‡¶∞ MCP ‡¶¨‡¶æ‡¶®‡¶æ‡¶≤‡ßá agent ‡¶∏‡¶π‡¶ú‡ßá use ‡¶ï‡¶∞‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡ßá‡•§

---

### üîπ Relation with RAG

* **RAG** = ‡¶®‡¶ø‡¶ú‡ßá‡¶∞ data (PDF, Docs, DB) ‡¶•‡ßá‡¶ï‡ßá answer ‡¶¨‡¶æ‡¶®‡¶æ‡¶®‡ßã‡•§
* **MCP** = ‡¶∏‡ßá‡¶á retriever / tool ‡¶ï‡ßá ‡¶è‡¶ï‡¶ü‡¶æ **standard server** ‡¶π‡¶ø‡¶∏‡ßá‡¶¨‡ßá expose ‡¶ï‡¶∞‡¶æ‡•§

üëâ ‡¶¨‡¶≤‡¶æ ‡¶Ø‡¶æ‡ßü:
**RAG = concept** (retrieve + generate)
**MCP = technology** (standard way to serve data/tools to LLM)

---

# üìù Quick Summary

* **MCP server = middleman** ‚Üí LLM ‚Üî Data/API.
* **Job** ‚Üí ‡¶∂‡ßÅ‡¶ß‡ßÅ ‡¶¶‡¶∞‡¶ï‡¶æ‡¶∞‡¶ø ‡¶§‡¶•‡ßç‡¶Ø ‡¶è‡¶®‡ßá ‡¶¶‡ßá‡¶ì‡ßü‡¶æ‡•§
* **Use Cases** ‚Üí HR, Finance, Support, DevOps, Knowledge base.
* **Benefit** ‚Üí safe, reusable, no hallucination, standardized.

---