<a href="https://colab.research.google.com/github/rodiwaa/learnings-pocs/blob/main/notebooks/yt_rag_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YT Rag System
Learning RAG components from the excellent CampusX Playlist on YT.

Scope
- RAG
  - Doc Loaders
  - Text Splitters
  - Vector Stores
  - Retrievers
- Advanced RAG (Future)
  - UI (streamlit, chainlit, gradio, react)
  - Evaluations
    - Ragas (4xmetrics)
    - Langsmith (traces, tags)
  - Indexing
  - Retrieval
    - Pre R
      - LLM Q Rewrite
      - Multi query
      - Domain Aware Routing
    - During R (search strategy)
      - MMR
      - Hybrid (Semantic, BM25, Keyword)
      - Reranking (algo, LLMs)
    - Post R
      - contextual compression
  - Augmentation
    - Prompt templating
    - Grounding (use context only, else say IDK)
    - Context window optimisation
  - Generation
    - Citations
    - Guardrails
  - System Design
    - Multimodal
    - Agentic (web search, routers)
    - Memory based (from last time convos)



In [None]:
!pip install langchain langgraph langsmith langchain.community wikipedia langchain_openai chromadb python-dotenv

## Ingestion Module

In [None]:
from google.colab import drive
from dotenv import load_dotenv
import os

MOUNT_PATH="/content/drive"
drive.mount(MOUNT_PATH, force_remount=True)

ENV_PATH=f"{MOUNT_PATH}/MyDrive/Projects/.env/.env"
print(ENV_PATH)

load_dotenv(dotenv_path=ENV_PATH)

# FIXME: get creds, API KEYS from external .env
# fetch .env from gdrive
# dotenv the .env


In [None]:
from langchain_core.documents import Document

# custom docs
docs = [
    Document(page_content="Rohit is software engineer"),
    Document(page_content="Rohit is AI engineer"),
    Document(page_content="Rohit is data engineer"),
    Document(page_content="Rohit is an astronaut")
]
print(f"{len(docs)} docs added")

In [None]:
# from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()

vectorstore = Chroma.from_documents(
    documents = docs,
    embedding = embedding_model,
    collection_name = "temp-documents"
)
print("vs created")

In [None]:
# retriave documents from VS

retriever = vectorstore.as_retriever(
    search_type="mmr",
    kwargs=2
)
print("retriever created 2")

vectorstore.add_documents([Document(page_content="Scott is an astronaut", id=1234)])
print("new doc added")

In [None]:
# test vector store retrieval

query = "who all are astronauts?"
docs = retriever.invoke(query)

for doc in docs:
  print(doc.page_content)
# FIXME: better search needed, this is printing all docs

# Retrievers

In [None]:
from langchain_community.retrievers import WikipediaRetriever

retriever = WikipediaRetriever(top_k_results=2, lang="en")
query = "2024 IPL"
docs = retriever.invoke(query)
len(docs)
docs

# for i, doc in enumerate(docs):
#   print(doc.page_content)