<a href="https://colab.research.google.com/github/rj2663972/RAG-Application/blob/main/RAG_Pipeline_Updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install Dependecies**

In [None]:
!pip install langchain_community langchainhub chromadb langchain langchain_openai

**Import Open AI API Keys**

In [None]:
from google.colab import userdata
import os
os.environ['OPENAI_API_KEY'] = userdata.get('open-api-key')

**Using WebBasedLoaders to Scrap a website and prepare documents**

In [None]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(web_paths=["https://www.educosys.com/course/genai"])

docs = loader.load()
print(f"Loaded {len(docs)} raw documents (web pages).")

**Creating chunks of the documents**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
splits = text_splitter.split_documents(docs)

print(f"Created {len(splits)} chunks. Example chunk length: {len(splits[0].page_content)} chars")
print("----- Sample chunk start -----")
print(splits[0].page_content[:500])
print("----- Sample chunk end -----")

**Prepare Embeddings of the chunks and store them in Chroma DB Vector Store**

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

PERSIST_DIR = "./chroma_db"
embeddings = OpenAIEmbeddings()

# Load existing DB if present; otherwise create and persist:
if os.path.exists(PERSIST_DIR) and os.listdir(PERSIST_DIR):
    print("Loading existing Chromadb from disk...")
    vectorstore = Chroma(persist_directory=PERSIST_DIR, embedding_function=embeddings)
else:
    print("Creating new Chromadb from documents (may take time for embeddings)...")
    vectorstore = Chroma.from_documents(splits, embeddings, persist_directory=PERSIST_DIR)
    try:
        # persist the DB to avoid re-embedding next runs
        vectorstore.persist()
        print(f"Persisted Chroma DB to {PERSIST_DIR}")
    except Exception as e:
        print("Persist failed (non-fatal):", e)

# Expose retriever (same shape as your original)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

**Create Prompt and LLM**

In [None]:
from langchain import hub
from langchain_openai import ChatOpenAI

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [None]:
def format_docs(docs):
    """
    Build a single context string out of retrieved docs.
    Each doc will include a small excerpt and its source metadata (if present).
    """
    parts = []
    for i, d in enumerate(docs):
        src = d.metadata.get("source", f"doc_{i}")
        excerpt = d.page_content.strip().replace("\n", " ")
        # include only a reasonable excerpt (to limit tokens)
        excerpt_snippet = excerpt[:800]
        parts.append(f"[Source: {src}]\n{excerpt_snippet}")
    return "\n\n---\n\n".join(parts)

**Final Rag Pipeline**

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

rag_chain = ({"context": retriever | format_docs, "question": RunnablePassthrough()}
             | prompt
             | llm
             | StrOutputParser())

**Test the Rag pipeline**

In [None]:
rag_chain.invoke("Are the recordings of the course available? For how long?")