### Routing (filter_unique_query)

![rag flow image](routing-flow.png "RAG FLOW")

website - https://chaidocs.vercel.app/youtube/getting-started

### Logical Routing (filter_unique_query)

![rag flow image](logical-router.png "RAG FLOW")

# Data Ingestion

In [10]:
import os
from dotenv import load_dotenv
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from openai import OpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import RecursiveUrlLoader
from bs4 import BeautifulSoup

load_dotenv()

True

### Only getting the specific page content

In [None]:
loader = WebBaseLoader(["https://chaidocs.vercel.app/youtube/getting-started/", ""])

docs = loader.load()

In [7]:
docs[0]

Document(metadata={'source': 'https://chaidocs.vercel.app/youtube/getting-started/', 'title': 'Getting Started | Chai aur Docs', 'description': 'Vision behind Chai aur Docs', 'language': 'en'}, page_content=' Getting Started | Chai aur Docs\n  Skip to content        Chai aur Docs        Search  CtrlK      Cancel             YouTube Instagram LinkedIn GitHub X                                            Getting Started       Chai aur HTML       Welcome    HTML Intro    Emmet Crash Course    Common HTML Tags          Chai aur Git       Welcome    Git and GitHub    Terminology    Behind the scenes    Branches in Git    Diff, Stash, Tags    Managing History    Collaborate with Github          Chai aur C++       Welcome    C++ Intro    First Program in C++    Variables & Constants    Data Types    Operators    Control Flow    Loops    Functions          Chai aur Django       Welcome    Django Intro    Jinja Templates App    Tailwind Integration    Models    Relationships & Forms          Cha

In [22]:
docs = loader.load()
docs[0].metadata

{'source': 'https://chaidocs.vercel.app/youtube/getting-started',
 'content_type': 'text/html; charset=utf-8',
 'title': 'Getting Started | Chai aur Docs',
 'description': 'Vision behind Chai aur Docs',
 'language': 'en'}

### Scrapped all the links from the website

In [27]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

def extract_links(url, base_domain="chaidocs.vercel.app"):
    # Fetch the page content
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    # Find all links
    links = []
    for a_tag in soup.find_all('a', href=True):
        href = a_tag['href']
        # Convert relative URLs to absolute
        full_url = urljoin(url, href)
        # Only include links from the same domain
        if base_domain in full_url:
            links.append(full_url)
    
    return links

# Base URL
base_url = "https://chaidocs.vercel.app/youtube/chai-aur-html/introduction/"

# Get all links
all_links = extract_links(base_url)
print(f"Found {len(all_links)} links")

# Now use these links with a loader
from langchain_community.document_loaders import WebBaseLoader

# Create a list to store all documents
all_documents = []

# Load each URL
for link in all_links:
    try:
        loader = WebBaseLoader(link)
        documents = loader.load()
        all_documents.extend(documents)
        print(f"Loaded: {link}")
    except Exception as e:
        print(f"Error loading {link}: {e}")

print(f"Total documents loaded: {len(all_documents)}")

Found 65 links
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-html/introduction/#_top
Loaded: https://chaidocs.vercel.app/
Loaded: https://chaidocs.vercel.app/youtube/getting-started/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-html/welcome/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-html/introduction/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-html/emmit-crash-course/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-html/html-tags/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-git/welcome/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-git/introduction/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-git/terminology/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-git/behind-the-scenes/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-git/branches/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-git/diff-stash-tags/
Loaded: https://chaidocs.vercel.app/youtube/chai-aur-git/managing-history/
Loaded: https://chaidocs.

In [34]:
print(len(all_documents))
all_documents[0]

65


Document(metadata={'source': 'https://chaidocs.vercel.app/youtube/chai-aur-html/introduction/#_top', 'title': 'Introduction to HTML | Chai aur Docs', 'description': 'Learn the essentials of HTML and HTML5.', 'language': 'en'}, page_content=' Introduction to HTML | Chai aur Docs\n  Skip to content        Chai aur Docs        Search  CtrlK      Cancel             YouTube Instagram LinkedIn GitHub X                                            Getting Started       Chai aur HTML       Welcome    HTML Intro    Emmet Crash Course    Common HTML Tags          Chai aur Git       Welcome    Git and GitHub    Terminology    Behind the scenes    Branches in Git    Diff, Stash, Tags    Managing History    Collaborate with Github          Chai aur C++       Welcome    C++ Intro    First Program in C++    Variables & Constants    Data Types    Operators    Control Flow    Loops    Functions          Chai aur Django       Welcome    Django Intro    Jinja Templates App    Tailwind Integration    Models

### Get all the section title so based on that we will store in our vector db's collection

In [53]:
all_documents[10].metadata.get('source').split('/')[-3]

'chai-aur-git'

In [60]:
section_titles = []
for index in range(len(all_documents)):
    title = all_documents[index].metadata.get('source').split('/')[-3]
    
    if("chai-aur") in title:
        section_titles.append(title)

### Dynammically extract section titles

In [61]:
len(section_titles)
print(set(section_titles))

{'chai-aur-c', 'chai-aur-django', 'chai-aur-html', 'chai-aur-devops', 'chai-aur-sql', 'chai-aur-git'}


In [31]:
# Types of Text Splitters: https://python.langchain.com/docs/concepts/text_splitters/
# Chunking func
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)

# applying chunck fun to all docs
split_docs = text_splitter.split_documents(all_documents)
split_docs

[Document(metadata={'source': 'https://chaidocs.vercel.app/youtube/chai-aur-html/introduction/#_top', 'title': 'Introduction to HTML | Chai aur Docs', 'description': 'Learn the essentials of HTML and HTML5.', 'language': 'en'}, page_content='Introduction to HTML | Chai aur Docs'),
 Document(metadata={'source': 'https://chaidocs.vercel.app/youtube/chai-aur-html/introduction/#_top', 'title': 'Introduction to HTML | Chai aur Docs', 'description': 'Learn the essentials of HTML and HTML5.', 'language': 'en'}, page_content='Skip to content        Chai aur Docs        Search  CtrlK      Cancel             YouTube Instagram LinkedIn GitHub X                                            Getting Started       Chai aur HTML       Welcome    HTML Intro    Emmet Crash Course    Common HTML Tags          Chai aur Git       Welcome    Git and GitHub    Terminology    Behind the scenes    Branches in Git    Diff, Stash, Tags    Managing History    Collaborate with Github          Chai aur C++       Welc

In [32]:
len(split_docs)

553

In [37]:
# Embedder function
embedder = OpenAIEmbeddings(
    model="text-embedding-3-large",
    api_key=os.getenv("OPEN_API_KEY")
)

In [101]:
SYSTEM_PROMPT_TO_DECIDE_COLLECTION = f"""
    You are an helpful agent which will take the the documents chunks and based on your understanding of that content you have categoried the chunked document into following section_titles list of options only, dont categories into any other options outside of the list
    DONT ADD ANY TEXT OTHER THAN THE SECTION TITLES
    
    section titles:
    {set(section_titles)}
"""

In [102]:
print(SYSTEM_PROMPT_TO_DECIDE_COLLECTION)



    You are an helpful agent which will take the the documents chunks and based on your understanding of that content you have categoried the chunked document into following section_titles list of options only, dont categories into any other options outside of the list
    DONT ADD ANY TEXT OTHER THAN THE SECTION TITLES

    section titles:
    {'chai-aur-c', 'chai-aur-django', 'chai-aur-html', 'chai-aur-devops', 'chai-aur-sql', 'chai-aur-git'}



### Testing on random index of splitted document

In [103]:
client = OpenAI(api_key=os.getenv("OPEN_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT_TO_DECIDE_COLLECTION},
        {"role": "user", "content": split_docs[455].page_content}
    ]
)

# Print the response
print("\nUser Query:", split_docs[455].page_content)
print("\nAssistant Response:", response.choices[0].message.content)


User Query: Essential VS Code Extensions for HTML
Enhance your HTML coding experience with these recommended VS Code extensions:

HTML Snippets – Quickly insert common HTML structures.
Live Server – Automatically refreshes your browser as you edit your HTML.

Emmet for HTML Productivity
Emmet is built-in with VS Code and allows you to rapidly generate HTML code using short abbreviations. This significantly speeds up your coding workflow. No need to manually type lengthy tags—let Emmet handle it for you.
Spend some time getting familiar with Emmet shortcuts to greatly improve your productivity. Learn more about Emmet at the official website.
Start your journey with ChaiCode 
All of our courses are available on chaicode.com. Feel free to check them out.
     
Complied by: Hitesh Choudhary Last updated: Apr 20, 2025   PreviousWelcomeNext Emmet Crash Course    
Contribute
    
Community
   
Sponsor

Assistant Response: chai-aur-html


In [104]:
for split_docs_index in range(len(split_docs)):
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT_TO_DECIDE_COLLECTION},
            {"role": "user", "content": split_docs[split_docs_index].page_content}
        ]
    )
    
    model_response_on_split_chunk = response.choices[0].message.content
    
    vector_store = QdrantVectorStore.from_documents(
        documents = [], # for the 1st time it will create
        url = "http://localhost:6333", 
        collection_name = model_response_on_split_chunk, 
        embedding = embedder # openai embedder
    )

    # adding document(chunked of that index)
    vector_store.add_documents(documents = [split_docs[split_docs_index]])

print('Injection done')

KeyboardInterrupt: 

### LLM Generated Collections

![rag flow image](llm-generated-collections.png "RAG FLOW")

# Data Retrievel

In [106]:
user_query = "what are the common HTML tags"

In [107]:
SYSTEM_PROMPT_TO_GET_DATA = f"""
    You are an helpful AI assistant who will take the user input and based on the user query you will categories the query into provided section titles
    
    DONT ADD ANY TEXT OTHER THAN THE SECTION TITLES

    section titles:
    {set(section_titles)}
"""

In [108]:
# Init Openai client
client = OpenAI(api_key=os.getenv("OPEN_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT_TO_GET_DATA},
        {"role": "user", "content": user_query}
    ],
    temperature=0.7
)

# Print the response
print("\nUser Query:", user_query)
print("\nAssistant Response:", response.choices[0].message.content)

collection_name = response.choices[0].message.content


User Query: what are the common HTML tags

Assistant Response: chai-aur-html


### Connect to particular collection name suggested by AI

In [114]:
from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient

qdrant_client = QdrantClient(
    url="http://localhost:6333",  # or your remote Qdrant URL
)

vector_store = Qdrant(
    client=qdrant_client,
    collection_name=collection_name,
    embeddings = embedder # openai embedder
)

results = vector_store.similarity_search(user_query)

In [115]:
results

[Document(metadata={'source': 'https://chaidocs.vercel.app/youtube/chai-aur-html/html-tags/', 'title': 'Common HTML Tags | Chai aur Docs', 'description': 'Actionable things you should know.', 'language': 'en', '_id': '8b6b8a65-a1f3-4fdd-8a10-43c39c0e1410', '_collection_name': 'chai-aur-html'}, page_content='Tags for Text Content    HTML Tags for Lists    HTML Tags for Tables    HTML Tags for Forms    HTML Tags for Media    HTML Tags for Linking and Metadata     Script Tag Variations      HTML Tags for Semantic and Meta Content    Attributes for HTML Tags     HTML5 attributes      HTML5 tags    Conclusion           Common HTML Tags         Focus on the Essentials   Remember, you don’t need to master HTML to become a web developer. Focus on the basics and move on quickly. HTML5 only adds a few new tags and attributes to the classic HTML vocabulary. Although accessibility is very important, we’ll cover that later—especially if you plan to build web applications with JavaScript or modern f

In [118]:
for r in results:
    print(r.page_content, r.metadata.get('source'))

Tags for Text Content    HTML Tags for Lists    HTML Tags for Tables    HTML Tags for Forms    HTML Tags for Media    HTML Tags for Linking and Metadata     Script Tag Variations      HTML Tags for Semantic and Meta Content    Attributes for HTML Tags     HTML5 attributes      HTML5 tags    Conclusion           Common HTML Tags         Focus on the Essentials   Remember, you don’t need to master HTML to become a web developer. Focus on the basics and move on quickly. HTML5 only adds a few new tags and attributes to the classic HTML vocabulary. Although accessibility is very important, we’ll cover that later—especially if you plan to build web applications with JavaScript or modern frameworks. https://chaidocs.vercel.app/youtube/chai-aur-html/html-tags/
HTML Tags for Lists

<ul> – Unordered list
<ol> – Ordered list
<li> – List item

HTML Tags for Tables

<table> – Table container
<tr> – Table row
<td> – Table cell

HTML Tags for Forms

<form> – Form container
<input> – Input field
<text

In [119]:
user_query = "How to check git version"

In [122]:
# Init Openai client
client = OpenAI(api_key=os.getenv("OPEN_API_KEY"))

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT_TO_GET_DATA},
        {"role": "user", "content": user_query}
    ],
    temperature=0.7
)

# Print the response
print("\nUser Query:", user_query)
print("\nAssistant Response:", response.choices[0].message.content)

collection_name = response.choices[0].message.content


from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient

qdrant_client = QdrantClient(
    url="http://localhost:6333",  # or your remote Qdrant URL
)

vector_store = Qdrant(
    client=qdrant_client,
    collection_name=collection_name,
    embeddings = embedder # openai embedder
)

results = vector_store.similarity_search(user_query)

print("--------------------------------")

for r in results:
    print(f"page_content: {r.page_content}, source: {r.metadata.get('source')}")


User Query: How to check git version

Assistant Response: chai-aur-git
--------------------------------
page_content: Check your git version
To check your git version, you can run the following command:
Terminal windowgit --version
This command will display the version of git installed on your system. Git is a very stable software and don’t get any breaking changes in majority of the cases, at least in my experience.
Repository
A repository is a collection of files and directories that are stored together. It is a way to store and manage your code. A repository is like a folder on your computer, but it is more than just a folder. It can contain other files, folders, and even other repositories. You can think of a repository as a container that holds all your code.
There is a difference between a software on your system vs tracking a particular folder on your system. At any point you can run the following command to see the current state of your repository:
Terminal windowgit status, s