In [1]:
import os


from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

  from .autonotebook import tqdm as notebook_tqdm

All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  import google.generativeai as genai


In [2]:
!pip install -U google-genai

Collecting google-genai
  Downloading google_genai-1.64.0-py3-none-any.whl.metadata (53 kB)
Collecting websockets<15.1.0,>=13.0.0 (from google-genai)
  Downloading websockets-15.0.1-cp310-cp310-win_amd64.whl.metadata (7.0 kB)
Collecting sniffio (from google-genai)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Downloading google_genai-1.64.0-py3-none-any.whl (728 kB)
   ---------------------------------------- 0.0/728.8 kB ? eta -:--:--
   ---------------------------------------- 0.0/728.8 kB ? eta -:--:--
   ---------------------------------------- 0.0/728.8 kB ? eta -:--:--
   -------------- ------------------------- 262.1/728.8 kB ? eta -:--:--
   -------------- ------------------------- 262.1/728.8 kB ? eta -:--:--
   -------------- ------------------------- 262.1/728.8 kB ? eta -:--:--
   --------------------------- ---------- 524.3/728.8 kB 441.3 kB/s eta 0:00:01
   --------------------------- ---------- 524.3/728.8 kB 441.3 kB/s eta 0:00:01
   -------------------

In [3]:
from google import genai

In [5]:
from google import genai

client = genai.Client(api_key="AIzaSyCPayHQbhaq4yLQr01tqx6suRIe3_cduFI")

print("Gemini ready")

Gemini ready


In [6]:
documents = []

data_folder = "data"

for file in os.listdir(data_folder):
    
    if file.endswith(".pdf"):
        
        loader = PyPDFLoader(os.path.join(data_folder, file))
        
        documents.extend(loader.load())

print("Total pages loaded:", len(documents))

Total pages loaded: 178


In [7]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)

chunks = text_splitter.split_documents(documents)

print("Chunks created:", len(chunks))

Chunks created: 898


In [8]:
embedding = SentenceTransformerEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

print("Embedding ready")

  embedding = SentenceTransformerEmbeddings(
Loading weights: 100%|█████████████████████| 103/103 [00:00<00:00, 151.87it/s, Materializing param=pooler.dense.weight]
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Embedding ready


In [9]:
vectorstore = Chroma.from_documents(
    chunks,
    embedding,
    persist_directory="./db"
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

print("Vector DB ready")

Vector DB ready


In [12]:
def ask_dbms_assistant(question):

    docs = retriever.get_relevant_documents(question)

    context = "\n".join([doc.page_content for doc in docs])

    prompt = f"""
    You are a DBMS expert assistant.

    Answer using the context below.

    Context:
    {context}

    Question:
    {question}

    Answer:
    """

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt
    )

    return response.text

In [13]:
question = "What is normalization in DBMS?"

answer = ask_dbms_assistant(question)

print(answer)

Database normalization is the process of organizing the attributes of the database to reduce or eliminate data redundancy (having the same data but at different places).

It is crucial because data redundancy unnecessarily increases the size of the database and can lead to inconsistency problems during insert, delete, and update operations. Normalization helps to eliminate these anomalies and improve the overall quality and design of the database.

According to E.F.Codd, the inventor of the Relational Database, the goals of normalization include:
*   Vacating all repeated data from the database.
*   Removing undesirable deletion, insertion, and update anomalies.
*   Making a proper and useful relationship between tables.

By breaking down complex data structures into simpler tables, normalization makes it easier to manage, update, and retrieve data, leading to an improved and more flexible database design.


In [15]:
# Create large chunks (Experiment 1)

text_splitter_large = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks_large = text_splitter_large.split_documents(documents)

print("Large chunks created:", len(chunks_large))

Large chunks created: 472


In [16]:
# Create vector database with large chunks

vectorstore_large = Chroma.from_documents(
    chunks_large,
    embedding,
    persist_directory="./db_large"
)

retriever_large = vectorstore_large.as_retriever(search_kwargs={"k": 3})

print("Large chunk vector DB ready")

Large chunk vector DB ready


In [17]:
def ask_large_chunks(question):

    docs = retriever_large.invoke(question)

    context = "\n".join([doc.page_content for doc in docs])

    prompt = f"""
    Answer using context below.

    Context:
    {context}

    Question:
    {question}
    """

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt
    )

    return response.text

In [25]:
questions = [
"What is normalization?",
"What is primary key?",
"Explain ACID properties"
]

for q in questions:

    print("\nQUESTION:", q)

    print("\nFixed chunk answer:")
    print(ask_dbms_assistant(q))

    print("\nLarge chunk answer:")
    print(ask_large_chunks(q))

    print("\n-------------------------")


QUESTION: What is normalization?

Fixed chunk answer:
Database normalization is the process of organizing the attributes of the database to reduce or eliminate data redundancy (having the same data but at different places). It helps to eliminate anomalies, improve the overall quality of the database, ensure data consistency, and simplify data management.

Large chunk answer:
Normalization involves organizing data into tables and applying rules to ensure data is stored in a consistent and efficient manner. By reducing data redundancy and ensuring data integrity, it helps to eliminate anomalies and improve the overall quality of the database.

-------------------------

QUESTION: What is primary key?

Fixed chunk answer:
Based on the context, a primary key is a proper subset chosen from super keys that can be used to identify unique rows (tuples) in a given relationship. The context also states that "Such keys" (which can be used as a primary key) are known as Candidate keys. If a prima

In [None]:
## Experiment 1: Comparison of Chunking Strategies

In this experiment, two different chunk sizes were used to divide the DBMS notes:

- Small chunks (500 characters with overlap)
- Large chunks (1000 characters with overlap)

The small chunk method created 898 chunks, while the large chunk method created 472 chunks.

### Observations

When testing both methods with DBMS questions, I observed that the small chunk method generally provided more accurate and detailed answers. This is because smaller chunks allow the system to retrieve very specific and relevant pieces of information.

On the other hand, the large chunk method sometimes provided shorter or incomplete answers. This may happen because larger chunks contain more mixed information, which can make it harder for the retriever to select the most precise content.

For example, in the question about the primary key, the small chunk method provided a clearer explanation, while the large chunk method returned a less complete answer.

### Conclusion

Based on these observations, the small chunk strategy performed better for DBMS notes. Smaller chunks improve retrieval precision and help the model generate more accurate answers.

However, large chunks can still be useful when concepts require broader context.

In [23]:
def ask_improved_prompt(question):

    docs = retriever.invoke(question)

    context = "\n".join([doc.page_content for doc in docs])

    prompt = f"""
    You are an expert in DBMS.

    Explain the answer clearly in simple words.

    Give example if possible.

    Context:
    {context}

    Question:
    {question}

    Provide detailed explanation:
    """

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt
    )

    return response.text

In [24]:
question = "What is transaction in DBMS?"

print("Basic Prompt Answer:\n")
print(ask_dbms_assistant(question))

print("\n\nImproved Prompt Answer:\n")
print(ask_improved_prompt(question))

Basic Prompt Answer:

A transaction is a collection of operations that performs a single logical function in a database application. It is also defined as a single logical unit of work that accesses and possibly modifies the contents of a database. A set of logically related operations is known as a transaction.


Improved Prompt Answer:

As an expert in DBMS, let me explain what a transaction is in simple words, drawing from the context you've provided.

---

### What is a Transaction in DBMS?

In the world of Database Management Systems (DBMS), a **transaction** is a fundamental concept that ensures the reliability and integrity of your data, especially when multiple operations or users are involved.

Think of it as a **single, indivisible unit of work** that performs a specific logical function in the database. This unit of work comprises a collection of one or more database operations (like reading, inserting, updating, or deleting data) that are *logically related* and must either

In [None]:
## Experiment 2: Prompt Engineering Comparison

In this experiment, I tested how different prompt styles affect the quality of answers generated by the system.

Two prompts were used:

1. Basic prompt – simple instruction to answer using the context
2. Improved prompt – detailed instruction asking the model to explain clearly and provide examples

### Observations

The basic prompt generated correct answers, but the explanations were often short and lacked detail.

The improved prompt generated more detailed, clearer, and easier-to-understand answers. The model also explained concepts in a more structured way, similar to how a teacher explains.

For example, when asked about normalization, the improved prompt provided a better explanation and included useful details.

### Conclusion

Prompt design has a significant impact on answer quality. A well-written prompt improves clarity, completeness, and usefulness of the generated answers.

This shows that prompt engineering is an important part of building effective RAG systems.