AI-Powered Study Assistant using Retrieval-Augmented Generation (RAG)

Course:Computer Networks  

This project implements an AI-powered study assistant designed to help students
answer questions from Computer Networks course materials using Retrieval-Augmented Generation (RAG).

The system processes academic documents such as lecture notes and textbooks,
retrieves relevant content using vector similarity search, and generates
answers using an open-source language model.

Technology Stack

This project is implemented entirely using open-source tools:

Language Model: Mistral (via Ollama, running locally)
Embedding Model: Sentence-Transformers (`all-MiniLM-L6-v2`)
Vector Database: ChromaDB (embedded, local)
Document Processing:PDF-based academic materials
Environment: Jupyter Notebook (Python)

Open-source models were chosen to avoid API costs and to better understand
the practical constraints of local deployment.

Part 1: Data Collection and Understanding

1.1 Dataset Overview

For this project, I collected academic materials from my Computer Networks course.
The dataset consists of lecture notes and reference material in PDF format, covering multiple layers of the network stack.

Types of documents:
- Lecture slide PDFs provided during coursework
- Reference-style notes explaining networking concepts
- Text-heavy PDFs with occasional diagrams and tables

The documents primarily cover the following topics:
- OSI and TCP/IP reference models
- Physical and Data Link layers
- Network layer concepts such as IP and routing
- Transport layer protocols including TCP and UDP

1.2 Document Structure and Formatting

Most documents follow a semi-structured format with headings, bullet points,
and short explanatory paragraphs. However, the structure is not consistent
across all PDFs.

Some documents are slide-based with minimal text per page, while others are
dense text documents resembling textbook chapters. Diagrams are often embedded
as images, and tables are sometimes split across pages.

1.3 Observed Challenges in the Dataset

After inspecting the raw PDFs, I observed several challenges that affect
automatic text processing:

1. Inconsistent formatting: Different PDFs use different heading styles, making it difficult to rely on document structure alone.
2. Broken text flow: In some cases, sentences are split across lines or pages
   during text extraction.
3. Tables and diagrams: Tables are converted into plain text with lost alignment,
   and diagrams do not contain meaningful extractable text.
4. Technical terminology: Networking concepts include abbreviations and protocol
   names (e.g., TCP, UDP, ARP) that require accurate retrieval to avoid confusion.

These challenges reflect real-world academic data and motivate the need for
careful chunking, retrieval, and prompt design in later stages of the project.

In [1]:
import pdfplumber
from pathlib import Path

data_path = Path("../data/raw")

pdf_files = list(data_path.glob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files")

sample_pdf = pdf_files[0]
print(f"Inspecting file: {sample_pdf.name}")

with pdfplumber.open(sample_pdf) as pdf:
    first_page = pdf.pages[0]
    text = first_page.extract_text()

print("----- Extracted Text (First Page) -----")
print(text[:1500]) 

Found 6 PDF files
Inspecting file: Computer-Networks-Notes-3-TutorialsDuniya.pdf
----- Extracted Text (First Page) -----
Download FREE Computer Science Notes at TutorialsDuniya.com
UNIT- I
Introduction
An interconnected collection of autonomous computers is called a computer network. Two
m
computers are said to be interconnected if they are able to exchange the information. If one
computer can forcibly start, stop and control another one, the computers are not autonomous. A
system with one control unit and many slaves is not a network, nor iso a large computer with
remote printers and terminals.
c
.
In a Distributed system, the existence of multiple autonomoaus computers is transparent(i.e., not
visible) to the user. He can type a command to run a program and it runs. It is up to the operating
y
system to select the best processor, find and transport all the files to that processor, and put the
i
results in the appropriate place.
n
The user of a distributed system is not aware of that 

In [2]:
from pathlib import Path
import pdfplumber

data_path = Path("../data/raw")

documents = []

for pdf_file in data_path.glob("*.pdf"):
    with pdfplumber.open(pdf_file) as pdf:
        full_text = ""
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                full_text += text + "\n"
        
        documents.append({
            "source": pdf_file.name,
            "text": full_text
        })

print(f"Total documents loaded: {len(documents)}")
for doc in documents:
    print(f"{doc['source']} → {len(doc['text'])} characters")

Total documents loaded: 6
Computer-Networks-Notes-3-TutorialsDuniya.pdf → 224599 characters
ComputerNetworks.pdf → 2270403 characters
Unit-1.pdf → 33729 characters
Unit-3.pdf → 36733 characters
Unit-4.pdf → 54481 characters
Unit-5.pdf → 66347 characters


The extracted text contains OCR artifacts such as broken words, inconsistent line breaks, and spelling errors. This reflects the real-world nature of academic PDFs and will influence chunking and retrieval quality in later stages.

In [3]:
def fixed_size_chunking(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap

    return chunks

In [4]:
all_chunks = []

for doc in documents:
    chunks = fixed_size_chunking(doc["text"])
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "source": doc["source"],
            "chunk_id": i,
            "text": chunk
        })

print(f"Total chunks created: {len(all_chunks)}")

Total chunks created: 6719


In [5]:
print(all_chunks[0]["source"])
print(all_chunks[0]["text"][:500])

Computer-Networks-Notes-3-TutorialsDuniya.pdf
Download FREE Computer Science Notes at TutorialsDuniya.com
UNIT- I
Introduction
An interconnected collection of autonomous computers is called a computer network. Two
m
computers are said to be interconnected if they are able to exchange the information. If one
computer can forcibly start, stop and control another one, the computers are not autonomous. A
system with one control unit and many slaves is not a network, nor iso a large computer with
remote printers and terminals.
c
.
In a Distribut


For the baseline RAG implementation, I used a fixed-size chunking strategy with 500 characters per chunk and 100 characters overlap. This approach is simple to implement and serves as a reference point for later experiments. However, it does not preserve sentence or semantic boundaries, which may affect retrieval quality.

In [6]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 362.27it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [7]:
texts = [chunk["text"] for chunk in all_chunks]

embeddings = embedding_model.encode(
    texts,
    show_progress_bar=True
)

print(f"Embeddings shape: {embeddings.shape}")

Batches: 100%|██████████| 210/210 [02:58<00:00,  1.18it/s]

Embeddings shape: (6719, 384)





For the baseline system, I used the open-source SentenceTransformers model all-MiniLM-L6-v2. This model provides a good trade-off between embedding quality and computational efficiency, making it suitable for large academic documents. Using a local embedding model also avoids API costs and ensures reproducibility.

In [8]:
import chromadb
from chromadb.config import Settings

In [9]:
chroma_client = chromadb.Client(
    Settings(
        persist_directory="../vectorstore",
        anonymized_telemetry=False
    )
)

collection = chroma_client.get_or_create_collection(
    name="computer_networks_baseline"
)

In [11]:
ids = [f"chunk_{i}" for i in range(len(all_chunks))]

metadatas = [
    {
        "source": chunk["source"],
        "chunk_id": chunk["chunk_id"]
    }
    for chunk in all_chunks
]

batch_size = 500

for i in range(0, len(all_chunks), batch_size):
    batch_texts = texts[i:i + batch_size]
    batch_embeddings = embeddings[i:i + batch_size]
    batch_metadatas = metadatas[i:i + batch_size]
    batch_ids = ids[i:i + batch_size]

    collection.add(
        documents=batch_texts,
        embeddings=batch_embeddings.tolist(),
        metadatas=batch_metadatas,
        ids=batch_ids
    )

    print(f"Inserted batch {i // batch_size + 1}")
print(f"Stored {len(ids)} chunks in ChromaDB")

Inserted batch 1
Inserted batch 2
Inserted batch 3
Inserted batch 4
Inserted batch 5
Inserted batch 6
Inserted batch 7
Inserted batch 8
Inserted batch 9
Inserted batch 10
Inserted batch 11
Inserted batch 12
Inserted batch 13
Inserted batch 14
Stored 6719 chunks in ChromaDB


In [13]:
print(collection.count())

6719


In [14]:
def retrieve_chunks(query, n_results=5):
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    
    retrieved_docs = results["documents"][0]
    retrieved_metadata = results["metadatas"][0]
    
    return retrieved_docs, retrieved_metadata

In [15]:
query = "What is a computer network?"
docs, meta = retrieve_chunks(query)

for i, doc in enumerate(docs):
    print(f"\n--- Retrieved Chunk {i+1} ---")
    print("Source:", meta[i]["source"])
    print(doc[:300])

C:\Users\VAISHNAVI TANDEL\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:35<00:00, 2.36MiB/s]  



--- Retrieved Chunk 1 ---
Source: Unit-1.pdf
ween nodes are established using either cable media or wireless media. The best-
known computer network is the Internet.
To build a computer network is defining what a network is and understanding how it is
used to help a business meet its objectives. A network is a connected collection of devices a

--- Retrieved Chunk 2 ---
Source: Unit-1.pdf
Smartzworld.com Smartworld.asia
Course Material –Lecture Notes
UNIT I
FUNDAMENTALS & LINK LAYER
Building a network
A computer network or data network is a telecommunications network which allows
computers to exchange data. In computer networks, networked computing devices pass data to
each other alo

--- Retrieved Chunk 3 ---
Source: Computer-Networks-Notes-3-TutorialsDuniya.pdf
Download FREE Computer Science Notes at TutorialsDuniya.com
UNIT- I
Introduction
An interconnected collection of autonomous computers is called a computer network. Two
m
computers are said to be interconnected if they are abl

For the baseline system, I used simple top-k semantic similarity search using ChromaDB. The retriever returns the most similar chunks based on cosine similarity of embeddings, without any re-ranking or filtering.

In [16]:
def build_prompt(context_chunks, question):
    context = "\n\n".join(context_chunks)
    
    prompt = f"""
You are a study assistant for a Computer Networks course.
Answer the question using ONLY the context provided below.
If the answer is not present, say that clearly.

Context:
{context}

Question:
{question}

Answer:
"""
    return prompt

In [17]:
import ollama

def generate_answer(prompt, model="mistral"):
    response = ollama.chat(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response["message"]["content"]

In [19]:
question = "What is a computer network?"

retrieved_docs, _ = retrieve_chunks(question)
prompt = build_prompt(retrieved_docs, question)
answer = generate_answer(prompt)

print(answer)

 A computer network is an interconnected collection of autonomous computers that allows them to exchange data. In this context, it refers to a telecommunications network which enables computers to communicate and exchange data using cable media or wireless media. The best-known example of such a network is the Internet.


Evaluation Questions: 

What is a computer network?

What is the difference between a LAN and a WAN?

Explain the OSI model and its layers.

What is the role of the Transport Layer?

What is the difference between TCP and UDP?

What is packet switching?

What causes network congestion?

What is the purpose of DNS?

How does error detection work in data communication?

When would you prefer UDP over TCP?

In [None]:
test_questions = [
    "What is a computer network?",
    "What is the difference between a LAN and a WAN?",
    "Explain the OSI model and its layers.",
    "What is the role of the Transport Layer?",
    "What is the difference between TCP and UDP?",
    "What is packet switching?",
    "What causes network congestion?",
    "What is the purpose of DNS?",
    "How does error detection work in data communication?",
    "When would you prefer UDP over TCP?"
]

baseline_answers = {}

for q in test_questions:
    docs, _ = retrieve_chunks(q)
    prompt = build_prompt(docs, q)
    answer = generate_answer(prompt)
    baseline_answers[q] = answer
    
    print("\n" + "="*80)
    print("QUESTION:", q)
    print("ANSWER:\n", answer)


QUESTION: What is a computer network?
ANSWER:
  A computer network is an interconnected collection of autonomous computers that are able to exchange information. It allows computers to communicate with each other using cable media or wireless media for data transfer in the form of packets, as described in the context provided. The best-known example of a computer network is the Internet.

QUESTION: What is the difference between a LAN and a WAN?
ANSWER:
  A Local Area Network (LAN) is typically used within a single building or a small geographical area to connect devices such as personal computers, workstations, and servers. On the other hand, a Wide Area Network (WAN) spans larger geographic areas, often connecting multiple LANs or networks across cities, countries, or even globally. The main difference lies in their scale and the distances they cover. A LAN is usually owned and controlled by a single organization or individual, while a WAN may be operated by service providers like I

Baseline RAG Results

The baseline RAG system performed well on direct factual questions such as definitions. However, answers to conceptual and multi-layer questions were sometimes incomplete or verbose. In some cases, sentence breaks and OCR artifacts affected answer clarity. These observations motivate improvements in chunking and prompting strategies explored in later experiments.

In [5]:
import re

def sentence_based_chunking(text, min_length=40):
    # Split on sentence endings
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    for sentence in sentences:
        sentence = sentence.strip()
        if len(sentence) >= min_length:
            chunks.append(sentence)

    return chunks

In [7]:
import pdfplumber
from pathlib import Path

data_path = Path("../data/raw")

documents = []

for pdf_path in data_path.glob("*.pdf"):
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"

    documents.append({
        "source": pdf_path.name,
        "text": text
    })

print(f"Total documents loaded: {len(documents)}")

Total documents loaded: 6


In [8]:
import re

def sentence_based_chunking(text, min_length=40):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if len(s.strip()) >= min_length]

In [9]:
sentence_chunks = []

for doc in documents:
    chunks = sentence_based_chunking(doc["text"])
    for i, chunk in enumerate(chunks):
        sentence_chunks.append({
            "source": doc["source"],
            "chunk_id": i,
            "text": chunk
        })

print(f"Total sentence-based chunks: {len(sentence_chunks)}")

Total sentence-based chunks: 18418


In [10]:
sentence_chunks[0]["text"]

'Download FREE Computer Science Notes at TutorialsDuniya.com\nUNIT- I\nIntroduction\nAn interconnected collection of autonomous computers is called a computer network.'

In [11]:
texts = [chunk["text"] for chunk in sentence_chunks]

metadatas = [
    {
        "source": chunk["source"],
        "chunk_id": chunk["chunk_id"]
    }
    for chunk in sentence_chunks
]

ids = [f"sent_chunk_{i}" for i in range(len(sentence_chunks))]

print(len(texts), len(metadatas), len(ids))

18418 18418 18418


In [12]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = embedding_model.encode(
    texts,
    show_progress_bar=True
)

  from .autonotebook import tqdm as notebook_tqdm
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 272.70it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m
Batches: 100%|██████████| 576/576 [02:38<00:00,  3.64it/s]


In [13]:
import chromadb

chroma_client = chromadb.Client()

sentence_collection = chroma_client.create_collection(
    name="computer_networks_sentence_chunks"
)

In [None]:
batch_size = 1000

for i in range(0, len(texts), batch_size):
    sentence_collection.add(
        documents=texts[i:i+batch_size],
        embeddings=embeddings[i:i+batch_size].tolist(),
        metadatas=metadatas[i:i+batch_size],
        ids=ids[i:i+batch_size]
    )

print("Sentence-based chunks stored in ChromaDB")

✅ Sentence-based chunks stored in ChromaDB


In [15]:
def retrieve_sentence_chunks(query, k=5):
    results = sentence_collection.query(
        query_texts=[query],
        n_results=k
    )
    return results["documents"][0], results["metadatas"][0]

In [16]:
question = "What is a computer network and how is it different from a distributed system?"

In [17]:
def build_prompt(retrieved_docs, question):
    context = "\n".join(retrieved_docs)
    prompt = f"""
Use the following context to answer the question clearly and concisely.

Context:
{context}

Question:
{question}

Answer:
"""
    return prompt

In [20]:
retrieved_docs, _ = retrieve_sentence_chunks(question)
prompt = build_prompt(retrieved_docs, question)

answer_sentence = generate_answer(prompt)

print(answer_sentence)

 A computer network is a collection of interconnected devices, such as computers and servers, that can communicate with each other to exchange data. It is primarily based on hardware connections like cables or wireless links.

On the other hand, a distributed system is a collection of independent computers that appear to its users as a single coherent system. The distinction between a network and a distributed system lies not just in the hardware but also in the software (operating system) they use. In a distributed system, processes running on various nodes interact with each other using a set of protocols, creating an illusion of a unified system.

The goals of a computer network include:
1. Communication: Facilitate communication and data exchange between different devices.
2. Resource Sharing: Allow sharing of resources like printers, storage devices, and applications across the networked devices.
3. Collaboration: Enable collaboration among users in real-time by facilitating inter

Retrieval-Augmented Generation (RAG) Study Assistant
Subject: Computer Networks

1. Dataset Preparation
- PDF collection
- Text extraction

2. Baseline RAG (Fixed-size Chunking)
- Chunking strategy
- Embedding & retrieval
- Baseline answer

3. Improved RAG (Sentence-based Chunking)
- Sentence-based chunking
- Embedding & retrieval
- Improved answer

4. Comparison & Analysis
(your human-written paragraph)

5. Conclusion

In this assignment, a Retrieval-Augmented Generation system was implemented for the Computer Networks domain using open-source tools. Initially, a baseline approach with fixed-size chunking was used. Although the system generated relevant answers, it lacked precision due to incomplete semantic units.

By applying sentence-based chunking, the retrieval quality improved significantly. The generated responses were more concise, accurate, and easier to understand. This demonstrates that thoughtful preprocessing and chunking strategies can greatly enhance the effectiveness of RAG systems without relying on proprietary APIs.