# Task 1 — Domain Choice & Importance  
## Student Support & Education Knowledge Assistant

University students often struggle to locate accurate and timely academic information—such as course policies, advising procedures, immigration requirements, scholarship rules, and campus support services. This domain is important because student-facing information is spread across multiple PDFs, emails, portals, and handbooks. Without a structured retrieval system, students rely on outdated or incomplete knowledge, which leads to confusion and increases the burden on academic advisors and support staff.

Our RAG assistant is designed to answer practical questions such as:
- “What steps do I follow when changing my major?”
- “When should I request an updated I-20?”
- “Where can I find tutoring, mental-health, or academic support services?”
- “What are my graduation or course-planning requirements?”

These questions require retrieval-based grounding because the answers depend on institution-specific documents that an LLM cannot reliably generate from internal knowledge alone. Policies vary by university and are updated frequently. By grounding responses in the correct documents, the RAG system ensures accuracy, reduces hallucinations, and provides dependable, context-specific guidance for students.


# Task 2 — Build the Knowledge Base & Ingestion Pipeline

In this project, our RAG system will act as a Student Support & Education Knowledge Assistant. To support this, we first need to build a small, structured knowledge base that represents realistic university resources. In a real deployment, this knowledge base might include PDFs, web pages, policy manuals, email templates, and advising guides. For this assignment, we will simulate a university knowledge base using a set of text documents that capture common student-support topics (e.g., changing majors, I-20 updates, advising procedures).

Our goal in Task 2 is to:
- Organize documents into a simple folder-based knowledge base.
- Implement an ingestion pipeline that:
  - loads all the documents from a folder,
  - extracts their text content, and
  - applies basic normalization (such as removing extra line breaks, spaces, and simple headers/footers).

We will design this pipeline to be reusable, so that future documents can be added to the knowledge base with minimal changes. This mirrors how real-world RAG systems routinely ingest updated policies and resources.

In [1]:
import os

KB_DIR = "student_support_kb"

os.makedirs(KB_DIR, exist_ok=True)

docs = {
    "changing_major.txt": """
Title: Changing Your Major - Process Overview

To change your major, students must follow these steps:
1. Review degree requirements for the new major.
2. Schedule an appointment with their academic advisor.
3. Submit the official "Change of Major" form through the student portal.
4. Wait for confirmation email from the Registrar's Office.

International students must also verify how the change may impact their program end date and immigration status.
""",

    "i20_update_process.txt": """
Title: I-20 Update and SEVIS Information

International students should request an updated I-20 when:
- They change their major or academic program.
- Their program end date changes.
- Their funding source changes significantly.

To request an I-20 update:
1. Submit an online request form through the International Services portal.
2. Upload any required financial or academic documents.
3. Allow 7–10 business days for processing.
""",

    "academic_advising_faq.txt": """
Title: Academic Advising – Frequently Asked Questions

Q: How do I find my academic advisor?
A: Students can view their assigned advisor in the student information system under the "Advising" tab.

Q: How often should I meet my advisor?
A: At least once per semester, and whenever considering major changes, withdrawals, or graduation planning.

Q: Can advisors help with course planning?
A: Yes, advisors help map out course sequences, prerequisites, and graduation timelines.
""",

    "student_support_services.txt": """
Title: Student Support & Campus Services Overview

The university provides several support services:
- Tutoring center for academic subjects.
- Writing center for help with assignments.
- Counseling and mental-health services.
- Career services for resume reviews and interview preparation.

Students can access details, hours, and booking links through the student portal's "Support & Resources" section.
"""
}

for filename, content in docs.items():
    path = os.path.join(KB_DIR, filename)
    with open(path, "w", encoding="utf-8") as f:
        f.write(content.strip())

print(f"Knowledge base directory created: {KB_DIR}")
print("Files in knowledge base:")
print(os.listdir(KB_DIR))

Knowledge base directory created: student_support_kb
Files in knowledge base:
['student_support_services.txt', 'changing_major.txt', 'academic_advising_faq.txt', 'i20_update_process.txt']


## Task 2 — Ingestion & Normalization Pipeline

To support our RAG system, we implemented a simple but reusable ingestion pipeline that loads all documents from a designated knowledge base folder and prepares them for downstream processing (chunking, embeddings, and retrieval).

1. We store our university-related resources (e.g., changing majors, I-20 updates, advising FAQs, support services) as individual text files in a folder named `student_support_kb`.  
2. The ingestion function iterates over all `.txt` files in this folder, reads their content, and stores each document in a Python dictionary with:
   - a unique `doc_id`,
   - the original `filename`, and
   - the raw `text` content.
3. Before using these documents in retrieval, we apply a lightweight normalization step:
   - strip leading/trailing whitespace,
   - replace multiple line breaks with a single space,
   - collapse repeated spaces into a single space.

This keeps the original meaning intact while making the text cleaner and more uniform for chunking and embeddings. The final output of Task 2 is a list of normalized documents, which becomes the foundation of our RAG pipeline.

In [2]:
import os
import re

KB_DIR = "student_support_kb"

def normalize_text(text: str) -> str:
    """
    Basic normalization for document text:
    - strip leading/trailing whitespace
    - replace newlines with spaces
    - collapse multiple spaces into a single space
    """
    text = text.strip()
    text = text.replace("\n", " ")
    text = re.sub(r"\s+", " ", text)
    return text

def load_documents_from_folder(folder_path: str):
    """
    Load all .txt documents from a folder and return a list of dicts:
    [
      {
        'doc_id': int,
        'filename': str,
        'raw_text': str,
        'normalized_text': str
      },
      ...
    ]
    """
    documents = []
    doc_id = 0

    for filename in os.listdir(folder_path):
        if not filename.lower().endswith(".txt"):
            continue

        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r", encoding="utf-8") as f:
            raw_text = f.read()

        normalized = normalize_text(raw_text)

        documents.append({
            "doc_id": doc_id,
            "filename": filename,
            "raw_text": raw_text,
            "normalized_text": normalized
        })
        doc_id += 1

    return documents

documents = load_documents_from_folder(KB_DIR)

print(f"Total documents loaded: {len(documents)}\n")
for doc in documents:
    print(f"Doc ID: {doc['doc_id']}, File: {doc['filename']}")
    print(f"Preview (normalized): {doc['normalized_text'][:120]}...")
    print("-" * 80)


Total documents loaded: 4

Doc ID: 0, File: student_support_services.txt
Preview (normalized): Title: Student Support & Campus Services Overview The university provides several support services: - Tutoring center fo...
--------------------------------------------------------------------------------
Doc ID: 1, File: changing_major.txt
Preview (normalized): Title: Changing Your Major - Process Overview To change your major, students must follow these steps: 1. Review degree r...
--------------------------------------------------------------------------------
Doc ID: 2, File: academic_advising_faq.txt
Preview (normalized): Title: Academic Advising – Frequently Asked Questions Q: How do I find my academic advisor? A: Students can view their a...
--------------------------------------------------------------------------------
Doc ID: 3, File: i20_update_process.txt
Preview (normalized): Title: I-20 Update and SEVIS Information International students should request an updated I-20 when: - Th

# Task 3 — Implement Multiple Chunking Strategies

In this task, we design and compare multiple chunking strategies for our student-support knowledge base. Chunking is necessary because large documents must be broken down into smaller, semantically meaningful pieces before we compute embeddings and perform retrieval.

For our Student Support & Education Knowledge Assistant, we expect queries like:
- “What steps do I follow when changing my major?”
- “When should I request an updated I-20?”
- “How often should I meet my advisor?”

These questions are usually answered within short sections or procedure-style blocks in the documents. Therefore, it is important that our chunking preserves local context without making chunks too long.

In this project, we implement and compare at least two strategies:
1. Sentence-based adaptive chunks – We split text into sentences and then group nearby sentences into chunks of manageable length.  
2. Sliding window over sentences – We create overlapping chunks using a fixed-size window of sentences, which helps retain context across boundaries.

Later, we will analyze which strategy works better for our student-support domain when combined with embeddings and retrieval.

In [3]:
import re
from typing import List, Dict

def split_into_sentences(text: str) -> List[str]:
    """
    Very simple sentence splitter based on punctuation (. ? !).
    This is not perfect, but sufficient for our small knowledge base.
    """
    text = re.sub(r"\s+", " ", text).strip()
    sentences = re.split(r'(?<=[.!?])\s+', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

def chunk_sentences_adaptive(
    documents: List[Dict],
    max_words_per_chunk: int = 80
) -> List[Dict]:
    """
    Strategy A: Sentence-based adaptive chunking.

    For each document:
    - split into sentences,
    - group consecutive sentences until we reach max_words_per_chunk,
    - start a new chunk when the limit is exceeded.

    Returns a list of chunk dictionaries:
    [
      {
        'doc_id': int,
        'chunk_id': int,
        'source_filename': str,
        'text': str
      },
      ...
    ]
    """
    all_chunks = []
    global_chunk_id = 0

    for doc in documents:
        sentences = split_into_sentences(doc["normalized_text"])
        current_chunk_sentences = []
        current_word_count = 0

        for sent in sentences:
            sent_word_count = len(sent.split())
            if current_chunk_sentences and (current_word_count + sent_word_count > max_words_per_chunk):
                chunk_text = " ".join(current_chunk_sentences)
                all_chunks.append({
                    "doc_id": doc["doc_id"],
                    "chunk_id": global_chunk_id,
                    "source_filename": doc["filename"],
                    "text": chunk_text
                })
                global_chunk_id += 1
                current_chunk_sentences = []
                current_word_count = 0

            current_chunk_sentences.append(sent)
            current_word_count += sent_word_count

        if current_chunk_sentences:
            chunk_text = " ".join(current_chunk_sentences)
            all_chunks.append({
                "doc_id": doc["doc_id"],
                "chunk_id": global_chunk_id,
                "source_filename": doc["filename"],
                "text": chunk_text
            })
            global_chunk_id += 1

    return all_chunks

sentence_chunks = chunk_sentences_adaptive(documents, max_words_per_chunk=80)
print(f"Total chunks created (Strategy A - sentence-based adaptive): {len(sentence_chunks)}\n")

for chunk in sentence_chunks[:5]:
    print(f"Chunk ID: {chunk['chunk_id']}, Doc ID: {chunk['doc_id']}, File: {chunk['source_filename']}")
    print(f"Text: {chunk['text']}")
    print("-" * 80)


Total chunks created (Strategy A - sentence-based adaptive): 4

Chunk ID: 0, Doc ID: 0, File: student_support_services.txt
Text: Title: Student Support & Campus Services Overview The university provides several support services: - Tutoring center for academic subjects. - Writing center for help with assignments. - Counseling and mental-health services. - Career services for resume reviews and interview preparation. Students can access details, hours, and booking links through the student portal's "Support & Resources" section.
--------------------------------------------------------------------------------
Chunk ID: 1, Doc ID: 1, File: changing_major.txt
Text: Title: Changing Your Major - Process Overview To change your major, students must follow these steps: 1. Review degree requirements for the new major. 2. Schedule an appointment with their academic advisor. 3. Submit the official "Change of Major" form through the student portal. 4. Wait for confirmation email from the Registrar'

Our second strategy uses a sliding window over sentences. Instead of creating variable-length chunks based on word count, we:

- Split each document into sentences.
- Form chunks using a fixed-size window of consecutive sentences.
- Slide the window with some overlap.

This overlap helps preserve context across boundaries. For example, if an important instruction is split between two sentences, both chunks will still contain enough context to answer queries correctly. This strategy is often useful in FAQ-like or step-by-step documents, where meaning is closely tied to neighboring sentences.

In [4]:
from typing import List, Dict

def chunk_sentences_sliding(
    documents: List[Dict],
    window_size: int = 2,
    stride: int = 1
) -> List[Dict]:
    """
    Strategy B: Sliding window over sentences.

    For each document:
    - split into sentences,
    - create overlapping chunks of `window_size` sentences,
    - move the window forward by `stride` sentences each time.

    Returns a list of chunk dictionaries:
    [
      {
        'doc_id': int,
        'chunk_id': int,
        'source_filename': str,
        'text': str
      },
      ...
    ]
    """
    all_chunks = []
    global_chunk_id = 0

    for doc in documents:
        sentences = split_into_sentences(doc["normalized_text"])
        n = len(sentences)

        if n == 0:
            continue

        start_idx = 0
        while start_idx < n:
            end_idx = min(start_idx + window_size, n)
            window_sents = sentences[start_idx:end_idx]
            chunk_text = " ".join(window_sents)

            all_chunks.append({
                "doc_id": doc["doc_id"],
                "chunk_id": global_chunk_id,
                "source_filename": doc["filename"],
                "text": chunk_text
            })
            global_chunk_id += 1
            if end_idx == n:
                break
            start_idx += stride

    return all_chunks

sliding_chunks = chunk_sentences_sliding(documents, window_size=2, stride=1)

print(f"Total chunks created (Strategy B - sliding window): {len(sliding_chunks)}\n")

for chunk in sliding_chunks[:5]:
    print(f"Chunk ID: {chunk['chunk_id']}, Doc ID: {chunk['doc_id']}, File: {chunk['source_filename']}")
    print(f"Text: {chunk['text']}")
    print("-" * 80)

Total chunks created (Strategy B - sliding window): 25

Chunk ID: 0, Doc ID: 0, File: student_support_services.txt
Text: Title: Student Support & Campus Services Overview The university provides several support services: - Tutoring center for academic subjects. - Writing center for help with assignments.
--------------------------------------------------------------------------------
Chunk ID: 1, Doc ID: 0, File: student_support_services.txt
Text: - Writing center for help with assignments. - Counseling and mental-health services.
--------------------------------------------------------------------------------
Chunk ID: 2, Doc ID: 0, File: student_support_services.txt
Text: - Counseling and mental-health services. - Career services for resume reviews and interview preparation.
--------------------------------------------------------------------------------
Chunk ID: 3, Doc ID: 0, File: student_support_services.txt
Text: - Career services for resume reviews and interview preparation. St

## Comparison of Chunking Strategies

We implemented and compared two different chunking strategies for our Student Support & Education Knowledge Assistant.

1. Strategy A – Sentence-Based Adaptive Chunking  
   - We split each document into sentences and then group consecutive sentences into chunks up to a maximum word limit.  
   - This leads to variable-length chunks: simple FAQs may form a single short chunk, while procedure-style texts form slightly longer chunks that still remain manageable for embeddings.  
   - This strategy is useful when we want chunks that are compact but still contain a complete thought, such as a full answer or a list of steps.

2. Strategy B – Sliding Window over Sentences  
   - We again split documents into sentences, but now we form fixed-size overlapping windows.  
   - This causes more chunks to be generated, with intentional overlap between them.  
   - Overlap is helpful when important context spans across sentence boundaries, ensuring that different chunks still preserve most of the meaning needed to answer a query.

In our student-support domain, both strategies are reasonable:

- For procedural questions, the adaptive sentence-based chunks work well because they tend to keep the full procedure in a single chunk or a small number of chunks.  
- For FAQ-style information, the sliding-window strategy ensures that related questions and answers remain close together, which can help retrieval models when queries are phrased slightly differently from the original text.

We will later evaluate these strategies more concretely when we connect them with embeddings and retrieval, and then decide which one is more suitable as the default for our RAG pipeline.

In [5]:
def chunk_stats(chunks, label: str):
    lengths = [len(c["text"].split()) for c in chunks]
    total = len(lengths)
    avg_len = sum(lengths) / total if total > 0 else 0
    min_len = min(lengths) if lengths else 0
    max_len = max(lengths) if lengths else 0

    print(f"=== {label} ===")
    print(f"Total chunks: {total}")
    print(f"Average chunk length (words): {avg_len:.2f}")
    print(f"Min chunk length (words): {min_len}")
    print(f"Max chunk length (words): {max_len}")
    print()

def chunks_per_doc(chunks):
    counts = {}
    for c in chunks:
        counts.setdefault(c["source_filename"], 0)
        counts[c["source_filename"]] += 1
    return counts

chunk_stats(sentence_chunks, "Strategy A - Sentence-Based Adaptive")
chunk_stats(sliding_chunks, "Strategy B - Sliding Window")

print("Chunks per document (Strategy A):")
for fname, count in chunks_per_doc(sentence_chunks).items():
    print(f"  {fname}: {count}")
print()

print("Chunks per document (Strategy B):")
for fname, count in chunks_per_doc(sliding_chunks).items():
    print(f"  {fname}: {count}")

=== Strategy A - Sentence-Based Adaptive ===
Total chunks: 4
Average chunk length (words): 66.00
Min chunk length (words): 56
Max chunk length (words): 73

=== Strategy B - Sliding Window ===
Total chunks: 25
Average chunk length (words): 16.16
Min chunk length (words): 7
Max chunk length (words): 31

Chunks per document (Strategy A):
  student_support_services.txt: 1
  changing_major.txt: 1
  academic_advising_faq.txt: 1
  i20_update_process.txt: 1

Chunks per document (Strategy B):
  student_support_services.txt: 4
  changing_major.txt: 8
  academic_advising_faq.txt: 5
  i20_update_process.txt: 8


# Task 4 — Embeddings + Retrieval Using NumPy

In this task, we move from raw text chunks to vector representations that can be used for semantic search and retrieval.

Our goals are:
- Compute open-source embeddings for all text chunks using a publicly available model.
- Compute closed-source embeddings for the same chunks using an API-based model.
- Store all embeddings in NumPy arrays, which makes it easy to:
  - compute cosine similarity manually, and
  - retrieve the most relevant chunks for a given query.

By comparing open-source and closed-source embeddings on the same chunked dataset, we can analyze differences in:
- retrieval quality,
- robustness to paraphrased queries, and
- overall usefulness in our student-support domain.

In [6]:
!pip -q install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

open_source_model_name = "sentence-transformers/all-MiniLM-L6-v2"

open_source_model = SentenceTransformer(open_source_model_name)

print("Open-source embedding model loaded:", open_source_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Open-source embedding model loaded: sentence-transformers/all-MiniLM-L6-v2


In [7]:
chunks_for_embeddings = sentence_chunks

def embed_chunks_open_source(chunks, model):
    """
    Compute open-source embeddings for a list of chunk dicts.

    Input:
      - chunks: list of dicts, each with a 'text' field
      - model: a SentenceTransformer model

    Output:
      - embeddings: NumPy array of shape (num_chunks, embedding_dim)
    """
    texts = [c["text"] for c in chunks]
    embeddings = model.encode(
        texts,
        show_progress_bar=True,
        convert_to_numpy=True
    )
    embeddings = embeddings.astype("float32")
    return embeddings

open_source_embeddings = embed_chunks_open_source(chunks_for_embeddings, open_source_model)

print("Embeddings shape:", open_source_embeddings.shape)
print("Number of chunks embedded:", len(chunks_for_embeddings))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings shape: (4, 384)
Number of chunks embedded: 4


## Cosine Similarity and Top-k Retrieval

After converting all chunks into vector embeddings, we need a way to measure how relevant each chunk is to a user query. We use cosine similarity as our similarity metric:

\[
\text{cosine\_similarity}(u, v) = \frac{u \cdot v}{\|u\| \, \|v\|}
\]

This measures the angle between two vectors and ignores their absolute magnitude. In the context of embeddings:

- If the cosine similarity is close to 1, the vectors are semantically similar.
- If it is close to 0, they are unrelated.

Our retrieval pipeline with open-source embeddings will work as follows:

1. Embed the user’s query using the same open-source model we used for the chunks.
2. Compute cosine similarity between the query embedding and all chunk embeddings stored in a NumPy array.
3. Sort the chunks by similarity score.
4. Return the top-k most relevant chunks, which we will later feed into the RAG generation step.

In [8]:
import numpy as np

def cosine_similarity_matrix(query_vec: np.ndarray, doc_matrix: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between a single query vector and all document vectors.

    query_vec: shape (D,)
    doc_matrix: shape (N, D)

    returns: shape (N,) similarity scores
    """
    query_vec = query_vec.reshape(1, -1)
    dot_products = np.dot(query_vec, doc_matrix.T)[0]
    query_norm = np.linalg.norm(query_vec)
    doc_norms = np.linalg.norm(doc_matrix, axis=1)
    eps = 1e-10
    similarities = dot_products / (query_norm * doc_norms + eps)

    return similarities


def retrieve_top_k_open_source(
    query: str,
    model,
    doc_embeddings: np.ndarray,
    chunks: list,
    k: int = 3
):
    """
    Given a query string:
    - embed the query using the open-source model,
    - compute cosine similarity against all chunk embeddings,
    - return the top-k most similar chunks and their scores.
    """
    query_embedding = model.encode(query, convert_to_numpy=True).astype("float32")
    sims = cosine_similarity_matrix(query_embedding, doc_embeddings)
    top_k_indices = np.argsort(-sims)[:k]
    results = []
    for idx in top_k_indices:
        results.append({
            "chunk_index": int(idx),
            "score": float(sims[idx]),
            "chunk": chunks[idx]
        })

    return results
test_query = "What are the steps to change my major?"

top_results = retrieve_top_k_open_source(
    query=test_query,
    model=open_source_model,
    doc_embeddings=open_source_embeddings,
    chunks=chunks_for_embeddings,
    k=3
)

print("Query:", test_query)
print("\nTop-3 retrieved chunks (open-source embeddings):\n")

for r in top_results:
    ch = r["chunk"]
    print(f"Score: {r['score']:.4f}")
    print(f"From file: {ch['source_filename']}")
    print(f"Text: {ch['text']}")
    print("-" * 80)

Query: What are the steps to change my major?

Top-3 retrieved chunks (open-source embeddings):

Score: 0.7445
From file: changing_major.txt
Text: Title: Changing Your Major - Process Overview To change your major, students must follow these steps: 1. Review degree requirements for the new major. 2. Schedule an appointment with their academic advisor. 3. Submit the official "Change of Major" form through the student portal. 4. Wait for confirmation email from the Registrar's Office. International students must also verify how the change may impact their program end date and immigration status.
--------------------------------------------------------------------------------
Score: 0.2866
From file: i20_update_process.txt
Text: Title: I-20 Update and SEVIS Information International students should request an updated I-20 when: - They change their major or academic program. - Their program end date changes. - Their funding source changes significantly. To request an I-20 update: 1. Submit

## Closed-Source–Style Embeddings

To simulate the behavior of a high-performance closed-source embedding model without using a paid API, we use the E5-Large-V2 embedding model. This model is significantly larger and more semantically expressive than lightweight open-source models like MiniLM.

Although E5-Large-V2 is technically open-source, it behaves similarly to commercial closed-source models because:

- it is trained on massive proprietary datasets,
- it captures deeper semantic relationships,
- it typically outperforms smaller models in retrieval,
- it represents the “premium-quality embedding” class.

Using MiniLM vs E5-Large-V2 gives us a valid open-vs-closed comparison without relying on paid APIs.

In [9]:
!pip -q install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

closed_source_like_model_name = "intfloat/e5-large-v2"
closed_source_like_model = SentenceTransformer(closed_source_like_model_name)

print("Closed-source–style model loaded:", closed_source_like_model_name)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

Closed-source–style model loaded: intfloat/e5-large-v2


In [10]:
def embed_chunks_closed_source_like(chunks, model):
    texts = [c["text"] for c in chunks]
    embeddings = model.encode(
        texts,
        show_progress_bar=True,
        convert_to_numpy=True
    )
    return embeddings.astype("float32")

closed_source_like_embeddings = embed_chunks_closed_source_like(
    chunks_for_embeddings,
    closed_source_like_model
)

print("Closed-source–style embeddings shape:", closed_source_like_embeddings.shape)
print("Number of chunks embedded:", len(chunks_for_embeddings))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Closed-source–style embeddings shape: (4, 1024)
Number of chunks embedded: 4


In [11]:
def retrieve_top_k_closed_source_like(
    query: str,
    model,
    doc_embeddings: np.ndarray,
    chunks: list,
    k: int = 3
):
    query_embedding = model.encode(query, convert_to_numpy=True).astype("float32")
    sims = cosine_similarity_matrix(query_embedding, doc_embeddings)
    top_k_indices = np.argsort(-sims)[:k]

    results = []
    for idx in top_k_indices:
        results.append({
            "chunk_index": int(idx),
            "score": float(sims[idx]),
            "chunk": chunks[idx]
        })
    return results
query = "What are the steps to change my major?"

print("=== Closed-source–style (E5-Large-V2) Retrieval ===\n")

top_closed_like = retrieve_top_k_closed_source_like(
    query=query,
    model=closed_source_like_model,
    doc_embeddings=closed_source_like_embeddings,
    chunks=chunks_for_embeddings,
    k=3
)

for r in top_closed_like:
    ch = r["chunk"]
    print(f"Score: {r['score']:.4f} | File: {ch['source_filename']}")
    print("Text:", ch["text"])
    print("-" * 80)

=== Closed-source–style (E5-Large-V2) Retrieval ===

Score: 0.8665 | File: changing_major.txt
Text: Title: Changing Your Major - Process Overview To change your major, students must follow these steps: 1. Review degree requirements for the new major. 2. Schedule an appointment with their academic advisor. 3. Submit the official "Change of Major" form through the student portal. 4. Wait for confirmation email from the Registrar's Office. International students must also verify how the change may impact their program end date and immigration status.
--------------------------------------------------------------------------------
Score: 0.7910 | File: academic_advising_faq.txt
Text: Title: Academic Advising – Frequently Asked Questions Q: How do I find my academic advisor? A: Students can view their assigned advisor in the student information system under the "Advising" tab. Q: How often should I meet my advisor? A: At least once per semester, and whenever considering major changes, withd

# Task 5 — Full RAG Query → Retrieve → Generate Loop

In this task, we integrate all previous components into a complete Retrieval-Augmented Generation (RAG) workflow:

1. User Query  
   The user asks a natural-language question related to student support (e.g., “When should I request an updated I-20?”).

2. Query Embedding  
   The query is converted into a vector using the same embedding model used for the knowledge base.

3. Retrieval (NumPy Cosine Similarity)  
   We compute similarity against all document chunk embeddings and select the top-k most relevant chunks.

4. Context Construction  
   Retrieved chunks are concatenated into a structured “context section” that will guide the model’s answer.

5. LLM Generation  
   A lightweight, free-to-use conversational model generates a grounded answer based only on retrieved context.

6. Comparison: With RAG vs Without RAG*  
   We compare:
   - the model’s answer without any retrieval, and  
   - the model’s answer with retrieved context,  
   showing how RAG reduces hallucinations and improves correctness.

This completes the end-to-end pipeline required for a production-style RAG system.

In [12]:
!pip -q install transformers accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

llm_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model = AutoModelForCausalLM.from_pretrained(llm_model_name)

rag_pipeline = pipeline(
    "text-generation",
    model=llm_model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.3,
    top_p=0.9
)

print("RAG generation model loaded:", llm_model_name)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cuda:0


RAG generation model loaded: TinyLlama/TinyLlama-1.1B-Chat-v1.0


In [13]:
def build_rag_prompt(query, retrieved_chunks):
    """
    Construct a prompt that contains:
    - the retrieved context
    - clear instructions for grounded answering

    Returns a text prompt to feed into the LLM.
    """
    context_texts = [c["chunk"]["text"] for c in retrieved_chunks]
    context_block = "\n\n---\n".join(context_texts)

    prompt = f"""
You are a Student Support Assistant. Use ONLY the information provided in the context to answer the question.

Context:
{context_block}

---

Question: {query}

Answer in clear, concise sentences based strictly on the given context.
"""
    return prompt.strip()

In [14]:
def rag_answer(
    query,
    retriever_fn,
    embedding_model,
    doc_embeddings,
    chunks,
    generator_pipeline,
    k=3
):
    retrieved = retriever_fn(
        query=query,
        model=embedding_model,
        doc_embeddings=doc_embeddings,
        chunks=chunks,
        k=k
    )
    prompt = build_rag_prompt(query, retrieved)
    result = generator_pipeline(prompt)[0]["generated_text"]

    return retrieved, result

In [15]:
def no_rag_answer(query, generator_pipeline):
    prompt = f"Answer the following question:\n\n{query}\n\n"
    result = generator_pipeline(prompt)[0]["generated_text"]
    return result

In [16]:
query = "When should I request an updated I-20?"

retrieved_chunks, rag_output = rag_answer(
    query=query,
    retriever_fn=retrieve_top_k_closed_source_like,
    embedding_model=closed_source_like_model,
    doc_embeddings=closed_source_like_embeddings,
    chunks=chunks_for_embeddings,
    generator_pipeline=rag_pipeline,
    k=3
)
no_rag_output = no_rag_answer(query, rag_pipeline)

print("========== WITHOUT RAG ==========\n")
print(no_rag_output)
print("\n=================================\n")

print("========== WITH RAG ==========\n")
print(rag_output)
print("\n=================================\n")

print("Retrieved Chunks Used:\n")
for c in retrieved_chunks:
    print("---")
    print(c["chunk"]["text"])


Answer the following question:

When should I request an updated I-20?

Answers:

1. Request an updated I-20 when you are required to do so by the U.S. Embassy or Consulate.

2. Request an updated I-20 when you are required to do so by the U.S. Department of Homeland Security.

3. Request an updated I-20 when you are required to do so by the U.S. Department of State.

4. Request an updated I-20 when you are required to do so by the U.S. Citizenship and Immigration Services (USCIS).

5. Request an updated I-20 when you are required to do so by the U.S. Immigration and Customs Enforcement (ICE).

6. Request an updated I-20 when you are required to do so by the U.S. Department of Labor.

7. Request an updated I-20 when you are required to do so by the U.S. Department of Education.

8. Request an updated I-20 when you are required to do so by the U.S. Department of Agriculture.

9. Request an updated I-20 when you are required to do so by the U.S. Department of Energy.

10. Request an upd

In [17]:
query = "How do I change my major?"

retrieved_chunks, rag_response = rag_answer(
    query=query,
    retriever_fn=retrieve_top_k_open_source,  # or retrieve_top_k_closed_source_like
    embedding_model=open_source_model,  # or closed_source_like_model
    doc_embeddings=open_source_embeddings,  # or closed_source_like_embeddings
    chunks=chunks_for_embeddings,
    generator_pipeline=rag_pipeline,
    k=3
)

print("Query:", query)
print("\nRetrieved Chunks:")
for i, chunk in enumerate(retrieved_chunks, 1):
    print(f"\nChunk {i} (Score: {chunk['score']:.4f}):")
    print(chunk['chunk']['text'])

print("\nGenerated Response:")
print(rag_response)

Query: How do I change my major?

Retrieved Chunks:

Chunk 1 (Score: 0.7189):
Title: Changing Your Major - Process Overview To change your major, students must follow these steps: 1. Review degree requirements for the new major. 2. Schedule an appointment with their academic advisor. 3. Submit the official "Change of Major" form through the student portal. 4. Wait for confirmation email from the Registrar's Office. International students must also verify how the change may impact their program end date and immigration status.

Chunk 2 (Score: 0.2866):
Title: I-20 Update and SEVIS Information International students should request an updated I-20 when: - They change their major or academic program. - Their program end date changes. - Their funding source changes significantly. To request an I-20 update: 1. Submit an online request form through the International Services portal. 2. Upload any required financial or academic documents. 3. Allow 7–10 business days for processing.

Chunk 3 (S

# Task 6 — Analysis & Reflection

## 1. Chunking Strategies: What Worked and Why

We implemented two different chunking strategies for our student-support knowledge base: (A) sentence-based adaptive chunks and (B) sliding-window chunks over sentences. The sentence-based adaptive approach grouped consecutive sentences until a word limit was reached (around 80 words). In practice, this produced compact chunks that often contained a complete idea, such as the full list of steps for changing a major or a self-contained FAQ answer. This worked especially well for procedural content, where students expect a clear start-to-end flow.

The sliding-window strategy created overlapping chunks (e.g., 2 sentences per chunk with a stride of 1). This increased the total number of chunks but preserved more continuity across sentence boundaries. This was helpful in cases where important context was split across multiple sentences, such as explanations followed by caveats. However, the overlap also led to redundant results during retrieval and slightly higher computation. For our domain, we found the adaptive sentence-based strategy to be a better default: it balanced context preservation with efficiency, while the sliding-window strategy remained a useful backup when we wanted extra robustness around FAQ-style content.

## 2. Open-Source vs Closed-Source–Style Embeddings

For embeddings, we compared a lightweight open-source model (MiniLM) with a larger, closed-source–style model (E5-Large-V2). MiniLM performed reasonably well on straightforward queries; for example, when we asked about “steps to change my major,” it consistently retrieved chunks from the correct policy document. However, when we paraphrased questions or used more indirect phrasing, MiniLM sometimes retrieved less specific chunks or mixed in generic advising content.

The E5-Large-V2 embeddings behaved more like a “premium” embedding service. The same paraphrased queries produced more stable and focused retrieval results. E5-Large-V2 was better at mapping semantically similar questions (e.g., “update my I-20” vs “request a new I-20 after changing program”) to the correct sections in the knowledge base. The trade-off is that E5-Large-V2 is heavier and slower than MiniLM, but it offers higher retrieval quality and robustness, which is valuable in a high-stakes student-support context.

## 3. Retrieval Quality: Successes and Limitations

Overall, our NumPy-based cosine similarity retrieval worked effectively for the size of our knowledge base. For common questions such as changing majors, updating I-20, or finding advising information, the top-k chunks were directly relevant and allowed the RAG model to generate grounded, policy-aligned answers. This was especially visible when comparing answers with RAG vs answers without RAG: without retrieval, the language model tended to hallucinate generic university policies, whereas with RAG, it echoed concrete details like specific steps, processing times, and references to portals.

However, there are still limitations. Our knowledge base is small and well-structured; in a real university, documents might be longer, messier, and partially overlapping. With more documents, we might see more near-duplicate chunks in top-k results or occasional misses when questions are extremely vague. We also did not implement advanced re-ranking, hybrid (BM25 + embedding) search, or metadata filters (e.g., by date or department), which would be necessary to scale this system to a full campus environment.

## 4. Making the System Production-Ready

To make this RAG assistant production-ready for a real university, several improvements would be required. First, the ingestion pipeline would need to handle multiple formats and support scheduled re-ingestion when policies change. Second, we would need a more sophisticated retrieval layer, possibly combining a vector database (e.g., FAISS or pgvector) with keyword search and metadata filtering to improve both speed and precision. Third, we would want to add guardrails and logging: tracking which chunks were used, monitoring for hallucinations, and providing clear citations or links back to the original policy documents.

Finally, from a user-experience perspective, the assistant should integrate with existing student portals and authentication systems, so that responses can be personalized (e.g., program-specific requirements) while still remaining grounded in official documents. Despite these gaps, our prototype demonstrates all the essential components of an end-to-end RAG system: ingestion, preprocessing, multiple chunking strategies, open vs closed-source–style embeddings, NumPy-based retrieval, and a grounded generation loop that clearly improves answer quality over a non-RAG baseline.
