---

### ðŸŽ“ **Professor**: Apostolos Filippas

### ðŸ“˜ **Class**: AI Engineering

### ðŸ“‹ **Topic**: You Can Just Build Things

ðŸš« **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

## Welcome!

In our firstfour lectures, we've covered how
1. We can call LLMs via APIs and get structured responses
2. We can build lexical search with BM25
3. We can build semantic search with embeddings
4. We can combine lexical and semantic search into hybrid search

Today you will put it all together by building a Retrieval Augmented Generation (RAG) system.
- This is a question-answering bot that can answer questions about Fordham University
- You will use real data scraped from the Fordham website.


Your RAG pipeline will look like this:

```
User Question
     â†“
1. RETRIEVE: Find relevant documents (search!)
     â†“
2. AUGMENT: Stuff those documents into a prompt
     â†“
3. GENERATE: Ask an LLM to answer using the context
     â†“
Answer
```


---

# 1. Look at your data

In `data/fordham-website.zip` you'll find **~9,500 Markdown files** scraped from Fordham's website. Each file is one page â€” admissions info, program descriptions, faculty pages, financial aid, campus life, and more.

Your task: **look at the data**
- The first step in any AI engineering or data science project should always be to familiarize yourself with the data.
- I cannot stress this enough.. without this step, it's hard to build anything useful.

Tips:
- Unzip the archive and look at some of the files. 
- Open a few in a text editor. 
- Get a feel for what you're working with.
- The first line of every file is always the **URL** of the page it was scraped from. The rest is the page content converted to Markdown. Here's an example â€” `gabelli-school-of-business_veterans.md`:

```markdown
https://www.fordham.edu/gabelli-school-of-business/veterans

# Military Veterans & Active Duty Members of the Military

## Transform Your Knowledge & Skills Into a Business Career for the Future

As a veteran or an active duty member of the United States Armed Services,
you have gained or are currently acquiring the invaluable organizational,
leadership, analytics, and technical knowledge and skills that hiring
managers seek. These transferrable skills provide a major advantage in
emerging, business-related industries where innovation, a global mind-set,
and the ability to lead individuals and teams in the continuously evolving
work environment, are critical for success.

By completing a graduate or undergraduate business degree at the Gabelli
School of Business, you can prepare for a lifelong career in some of
today's fastest-growing fields. ...

### Study at a Top-Ranked, Military-Friendly University

The Gabelli School of Business is part of Fordham University, the only
New York City university to be among those ranked "Best for Vets" by
Military Times. ...

### Learn How the Yellow Ribbon Program Works

The Yellow Ribbon GI Education Enhancement Program, or the Yellow Ribbon
Program, is a part of the Post-9/11 Veterans Educational Assistance Act
of 2008. ...
```

The filenames mirror the URL structure â€” underscores replace path separators (e.g. `gabelli-school-of-business_veterans.md` came from `/gabelli-school-of-business/veterans`). Some files are short (a few lines), others are quite long.

- Once you've looked around, load the files into Python. Python's built-in `zipfile` module can read zip archives without extracting to disk. Load them into a list of dictionaries or a DataFrame with at least two fields: the filename (or a clean page name) and the content

In [1]:
# Placeholder for your implementation
from pathlib import Path
import zipfile

zip_path = Path("fordham-website.zip")

def clean_page_name(path_str: str) -> str:
    """
    Turn 'about/index.html' or 'news-events.html' into a clean page name.
    """
    # Take just the file name (last component)
    fname = Path(path_str).name
    # Strip extension
    stem = Path(fname).stem
    # Replace separators with spaces and title-case
    clean = stem.replace("-", " ").replace("_", " ").strip().title()
    return clean

docs = []

with zipfile.ZipFile(zip_path, "r") as zf:
    for info in zf.infolist():
        # Skip directories
        if info.is_dir():
            continue
        
        # Read file bytes and decode as text
        content_bytes = zf.read(info.filename)
        try:
            text = content_bytes.decode("utf-8")
        except UnicodeDecodeError:
            # Fallback if some file has a different encoding
            text = content_bytes.decode("latin-1", errors="ignore")
        
        page_name = clean_page_name(info.filename)
        
        docs.append({
            "page_name": page_name,   # clean page name / fieldname
            "content": text           # full file content as a string
        })

len(docs), docs[:3]  # quick sanity check


(9560,
 [{'page_name': 'Index',
   'content': 'https://www.fordham.edu/\n\n## Doing Good That Becomes Greater As The Jesuit University of New York\n\nWeâ€™re located in New York Cityâ€”driven by our Jesuit values and tackling todayâ€™s most pressing issues at the center of the world stage.\n\n## We Are Leaders, Dreamers, Achievers, And Doers\n\nWith sound hearts, strong minds, and the wisdom to take charge, generations of Rams have found what they have needed to growâ€”the opportunities, connections, and support of this community.\n\n## From Winding Elms to Bustling City Blocks\n\nWith residential campuses in the Bronx and Manhattan, as well as campuses in Westchester and London, Fordham provides endless opportunities to start working toward your career and building the life you want.\n\n\n## Weâ€™re Drawn to Where Weâ€™re Needed Most\n\nExplore how our values come to life: how Fordhamâ€™s students, faculty, and alumni contribute to society and make lives better.\n\n**Notice of Nondisc

---

# 2. Chunk the Documents

Some of the pages could be too long to embed as a single unit. Down the line, the pages may be too long to stuff into the LLM's prompt during the generation step. As such, most of the RAG systems will break down big documents into into smaller **chunks**.

> ðŸ“š **TERM: Chunking**  
> Splitting documents into smaller, self-contained pieces for embedding and retrieval. The goal is chunks that are small enough to be specific, but large enough to be meaningful.

Your task: **write a function that splits each document into chunks.**

Things to think about:
- What's a reasonable chunk size? (Think about what fits in a prompt vs. what's too vague)
- Should you split on sentences? Paragraphs? A fixed character/word count?
- Should chunks overlap? What happens if an answer spans two chunks?
- How do you keep track of which document each chunk came from? You may need that information down the line.

In [8]:
def chunk_documents(
    docs: list[dict],
    chunk_size: int = 900,
    overlap: int = 0,
) -> list[dict]:
    """
    Split documents into fixed-size character chunks (default 900 chars).
    Each chunk keeps metadata so you know which document it came from.

    Args:
        docs: List of dicts with "page_name" and "content" keys
        chunk_size: Target number of characters per chunk
        overlap: Number of characters to overlap between chunks (0 = no overlap)

    Returns:
        List of chunk dicts with page_name, source_url, chunk_index, content
    """
    chunks = []
    chunk_id = 0

    for doc in docs:
        page_name = doc["page_name"]
        text = doc["content"]

        # First line is the URL; rest is content
        lines = text.strip().split("\n")
        source_url = lines[0].strip() if lines else ""
        body = "\n".join(lines[1:]) if len(lines) > 1 else ""

        i = 0
        chunk_index = 0

        while i < len(body):
            chunk_text = body[i: i + chunk_size]
            chunk_dict = {
                "page_name": page_name,
                "source_url": source_url,
                "chunk_index": chunk_index,
                "content": chunk_text,
            }
            chunks.append(chunk_dict)
            chunk_index += 1
            if overlap > 0:
                i += chunk_size - overlap
            else:
                i += chunk_size

    return chunks

# Run chunking on docs (from cell above)
chunks = chunk_documents(docs)
print(f"Created {len(chunks):,} chunks from {len(docs):,} documents")
print(f"Example chunk: {chunks[0]}")


Created 52,694 chunks from 9,560 documents
Example chunk: {'page_name': 'Index', 'source_url': 'https://www.fordham.edu/', 'chunk_index': 0, 'content': '\n## Doing Good That Becomes Greater As The Jesuit University of New York\n\nWeâ€™re located in New York Cityâ€”driven by our Jesuit values and tackling todayâ€™s most pressing issues at the center of the world stage.\n\n## We Are Leaders, Dreamers, Achievers, And Doers\n\nWith sound hearts, strong minds, and the wisdom to take charge, generations of Rams have found what they have needed to growâ€”the opportunities, connections, and support of this community.\n\n## From Winding Elms to Bustling City Blocks\n\nWith residential campuses in the Bronx and Manhattan, as well as campuses in Westchester and London, Fordham provides endless opportunities to start working toward your career and building the life you want.\n\n\n## Weâ€™re Drawn to Where Weâ€™re Needed Most\n\nExplore how our values come to life: how Fordhamâ€™s students, faculty

---

# 3. Embed the Chunks

Now we need to turn each chunk into a vector so we can search over them. You've done this before in Lecture 4.

Your task: **embed all chunks using an embedding model.**

Tips:
- You could use a local model, or API model. What are the tradeoffs?
- This will take a while if you do it serially. You might want to use async/batch.
- Once you've created your embeddings, you may want to save them to disk so you don't have to redo this step every time
- You'll need to embed queries with the **same model** at search time

In [10]:
import sys
from pathlib import Path

# Ensure helpers is on path (run notebook from lectures/ or project root)
_lectures = Path.cwd() / "lectures"
if (_lectures / "helpers.py").exists():
    sys.path.insert(0, str(_lectures))
elif not (Path.cwd() / "helpers.py").exists():
    sys.path.insert(0, ".")

from helpers import get_local_model, batch_embed_local, batch_embed_openai, batch_cosine_similarity
from dotenv import load_dotenv
import numpy as np

load_dotenv()

# "local" = Hugging Face (default) | "openai" = OpenAI API
EMBED_MODEL_TYPE = "openai"

LOCAL_MODEL_NAME = "all-MiniLM-L6-v2"
OPENAI_MODEL_NAME = "text-embedding-3-small"


def get_embedding_model(model_type: str = EMBED_MODEL_TYPE):
    """Load the embedding model. Use same model at search time."""
    if model_type == "local":
        return get_local_model(LOCAL_MODEL_NAME)
    if model_type == "openai":
        return None  # API handles calls; no model object
    raise ValueError(f"model_type must be 'local' or 'openai', got {model_type}")


def _truncate_for_openai(text: str, max_chars: int = 8000) -> str:
    """Simple safeguard: trim very long chunks before sending to OpenAI."""
    if len(text) <= max_chars:
        return text
    return text[:max_chars]


def embed_texts(texts: list[str], model_type: str = EMBED_MODEL_TYPE, show_progress: bool = True) -> np.ndarray:
    """Embed a list of texts. Returns (n_texts, dim) numpy array."""
    if model_type == "local":
        return batch_embed_local(texts, model_name=LOCAL_MODEL_NAME, show_progress=show_progress)
    if model_type == "openai":
        safe_texts = [_truncate_for_openai(t) for t in texts]
        return batch_embed_openai(safe_texts, model=OPENAI_MODEL_NAME, batch_size=50, verbose=show_progress)
    raise ValueError(f"model_type must be 'local' or 'openai', got {model_type}")


# Set to True to only embed 100 chunks for a quick test
TEST_MODE = False
n_chunks = 100 if TEST_MODE else len(chunks)
chunks_to_embed = chunks[:n_chunks]

texts = [c["content"] for c in chunks_to_embed]
chunk_embeddings = embed_texts(texts, model_type=EMBED_MODEL_TYPE)
print(f"Embedded {len(chunks_to_embed):,} chunks â†’ shape {chunk_embeddings.shape}")

# Optional: save embeddings so you don't redo this step
np.save("chunk_embeddings.npy", chunk_embeddings)


Embedded 500/52694 texts
Embedded 1000/52694 texts
Embedded 1500/52694 texts
Embedded 2000/52694 texts
Embedded 2500/52694 texts
Embedded 3000/52694 texts
Embedded 3500/52694 texts
Embedded 4000/52694 texts
Embedded 4500/52694 texts
Embedded 5000/52694 texts
Embedded 5500/52694 texts
Embedded 6000/52694 texts
Embedded 6500/52694 texts
Embedded 7000/52694 texts
Embedded 7500/52694 texts
Embedded 8000/52694 texts
Embedded 8500/52694 texts
Embedded 9000/52694 texts
Embedded 9500/52694 texts
Embedded 10000/52694 texts
Embedded 10500/52694 texts
Embedded 11000/52694 texts
Embedded 11500/52694 texts
Embedded 12000/52694 texts
Embedded 12500/52694 texts
Embedded 13000/52694 texts
Embedded 13500/52694 texts
Embedded 14000/52694 texts
Embedded 14500/52694 texts
Embedded 15000/52694 texts
Embedded 15500/52694 texts
Embedded 16000/52694 texts
Embedded 16500/52694 texts
Embedded 17000/52694 texts
Embedded 17500/52694 texts
Embedded 18000/52694 texts
Embedded 18500/52694 texts
Embedded 19000/52694 

---

# 4. Retrieve

Now build the **R** in RAG. Given a user's question, find the most relevant chunks.

Your task: **write a retrieval function that takes a question and returns the most relevant chunks.**

Tips:
- You can use lexical or semantic search or both!
- How many chunks should you retrieve? Too few and you might miss the answer; too many and you'll overwhelm the LLM (and pay more tokens)
- Try a few test questions and eyeball whether the retrieved chunks are relevant
- Try a few questions and see what comes back. For example:
  - "What programs does the Gabelli School of Business offer?"
  - "How do I apply for financial aid?"
  - "Where is Fordham's campus?"

In [None]:
from typing import List, Dict, Any


def retrieve_chunks(
    question: str,
    k: int = 5,
    model_type: str = EMBED_MODEL_TYPE,
) -> List[Dict[str, Any]]:
    """Retrieve the top-k most relevant chunks for a question.

    Args:
        question: User question in plain text
        k: Number of chunks to return
        model_type: "local" or "openai" (must match how `chunk_embeddings` were created)

    Returns:
        List of chunk dicts augmented with rank and similarity score
    """
    # 1. Embed the question using the same model type
    if model_type == "local":
        model = get_embedding_model("local")
        query_emb = model.encode(question, convert_to_numpy=True)
    elif model_type == "openai":
        q_text = _truncate_for_openai(question)
        q_embs = batch_embed_openai([q_text], model=OPENAI_MODEL_NAME, batch_size=1, verbose=False)
        query_emb = q_embs[0]
    else:
        raise ValueError("model_type must be 'local' or 'openai'")

    # 2. Compute cosine similarity against all chunk embeddings
    scores = batch_cosine_similarity(query_emb, chunk_embeddings)

    # 3. Get top-k indices
    top_idx = np.argsort(-scores)[:k]

    results: List[Dict[str, Any]] = []
    for rank, idx in enumerate(top_idx, start=1):
        chunk = chunks_to_embed[idx]
        results.append(
            {
                "rank": rank,
                "score": float(scores[idx]),
                **chunk,
            }
        )

    return results


# Quick sanity check
# sample_question = "How do I apply for financial aid?"
# retrieved = retrieve_chunks(sample_question, k=3, model_type=EMBED_MODEL_TYPE)
# for r in retrieved:
#     print(f"Rank {r['rank']} | score={r['score']:.4f} | page={r['page_name']}")
#     print(r["content"][:200].replace("\n", " "), "...\n")
#     print("-" * 80)



Rank 1 | score=0.7064 | page=School Of Professional And Continuing Studies Admissions And Aid Financial Aid And Scholarships
Do I have to apply every year?**Yes. Financial aid is awarded on an annual basis.  **I don't think I will be eligible for aid. Should I still apply?**Absolutely! It is important to apply so that you c ...

--------------------------------------------------------------------------------
Rank 2 | score=0.6225 | page=Student Financial Services Undergraduate Financial Aid Current Students
 # Financial Aid Guidance for Current Students  Are you a current student looking to renew your financial aid or apply for new aid? Weâ€™ve got you covered.  Navigate this page:[Eligibility](#eligibilit ...

--------------------------------------------------------------------------------
Rank 3 | score=0.5989 | page=Student Financial Services Undergraduate Financial Aid Professional  Continuing Studies Students
 receive an award offer by email within four weeks of the completion of 

---

# 5. Generate

Now build the **G** in RAG. Take the retrieved chunks and pass them to an LLM along with the user's question.

Your task: **write a function that takes a question and the retrieved chunks, builds a prompt, and calls an LLM to generate an answer.**

Tips:
- How should you structure the prompt? The LLM needs to know: (1) what is the context of the application, (2) what is the question, (3) what it should include in its answer
- What should the LLM do if the context doesn't contain the answer?
- Start with a cheap model; try a better one when you've figured out the pipeline

In [None]:
import litellm

def generate_answer(
    question: str,
    retrieved_chunks: list[dict],
    model: str = "gpt-4o-mini",
) -> str:
    """Build a prompt from question + chunks, call LLM, return answer."""
    context_parts = []
    for c in retrieved_chunks:
        context_parts.append(f"[Source: {c.get('page_name', 'Unknown')}]\n{c['content']}")
    context = "\n\n---\n\n".join(context_parts)

    system = (
        "You are a helpful assistant that answers questions about Fordham University "
        "using only the provided context. If the context does not contain the answer, "
        "say so clearly. Do not make up information."
    )
    user = f"Context:\n\n{context}\n\nQuestion: {question}"

    response = litellm.completion(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    return response.choices[0].message.content.strip()


The Full-time MBA (FTMBA) program at Fordham University is offered through the Gabelli School of Business and includes an onboarding component called Gabelli Launch, which is required for all incoming first-year students. This program starts in July and consists of both online and in-person elements, focusing on preparatory academic and experiential learning before the fall semester. It emphasizes skills in leadership development, teamwork, communication, technical skills with data analytics, problem-solving, and critical thinking.

The FTMBA program is also STEM-designated, reflecting its integration of foundational business disciplines with technology, engineering, and science, preparing graduates for careers in a growing digital economy. The program fosters a close-knit community among students and utilizes advanced tools like Augmented Reality (AR) and Virtual Reality (VR) to enhance the classroom experience.


---

# 6. Wire everything together

Combine the previous steps into a simple function that takes in a question and returns an answer.

Your task: **write a `rag(question)` function that retrieves relevant chunks and generates an answer.**

In [18]:
def rag(question: str, k: int = 5, model: str = "gpt-4o-mini") -> str:
    """User's question flows in â†’ retrieve â†’ generate_answer. Question in, answer out."""
    chunks = retrieve_chunks(question, k=k, model_type=EMBED_MODEL_TYPE)
    return generate_answer(question, chunks, model=model)


# Example: question is the input; it flows to retrieve, then to generate_answer
user_question = input("Enter your question about Fordham University: ")
answer = rag(user_question)
print(answer)


Apostolos Filippas is an Assistant Professor at Fordham University's Gabelli School of Business, specifically focusing on Information, Technology, and Operations. He joined Fordham in 2019 and is also a research affiliate at the MIT Initiative on the Digital Economy. He is an economist who works on market design and the economics of technology, primarily in the context of online platforms. He teaches courses in Web Analytics and E-Commerce. His research addresses topics such as the design of reputation and pricing systems in online marketplaces, the economic and public policy implications of the "sharing economy," and the design of social media platforms. Additionally, his work has been featured in various mainstream media outlets.


---

# 7. Evaluate, experiment and improve

Your RAG system works â€” but there's always room to make it better. 

Your task: **evaluate, experiment, and improve your system**

Tips:
- How do you know that your system is working or that your changes are improving it?
- Try different questions â€” where does it do well? Where does it struggle?
- Adjust the number of retrieved chunks â€” what happens with more or fewer?
- Try different chunking strategies â€” bigger chunks? Smaller? Overlap?
- Try a different embedding model â€” does it change retrieval quality?
- Improve the prompt â€” can you get better, more concise answers?
- Add source attribution â€” can the system tell the user which pages the answer came from?

In [None]:
# Placeholder for your implementation

---

# 8. (Optional) Make it an app

So far your RAG system lives inside a notebook. That's great for development â€” but nobody is going to use your Jupyter notebook to ask questions about Fordham. Let's turn it into a real web app.

> ðŸ“š **TERM: Streamlit**  
> A Python library that turns plain Python scripts into interactive web apps. You write Python â€” no HTML, CSS, or JavaScript â€” and Streamlit renders it as a web page with inputs, buttons, and formatted output. It's the fastest way to go from "I have a function" to "I have a web app."

Your task: **create a Streamlit app that lets a user type a question about Fordham and get an answer from your RAG system.**

To get started:
- Install it: `uv pip install streamlit` 
- A Streamlit app is just a `.py` file (not a notebook). Create something like `fordham_rag_app.py`
- Run it: `streamlit run scripts/fordham_rag_app.py` â€” this opens a browser tab with your app

Tips:
- Check out the [Streamlit docs](https://docs.streamlit.io/) â€” the "Get started" tutorial is very short
- Your best bet is to vibecode your way to this. You'll be surprised how fast you can get it up and running

---

# Summary

## What You Built

| Step | What You Did | What It Does |
|------|-------------|-------------|
| **Load** | Read 9,500+ Fordham web pages | Get raw content |
| **Chunk** | Split pages into smaller pieces | Make content searchable and promptable |
| **Embed** | Turn chunks into vectors | Enable semantic search |
| **Retrieve** | Find relevant chunks for a question | The **R** in RAG |
| **Generate** | Ask an LLM to answer using the chunks | The **G** in RAG |
| **RAG** | Wire it all together | Question in, answer out |

## The Big Picture

RAG is one of the most common patterns in AI engineering today. What you built here is the same core architecture behind tools like ChatGPT with search, Perplexity, enterprise Q&A bots, and more. The details get more sophisticated (vector databases, reranking, query rewriting, evaluation) but the pattern is the same:

**Find relevant stuff â†’ give it to an LLM â†’ get an answer.**

You can just build things.