# COMP8420 Student Agent – Dataset Preparation

This notebook prepares the dataset for the Student Agent project. The dataset includes:

1. Lecture content extracted from COMP8420 PDF slides  
2. Practical files converted from Jupyter Notebooks (.ipynb)  
3. All content is stored in plain `.txt` files to be used in a Retrieval-Augmented Generation (RAG) pipeline


In [7]:
from langchain_community.document_loaders import PyMuPDFLoader
from pathlib import Path

# Set your input/output paths
pdf_folder = Path("/Users/shaimonrahman/Desktop/COMP8420/Lectures")  # Change this
output_folder = Path("/Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420/lectures")
output_folder.mkdir(parents=True, exist_ok=True)

# Extract each lecture PDF into a .txt file
for pdf_file in pdf_folder.glob("*.pdf"):
    loader = PyMuPDFLoader(str(pdf_file))
    documents = loader.load()
    full_text = "\n".join([doc.page_content for doc in documents])
    txt_filename = pdf_file.stem + ".txt"
    with open(output_folder / txt_filename, "w", encoding="utf-8") as f:
        f.write(full_text)
    print(f"Extracted: {pdf_file.name} → {txt_filename}")


Extracted: COMP8420-W6 - Dev LLMs - Fine-tuning.pdf → COMP8420-W6 - Dev LLMs - Fine-tuning.txt
Extracted: COMP8420-W9-v2.pdf → COMP8420-W9-v2.txt
Extracted: COMP8420-W4 - Use LLMs - Text processing.pdf → COMP8420-W4 - Use LLMs - Text processing.txt
Extracted: COMP8420-W1.pdf → COMP8420-W1.txt
Extracted: HEAL.pdf → HEAL.txt
Extracted: NLP_Guest_Lecture_HMC_V2.pdf → NLP_Guest_Lecture_HMC_V2.txt
Extracted: COMP8420-W13.pdf → COMP8420-W13.txt
Extracted: COMP8420-W10.pdf → COMP8420-W10.txt
Extracted: COMP8420-W7 - Und LLMs - Risk and future.pdf → COMP8420-W7 - Und LLMs - Risk and future.txt
Extracted: COMP8420-W8 - Dev LLMs - Humanoid AI - 2.pdf → COMP8420-W8 - Dev LLMs - Humanoid AI - 2.txt
Extracted: COMP8420-W5 - Dev LLMs - Multimodal LLMs.pdf → COMP8420-W5 - Dev LLMs - Multimodal LLMs.txt
Extracted: COMP8420-W11-v2.pdf → COMP8420-W11-v2.txt
Extracted: COMP8420-W2-v1.pdf → COMP8420-W2-v1.txt
Extracted: COMP8420-W3 - Und LLMs - Foundation models.pdf → COMP8420-W3 - Und LLMs - Foundation m

In [8]:
import nbformat

# Define paths for notebooks and output .txt
prac_folder = Path("/Users/shaimonrahman/Desktop/COMP8420/Prac")  # Folder with your .ipynb files
output_folder = Path("/Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420/practicals")
output_folder.mkdir(parents=True, exist_ok=True)

# Function to extract markdown + code
def extract_notebook_text(nb_path):
    nb = nbformat.read(open(nb_path, "r", encoding="utf-8"), as_version=4)
    content = []
    for cell in nb.cells:
        if cell.cell_type in ['markdown', 'code']:
            content.append(cell.source)
    return "\n\n".join(content)

# Convert all .ipynb files to .txt
for nb_file in prac_folder.glob("*.ipynb"):
    text = extract_notebook_text(nb_file)
    txt_name = nb_file.stem + ".txt"
    with open(output_folder / txt_name, "w", encoding="utf-8") as f:
        f.write(text)
    print(f"Converted: {nb_file.name} → {txt_name}")



Converted: COMP8420-week3-solution.ipynb → COMP8420-week3-solution.txt
Converted: COMP8420-week4-solution.ipynb → COMP8420-week4-solution.txt
Converted: COMP8420-week5-solution.ipynb → COMP8420-week5-solution.txt
Converted: COMP8420-week2-solution.ipynb → COMP8420-week2-solution.txt
Converted: COMP8420-week4-practice.ipynb → COMP8420-week4-practice.txt
Converted: COMP8420-week7-solution.ipynb → COMP8420-week7-solution.txt
Converted: COMP8420-week3-practice.ipynb → COMP8420-week3-practice.txt
Converted: COMP8420-week6-solution.ipynb → COMP8420-week6-solution.txt
Converted: COMP8420_week1_solution.ipynb → COMP8420_week1_solution.txt
Converted: COMP8420_week12_solution.ipynb → COMP8420_week12_solution.txt
Converted: COMP8420_week9_solution.ipynb → COMP8420_week9_solution.txt
Converted: COMP8420_week8_solution.ipynb → COMP8420_week8_solution.txt
Converted: COMP8420_week11_solution.ipynb → COMP8420_week11_solution.txt


In [9]:
import json
from pathlib import Path

# Output directory
output_dir = Path("/Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420")
output_dir.mkdir(parents=True, exist_ok=True)

# Define Q&A dataset
qa_data = [
    {
        "question": "What is a foundation model in NLP?",
        "answer": "A foundation model is a large pre-trained model trained on massive datasets, serving as the base for fine-tuning on specific NLP tasks."
    },
    {
        "question": "What are examples of foundation models?",
        "answer": "Examples include GPT-3.5, Claude, PaLM, BERT, LLaMA, and Mistral."
    },
    {
        "question": "What does fine-tuning mean in the context of LLMs?",
        "answer": "Fine-tuning is the process of continuing training on a pre-trained model using a smaller, task-specific dataset."
    },
    {
        "question": "What is the difference between prompt tuning and fine-tuning?",
        "answer": "Prompt tuning adjusts input formatting without altering the model, while fine-tuning retrains the model on new data."
    },
    {
        "question": "What was covered in Week 6 of COMP8420?",
        "answer": "Week 6 focused on fine-tuning large language models, covering parameter-efficient fine-tuning and adapter-based methods."
    },
    {
        "question": "When is the COMP8420 project presentation due?",
        "answer": "The presentation is scheduled for Week 13, Friday, June 6th, 2025."
    },
    {
        "question": "What is the final deadline for Assignment 3?",
        "answer": "Assignment 3 (code + report) is due during the exam period on June 17th, 2025."
    },
    {
        "question": "What type of model should we use for our project?",
        "answer": "You can use OpenAI’s GPT-3.5 or open-source models like LLaMA2 or Mistral, depending on your goals and data."
    },
    {
        "question": "How do we evaluate our NLP project?",
        "answer": "Use metrics like BLEU, ROUGE, accuracy, and ablation studies to compare performance against baselines or alternatives."
    },
    {
        "question": "What are embedding models used for?",
        "answer": "They convert text into vector form for similarity search and retrieval, commonly used in RAG pipelines."
    },
    {
        "question": "How do we retrieve answers in a RAG system?",
        "answer": "Text chunks are embedded into vectors and searched via similarity to retrieve relevant content for generation."
    },
    {
        "question": "How are lectures delivered in COMP8420?",
        "answer": "Lectures are delivered via PDFs and practical notebooks, combining theory and hands-on exercises."
    },
    {
        "question": "What technologies are used in this course?",
        "answer": "Hugging Face, PyTorch, LangChain, OpenAI APIs, and vector databases like FAISS."
    },
    {
        "question": "Can we use ChatGPT in our project?",
        "answer": "Yes, but your project must demonstrate additional engineering beyond just using the API."
    },
    {
        "question": "What is parameter-efficient fine-tuning?",
        "answer": "It refers to fine-tuning techniques like LoRA and Adapters that update only a small subset of the model."
    },
    {
        "question": "What should we cover in the presentation?",
        "answer": "Your project title, real-world problem, methodology, expected outcome, and team contributions."
    },
    {
        "question": "What’s the length of the presentation?",
        "answer": "You must present for 3–4 minutes during the Week 13 Practice Workshop."
    },
    {
        "question": "How can we improve our mark?",
        "answer": "Make your project novel, apply evaluation methods, and clearly explain your work in the report and presentation."
    },
    {
        "question": "What are some real-world NLP tasks?",
        "answer": "Text classification, summarization, question answering, entity recognition, translation, and dialogue generation."
    },
    {
        "question": "Can we build an agent using LangChain?",
        "answer": "Yes, LangChain is commonly used to implement RAG-based agents using LLMs and vector stores."
    }
]

# Save as qna.json
with open(output_dir / "qna.json", "w", encoding="utf-8") as f:
    json.dump(qa_data, f, indent=2)

print("qna.json saved successfully.")


qna.json saved successfully.


In [10]:
import json
from pathlib import Path

# Output directory (same as before)
output_dir = Path("/Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420")

# Define the deadlines
deadlines = [
    {
        "task": "Team Registration",
        "due_date": "2025-05-30",
        "type": "Workshop"
    },
    {
        "task": "Project Presentation (Week 13)",
        "due_date": "2025-06-06",
        "type": "Practice Workshop"
    },
    {
        "task": "Final Report + Code Submission",
        "due_date": "2025-06-17",
        "type": "Exam Period"
    }
]

# Save to JSON
with open(output_dir / "deadlines.json", "w", encoding="utf-8") as f:
    json.dump(deadlines, f, indent=2)

print("deadlines.json saved successfully.")


deadlines.json saved successfully.


In [11]:
course_info_text = """
Course Title: COMP8420 – Advanced Natural Language Processing (S1 2025)

Description:
This course teaches students how to apply modern Natural Language Processing (NLP) techniques using large language models (LLMs). It focuses on real-world applications and responsible development practices. Topics include foundation models, prompt engineering, fine-tuning, RAG pipelines, privacy, security, and AI agent design.

Teaching Team:
- Dr. Qiongkai Xu (Lecturer)
- Prof. Longbing Cao (Supervisor)
- Mr. Weijun Li (TA)

Key Technologies:
- Hugging Face, PyTorch, OpenAI APIs, LangChain, FAISS

Assessments:
- Assignment 1: Text Classification
- Assignment 2: Text Generation
- Assignment 3: Team Project (Presentation + Code + Report)

Objective:
Prepare students to build and deploy intelligent NLP systems with ethical awareness and practical skills.
"""

# Save to .txt
with open(output_dir / "course_info.txt", "w", encoding="utf-8") as f:
    f.write(course_info_text.strip())

print("course_info.txt saved successfully.")


course_info.txt saved successfully.


In [12]:
import json
from pathlib import Path

# Output directory
output_dir = Path("/Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420")

# Announcements dataset
announcements = [
    {
        "title": "Assignment 3 – Presentation Reminder",
        "content": "Don't forget to prepare a 3–4 minute talk about your project for the Week 13 workshop (Friday June 6th)."
    },
    {
        "title": "Assignment 3 Submission",
        "content": "Final code and report must be submitted by June 17th during the exam period. Late submissions will not be accepted without special consideration."
    },
    {
        "title": "Week 7 Workshop Topic",
        "content": "We'll explore risks and safety issues in LLMs. Please review the Week 7 slides before the workshop."
    },
    {
        "title": "Week 10 Team Registration",
        "content": "Make sure to form your group and register your project title by the Week 10 workshop."
    }
]

# Save as JSON
with open(output_dir / "announcements.json", "w", encoding="utf-8") as f:
    json.dump(announcements, f, indent=2)

print("announcements.json saved successfully.")


announcements.json saved successfully.


In [13]:
# Sample discussion Q&A
discussions = [
    {
        "question": "Do we have to use LangChain for Assignment 3?",
        "answer": "No, LangChain is optional. You can use any framework that supports RAG or LLM integration."
    },
    {
        "question": "Can we use ChatGPT API?",
        "answer": "Yes, but remember that you must demonstrate your own engineering effort in addition to using the API."
    },
    {
        "question": "Is it okay to work solo on Assignment 3?",
        "answer": "Projects should be completed in teams of two unless you’ve received special permission."
    },
    {
        "question": "What’s the expected length of the presentation?",
        "answer": "Each team should present for 3–4 minutes during the Week 13 practice workshop."
    },
    {
        "question": "Can we include public datasets in our project?",
        "answer": "Yes, you may include public datasets as long as they’re relevant to your topic and cited properly."
    }
]

# Save as JSON
with open(output_dir / "discussions.json", "w", encoding="utf-8") as f:
    json.dump(discussions, f, indent=2)

print("discussions.json saved successfully.")


discussions.json saved successfully.


In [14]:
pip install langchain langchain-community faiss-cpu openai tiktoken


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0.tar.gz (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.2/70.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: faiss-cpu


  Building wheel for faiss-cpu (pyproject.toml) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for faiss-cpu [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[12 lines of output][0m
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m running build_ext
  [31m   [0m building 'faiss._swigfaiss' extension
  [31m   [0m swigging faiss/faiss/python/swigfaiss.i to faiss/faiss/python/swigfaiss_wrap.cpp
  [31m   [0m swig -python -c++ -Doverride= -doxygen -Ifaiss -I/private/var/folders/n9/zw6f9z6j7tbf93ys9qbgldpw0000gn/T/pip-build-env-ea7i9ihs/overlay/lib/python3.11/site-packages/numpy/_core/include -Ifaiss -I/usr/local/include -o faiss/faiss/python/swigfaiss_wrap.cpp faiss/faiss/python/swigfaiss.i
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "/Users/shaimonra

In [11]:
import os
from pathlib import Path
import nbformat
from langchain_community.document_loaders import PyMuPDFLoader, JSONLoader, TextLoader
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Set your base dataset path
dataset_path = Path("/Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420")
persist_path = dataset_path / "chroma_store"
persist_path.mkdir(parents=True, exist_ok=True)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=100)
all_chunks = []

# 1. Load Lecture PDFs
lecture_path = dataset_path / "lectures"
for pdf_file in lecture_path.glob("*.pdf"):
    loader = PyMuPDFLoader(str(pdf_file))
    docs = loader.load()
    chunks = text_splitter.split_documents(docs)
    all_chunks.extend(chunks)

# 2. Load Practical Notebooks (.ipynb)
pracs_path = dataset_path / "practicals"
for ipynb_file in pracs_path.glob("*.ipynb"):
    nb = nbformat.read(open(ipynb_file, "r", encoding="utf-8"), as_version=4)
    text = ""
    for cell in nb.cells:
        if cell.cell_type in ["markdown", "code"]:
            text += cell.source + "\n\n"
    doc = Document(page_content=text, metadata={"source": ipynb_file.name})
    chunks = text_splitter.split_documents([doc])
    all_chunks.extend(chunks)

# 3. Load JSON Files (qna, deadlines, etc.)
json_files = ["qna.json", "deadlines.json", "announcements.json", "discussions.json"]
for json_file in json_files:
    loader = JSONLoader(
        file_path=str(dataset_path / json_file),
        jq_schema=".[]",
        text_content=False
    )
    docs = loader.load()
    chunks = text_splitter.split_documents(docs)
    all_chunks.extend(chunks)

# 4. Load course_info.txt
info_loader = TextLoader(str(dataset_path / "course_info.txt"))
docs = info_loader.load()
chunks = text_splitter.split_documents(docs)
all_chunks.extend(chunks)

# 5. Embed and save to Chroma
embedding = OpenAIEmbeddings(disallowed_special=())
vectorstore = Chroma.from_documents(
    documents=all_chunks,
    embedding=embedding,
    persist_directory=str(persist_path)
)
vectorstore.persist()

print(f"Done. Total chunks embedded: {len(all_chunks)}")


Done. Total chunks embedded: 34


  vectorstore.persist()


In [12]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Load embedding and Chroma vector store
embedding = OpenAIEmbeddings(disallowed_special=())
persist_path = "/Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420/chroma_store"
vectorstore = Chroma(persist_directory=persist_path, embedding_function=embedding)

# Create retriever
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4})

# Load chat model (you can use "gpt-3.5-turbo" or any model you prefer)
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Set up the QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

# Ask a question
query = "When is Assignment 3 due and how do we submit it?"
response = qa_chain.invoke({"query": query})

print("Answer:")
print(response["result"])

print("\nSources:")
for doc in response["source_documents"]:
    print("-", doc.metadata.get("source", "Unknown"))


  vectorstore = Chroma(persist_directory=persist_path, embedding_function=embedding)


Answer:
Assignment 3 (code + report) is due during the exam period on June 17th, 2025. The final code and report must be submitted by that date. Late submissions will not be accepted without special consideration.

Sources:
- /Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420/qna.json
- /Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420/announcements.json
- /Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420/discussions.json
- /Users/shaimonrahman/Desktop/COMP8420/Assignment_3/StudentAgentDataset/COMP8420/qna.json
