# 🛠️ Active Learning Workshop: Implementing an Inverted Matrix (Jupyter + GitHub Edition)
## 🔍 Workshop Theme
*Readable, correct, and collaboratively reviewed code—just like in the real world.*


Welcome to the 90-minute workshop! In this hands-on session, your team will build an **Inverted Index** pipeline, the foundation of many intelligent systems that need fast and relevant access to text data — such as AI agents.

### 👥 Team Guidelines
- Work in teams of 3.
- Submit one completed Jupyter Notebook per team.
- The final notebook must contain **Markdown explanations** and **Python code**.
- Push your notebook to GitHub and share the `.git` link before class ends.

---
## 🔧 Workshop Tasks Overview

1. **Document Collection**
2. **Tokenizer Implementation**
3. **Normalization Pipeline (Stemming, Stop Words, etc.)**
4. **Build and Query the Inverted Index**

> Each step includes a sample **talking point**. Your team must add your own custom **Markdown + code cells** with a **second talking point**, and test your Inverted Index with **2 phrase queries**.




## 🧠 Learning Objectives
- Implement an **Inverted Matrix** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – Manual IR and Inverted Matrix coding + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the Min-Max code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Inverted Matrix  Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IR_InvertedMatrix_Workshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, and Inverted Index.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** and 2 phrase query tests
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IR-invertedmatrix-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 📄 Step 1: Document Collection


### 🗣 Instructor Talking Point:
> We begin by gathering a text corpus. To build a robust index, your vocabulary should include **over 2000 unique words**. You can use scraped articles, academic papers, or open datasets.

### 🔧 Your Task:
- Collect at least 20+ text documents.
- Ensure the vocabulary exceeds 2000 unique words.
- Load the documents into a list for processing.


In [1]:
# Example: Load text files from a folder
import os

def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Replace 'sample_docs/' with your actual folder
documents = load_documents('data/')
print(f"Loaded {len(documents)} documents.")


Loaded 20 documents.


## ✂️ Step 2: Tokenizer


### 🗣 Instructor Talking Point:
> The tokenizer breaks raw text into a stream of words (tokens). This is the foundation for every later step in IR and NLP.

### 🔧 Your Task:
- Implement a basic tokenizer that splits text into lowercase words.
- Handle punctuation removal and basic non-alphanumeric filtering.


In [2]:
import re

def tokenize(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

# Test on one document
tokens = tokenize(documents[0])
print(tokens[:20])  # Preview first 20 tokens


['business', 'analyst', 'top', 'companies', 'accenture', 'capgemini', 'infosys', 'deloitte', 'pwc', 'ey', 'loblaws', 'bell', 'shopify', 'td', 'core', 'skills', 'business', 'acumen', 'stakeholder', 'communication']


## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)


### 🗣 Instructor Talking Point:
> Now we normalize tokens: convert to lowercase, remove stop words, apply stemming or affix stripping. This reduces redundancy and enhances search accuracy.

### 🔧 Your Task:
- Use `nltk` to remove stopwords and apply stemming.


In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_tokens(tokens):
    return [stemmer.stem(t) for t in tokens if t not in stop_words]

# Example: normalize one document
norm_tokens = normalize_tokens(tokens)
print(norm_tokens[:20])


['busi', 'analyst', 'top', 'compani', 'accentur', 'capgemini', 'infosi', 'deloitt', 'pwc', 'ey', 'loblaw', 'bell', 'shopifi', 'td', 'core', 'skill', 'busi', 'acumen', 'stakehold', 'commun']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mathe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 🔍 Step 4: Inverted Index


### 🗣 Instructor Talking Point:
> We now map each normalized token to the list of document IDs in which it appears. This is the core structure that allows fast Boolean and phrase queries.

### 🔧 Your Task:
- Build the inverted index using a dictionary.
- Add code to support phrase queries using positional indexing.


In [4]:
from collections import defaultdict
# Modified to have the dict format to understand easily
def build_position_inverted_index(documents):
    index = defaultdict(lambda: defaultdict(list))
    for doc_id, text in enumerate(documents):
        tokens = normalize_tokens(tokenize(text))
        for position, token in enumerate(tokens):
            index[token][doc_id].append(position)
    return index

inverted_index = build_position_inverted_index(documents)
print(dict(inverted_index))


{'busi': defaultdict(<class 'list'>, {0: [0, 16, 31, 39], 4: [110, 600], 6: [506], 9: [232], 14: [299], 16: [7, 9], 17: [10], 19: [41]}), 'analyst': defaultdict(<class 'list'>, {0: [1, 32], 7: [4], 8: [26], 11: [13, 46], 16: [3, 8, 15, 23, 30, 36], 18: [11]}), 'top': defaultdict(<class 'list'>, {0: [2], 3: [2], 5: [2], 6: [526, 542], 11: [25], 13: [3], 19: [1]}), 'compani': defaultdict(<class 'list'>, {0: [3], 2: [0, 45], 3: [3], 4: [44, 329, 462], 5: [3], 6: [568], 9: [26], 11: [3, 30], 13: [4], 14: [277], 15: [111], 17: [0, 23], 19: [2]}), 'accentur': defaultdict(<class 'list'>, {0: [4]}), 'capgemini': defaultdict(<class 'list'>, {0: [5]}), 'infosi': defaultdict(<class 'list'>, {0: [6]}), 'deloitt': defaultdict(<class 'list'>, {0: [7]}), 'pwc': defaultdict(<class 'list'>, {0: [8]}), 'ey': defaultdict(<class 'list'>, {0: [9]}), 'loblaw': defaultdict(<class 'list'>, {0: [10], 11: [51]}), 'bell': defaultdict(<class 'list'>, {0: [11]}), 'shopifi': defaultdict(<class 'list'>, {0: [12], 5:

## 🧪 Test: Phrase Queries


### 🗣 Instructor Talking Point:
> A phrase query requires the exact sequence of terms (e.g., "machine learning"). To support this, extend the inverted index to store positions, not just docIDs.

### 🔧 Your Task:
- Implement 2 phrase queries.
- Demonstrate that they return the correct documents.


In [5]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_tokens(tokens):
    return [stemmer.stem(token) for token in tokens]

def phrase_query_search_stem(phrase, inverted_index):
    # Tokenize, normalize, and stem the query
    tokens = normalize_tokens(tokenize(phrase))
    stemmed_terms = stem_tokens(tokens)
    print(f"Stemmed terms for query '{phrase}':")

    if not stemmed_terms:
        return []

    postings = [inverted_index.get(term, {}) for term in stemmed_terms]

    common_docs = set(postings[0].keys())
    for p in postings[1:]:
        common_docs &= set(p.keys())

    results = []
    for doc_id in common_docs:
        positions_lists = [p[doc_id] for p in postings]
        for pos in positions_lists[0]:
            if all((pos + i) in positions_lists[i] for i in range(1, len(stemmed_terms))):
                results.append(doc_id)
                break
    return results


In [6]:
query1 = "design"
query2 = "design systems"
quuery3 = "Experience with design systems"

matching_docs1 = phrase_query_search_stem(query1, inverted_index)
matching_docs2 = phrase_query_search_stem(query2, inverted_index)
matching_docs3 = phrase_query_search_stem(quuery3, inverted_index)

print("Query 1 matched documents:", matching_docs1)
print("Query 2 matched documents:", matching_docs2)
print("Query 3 matched documents:", matching_docs3)


Stemmed terms for query 'design':
Stemmed terms for query 'design systems':
Stemmed terms for query 'Experience with design systems':
Query 1 matched documents: [4, 6, 7, 8, 9, 14, 18]
Query 2 matched documents: [6, 7]
Query 3 matched documents: [6]


## Talking Points

### **Query 1: "design"**
- **Stemmed Term:** `['design']`
- **Type:** Single keyword search
- **Match Strategy:** Finds documents containing the stemmed term `'design'`
- **Result:** Returns all documents where collaboration is mentioned in any form
- **Talking Point:**  
  This is a broad query. Since `'design'` is a commonly used term, it's likely to retrieve multiple documents discussing teamwork or cooperative efforts.

---

### **Query 2: "design systems"**
- **Stemmed Terms:** `['design', 'system']`
- **Type:** Phrase query (2 terms)
- **Match Strategy:** Searches for documents where `'design'` is immediately followed by `'system'`
- **Result:** Returns fewer, more context-specific matches
- **Talking Point:**  
  This query is more targeted. It narrows results to documents that mention design systems explicitly, which is useful when looking for focused content.

---

### **Query 3: "Experience with design systems"**
- **Stemmed Terms:** `['experi', 'with', 'design', 'system']`
- **Type:** Full phrase query (4 terms)
- **Match Strategy:** Requires all stemmed terms to appear in **sequence** in a document
- **Result:** Likely to return few or no results
- **Talking Point:**  
  This is a highly specific query. It retrieves only documents that match the full phrase exactly. This improves precision but may limit recall if the exact phrasing isn't present.

---

### 🧠 Summary:
- Short, broad queries (like Query 1) have high recall but low precision.
- Two-term phrases (like Query 2) balance precision and recall.
- Long, structured phrases (like Query 3) offer high precision but may miss loosely worded matches.
