# 🛠️ Active Learning Workshop: Implementing an Inverted Matrix (Jupyter + GitHub Edition)
## 🔍 Workshop Theme
*Readable, correct, and collaboratively reviewed code—just like in the real world.*


Welcome to the 90-minute workshop! In this hands-on session, your team will build an **Inverted Index** pipeline, the foundation of many intelligent systems that need fast and relevant access to text data — such as AI agents.

### 👥 Team Guidelines
- Work in teams of 3.
- Submit one completed Jupyter Notebook per team.
- The final notebook must contain **Markdown explanations** and **Python code**.
- Push your notebook to GitHub and share the `.git` link before class ends.

---
## 🔧 Workshop Tasks Overview

1. **Document Collection**
2. **Tokenizer Implementation**
3. **Normalization Pipeline (Stemming, Stop Words, etc.)**
4. **Build and Query the Inverted Index**

> Each step includes a sample **talking point**. Your team must add your own custom **Markdown + code cells** with a **second talking point**, and test your Inverted Index with **2 phrase queries**.




## 🧠 Learning Objectives
- Implement an **Inverted Matrix** using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – Manual IR and Inverted Matrix coding + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the Min-Max code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - Inverted Matrix  Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IR_InvertedMatrix_Workshop.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, and Inverted Index.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** and 2 phrase query tests
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IR-invertedmatrix-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 📄 Step 1: Document Collection


### 🗣 Instructor Talking Point:
> We begin by gathering a text corpus. To build a robust index, your vocabulary should include **over 2000 unique words**. You can use scraped articles, academic papers, or open datasets.

### 🔧 Your Task:
- Collect at least 20+ text documents.
- Ensure the vocabulary exceeds 2000 unique words.
- Load the documents into a list for processing.


In [33]:
# Example: Load text files from a folder
import os

def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Replace 'sample_docs/' with your actual folder
documents = load_documents('data/')
print(f"Loaded {len(documents)} documents.")


Loaded 16 documents.


## ✂️ Step 2: Tokenizer


### 🗣 Instructor Talking Point:
> The tokenizer breaks raw text into a stream of words (tokens). This is the foundation for every later step in IR and NLP.

### 🔧 Your Task:
- Implement a basic tokenizer that splits text into lowercase words.
- Handle punctuation removal and basic non-alphanumeric filtering.


In [34]:
import re

def tokenize(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

# Test on one document
tokens = tokenize(documents[0])
print(tokens[:20])  # Preview first 20 tokens


['about', 'the', 'job', 'swift', 'python', 'java', 'go', 'verilog', 'typescript', 'javascript', 'c', 'or', 'c', 'coding', 'experience', 'required', 'this', 'is', 'a', 'great']


## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)


### 🗣 Instructor Talking Point:
> Now we normalize tokens: convert to lowercase, remove stop words, apply stemming or affix stripping. This reduces redundancy and enhances search accuracy.

### 🔧 Your Task:
- Use `nltk` to remove stopwords and apply stemming.


In [35]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_tokens(tokens):
    return [stemmer.stem(t) for t in tokens if t not in stop_words]

# Example: normalize one document
norm_tokens = normalize_tokens(tokens)
print(norm_tokens[:20])


['job', 'swift', 'python', 'java', 'go', 'verilog', 'typescript', 'javascript', 'c', 'c', 'code', 'experi', 'requir', 'great', 'opportun', 'supplement', 'incom', 'look', 'longer', 'full']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yogeshkumar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 🔍 Step 4: Inverted Index


### 🗣 Instructor Talking Point:
> We now map each normalized token to the list of document IDs in which it appears. This is the core structure that allows fast Boolean and phrase queries.

### 🔧 Your Task:
- Build the inverted index using a dictionary.
- Add code to support phrase queries using positional indexing.


In [36]:
from collections import defaultdict

def build_position_inverted_index(documents):
    index = defaultdict(lambda: defaultdict(list))  # term -> {doc_id: [positions]}
    for doc_id, text in enumerate(documents):
        tokens = normalize_tokens(tokenize(text))  # assumes you already have tokenize and normalize_tokens()
        for position, token in enumerate(tokens):
            index[token][doc_id].append(position)
    return index

inverted_index = build_position_inverted_index(documents)
print(dict(inverted_index))


{'job': defaultdict(<class 'list'>, {0: [0], 2: [0, 611], 3: [0, 320, 339, 344], 6: [0], 7: [3, 9], 10: [0], 11: [0, 153]}), 'swift': defaultdict(<class 'list'>, {0: [1, 94], 10: [192]}), 'python': defaultdict(<class 'list'>, {0: [2, 95], 5: [7], 15: [15]}), 'java': defaultdict(<class 'list'>, {0: [3, 96], 5: [8], 10: [194]}), 'go': defaultdict(<class 'list'>, {0: [4, 97]}), 'verilog': defaultdict(<class 'list'>, {0: [5, 98]}), 'typescript': defaultdict(<class 'list'>, {0: [6, 99], 3: [187], 10: [135, 181]}), 'javascript': defaultdict(<class 'list'>, {0: [7, 100], 10: [134, 180]}), 'c': defaultdict(<class 'list'>, {0: [8, 9, 101, 102], 5: [9], 10: [186]}), 'code': defaultdict(<class 'list'>, {0: [10, 110], 2: [153, 252, 599], 10: [131]}), 'experi': defaultdict(<class 'list'>, {0: [11], 2: [166, 378, 523], 3: [60, 176, 215, 236, 278, 306, 410], 5: [10, 37, 52, 103], 6: [268, 287], 9: [51], 10: [78, 144, 172, 184, 196, 213], 11: [142, 253]}), 'requir': defaultdict(<class 'list'>, {0: [12

## 🧪 Test: Phrase Queries


### 🗣 Instructor Talking Point:
> A phrase query requires the exact sequence of terms (e.g., "machine learning"). To support this, extend the inverted index to store positions, not just docIDs.

### 🔧 Your Task:
- Implement 2 phrase queries.
- Demonstrate that they return the correct documents.


In [39]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_tokens(tokens):
    return [stemmer.stem(token) for token in tokens]

def phrase_query_search_stem(phrase, inverted_index):
    # Tokenize, normalize, and stem the query
    tokens = normalize_tokens(tokenize(phrase))   # assuming these are defined
    stemmed_terms = stem_tokens(tokens)    
    print(stemmed_terms)# apply the same stemming as for documents

    if not stemmed_terms:
        return []

    postings = [inverted_index.get(term, {}) for term in stemmed_terms]

    common_docs = set(postings[0].keys())
    for p in postings[1:]:
        common_docs &= set(p.keys())

    results = []
    for doc_id in common_docs:
        positions_lists = [p[doc_id] for p in postings]
        for pos in positions_lists[0]:
            if all((pos + i) in positions_lists[i] for i in range(1, len(stemmed_terms))):
                results.append(doc_id)
                break
    return results


In [41]:
query1 = "design"
query2 = "Collaborate closely"

matching_docs1 = phrase_query_search_stem(query1, inverted_index)
matching_docs2 = phrase_query_search_stem(query2, inverted_index)

print("Query 1 matched documents:", matching_docs1)
print("Query 2 matched documents:", matching_docs2)


['design']
['collabor', 'close']
Query 1 matched documents: [2, 3, 5, 6, 10, 14]
Query 2 matched documents: [3]
