# 🧠 NLP Foundations Workshop: From Preprocessing to tf-idf


**Duration**: 90 minutes  
**Team Size**: 3 students  
**Objective**: Build an NLP pipeline from scratch to implement and test six foundational concepts in Natural Language Processing in preparation for Vector Space Models and Cosine Similarity.
### Team:
- **Zhimin Xiong** 
- **Yu-Chen Chou**
- **Haysam Elamin**


## Step 1: Presenting the Six Core NLP Concepts

### 🔹 Term-Document Incidence Matrix

The **Term-Document Incidence Matrix** is a binary matrix that shows whether a term $t$ appears in a document $d$.

- Rows represent terms in the vocabulary  
- Columns represent documents in the corpus  
- Each entry $w_{t,d}$ is defined as:

$$
w_{t,d} =
\begin{cases}
1 & \text{if } t \in d \\
0 & \text{otherwise}
\end{cases}
$$

This is a **binary representation** — it only records the **presence or absence** of a term, not how many times it appears.

---

#### ✅ Why Use It?

- It’s the **simplest form** of representing document contents using structured data.
- Useful for:
  - Boolean search and keyword filters
  - Document classification based on keyword sets
  - Building foundational **retrieval systems**
- Helps in detecting whether **all query terms exist** in a document (e.g., phrase queries or "AND" operations)

---

#### 📘 Example

Suppose we have 3 documents:

- **Doc1**: "machine learning is fun"  
- **Doc2**: "deep learning is powerful"  
- **Doc3**: "machine learning and deep models"

The vocabulary extracted from all three is:

**Vocabulary** = {machine, learning, is, fun, deep, powerful, and, models}

The Term-Document Incidence Matrix would look like:

| Term       | Doc1 | Doc2 | Doc3 |
|------------|------|------|------|
| machine    | 1    | 0    | 1    |
| learning   | 1    | 1    | 1    |
| is         | 1    | 1    | 0    |
| fun        | 1    | 0    | 0    |
| deep       | 0    | 1    | 1    |
| powerful   | 0    | 1    | 0    |
| and        | 0    | 0    | 1    |
| models     | 0    | 0    | 1    |

For example:
- $w_{\text{machine}, \text{Doc1}} = 1$ → "machine" is in Doc1
- $w_{\text{powerful}, \text{Doc1}} = 0$ → "powerful" is not in Doc1

This matrix is particularly helpful when implementing **Boolean retrieval systems** and **phrase matching**.


In [7]:
# 📘 Example: Term-Document Incidence Matrix

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample corpus from the Markdown example
docs = [
    "machine learning is fun",          # Doc1
    "deep learning is powerful",        # Doc2
    "machine learning and deep models"  # Doc3
]

# Use binary=True to indicate presence/absence (1 or 0)
vectorizer = CountVectorizer(binary=True)

# Fit and transform the corpus
X = vectorizer.fit_transform(docs)

# Create a labeled DataFrame
incidence_matrix = pd.DataFrame(X.toarray(),
                                index=["Doc1", "Doc2", "Doc3"],
                                columns=vectorizer.get_feature_names_out())

# Display the incidence matrix
print("🔎 Term-Document Incidence Matrix:")
display(incidence_matrix)


🔎 Term-Document Incidence Matrix:


Unnamed: 0,and,deep,fun,is,learning,machine,models,powerful
Doc1,0,0,1,1,1,1,0,0
Doc2,0,1,0,1,1,0,0,1
Doc3,1,1,0,0,1,1,1,0


In [8]:
# Terms to check
term1 = "machine"
term2 = "learning"
 
# Check if both terms occur in the same document (both columns == 1)
docs_with_both_terms = incidence_matrix[
    (incidence_matrix[term1] == 1) & (incidence_matrix[term2] == 1)
]
 
print(f"Documents containing both '{term1}' and '{term2}':")
print(docs_with_both_terms)

Documents containing both 'machine' and 'learning':
      and  deep  fun  is  learning  machine  models  powerful
Doc1    0     0    1   1         1        1       0         0
Doc3    1     1    0   0         1        1       1         0


🗣️ **Instructor Talking Point**: This code demonstrates how the presence or absence of a term in a document is encoded as a binary matrix — foundational for Boolean retrieval. Explain this with respect to a future AI agent (chatbot) builds context.
<br/>
<br/>
🧠 **Student Talking Point**: Add a phrase query (e.g., 'machine learning') and explain your reasoning as to how you would check if both terms occur in a single document using this matrix.

### 🔹 Term Frequency (TF)

`**Term Frequency (TF)**` measures how frequently a term $t$ appears in a document $d$.

$$
tf_{t,d} = f_{t,d}
$$

Where $f_{t,d}$ is the raw count of term $t$ in document $d$.

---

#### ✅ Why Use It?

- TF reflects the importance of a word **within a specific document**.
- A higher TF means the term is likely central to the topic of that document.
- It's used as the **first step** in `vectorizing` text for machine learning models like classification, clustering, or information retrieval.

TF is most effective when combined with **IDF** (Inverse Document Frequency) to balance against very common terms across the corpus.

---

#### 📘 Example

Let’s say we have this document:

> **Doc1**: `"machine learning is fun and machine learning is useful"`

Calculate raw term counts:

| Term     | Raw TF $(f_{t,d})$ |
|----------|--------------------|
| machine  | 2                  |
| learning | 2                  |
| is       | 2                  |
| fun      | 1                  |
| and      | 1                  |
| useful   | 1                  |

If normalized (total of 9 words):

- $tf(\text{"machine"}, \text{Doc1}) = \frac{2}{9} \approx 0.22$
- $tf(\text{"learning"}, \text{Doc1}) = \frac{2}{9} \approx 0.22$

This simple frequency can then be used as input into models such as `**TF-IDF**`, which adjusts these values based on how rare the words are across multiple documents.


In [9]:
# 📘 Example: Term Frequency (TF)

import pandas as pd
from collections import Counter

# Sample document
doc1 = "machine learning is fun and machine learning is useful"

# Tokenize the document (simple lowercase + split)
tokens = doc1.lower().split()

# Count term frequencies
tf_raw = Counter(tokens)

# Total number of words
total_terms = len(tokens)

# Compute normalized TF
tf_normalized = {term: count / total_terms for term, count in tf_raw.items()}

# Display results
print("🔢 Raw Term Frequencies:")
display(pd.DataFrame(tf_raw.items(), columns=["Term", "Raw TF"]))

print("\n📏 Normalized Term Frequencies:")
display(pd.DataFrame(tf_normalized.items(), columns=["Term", "TF (Normalized)"]))


🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,machine,2
1,learning,2
2,is,2
3,fun,1
4,and,1
5,useful,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,machine,0.222222
1,learning,0.222222
2,is,0.222222
3,fun,0.111111
4,and,0.111111
5,useful,0.111111


🗣️ **Instructor Talking Point**: "Here we count how often each term appears in a single document and normalize it. This is the simplest way to represent word importance within a document. `Explain this with respect to a future AI agent (chatbot) builds  builds context`.
<br/>
<br/>
🧠 **Student Talking Point**: "Use this TF output to compare with another document. Which terms are likely to be most important in Doc1 based on their normalized TF? Explain your reasoning.

`### 🔹 Log Frequency Weight`

To reduce the impact of very frequent terms, **log frequency weighting** is applied.

$$
w_{t,d} =
\begin{cases}
1 + \log_{10}(f_{t,d}) & \text{if } f_{t,d} > 0 \\
0 & \text{if } f_{t,d} = 0
\end{cases}
$$

This transformation reduces the skew caused by terms that appear many times in a document. Instead of allowing their raw frequency to dominate, we scale their contribution **logarithmically**.

---

#### ✅ Why Use It?

- Frequent terms are not always the most **important** terms.
- Log scaling ensures that:
  - Words with a raw count of 1 are preserved ($1 + \\log_{10}(1) = 1$),
  - But words with very high counts (e.g., 1000) don’t dominate the document vector.

This helps **normalize the influence** of repetitive terms and improve the **numerical stability** of document representations in models.

---

#### 📘 Example

Let’s say we have a document with the following raw term counts:

| Term     | Raw TF $f_{t,d}$ | Log Frequency Weight $w_{t,d}$ |
|----------|------------------|-------------------------------|
| machine  | 1                | $1 + \\log_{10}(1) = 1$        |
| learning | 3                | $1 + \\log_{10}(3) \approx 1.477$ |
| data     | 10               | $1 + \\log_{10}(10) = 2$       |

So even though "data" appears 10 times, its log-weighted value is **just 2**, making it more comparable to less frequent but potentially more meaningful terms like "learning".

This makes log frequency weighting especially useful when preparing inputs for models like **TF-IDF** or **document clustering**.


In [10]:
# 📘 Example: Log Frequency Weighting

import pandas as pd
import numpy as np
from collections import Counter

# Sample document with varying term frequencies
doc = "machine learning data data data learning learning learning machine data data data data"

# Tokenize and count raw term frequencies
tokens = doc.lower().split()
raw_tf = Counter(tokens)

# Compute log frequency weights
log_weighted_tf = {
    term: 1 + np.log10(freq) if freq > 0 else 0
    for term, freq in raw_tf.items()
}

# Build and display the result as a DataFrame
df = pd.DataFrame({
    "Term": raw_tf.keys(),
    "Raw TF (f_{t,d})": raw_tf.values(),
    "Log Weight (w_{t,d})": log_weighted_tf.values()
})

print("📊 Log Frequency Weighting:")
display(df)


📊 Log Frequency Weighting:


Unnamed: 0,Term,"Raw TF (f_{t,d})","Log Weight (w_{t,d})"
0,machine,2,1.30103
1,learning,4,1.60206
2,data,7,1.845098


🗣️ **Instructor Talking Point**: Note how 'data' has a high frequency, but its impact is smoothed by log weighting, making it comparable to 'learning'. `Explain this with respect to how a future AI agent (chatbot) builds builds context`.
<br/>
<br/>
🧠 **Student Talking Point**: Try adjusting the number of times a word appears and observe how the log scale compresses large values.

### 🔹 Document Frequency (DF)

**Document Frequency** is the number of documents in which a term $t$ appears:

$$
df_t = |\{ d \in D : t \in d \}|
$$

Where:
- $df_t$ is the document frequency of term $t$
- $D$ is the set of all documents in the corpus
- $t \in d$ means the term $t$ appears in document $d$

---

#### ✅ Why Use It?

- It helps you understand **how common or rare** a word is across the entire document set.
- Words with **high DF** (e.g., “the”, “and”) occur in many documents and are often **less informative**.
- Words with **low DF** are more likely to be **specific and meaningful** for distinguishing between documents.
- DF is a key ingredient in calculating **Inverse Document Frequency (IDF)**.

---

#### 📘 Example

Suppose you have the following three documents:

- **Doc1**: "machine learning is fun"  
- **Doc2**: "deep learning is powerful"  
- **Doc3**: "machine learning and deep models"

Now, let’s compute the Document Frequency:

| Term     | Document Frequency ($df_t$) |
|----------|-----------------------------|
| machine  | 2 (Doc1, Doc3)              |
| learning | 3 (Doc1, Doc2, Doc3)        |
| deep     | 2 (Doc2, Doc3)              |
| models   | 1 (Doc3)                    |

The term **"learning"** appears in all three documents → **high DF**, which means it’s **less useful for distinguishing** between them.

The term **"models"** appears in only one document → **low DF**, meaning it could be a **useful keyword** for that specific document.


In [11]:
# 📘 Example: Document Frequency (DF)

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents from Curriculum Learning (4)
docs = [
    "machine learning is fun",          # Doc1
    "deep learning is powerful",        # Doc2
    "machine learning and deep models"  # Doc3
]

# Use CountVectorizer to extract term-document matrix (raw counts)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)

# Get feature names and document-term matrix as array
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Calculate document frequency for each term
df_counts = (X_array > 0).sum(axis=0)

# Format as a DataFrame
df_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts
}).sort_values("Document Frequency (df_t)", ascending=False)

print("📊 Document Frequency (DF) Table:")
display(df_table)


📊 Document Frequency (DF) Table:


Unnamed: 0,Term,Document Frequency (df_t)
4,learning,3
1,deep,2
3,is,2
5,machine,2
0,and,1
2,fun,1
6,models,1
7,powerful,1


🗣️ **Instructor Talking Point**: Notice how common terms like 'learning' appear in all documents, while more specific terms like 'fun' or 'models' appear in only one.
<br/>
<br/>
🧠 **Student Talking Point**: Choose a term and explain how its document frequency could affect downstream TF-IDF weighting.

### 🔹 Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** measures how rare or informative a term is across the entire corpus:

$$
idf_t = \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $N$ is the total number of documents in the corpus  
- $df_t$ is the number of documents that contain the term $t$

---

#### ✅ Why Use It?

- IDF is used to **downweight common terms** and **upweight rare ones**.
- Words like “the”, “and”, or “data” appear frequently and are less helpful in distinguishing documents.
- Terms that appear in **fewer documents** are often **more informative** and **discriminative**.
- IDF is a core component of **TF-IDF**, a widely used technique in search engines, document classification, and clustering.

---

#### 📘 Example

Let’s say we have **5 documents** total, and the following document frequencies:

| Term     | $df_t$ | $idf_t = \log_{10}(N / df_t)$ |
|----------|--------|-------------------------------|
| machine  | 3      | $\log_{10}(5 / 3) \approx 0.22$ |
| entropy  | 1      | $\log_{10}(5 / 1) = 0.70$       |
| the      | 5      | $\log_{10}(5 / 5) = 0.00$       |

- The term **"entropy"** appears in only one document, so its IDF is **high** → it’s a **rare and informative term**.
- The term **"the"** ap


In [12]:
# 📘 Example: Inverse Document Frequency (IDF)

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents (5 total)
docs = [
    "machine learning is powerful",
    "deep learning is advanced",
    "entropy measures randomness",
    "machine learning and AI are evolving",
    "the science of machine learning"
]

# Total number of documents
N = len(docs)

# Use CountVectorizer to get document-term matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Compute document frequency for each term
df_counts = (X_array > 0).sum(axis=0)

# Compute IDF using log base 10
idf_values = np.log10(N / df_counts)

# Build a DataFrame for display
idf_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts,
    "IDF (log10(N / df_t))": idf_values
}).sort_values("IDF (log10(N / df_t))", ascending=False)

print("📊 Inverse Document Frequency (IDF) Table:")
display(idf_table)


📊 Inverse Document Frequency (IDF) Table:


Unnamed: 0,Term,Document Frequency (df_t),IDF (log10(N / df_t))
0,advanced,1,0.69897
1,ai,1,0.69897
2,and,1,0.69897
3,are,1,0.69897
4,deep,1,0.69897
5,entropy,1,0.69897
6,evolving,1,0.69897
10,measures,1,0.69897
11,of,1,0.69897
12,powerful,1,0.69897


🗣️ **Instructor Talking Point**: IDF adjusts for the fact that some words are common across all documents — this is critical in improving document relevance in search systems.
<br/>
<br/>
🧠 **Student Talking Point**: Choose a low-IDF and high-IDF term from this output and explain why they behave differently.

### 🔹 TF-IDF Weighting

**TF-IDF (Term Frequency–Inverse Document Frequency)** scores each term $t$ in document $d$ based on how frequent and how rare it is:

$$
w_{t,d} = \left(1 + \log_{10}(f_{t,d})\right) \times \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $f_{t,d}$ is the raw count of term $t$ in document $d$
- $df_t$ is the number of documents that contain term $t$
- $N$ is the total number of documents in the corpus

---

#### ✅ Why Use It?

- TF-IDF balances **term importance within a document** (TF) against **term commonality across all documents** (IDF).
- It **boosts rare, relevant words** while **suppressing frequent, generic words**.
- TF-IDF is foundational in:
  - Information Retrieval (search engines)
  - Document similarity
  - Feature engineering for classification or clustering

---

#### 📘 Example

Suppose we have:

- $f_{\text{machine}, \text{Doc1}} = 3$
- $df_{\text{machine}} = 2$
- $N = 5$ total documents

Then:

- TF part: $1 + \log_{10}(3) \approx 1 + 0.477 = 1.477$
- IDF part: $\log_{10}(5 / 2) \approx 0.398$
- TF-IDF weight:

$$
w_{\text{machine}, \text{Doc1}} = 1.477 \times 0.398 \approx 0.588
$$

This means "machine" is **important within Doc1**, but since it's found in other documents too, the overall weight is **moderated**.

TF-IDF creates a **sparse, weighted vector representation** of documents, ready for:
- Cosine similarity
- Clustering
- Search ranking
- Input into classical machine learning models


In [13]:
# 📘 Example: TF-IDF Weighting (Manual Computation)

import pandas as pd
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

# Sample corpus of 5 documents
docs = [
    "machine learning is powerful",
    "deep learning is advanced",
    "entropy measures randomness",
    "machine learning and AI are evolving",
    "the science of machine learning"
]

# Total number of documents
N = len(docs)

# Vectorize (raw term frequencies)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()

# Compute Document Frequencies
df = (X_array > 0).sum(axis=0)
idf = np.log10(N / df)

# Manual TF-IDF: apply (1 + log10(tf)) * idf
tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)
tfidf = tf_log * idf

# Create a DataFrame for visual inspection
tfidf_df = pd.DataFrame(tfidf, columns=terms, index=[f"Doc{i+1}" for i in range(N)])

print("📊 TF-IDF Weighted Matrix (Manual Computation):")
display(tfidf_df.round(3))


📊 TF-IDF Weighted Matrix (Manual Computation):


  tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)


Unnamed: 0,advanced,ai,and,are,deep,entropy,evolving,is,learning,machine,measures,of,powerful,randomness,science,the
Doc1,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc2,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc3,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc4,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699
Doc5,0.699,0.699,0.699,0.699,0.699,0.699,0.699,0.398,0.097,0.222,0.699,0.699,0.699,0.699,0.699,0.699


🗣️ **Instructor Talking Point**: We combined TF and IDF manually — useful for seeing how each part of the formula shapes the final result.
<br/>
<br/>
🗣️ **Instructor Talking Point**: Document Frequency (DF) counts how many documents contain a specific term, showing how common it is across the corpus.
Inverse Document Frequency (IDF) does the opposite—it measures how rare or informative a term is by applying a logarithmic scale to the inverse of DF.
So, DF increases with term frequency across documents, while IDF decreases, giving higher weight to rare terms.
Together, they balance relevance: DF tells us "how many use this term," while IDF tells us "how useful is this term for distinguishing documents."
IDF is critical for reducing noise from overly common words.
<br/>
<br/>
🧠 **Student Talking Point**: "Pick one row (a document) and explain which term seems most important and why, based on the TF-IDF weights.

## Step 2: Document Collection

In [14]:
import re
import glob

input_dir = 'sample_docs/'

# sort documents by filenames
def sorted_doc_filenames(path=".", prefix="doc", suffix=".txt"):
    # Get all matching files like doc1.txt, doc2.txt, ...
    files = glob.glob(f"{path}/{prefix}*[0-9]{suffix}")
    
    # Sort numerically based on number in filename
    files.sort(key=lambda x: int(re.search(rf"{prefix}(\d+){suffix}", x).group(1)))
    
    return files

# load documents. the parameter file_paths are the list of file paths
def load_documents(file_paths):
    documents = []
    for file_path in file_paths:
        with open(file_path, 'r', encoding='utf-8') as f:
            documents.append(f.read())
    return documents

# load documents to list
file_paths = sorted_doc_filenames(path=input_dir, prefix="doc", suffix=".txt")
corpus = load_documents(file_paths)
print(f"Loaded {len(corpus)} documents.")


Loaded 20 documents.


## Step 3: Implement a Tokenizer

In [15]:

from typing import List
def tokenize(text: str) -> List[str]:
    return text.lower().split()

# Example
tokenize("Machine Learning is Fun!")


['machine', 'learning', 'is', 'fun!']

## Step 4: Text Normalization Pipeline

In [16]:

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def normalize(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(t) for t in tokens]
    return tokens


## Step 5: Build and Test the Pipeline


Using the six concepts and the preprocessing pipeline above, implement a full pipeline that:
- Preprocesses text
- Applies vectorization
- Computes all six concept metrics
- Tests with one phrase query per concept


In [21]:
# Applies vectorization
# Use binary=True to indicate presence/absence (1 or 0)
vectorizer = CountVectorizer(binary=True)

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

### Term-Document Incidence Matrix


In [None]:
# Create the original document-term matrix
doc_term_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
 
# Transpose to get term-document format
term_doc_df = doc_term_df.T
 
# Rename index and columns
term_doc_df.index.name = 'term'
term_doc_df.columns.name = 'document_id'
 
print(term_doc_df.head())

document_id  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  \
term                                                                          
00            0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   
0055          0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   
10            0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   
10_           0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   
120kvolt      0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   

document_id  16  17  18  19  
term                         
00            0   0   0   0  
0055          0   0   0   0  
10            0   0   0   0  
10_           0   0   0   0  
120kvolt      0   0   0   0  


### Term Frequency (TF)

In [None]:
# Sample document
doc1 = "machine learning is fun and machine learning is useful"

# Tokenize the document (simple lowercase + split)
tokens = doc1.lower().split()

# Count term frequencies
tf_raw = Counter(tokens)

# Total number of words
total_terms = len(tokens)

# Compute normalized TF
tf_normalized = {term: count / total_terms for term, count in tf_raw.items()}

# Display results
print("🔢 Raw Term Frequencies:")
display(pd.DataFrame(tf_raw.items(), columns=["Term", "Raw TF"]))

print("\n📏 Normalized Term Frequencies:")
display(pd.DataFrame(tf_normalized.items(), columns=["Term", "TF (Normalized)"]))

In [23]:
# Load documents into a list called corpus
file_paths = sorted_doc_filenames(path=input_dir, prefix="doc", suffix=".txt")
corpus = load_documents(file_paths)
print(f"Loaded {len(corpus)} documents from '{input_dir}'.")

# List to store TF results for each document
all_tf_raw = []
all_tf_normalized = []

# Now, iterate through the loaded corpus (which is a list of documents)
for i, doc_content in enumerate(corpus):
    print(f"\n--- Processing Document {i+1} (from file: {file_paths[i].split('/')[-1]}) ---")
    
    # Tokenize the document (simple lowercase + split)
    tokens = doc_content.lower().split()

    # Count term frequencies
    tf_raw = Counter(tokens)

    # Total number of words
    total_terms = len(tokens)

    # Compute normalized TF
    tf_normalized = {term: count / total_terms for term, count in tf_raw.items()}

    # Store results for each document
    all_tf_raw.append(tf_raw)
    all_tf_normalized.append(tf_normalized)

    # Display results for the current document
    print("🔢 Raw Term Frequencies:")
    display(pd.DataFrame(tf_raw.items(), columns=["Term", "Raw TF"]))

    print("\n📏 Normalized Term Frequencies:")
    display(pd.DataFrame(tf_normalized.items(), columns=["Term", "TF (Normalized)"]))

# Optional: You can further process or combine all_tf_raw and all_tf_normalized
# For example, to see all unique terms and their raw counts across all documents:
print("\n--- Aggregated Raw Term Frequencies Across All Documents ---")
aggregated_raw_tf = Counter()
for tf in all_tf_raw:
    aggregated_raw_tf.update(tf)
display(pd.DataFrame(aggregated_raw_tf.items(), columns=["Term", "Aggregated Raw TF"]))

# You might want to create a DataFrame where each row is a document and columns are terms
# (This is often a step towards creating a Term-Frequency Matrix for further analysis)
print("\n--- Term Frequency Matrix (Normalized) ---")
# Get all unique terms across all documents
all_unique_terms = sorted(list(aggregated_raw_tf.keys()))

tf_matrix_normalized = []
for tf_norm_doc in all_tf_normalized:
    row = {term: tf_norm_doc.get(term, 0) for term in all_unique_terms}
    tf_matrix_normalized.append(row)

tf_df = pd.DataFrame(tf_matrix_normalized)
display(tf_df)

Loaded 20 documents from 'sample_docs/'.

--- Processing Document 1 (from file: doc1.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,i,5
1,am,3
2,sure,1
3,some,1
4,bashers,1
...,...,...
85,islanders,1
86,lose,1
87,final,1
88,game.,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,i,0.036496
1,am,0.021898
2,sure,0.007299
3,some,0.007299
4,bashers,0.007299
...,...,...
85,islanders,0.007299
86,lose,0.007299
87,final,0.007299
88,game.,0.007299



--- Processing Document 2 (from file: doc2.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,my,1
1,brother,1
2,is,1
3,in,1
4,the,1
5,market,1
6,for,1
7,a,1
8,high-performance,2
9,video,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,my,0.018868
1,brother,0.018868
2,is,0.018868
3,in,0.018868
4,the,0.018868
5,market,0.018868
6,for,0.018868
7,a,0.018868
8,high-performance,0.037736
9,video,0.018868



--- Processing Document 3 (from file: doc3.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,finally,1
1,you,10
2,said,1
3,what,3
4,dream,1
...,...,...
157,butter?,1
158,arms,1
159,personel,1
160,russian,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,finally,0.004149
1,you,0.041494
2,said,0.004149
3,what,0.012448
4,dream,0.004149
...,...,...
157,butter?,0.004149
158,arms,0.004149
159,personel,0.004149
160,russian,0.004149



--- Processing Document 4 (from file: doc4.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,think!,1
1,it's,1
2,the,17
3,scsi,5
4,card,2
...,...,...
73,out,1
74,processes,1
75,wanting,1
76,irrespective,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,think!,0.006944
1,it's,0.006944
2,the,0.118056
3,scsi,0.034722
4,card,0.013889
...,...,...
73,out,0.006944
74,processes,0.006944
75,wanting,0.006944
76,irrespective,0.006944



--- Processing Document 5 (from file: doc5.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,1),1
1,i,8
2,have,6
3,an,3
4,old,1
...,...,...
74,as,1
75,"above,",1
76,beckup,1
77,can,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,1),0.008
1,i,0.064
2,have,0.048
3,an,0.024
4,old,0.008
...,...,...
74,as,0.008
75,"above,",0.008
76,beckup,0.008
77,can,0.008



--- Processing Document 6 (from file: doc6.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,back,1
1,in,1
2,high,1
3,school,1
4,i,1
5,worked,1
6,as,1
7,a,2
8,lab,1
9,assistant,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,back,0.015873
1,in,0.015873
2,high,0.015873
3,school,0.015873
4,i,0.015873
5,worked,0.015873
6,as,0.015873
7,a,0.031746
8,lab,0.015873
9,assistant,0.015873



--- Processing Document 7 (from file: doc7.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,ae,1
1,is,1
2,in,1
3,dallas...try,1
4,214/241-6060,1
5,or,1
6,214/241-0055.,1
7,tech,1
8,support,1
9,may,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,ae,0.043478
1,is,0.043478
2,in,0.043478
3,dallas...try,0.043478
4,214/241-6060,0.043478
5,or,0.043478
6,214/241-0055.,0.043478
7,tech,0.043478
8,support,0.043478
9,may,0.043478



--- Processing Document 8 (from file: doc8.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,[stuff,1
1,deleted],1
2,"ok,",1
3,here's,1
4,the,4
...,...,...
96,excellent,1
97,job.,1
98,impartial,1
99,"also,",1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,[stuff,0.007143
1,deleted],0.007143
2,"ok,",0.007143
3,here's,0.007143
4,the,0.028571
...,...,...
96,excellent,0.007143
97,job.,0.007143
98,impartial,0.007143
99,"also,",0.007143



--- Processing Document 9 (from file: doc9.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,"yeah,",1
1,it's,1
2,the,3
3,second,1
4,one.,1
5,and,3
6,i,2
7,believe,1
8,that,2
9,price,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,"yeah,",0.016667
1,it's,0.016667
2,the,0.05
3,second,0.016667
4,one.,0.016667
5,and,0.05
6,i,0.033333
7,believe,0.016667
8,that,0.033333
9,price,0.016667



--- Processing Document 10 (from file: doc10.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,if,1
1,a,2
2,christian,1
3,means,1
4,someone,1
5,who,1
6,believes,1
7,in,1
8,the,3
9,divinity,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,if,0.014286
1,a,0.028571
2,christian,0.014286
3,means,0.014286
4,someone,0.014286
5,who,0.014286
6,believes,0.014286
7,in,0.014286
8,the,0.042857
9,divinity,0.014286



--- Processing Document 11 (from file: doc11.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,the,2
1,blood,2
2,of,1
3,lamb.,1
4,this,1
5,will,1
6,be,2
7,a,1
8,hard,1
9,"task,",1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,the,0.052632
1,blood,0.052632
2,of,0.026316
3,lamb.,0.026316
4,this,0.026316
5,will,0.026316
6,be,0.052632
7,a,0.026316
8,hard,0.026316
9,"task,",0.026316



--- Processing Document 12 (from file: doc12.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,>say,1
1,they,1
2,have,1
3,a,1
4,"""history",1
5,of,1
6,untrustworthy,1
7,"behavoir[sic]""?",1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,>say,0.125
1,they,0.125
2,have,0.125
3,a,0.125
4,"""history",0.125
5,of,0.125
6,untrustworthy,0.125
7,"behavoir[sic]""?",0.125



--- Processing Document 13 (from file: doc13.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,930418,1
1,do,3
2,what,3
3,thou,1
4,wilt,1
...,...,...
592,stars.,1
593,love,2
594,"law,",1
595,will.,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,930418,0.000840
1,do,0.002519
2,what,0.002519
3,thou,0.000840
4,wilt,0.000840
...,...,...
592,stars.,0.000840
593,love,0.001679
594,"law,",0.000840
595,will.,0.000840



--- Processing Document 14 (from file: doc14.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,how,1
1,about,1
2,kirlian,1
3,imaging,1
4,?,1
5,i,1
6,believe,1
7,the,1
8,faq,1
9,for,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,how,0.02
1,about,0.02
2,kirlian,0.02
3,imaging,0.02
4,?,0.02
5,i,0.02
6,believe,0.02
7,the,0.02
8,faq,0.02
9,for,0.02



--- Processing Document 15 (from file: doc15.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,there,1
1,is,1
2,no,1
3,notion,1
4,of,1
5,"heliocentric,",1
6,or,1
7,even,1
8,galacticentric,1
9,either.,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,there,0.090909
1,is,0.090909
2,no,0.090909
3,notion,0.090909
4,of,0.090909
5,"heliocentric,",0.090909
6,or,0.090909
7,even,0.090909
8,galacticentric,0.090909
9,either.,0.090909



--- Processing Document 16 (from file: doc16.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,in,2
1,the,8
2,following,1
3,report:,1
4,_turkey,1
...,...,...
90,"""wolf""",1
91,just,1
92,once,1
93,too,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,in,0.016667
1,the,0.066667
2,following,0.008333
3,report:,0.008333
4,_turkey,0.008333
...,...,...
90,"""wolf""",0.008333
91,just,0.008333
92,once,0.008333
93,too,0.008333



--- Processing Document 17 (from file: doc17.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,many,1
1,thanks,1
2,to,7
3,those,1
4,who,1
...,...,...
79,....),1
80,jvc,1
81,mdp,1
82,series,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,many,0.008403
1,thanks,0.008403
2,to,0.058824
3,those,0.008403
4,who,0.008403
...,...,...
79,....),0.008403
80,jvc,0.008403
81,mdp,0.008403
82,series,0.008403



--- Processing Document 18 (from file: doc18.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,.........,1
1,"i,",1
2,some,1
3,years,1
4,"ago,",1
...,...,...
61,beware,1
62,explosive,1
63,properties,1
64,wd40,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,.........,0.012658
1,"i,",0.012658
2,some,0.012658
3,years,0.012658
4,"ago,",0.012658
...,...,...
61,beware,0.012658
62,explosive,0.012658
63,properties,0.012658
64,wd40,0.012658



--- Processing Document 19 (from file: doc19.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,the,64
1,supreme,4
2,court,5
3,seems,2
4,to,19
...,...,...
377,matter,1
378,you.,1
379,perry,1
380,metzger,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,the,0.086370
1,supreme,0.005398
2,court,0.006748
3,seems,0.002699
4,to,0.025641
...,...,...
377,matter,0.001350
378,you.,0.001350
379,perry,0.001350
380,metzger,0.001350



--- Processing Document 20 (from file: doc20.txt) ---
🔢 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,ed>1.,1
1,all,6
2,of,8
3,us,1
4,that,9
...,...,...
172,should,1
173,countersteering,1
174,knowledge,1
175,our,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,ed>1.,0.003650
1,all,0.021898
2,of,0.029197
3,us,0.003650
4,that,0.032847
...,...,...
172,should,0.003650
173,countersteering,0.003650
174,knowledge,0.003650
175,our,0.003650



--- Aggregated Raw Term Frequencies Across All Documents ---


Unnamed: 0,Term,Aggregated Raw TF
0,i,61
1,am,9
2,sure,2
3,some,8
4,bashers,1
...,...,...
1587,(imho),1
1588,reduce,1
1589,further.,1
1590,countersteering,1



--- Term Frequency Matrix (Normalized) ---


Unnamed: 0,"""...","""an","""and","""aura""...","""before","""geneva","""greater""","""grey","""history","""holocaust""",...,years,"years,",years.,yesterday,yet,you,you!,you.,you?],your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.007299,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.018868,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.004149,0.004149,0.0,0.0,0.004149,...,0.0,0.004149,0.0,0.0,0.0,0.041494,0.004149,0.0,0.0,0.012448
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.006944,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015873
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.007143,0.0,0.0,0.0,0.0,0.0,0.007143
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Step 6: The Workshop


One team member must push the final notebook to GitHub and send the `.git` URL to the instructor before the end of class.




## 🧠 Learning Objectives
- Implement the foundations of **Vector Space Proximity** algorithms using real-world data during the NLP process.
- Build **Jupyter Notebooks** with well-structured code and clear Markdown documentation.
- Use **Git and GitHub** for collaborative version control and code sharing.
- Identify and articulate coding issues ("**talking points**") and insert them directly into peer notebooks.
- Practice **collaborative debugging**, professional peer feedback, and improve code quality.

## 🧩 Workshop Structure (90 Minutes)
1. **Instructor Use Case Introduction** *(15 min)* – Set up teams of 3 people. Read and understand the workshop, plus submission instructions. Seek assistance if needed.
2. **Team Jupyter Notebook Development** *(45 min)* – NLP Pipeline and six IR basics techniques implementation + Markdown documentation (work as teams)
3. **Push to GitHub** *(15 min)* – Teams commit and push initial notebooks. **Make sure to include your names so it is easy to identify the team that developed the code**.
4. **Instructor Review** - The instructor will go around, take notes, and provide coaching as needed, during the **Peer Review Round**
5. **Email Delivery** *(15 min)* – Each team send the instructor an email **with the *.git link** to the GitHub repo **(one email/team)**. Subject on the email is: PROG8245 - IR Basics & Vector Space Proximity Foundations Workshop, Team #_____.


## 💻 Submission Checklist
- ✅ `IRBasics_VectorSpaceProximity.ipynb` with:
  - Demo code: Document Collection, Tokenizer, Normalization Pipeline, Inverted Index and the six concepts.
  - Markdown explanations for each major step
  - **Labeled talking point(s)** (1-2 per concept)
- ✅ `README.md` with:
  - Dataset description
  - Team member names
  - Link to the dataset and license (if public)
- ✅ GitHub Repo:
  - Public repo named `IRBasics-VectorSpaceProximity-workshop`
  - This is a group effort, so **choose one member of the team** to publish the repo
  - At least **one commit containing one meaningful talking point**

## 🔚 Conclusion


This workshop prepares you for our next session on **Vector Space Proximity** and **Cosine Similarity**.
