## Vector Space Model (VSM) for Information Retrieval

In this notebook, we implement the Vector Space Model using TF–IDF weighted document vectors and compute similarity between queries and documents using three measures:

- **Inner Product Similarity**
- **Cosine Similarity**
- **Jaccard Similarity**

We test the model on the following queries:

1. `q1`: large language models for information retrieval and ranking  
2. `q2`: LLM for information retrieval and Ranking  
3. `q3`: query Reformulation in information retrieval  
4. `q4`: ranking Documents  
5. `q5`: Optimizing recommendation systems with LLMs by leveraging item metadata

6. **Load the Document–Term and Inverted Index files**  
   - Use the files generated in Lab 1 to access the TF–IDF weights of terms in each document.

7. **Preprocess each query**  
   - Tokenize the query text.  
   - Remove stop words.  
   - Apply stemming (e.g., Porter Stemmer).

8. **For each similarity measure (Inner Product, Cosine, Jaccard):**  
   - Compute the similarity score between the query and every document.  
   - Use a **binary weighting scheme** for queries:  
     - Weight = 1 if the term appears in the query.  
     - Weight = 0 if the term does not appear.  
   - Rank the documents in **descending order of similarity**.  
   - Display the ranked documents for each query.


In [2]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from collections import defaultdict
import re

# -------------------------------
# 1️⃣ Load TF–IDF Weighted Inverted Index
# -------------------------------
inverted_path = "results/inverted_index_weighted.txt"

data = pd.read_csv(
    inverted_path, 
    sep="\t", 
    header=None, 
    names=["term", "doc", "freq", "tfidf"]
)

# Build dictionary: {doc: {term: tfidf}}
doc_dict = defaultdict(dict)
for _, row in data.iterrows():
    term = str(row["term"]).lower()
    doc = str(row["doc"]).replace(".txt", "")
    doc_dict[doc][term] = float(row["tfidf"])

# -------------------------------
# 2️⃣ Preprocessing Function for Queries
# -------------------------------
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()

def preprocess(text):
    tokens = re.findall(r"\b\w+\b", text.lower())
    filtered = [stemmer.stem(t) for t in tokens if t not in stop_words]
    return filtered

# -------------------------------
# 3️⃣ Queries
# -------------------------------
queries = {
    "q1": "large language models for information retrieval and ranking",
    "q2": "LLM for information retrieval and Ranking",
    "q3": "query Reformulation in information retrieval",
    "q4": "ranking Documents",
    "q5": "Optimizing recommendation systems with LLMs by leveraging item metadata"
}

# -------------------------------
# 4️⃣ Vector Space Model Scoring Functions
# -------------------------------
def inner_product(query_terms, doc_terms):
    score = 0.0
    for term in query_terms:
        if term in doc_terms:
            score += doc_terms[term] * 1  # query weight = 1
    return score

def cosine_similarity(query_terms, doc_terms):
    numerator = inner_product(query_terms, doc_terms)
    doc_norm = np.sqrt(sum(np.square(list(doc_terms.values()))))
    query_norm = np.sqrt(len(query_terms))  # binary weights
    if doc_norm == 0 or query_norm == 0:
        return 0
    return numerator / (doc_norm * query_norm)

def jaccard_similarity(query_terms, doc_terms):
    intersection = sum([doc_terms[t] for t in query_terms if t in doc_terms])
    union = len(query_terms) + sum(doc_terms.values()) - intersection
    return intersection / union if union != 0 else 0

# -------------------------------
# 5️⃣ Compute and Rank Similarities
# -------------------------------
def rank_documents(query_text):
    query_terms = preprocess(query_text)
    scores = []

    for doc, terms in doc_dict.items():
        inner = inner_product(query_terms, terms)
        cosine = cosine_similarity(query_terms, terms)
        jaccard = jaccard_similarity(query_terms, terms)
        scores.append((doc, inner, cosine, jaccard))

    df = pd.DataFrame(scores, columns=["Document", "InnerProduct", "Cosine", "Jaccard"])
    df = df.sort_values(by="Cosine", ascending=False)
    return df.reset_index(drop=True)

# -------------------------------
# 6️⃣ Run for All Queries
# -------------------------------
for q_id, q_text in queries.items():
    print(f"\n🔹 Results for {q_id}: {q_text}")
    results = rank_documents(q_text)
    display(results.head(10))



🔹 Results for q1: large language models for information retrieval and ranking


Unnamed: 0,Document,InnerProduct,Cosine,Jaccard
0,D4,0.5302,0.216419,0.040066
1,D2,0.5103,0.208325,0.036633
2,D3,0.3755,0.153312,0.028665
3,D1,0.3278,0.133832,0.024114
4,D5,0.3063,0.125068,0.021062
5,D6,0.1717,0.070091,0.013283



🔹 Results for q2: LLM for information retrieval and Ranking


Unnamed: 0,Document,InnerProduct,Cosine,Jaccard
0,D2,0.5715,0.285745,0.048151
1,D4,0.5123,0.25611,0.045533
2,D5,0.3346,0.167329,0.026737
3,D6,0.2286,0.114291,0.021032
4,D1,0.2126,0.106306,0.018157
5,D3,0.1972,0.098609,0.017486



🔹 Results for q3: query Reformulation in information retrieval


Unnamed: 0,Document,InnerProduct,Cosine,Jaccard
0,D4,1.0441,0.521968,0.097403
1,D1,0.5081,0.254065,0.044517
2,D5,0.1636,0.081814,0.012897
3,D3,0.0443,0.022152,0.003876
4,D6,0.043,0.021498,0.00389
5,D2,0.0424,0.0212,0.00342



🔹 Results for q4: ranking Documents


Unnamed: 0,Document,InnerProduct,Cosine,Jaccard
0,D2,0.4507,0.318687,0.045116
1,D1,0.1307,0.092424,0.013349
2,D5,0.0477,0.033735,0.004416
3,D6,0.0,0.0,0.0
4,D4,0.0,0.0,0.0
5,D3,0.0,0.0,0.0



🔹 Results for q5: Optimizing recommendation systems with LLMs by leveraging item metadata


Unnamed: 0,Document,InnerProduct,Cosine,Jaccard
0,D3,0.6425,0.242865,0.046449
1,D6,0.3639,0.137531,0.026496
2,D4,0.2945,0.111293,0.020354
3,D5,0.2596,0.098136,0.016652
4,D2,0.2198,0.083075,0.014441
5,D1,0.0598,0.022604,0.004024


## 🧩 Explanation

### 🔹 Inverted Index
- Loaded from your **Lab 1 output** (contains: *term*, *document*, *TF–IDF*).  
- Allows efficient retrieval of documents containing each term.

---

### 🔹 Document Dictionary (`doc_dict`)
- Maps every **document → its TF–IDF terms**.  
- Example:  
  `{ "doc1": {"cat": 0.23, "mat": 0.18, ...}, "doc2": {...} }`

---

### 🔹 Query Preprocessing
- Removes **stopwords**.  
- Applies **stemming** to reduce words to their root form.  
- Uses **binary weighting** for queries → each query term = **1**.

---

### 🔹 Similarity Measures

| Measure | Description |
|:---------|:-------------|
| **Inner Product** | Basic overlap strength between query and document vectors. |
| **Cosine Similarity** | Normalized measure of similarity (most commonly used). |
| **Jaccard Similarity** | Balances overlap with document length. |

---

### 🔹 Ranking
- Documents are **sorted by cosine similarity** by default.  
- You can change the sorting column to use **another similarity metric**.
