# 05. Vector Space Model (VSM)

## Table of Contents
1. [Introduction](#introduction)
2. [Theory: Vector Space Model](#theory)
3. [Document Vectors](#vectors)
4. [Cosine Similarity](#cosine)
5. [Ranked Retrieval](#ranking)
6. [Summary](#summary)

---

## 1. Introduction <a name="introduction"></a>

The **Vector Space Model (VSM)** revolutionized Information Retrieval by introducing:
- **Partial matching**: Documents can be somewhat relevant
- **Ranking**: Order results by similarity score
- **Geometric interpretation**: Documents and queries as vectors

### Why VSM?
Boolean retrieval is too rigid - documents either match or don't. VSM allows:
- ‚úì Ranking documents by relevance
- ‚úì Partial matching (query terms need not all appear)
- ‚úì Term weighting (some terms more important)

---

## 2. Theory: Vector Space Model <a name="theory"></a>

### Core Idea:
Represent documents and queries as **vectors in a high-dimensional space**, where each dimension corresponds to a term.

### Example:
Vocabulary: {‡§®‡•á‡§™‡§æ‡§≤, ‡§π‡§ø‡§Æ‡§æ‡§≤, ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ, ‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï}

```
Document 1: "‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤‡§ï‡•ã ‡§¶‡•á‡§∂" ‚Üí [2, 1, 0, 0]
Document 2: "‡§®‡•á‡§™‡§æ‡§≤ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®"         ‚Üí [1, 0, 0, 1]
Query:      "‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤"         ‚Üí [1, 1, 0, 0]
```

### Geometric Interpretation:
```
         ‡§π‡§ø‡§Æ‡§æ‡§≤ ‚Üë
               |
          Doc1 ‚Ä¢
              /|
        Query‚Ä¢ | 
            /  |
           /   ‚Ä¢ Doc2
          /____________________‚Üí ‡§®‡•á‡§™‡§æ‡§≤
```

**Similarity = Angle between vectors**
- Small angle = High similarity
- Large angle = Low similarity

### Term Weighting:
Not all terms are equally important!

1. **Term Frequency (TF)**: How often term appears in document
   - More occurrences ‚Üí Higher weight

2. **Document Frequency (DF)**: How many documents contain term
   - Common terms (high DF) ‚Üí Lower weight
   - Rare terms (low DF) ‚Üí Higher weight

For now, we'll use simple **term frequency** weighting.

---

## 3. Document Vectors <a name="vectors"></a>

In [1]:
from pathlib import Path
from collections import Counter
import math

# Load preprocessing utilities
DATA_DIR = Path('../data')

def load_documents(data_dir):
    documents = {}
    for file_path in sorted(data_dir.glob('doc*.txt')):
        with open(file_path, 'r', encoding='utf-8') as f:
            documents[file_path.stem] = f.read()
    return documents

def load_stopwords(file_path):
    stopwords = set()
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            stopwords.add(line.strip())
    return stopwords

def load_stemming_dict(file_path):
    stem_dict = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        next(f)
        for line in f:
            parts = line.strip().split(',')
            if len(parts) == 2:
                stem_dict[parts[0]] = parts[1]
    return stem_dict

def tokenize(text):
    tokens = text.split()
    cleaned = []
    for token in tokens:
        token = token.strip('‡•§,.!?;:"\'-()[]{}/')
        if token and any('\u0900' <= c <= '\u097F' for c in token):
            cleaned.append(token)
    return cleaned

def preprocess_text(text, stopwords, stem_dict):
    tokens = tokenize(text)
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [stem_dict.get(t, t) for t in tokens]
    return tokens

documents = load_documents(DATA_DIR)
stopwords = load_stopwords(DATA_DIR / 'nepali_stopwords.csv')
stem_dict = load_stemming_dict(DATA_DIR / 'nepali_stemming.csv')

preprocessed_docs = {}
for doc_id, text in documents.items():
    preprocessed_docs[doc_id] = preprocess_text(text, stopwords, stem_dict)

print(f"‚úì Loaded {len(preprocessed_docs)} documents")

‚úì Loaded 10 documents


In [2]:
def build_vocabulary(preprocessed_docs):
    """
    Build vocabulary from all documents.
    
    Returns: sorted list of unique terms
    """
    vocab = set()
    for terms in preprocessed_docs.values():
        vocab.update(terms)
    return sorted(vocab)

def build_term_frequency_vectors(preprocessed_docs, vocabulary):
    """
    Build document vectors using term frequency weighting.
    
    For each document, count how many times each term appears.
    
    Parameters:
    -----------
    preprocessed_docs : dict
        Document ID ‚Üí list of terms
    vocabulary : list
        Ordered list of all unique terms
    
    Returns:
    --------
    dict : Document ID ‚Üí vector (list of term frequencies)
    """
    vectors = {}
    
    # Create term ‚Üí index mapping
    term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}
    
    for doc_id, terms in preprocessed_docs.items():
        # Count term frequencies
        term_counts = Counter(terms)
        
        # Build vector
        vector = [0] * len(vocabulary)
        for term, count in term_counts.items():
            idx = term_to_idx[term]
            vector[idx] = count
        
        vectors[doc_id] = vector
    
    return vectors

# Build vocabulary and vectors
vocabulary = build_vocabulary(preprocessed_docs)
doc_vectors = build_term_frequency_vectors(preprocessed_docs, vocabulary)

print(f"‚úì Built document vectors")
print(f"  Vocabulary size: {len(vocabulary)}")
print(f"  Vector dimensions: {len(vocabulary)}")
print(f"  Number of vectors: {len(doc_vectors)}")

‚úì Built document vectors
  Vocabulary size: 398
  Vector dimensions: 398
  Number of vectors: 10


In [3]:
# Visualize sample vectors
def show_vector_sample(doc_vectors, vocabulary, sample_terms, num_docs=3):
    """
    Display sample document vectors for specific terms.
    """
    # Get indices of sample terms
    term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}
    sample_indices = [term_to_idx[term] for term in sample_terms if term in term_to_idx]
    
    print("\nüìä Document Vectors (sample):")
    print("="*80)
    
    # Header
    header = "Doc ID    "
    for term in sample_terms:
        header += f"{term[:10]:<12}"
    print(header)
    print("="*80)
    
    # Show first few documents
    for doc_id in sorted(doc_vectors.keys())[:num_docs]:
        row = f"{doc_id:<10}"
        for term in sample_terms:
            if term in term_to_idx:
                idx = term_to_idx[term]
                value = doc_vectors[doc_id][idx]
                row += f"{value:<12}"
            else:
                row += f"{'N/A':<12}"
        print(row)
    
    print("="*80)
    print("\nNote: Numbers show how many times each term appears in the document")

sample_terms = ['‡§®‡•á‡§™‡§æ‡§≤', '‡§π‡§ø‡§Æ‡§æ‡§≤', '‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ', '‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï', '‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø']
show_vector_sample(doc_vectors, vocabulary, sample_terms, num_docs=5)


üìä Document Vectors (sample):
Doc ID    ‡§®‡•á‡§™‡§æ‡§≤       ‡§π‡§ø‡§Æ‡§æ‡§≤       ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ      ‡§™‡§∞‡•ç‡§Ø‡§ü‡§ï      ‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø   
doc01     6           1           0           0           0           
doc02     4           5           0           4           0           
doc03     5           0           5           0           0           
doc04     5           0           0           0           0           
doc05     6           0           0           0           0           

Note: Numbers show how many times each term appears in the document


---

## 4. Cosine Similarity <a name="cosine"></a>

**Cosine Similarity** measures the angle between two vectors.

### Formula:

$$
\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{||A|| \times ||B||}
$$

Where:
- $A \cdot B$ = Dot product
- $||A||$ = Length (magnitude) of vector A

### Properties:
- Range: [0, 1] for non-negative vectors
- 1 = Identical direction (most similar)
- 0 = Perpendicular (no similarity)

### Why Cosine?
- **Length-invariant**: Long and short documents comparable
- **Angle-based**: Measures orientation, not magnitude
- **Efficient**: Fast to compute with sparse vectors

In [4]:
def dot_product(vec1, vec2):
    """
    Calculate dot product of two vectors.
    
    Formula: sum(vec1[i] * vec2[i] for all i)
    """
    return sum(v1 * v2 for v1, v2 in zip(vec1, vec2))

def vector_magnitude(vec):
    """
    Calculate magnitude (length) of a vector.
    
    Formula: sqrt(sum(vec[i]^2 for all i))
    """
    return math.sqrt(sum(v ** 2 for v in vec))

def cosine_similarity(vec1, vec2):
    """
    Calculate cosine similarity between two vectors.
    
    Returns value between 0 and 1:
    - 1 = identical
    - 0 = no similarity
    """
    dot_prod = dot_product(vec1, vec2)
    mag1 = vector_magnitude(vec1)
    mag2 = vector_magnitude(vec2)
    
    # Avoid division by zero
    if mag1 == 0 or mag2 == 0:
        return 0.0
    
    return dot_prod / (mag1 * mag2)

print("‚úì Similarity functions defined")

‚úì Similarity functions defined


In [5]:
# Test: Compare similarity between documents
print("\nüìê Document Similarity Matrix:")
print("="*70)

# Sample documents
sample_docs = ['doc01', 'doc02', 'doc03', 'doc07']

# Header
header = "      "
for doc in sample_docs:
    header += f"{doc:<10}"
print(header)
print("="*70)

# Calculate pairwise similarities
for doc1 in sample_docs:
    row = f"{doc1:<6}"
    for doc2 in sample_docs:
        sim = cosine_similarity(doc_vectors[doc1], doc_vectors[doc2])
        row += f"{sim:.4f}    "
    print(row)

print("="*70)
print("\nInterpretation:")
print("  - Diagonal = 1.0 (document compared to itself)")
print("  - Higher values = More similar documents")


üìê Document Similarity Matrix:
      doc01     doc02     doc03     doc07     
doc01 1.0000    0.3028    0.2664    0.1719    
doc02 0.3028    1.0000    0.1950    0.1198    
doc03 0.2664    0.1950    1.0000    0.2258    
doc07 0.1719    0.1198    0.2258    1.0000    

Interpretation:
  - Diagonal = 1.0 (document compared to itself)
  - Higher values = More similar documents


---

## 5. Ranked Retrieval <a name="ranking"></a>

Now we can rank documents by their similarity to a query!

In [6]:
def query_to_vector(query_text, vocabulary, stopwords, stem_dict):
    """
    Convert query text to a vector in the same space as documents.
    
    Parameters:
    -----------
    query_text : str
        Raw query text
    vocabulary : list
        Ordered vocabulary
    
    Returns:
    --------
    list : Query vector
    """
    # Preprocess query
    query_terms = preprocess_text(query_text, stopwords, stem_dict)
    
    # Count term frequencies
    term_counts = Counter(query_terms)
    
    # Build vector
    term_to_idx = {term: idx for idx, term in enumerate(vocabulary)}
    vector = [0] * len(vocabulary)
    
    for term, count in term_counts.items():
        if term in term_to_idx:
            idx = term_to_idx[term]
            vector[idx] = count
    
    return vector

def ranked_retrieval(query_text, doc_vectors, vocabulary, stopwords, stem_dict, top_k=5):
    """
    Retrieve and rank documents by similarity to query.
    
    Parameters:
    -----------
    query_text : str
        User's query
    doc_vectors : dict
        Document vectors
    top_k : int
        Number of top results to return
    
    Returns:
    --------
    list : [(doc_id, similarity_score), ...] sorted by score
    """
    # Convert query to vector
    query_vector = query_to_vector(query_text, vocabulary, stopwords, stem_dict)
    
    # Calculate similarity with each document
    scores = []
    for doc_id, doc_vector in doc_vectors.items():
        score = cosine_similarity(query_vector, doc_vector)
        scores.append((doc_id, score))
    
    # Sort by score (descending)
    scores.sort(key=lambda x: x[1], reverse=True)
    
    # Return top k
    return scores[:top_k]

print("‚úì Ranked retrieval functions defined")

‚úì Ranked retrieval functions defined


In [7]:
# Example Query 1
query1 = "‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®"
print(f"\nüîç Query: '{query1}'")
print("="*70)

results = ranked_retrieval(query1, doc_vectors, vocabulary, stopwords, stem_dict, top_k=5)

print(f"\n{'Rank':<6} {'Doc ID':<10} {'Score':<10} {'Preview'}")
print("="*70)

for rank, (doc_id, score) in enumerate(results, 1):
    preview = documents[doc_id][:50].replace('\n', ' ')
    print(f"{rank:<6} {doc_id:<10} {score:.4f}    {preview}...")

print("="*70)


üîç Query: '‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®'

Rank   Doc ID     Score      Preview
1      doc02      0.7256    ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®  ‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤‡§ï‡•ã ‡§¶‡•á‡§∂ ‡§π‡•ã‡•§ ‡§Ø‡§π‡§æ‡§Å ‡§µ‡§ø‡§∂‡•ç‡§µ‡§ï‡•ã...
2      doc01      0.3836    ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø  ‡§®‡•á‡§™‡§æ‡§≤ ‡§¶‡§ï‡•ç‡§∑‡§ø‡§£ ‡§è‡§∂‡§ø‡§Ø‡§æ‡§Æ‡§æ ‡§Ö‡§µ...
3      doc09      0.3518    ‡§µ‡§æ‡§§‡§æ‡§µ‡§∞‡§£ ‡§∞ ‡§ú‡§≤‡§µ‡§æ‡§Ø‡•Å  ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§≠‡•å‡§ó‡•ã‡§≤‡§ø‡§ï ‡§µ‡§ø‡§µ‡§ø‡§ß‡§§‡§æ‡§≤‡•á ‡§ó‡§∞‡•ç‡§¶‡§æ ...
4      doc05      0.3303    ‡§≠‡§æ‡§∑‡§æ ‡§∞ ‡§∏‡§æ‡§π‡§ø‡§§‡•ç‡§Ø  ‡§®‡•á‡§™‡§æ‡§≤‡§Æ‡§æ ‡§ß‡•á‡§∞‡•à ‡§≠‡§æ‡§∑‡§æ‡§π‡§∞‡•Ç ‡§¨‡•ã‡§≤‡§ø‡§®‡•ç‡§õ‡§®‡•ç‡•§ ‡§®‡•á...
5      doc04      0.2887    ‡§ï‡•É‡§∑‡§ø ‡§∞ ‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞  ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞ ‡§Æ‡•Å‡§ñ‡•ç‡§Ø‡§§‡§É ‡§ï‡•É‡§∑‡§ø...


In [8]:
# Example Query 2
query2 = "‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø ‡§µ‡§ø‡§∂‡•ç‡§µ‡§µ‡§ø‡§¶‡•ç‡§Ø‡§æ‡§≤‡§Ø"
print(f"\nüîç Query: '{query2}'")
print("="*70)

results = ranked_retrieval(query2, doc_vectors, vocabulary, stopwords, stem_dict, top_k=5)

print(f"\n{'Rank':<6} {'Doc ID':<10} {'Score':<10} {'Title'}")
print("="*70)

for rank, (doc_id, score) in enumerate(results, 1):
    title = documents[doc_id].split('\n')[0]
    print(f"{rank:<6} {doc_id:<10} {score:.4f}    {title}")

print("="*70)


üîç Query: '‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø ‡§µ‡§ø‡§∂‡•ç‡§µ‡§µ‡§ø‡§¶‡•ç‡§Ø‡§æ‡§≤‡§Ø'

Rank   Doc ID     Score      Title
1      doc03      0.7089    ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§∞ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø
2      doc07      0.0550    ‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø ‡§∏‡•á‡§µ‡§æ
3      doc01      0.0000    ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø
4      doc02      0.0000    ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®
5      doc04      0.0000    ‡§ï‡•É‡§∑‡§ø ‡§∞ ‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞


In [9]:
# Example Query 3
query3 = "‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø ‡§Ö‡§∏‡•ç‡§™‡§§‡§æ‡§≤ ‡§ö‡§ø‡§ï‡§ø‡§§‡•ç‡§∏‡§æ"
print(f"\nüîç Query: '{query3}'")
print("="*70)

results = ranked_retrieval(query3, doc_vectors, vocabulary, stopwords, stem_dict, top_k=5)

print(f"\n{'Rank':<6} {'Doc ID':<10} {'Score':<10} {'Title'}")
print("="*70)

for rank, (doc_id, score) in enumerate(results, 1):
    title = documents[doc_id].split('\n')[0]
    print(f"{rank:<6} {doc_id:<10} {score:.4f}    {title}")

print("="*70)
print("\nüí° Note: Documents with score 0.0 have no query terms in common")


üîç Query: '‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø ‡§Ö‡§∏‡•ç‡§™‡§§‡§æ‡§≤ ‡§ö‡§ø‡§ï‡§ø‡§§‡•ç‡§∏‡§æ'

Rank   Doc ID     Score      Title
1      doc07      0.6055    ‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø ‡§∏‡•á‡§µ‡§æ
2      doc01      0.0000    ‡§®‡•á‡§™‡§æ‡§≤‡§ï‡•ã ‡§á‡§§‡§ø‡§π‡§æ‡§∏ ‡§∞ ‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø
3      doc02      0.0000    ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§∞ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®
4      doc03      0.0000    ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§∞ ‡§™‡•ç‡§∞‡§µ‡§ø‡§ß‡§ø
5      doc04      0.0000    ‡§ï‡•É‡§∑‡§ø ‡§∞ ‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞

üí° Note: Documents with score 0.0 have no query terms in common


---

## 6. Summary <a name="summary"></a>

### What We Learned:

1. **Vector Space Model**
   - Documents and queries as vectors
   - Geometric interpretation of relevance
   - Partial matching and ranking

2. **Term Frequency Vectors**
   - Count how often each term appears
   - Higher frequency ‚Üí Higher weight
   - Vocabulary defines vector dimensions

3. **Cosine Similarity**
   - Measures angle between vectors
   - Range: 0 (no similarity) to 1 (identical)
   - Length-invariant (fair for different doc sizes)

4. **Ranked Retrieval**
   - Order documents by similarity score
   - Show most relevant results first
   - Much better than Boolean retrieval

### Advantages over Boolean:
- ‚úì **Ranked results**: Best matches first
- ‚úì **Partial matching**: Some query terms can be missing
- ‚úì **Graded relevance**: Similarity scores
- ‚úì **User-friendly**: Natural language queries

### Limitations:
- ‚úó **Vocabulary mismatch**: Synonyms not handled
- ‚úó **Equal term weights**: All terms treated same
- ‚úó **No context**: Word order ignored

### Next Steps:
In the next notebook (`06_tf_idf_ranking.ipynb`), we will:
- Implement TF-IDF weighting
- Give rare terms higher importance
- Penalize common terms
- Improve ranking quality

### Research References:
- Gerard Salton: Pioneer of VSM (1970s)
- Manning et al., "Introduction to Information Retrieval", Chapter 6
- VSM is the foundation of modern ranking algorithms
- Used in Lucene, Elasticsearch, and early Google