# 00. Data Expansion & Synthetic Generation\n
\n
To support advanced IR topics like **Topic Modeling (LDA)**, **Neural IR**, and **Evaluation**, we need more data than our initial 10 documents.\n
\n
This notebook generates:\n
1. **Synthetic Corpus**: 50+ documents across 5 topics.\n
2. **Evaluation Dataset**: `relevance_judgments.json` (Queries + Relevant Docs).\n
3. **Dummy Embeddings**: `word_vectors.json` (Simulated Word2Vec for Neural IR demo).\n
\n
**Note:** We use a simple template-based generator to create grammatically "okay" Nepali sentences for algorithmic testing. Content meaning may be nonsensical.

In [5]:
import os
import json
import random
import numpy as np
from pathlib import Path

DATA_DIR = Path('../data')
DATA_DIR.mkdir(exist_ok=True)

## 1. Synthetic Corpus Generation\n
\n
We define 5 topics with specific vocabularies. We generate documents by randomly combining these words into sentence templates.

In [6]:
topics = {
    "politics": {
        "keywords": ["सरकार", "मन्त्री", "चुनाव", "संसद", "संविधान", "राजनीति", "नेता", "जनता", "लोकतन्त्र", "अधिकार"],
        "sentences": [
            "सरकारले जनताको अधिकार सुनिश्चित गर्नुपर्छ।",
            "नयाँ संविधान जारी भएपछि चुनाव भयो।",
            "संसदमा मन्त्रीले भाषण गरे।",
            "राजनीति देशको मेरुदण्ड हो।"
        ]
    },
    "sports": {
        "keywords": ["फुटबल", "क्रिकेट", "खेलकुद", "मैदान", "खेलाडी", "प्रतियोगिता", "गोल", "ब्याटिङ", "जित", "हार"],
        "sentences": [
            "नेपालले क्रिकेट प्रतियोगिता जित्यो।",
            "खेलाडीले मैदानमा अभ्यास गरे।",
            "फुटबल खेल रोमाञ्चक थियो।",
            "खेलकुदले स्वास्थ्यलाई फाइदा गर्छ।"
        ]
    },
    "technology": {
        "keywords": ["कम्प्युटर", "इन्टरनेट", "मोबाइल", "सफ्टवेयर", "डिजिटल", "प्रविधि", "डाटा", "अनलाइन", "वेबसाइट", "एप"],
        "sentences": [
            "आजकल सबै काम इन्टरनेटबाट हुन्छ।",
            "मैले नयाँ मोबाइल किने।",
            "कम्प्युटर प्रविधिले विकास ल्याएको छ।",
            "सफ्टवेयर इन्जिनियरहरू कोड लेख्छन्।"
        ]
    },
    "travel": {
        "keywords": ["हिमाल", "पर्यटन", "पदयात्रा", "होटल", "पोखरा", "सगरमाथा", "पर्यटक", "यात्रा", "प्राकृतिक", "दृश्य"],
        "sentences": [
            "नेपाल हिमालको देश हो।",
            "पोखरामा धेरै पर्यटक आउँछन्।",
            "सगरमाथा चढ्न विदेशीहरू आउँछन्।",
            "हामी पदयात्रामा गयौं।"
        ]
    },
    "culture": {
        "keywords": ["चाडपर्व", "संस्कृति", "दशैं", "तिहार", "मन्दिर", "पूजा", "परम्परा", "धर्म", "जात्रा", "भेषभुषा"],
        "sentences": [
            "नेपालमा धेरै चाडपर्व मनाइन्छ।",
            "दशैं नेपालीहरूको महान चाड हो।",
            "हामी मन्दिरमा पूजा गर्छौं।",
            "हाम्रो संस्कृति धेरै धनी छ।"
        ]
    }
}

def generate_document(topic_name, doc_id):
    data = topics[topic_name]
    # Base content
    content = [random.choice(data["sentences"]) for _ in range(3)]
    
    # Add random keywords for density
    extra_keywords = random.sample(data["keywords"], 5)
    content.append(" ".join(extra_keywords) + "।")
    
    # Mix topic just a little bit (noise) - 10% chance
    if random.random() < 0.1:
        other_topic = random.choice(list(topics.keys()))
        content.append(random.choice(topics[other_topic]["sentences"]))
        
    text = " " .join(content)
    
    filename = DATA_DIR / f"doc{doc_id:03d}_{topic_name}.txt"
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(text)
    return filename

# Generate 50 documents (10 per topic)
doc_count = 10  # Start from 11 (assuming 1-10 exist)
generated_files = []

print("Generating documents...")
for topic in topics:
    for _ in range(10):
        doc_count += 1
        filepath = generate_document(topic, doc_count)
        generated_files.append(filepath.name)
        
print(f"✓ Generated {len(generated_files)} new documents.")

Generating documents...
✓ Generated 50 new documents.


## 2. Evaluation Ground Truth\n
\n
For evaluating our search engine (NDCG, MAP), we need "correct answers". We will create a JSON file mapping queries to relevant document IDs based on the topics we just generated.

In [7]:
relevance_data = {
    "queries": {
        "q1": "नेपालको राजनीति र सरकार",
        "q2": "फुटबल र क्रिकेट खेल",
        "q3": "कम्प्युटर प्रविधि",
        "q4": "हिमाल आरोहण पर्यटन",
        "q5": "नेपाली संस्कृति र चाडपर्व"
    },
    "assessments": {}
}

# Auto-generate assessments based on filenames
# Files are named like doc011_politics.txt
all_files = sorted([f.name for f in DATA_DIR.glob("*.txt")])

topic_map = {
    "q1": "politics",
    "q2": "sports",
    "q3": "technology",
    "q4": "travel",
    "q5": "culture"
}

for q_id, topic in topic_map.items():
    relevance_data["assessments"][q_id] = {}
    
    for fname in all_files:
        doc_id = fname.split('.')[0] # doc011_politics
        
        score = 0
        if topic in fname: # Highly relevant
            score = 3
        elif random.random() < 0.05: # Random noise/partial relevance
            score = 1
            
        if score > 0:
            relevance_data["assessments"][q_id][doc_id] = score

with open(DATA_DIR / 'relevance_judgments.json', 'w', encoding='utf-8') as f:
    json.dump(relevance_data, f, indent=4, ensure_ascii=False)
    
print("✓ Generated relevance_judgments.json")

✓ Generated relevance_judgments.json


## 3. Dummy Vectors for Neural IR\n
\n
Training a Word2Vec model on 60 small documents is useless. For the Neural IR notebook, we will generate **dummy pre-trained vectors**.\n
\n
We will enforce semantic relationships manually:\n
- 'politics' words will be close to each other in vector space.\n
- 'sports' words will be far from 'politics'.

In [8]:
def create_dummy_vectors(topics, dim=50):
    vocab_vectors = {}
    
    # Generate a base center vector for each topic
    topic_centers = {t: np.random.rand(dim) for t in topics}
    
    all_words = []
    for t in topics.values():
        all_words.extend(t["keywords"])
    
    for topic_name, data in topics.items():
        center = topic_centers[topic_name]
        
        for word in data["keywords"]:
            # Word vector = Topic Center + Small Random Noise
            noise = np.random.normal(0, 0.1, dim)
            vector = center + noise
            vocab_vectors[word] = vector.tolist()
            
    return vocab_vectors

vectors = create_dummy_vectors(topics)

with open(DATA_DIR / 'word_vectors.json', 'w', encoding='utf-8') as f:
    json.dump(vectors, f, indent=4, ensure_ascii=False)
    
print(f"✓ Generated word_vectors.json with {len(vectors)} words.")
print("  Note: 'सरकार' and 'मन्त्री' should be mathematically close now.")

✓ Generated word_vectors.json with 50 words.
  Note: 'सरकार' and 'मन्त्री' should be mathematically close now.
