# 10. Cross-Lingual Information Retrieval (PLACEHOLDER)

## Introduction

**Cross-Lingual IR (CLIR)** enables searching documents in one language using queries in another language.

### Example:
```
Query (English): "Mount Everest tourism"
Documents (Nepali): About ‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ and ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®
Goal: Retrieve relevant Nepali documents!
```

### Applications:
-Multilingual search engines
- International news retrieval
- Digital libraries
- E-commerce

---

## Approaches to CLIR

### 1. **Query Translation**
```
English Query ‚Üí Machine Translation ‚Üí Nepali Query ‚Üí Search
```
- Translate query to document language
- Use existing monolingual IR

### 2. **Document Translation**
```
Nepali Documents ‚Üí Machine Translation ‚Üí English Documents ‚Üí Search
```
- Translate all documents
- Expensive but effective

### 3. **Interlingual Approach**
```
Query + Documents ‚Üí Universal Representation ‚Üí Match
```
- Map both to language-independent space
- Requires parallel corpora

---

## Placeholder Note

**Full CLIR requires:**
- Machine translation systems
- Bilingual dictionaries
- Cross-lingual word embeddings
- Parallel corpora

---

## Simple Concept: Bilingual Dictionary

In [1]:
# Simple English-Nepali dictionary (would normally be loaded from file)
en_ne_dict = {
    'nepal': '‡§®‡•á‡§™‡§æ‡§≤',
    'mountain': '‡§π‡§ø‡§Æ‡§æ‡§≤',
    'education': '‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ',
    'health': '‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø',
    'tourism': '‡§™‡§∞‡•ç‡§Ø‡§ü‡§®',
    'culture': '‡§∏‡§Ç‡§∏‡•ç‡§ï‡•É‡§§‡§ø',
    'history': '‡§á‡§§‡§ø‡§π‡§æ‡§∏',
    'government': '‡§∏‡§∞‡§ï‡§æ‡§∞',
    'economy': '‡§Ö‡§∞‡•ç‡§•‡§§‡§®‡•ç‡§§‡•ç‡§∞',
    'everest': '‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ',
}

def translate_query(english_query, dictionary):
    """
    Simple query translation using bilingual dictionary.
    
    Parameters:
    -----------
    english_query : str
        Query in English
    dictionary : dict
        English ‚Üí Nepali dictionary
    
    Returns:
    --------
    str : Translated query in Nepali
    """
    # Tokenize and lowercase
    tokens = english_query.lower().split()
    
    # Translate each word
    nepali_tokens = []
    for token in tokens:
        if token in dictionary:
            nepali_tokens.append(dictionary[token])
        # else: word not in dictionary, skip
    
    return ' '.join(nepali_tokens)

# Example
print("üìö Cross-Lingual Query Translation Example:")
print("="*70)

english_queries = [
    "Nepal mountain tourism",
    "education and health",
    "Mount Everest"
]

for eng_query in english_queries:
    nep_query = translate_query(eng_query, en_ne_dict)
    print(f"English: {eng_query}")
    print(f"Nepali:  {nep_query}")
    print()

print("="*70)
print("\nüìå This is a PLACEHOLDER implementation.")
print("üìå Real CLIR requires: MT systems, word embeddings, or parallel corpora.")

üìö Cross-Lingual Query Translation Example:
English: Nepal mountain tourism
Nepali:  ‡§®‡•á‡§™‡§æ‡§≤ ‡§π‡§ø‡§Æ‡§æ‡§≤ ‡§™‡§∞‡•ç‡§Ø‡§ü‡§®

English: education and health
Nepali:  ‡§∂‡§ø‡§ï‡•ç‡§∑‡§æ ‡§∏‡•ç‡§µ‡§æ‡§∏‡•ç‡§•‡•ç‡§Ø

English: Mount Everest
Nepali:  ‡§∏‡§ó‡§∞‡§Æ‡§æ‡§•‡§æ


üìå This is a PLACEHOLDER implementation.
üìå Real CLIR requires: MT systems, word embeddings, or parallel corpora.


---

## Challenges in CLIR

### 1. **Translation Ambiguity**
- One word may have multiple translations
- Context needed for disambiguation

### 2. **Out-of-Vocabulary (OOV) Words**
- Named entities
- Technical terms
- New words

### 3. **Query Translation Errors**
- Wrong translation breaks retrieval
- Mitigation: use multiple translations

### 4. **Resource Scarcity**
- Low-resource languages lack:
  - Parallel corpora
  - MT systems
  - Bilingual dictionaries

---

## Modern Approaches

### 1. **Cross-Lingual Word Embeddings**
```python
# PLACEHOLDER - Requires training
# Map English and Nepali words to shared space
# similarity("mountain", "‡§π‡§ø‡§Æ‡§æ‡§≤") = high
```

### 2. **Multilingual Neural Models**
```python
# PLACEHOLDER - Requires deep learning
# mBERT, XLM-R: Pretrained multilingual models
# Directly match queries and documents across languages
```

### 3. **Zero-Shot Translation**
- Translate without parallel data
- Pivot language (English)
- Unsupervised methods

---

## Summary

### Key Concepts:

1. **CLIR** breaks language barriers in search
2. **Query Translation** is the simplest approach
3. **Bilingual Resources** are essential
4. **Modern Methods** use embeddings and neural models

### Evaluation:
- Use CLEF, NTCIR datasets
- Compare to monolingual retrieval
- Measure translation quality impact

### For Full Implementation:
You would need:
- **Google Translate API** or similar MT service
- **Cross-lingual word embeddings** (MUSE, VecMap)
- **Multilingual BERT** for neural approaches
- **Parallel corpora** for training

### Real-World Systems:
- **Google**: Multilingual search
- **Bing**: Cross-language retrieval
- **European Parliament**: Document retrieval in 24 languages

### References:
- Manning et al., "Introduction to Information Retrieval", Chapter 9.4
- CLEF (Cross-Language Evaluation Forum)
- NTCIR: NII Testbeds for Information Access Research
- Ruder et al. (2019): "A Survey of Cross-lingual Word Embedding Models"