# Problem 3: Multilingual Retrieval Augmented Generation (25 points)

Implement a multilingual search and retrieval augmented generation system using the OPUS Books dataset, which contains parallel text in English and Italian. You will create a system that can search across languages and generate content based on the retrieved passages.

## Problem 3(a): Setting up the vector search system (8 points)

- Use sentence-transformers' multilingual model `paraphrase-multilingual-MiniLM-L12-v2`
- Create vector embeddings for the OPUS Books text passages
- Build a FAISS index for efficient similarity search
- Save and load the index for reuse

In [None]:
# ── Required packages (Python 3.12.6, tested versions) ─────────────────────
# Pinned install (recommended for reproducibility):
# !pip install \
#     torch==2.10.0 \
#     numpy==2.4.2 \
#     faiss-cpu==1.13.2 \
#     sentence-transformers==5.2.3 \
#     transformers==5.2.0 \
#     datasets==4.5.0 \
#     tqdm==4.67.2 \
#     sentencepiece==0.2.1 \
#     accelerate==1.12.0
#
# Latest versions (may need minor adjustments):
# !pip install torch numpy faiss-cpu sentence-transformers transformers datasets tqdm sentencepiece accelerate

import os

# macOS: prevent Metal (MPS) memory conflicts when loading multiple large models
os.environ.setdefault("PYTORCH_MPS_HIGH_WATERMARK_RATIO", "0.0")

import numpy as np
import torch
import faiss
import time
import json
from tqdm import tqdm
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from transformers import MBartForConditionalGeneration, MBart50Tokenizer

#### Loading and Processing the Dataset

In [None]:
# Load 1000 parallel English-Italian text pairs from OPUS Books
dataset = load_dataset("opus_books", "en-it", split="train[:1000]")

texts_en = [item['translation']['en'] for item in dataset]
texts_it = [item['translation']['it'] for item in dataset]

print(f"Loaded {len(texts_en)} English passages and {len(texts_it)} Italian passages.")
print(f"\nExample English : {texts_en[0][:100]}...")
print(f"Example Italian : {texts_it[0][:100]}...")

#### Initialization

In [None]:
# Initialize the multilingual sentence transformer for creating embeddings
# Model name: 'paraphrase-multilingual-MiniLM-L12-v2'  (hint: pass device='cpu')
model = None  # YOUR CODE HERE

# Initialize the mBART model and tokenizer for multilingual text generation
# Model name: 'facebook/mbart-large-50-many-to-many-mmt'
# Note: use MBart50Tokenizer (not AutoTokenizer); pass torch_dtype=torch.float32, device_map="cpu" to the model
generator_model = None       # YOUR CODE HERE
generator_tokenizer = None   # YOUR CODE HERE

#### Create Embeddings

In [None]:
# Generate embeddings for all English and Italian passages, then combine them.
# The final array should have shape (2000, embedding_dim):
#   - rows 0..999   : English embeddings
#   - rows 1000..1999: Italian embeddings

embeddings_en = None  # YOUR CODE HERE: encode texts_en
embeddings_it = None  # YOUR CODE HERE: encode texts_it
embeddings    = None  # YOUR CODE HERE: stack into one array (hint: np.vstack)

print(f"Embedding shape: {embeddings.shape}")

#### FAISS Indexing for Efficient Similarity Search

In [None]:
def build_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2:
    """
    Build a FAISS flat L2 index from the given embedding matrix.

    Parameters
    ----------
    embeddings : np.ndarray, shape (n, d) -- should be float32

    Returns
    -------
    faiss.IndexFlatL2
    """
    # YOUR CODE HERE
    # 1. Create a faiss.IndexFlatL2 with dimension d = embeddings.shape[1]
    # 2. Add the embeddings to the index (ensure dtype is float32)
    pass


index = build_faiss_index(embeddings)
print(f"Index contains {index.ntotal} vectors.")

# Save the index to disk and reload it
# YOUR CODE HERE
# faiss.write_index(index, "multilingual_index.faiss")
# index = faiss.read_index("multilingual_index.faiss")
# print("Index saved and reloaded successfully.")

## Problem 3(b): Implement Multilingual Search (8 points)

- Create a search function that accepts queries in either English or Italian
- Add metadata filtering capability to search in specific languages
- Return top-k most relevant passages with scores
- Implement efficient batch processing for multiple queries

#### Implementing Multilingual Search

In [None]:
def search(query: str, lang: str = None, k: int = 5) -> list:
    """
    Perform semantic search over the multilingual FAISS index.

    Parameters
    ----------
    query : str       -- query text in English or Italian
    lang  : str|None  -- restrict results to 'en', 'it', or None for both
    k     : int       -- number of results to return

    Returns
    -------
    list[dict] -- each dict has keys 'text' (str), 'lang' (str), 'score' (float)
    """
    # YOUR CODE HERE
    # 1. Encode the query: query_vector = model.encode([query])
    # 2. Search the index: D, I = index.search(query_vector, k)
    # 3. For each (distance d, index i) pair:
    #      if i < len(texts_en): text = texts_en[i],              lang_code = 'en'
    #      else:                  text = texts_it[i - len(texts_en)], lang_code = 'it'
    # 4. If lang is not None, keep only results where lang_code == lang
    # 5. Return list of dicts with 'text', 'lang', 'score'
    pass

#### Testing Multilingual Search

In [None]:
# Test queries in both languages
queries = {
    "English": "stories about adventure and discovery",
    "Italian": "storie di avventura e scoperta"
}

for lang_name, query_text in queries.items():
    print(f"\n{'='*60}")
    print(f"{lang_name} query: '{query_text}'")
    print('='*60)

    start = time.time()
    results = search(query_text, k=3)
    elapsed = time.time() - start

    print(f"Search time: {elapsed*1000:.2f} ms  |  {len(results)} result(s)")
    for i, r in enumerate(results, 1):
        print(f"  [{i}] ({r['lang'].upper()}, score={r['score']:.4f})  {r['text'][:90]}...")

# Cross-lingual: English query filtered to Italian results only
print(f"\n{'='*60}")
print("English query -> Italian results only")
print('='*60)
results_it = search("stories about adventure and discovery", lang='it', k=3)
for i, r in enumerate(results_it, 1):
    print(f"  [{i}] ({r['lang'].upper()}, score={r['score']:.4f})  {r['text'][:90]}...")

## Problem 3(c): Adding Retrieval-Augmented Generation (RAG) Capabilities (9 points)

In this section you will add RAG functionality by combining the search system above with
the mBART-large-50 language model.

### Tasks

1. **Content generation** -- implement `generate_content(prompt, context)` that feeds a
   retrieved passage plus an instruction prompt into the mBART model and returns the
   generated text.

2. **Single-document RAG** -- implement `rag_single(query, prompt)` that retrieves the
   single most relevant passage and generates content based on it.

3. **Multi-document RAG** -- implement `rag_group(query, prompt, k)` that retrieves the
   top-k passages, concatenates them as context, and generates a comparative response.

4. **Prompt strategies** -- experiment with different prompt types:
   - *Recommendation prompt*: `"Write a short book recommendation based on this excerpt:"`
   - *Comparative prompt*: `"Compare and contrast these excerpts, discussing themes and style:"`

5. **Testing** -- run the functions with:
   - English query: `"stories about adventure and discovery"`
   - Italian query:  `"storie di avventura e scoperta"`

#### Implementing RAG Capabilities

In [None]:
def generate_content(prompt: str, context: str) -> str:
    """
    Generate text using the mBART model conditioned on a prompt and context.

    Parameters
    ----------
    prompt  : str -- instruction for the model
    context : str -- retrieved passage(s) to base the generation on

    Returns
    -------
    str -- generated text
    """
    # YOUR CODE HERE
    # 1. Build the input string: input_text = f"{prompt}\n{context}"
    # 2. Tokenize: inputs = generator_tokenizer(input_text, return_tensors="pt",
    #              max_length=512, truncation=True)
    # 3. Generate: outputs = generator_model.generate(
    #              **inputs,
    #              forced_bos_token_id=generator_tokenizer.lang_code_to_id["en_XX"])
    # 4. Decode and return: generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
    pass


def rag_single(query: str, prompt: str) -> str:
    """
    Retrieve the single most relevant passage and generate content based on it.

    Parameters
    ----------
    query  : str -- search query
    prompt : str -- generation instruction

    Returns
    -------
    str -- generated text
    """
    # YOUR CODE HERE
    # 1. Call search(query, k=1) to get the top result
    # 2. Extract the 'text' field from the result
    # 3. Call generate_content(prompt, context) and return the output
    pass


def rag_group(query: str, prompt: str, k: int = 3) -> str:
    """
    Retrieve the top-k passages and generate content based on all of them.

    Parameters
    ----------
    query  : str -- search query
    prompt : str -- generation instruction
    k      : int -- number of passages to retrieve

    Returns
    -------
    str -- generated text
    """
    # YOUR CODE HERE
    # 1. Call search(query, k=k) to get the top-k results
    # 2. Join retrieved texts with "\n---\n" as a separator
    # 3. Call generate_content(prompt, context) and return the output
    pass

#### Testing RAG Capabilities

In [None]:
# Single-document RAG
print("=" * 60)
print("Single-Document RAG")
print("=" * 60)
result = rag_single(
    query="stories about adventure and discovery",
    prompt="Write a short book recommendation based on this excerpt:"
)
print(result)

# Multi-document RAG (English query)
print("\n" + "=" * 60)
print("Multi-Document RAG  (English query)")
print("=" * 60)
result = rag_group(
    query="stories about adventure and discovery",
    prompt="Compare and contrast these book excerpts, discussing their themes and style:",
    k=3
)
print(result)

# Multi-document RAG (Italian query)
print("\n" + "=" * 60)
print("Multi-Document RAG  (Italian query)")
print("=" * 60)
result = rag_group(
    query="storie di avventura e scoperta",
    prompt="Confronta e analizza questi estratti di libri, discutendo i temi e lo stile:",
    k=3
)
print(result)

## Bonus (+5 points): Semantic Caching

Implement a `SemanticCache` class to avoid redundant searches for repeated or highly
similar queries.

- Store each query's embedding alongside its results.
- For a new query, compute cosine similarity against all cached embeddings.
- If the maximum similarity exceeds a threshold (e.g. 0.95), return the cached results.
- Otherwise run a fresh search and add the result to the cache.

Measure and report the cache hit rate and average query latency with and without caching.

In [None]:
class SemanticCache:
    def __init__(self, threshold: float = 0.95):
        """
        Parameters
        ----------
        threshold : float -- cosine similarity above which a cache hit is declared
        """
        self.threshold = threshold
        self.cache = []  # list of (query_embedding np.ndarray, results list)

    def search_with_cache(self, query: str, lang: str = None, k: int = 5) -> list:
        """
        Return cached results for semantically similar past queries;
        otherwise run a fresh search and cache the result.

        Parameters
        ----------
        query : str
        lang  : str|None
        k     : int

        Returns
        -------
        list[dict] -- same format as search()
        """
        # YOUR CODE HERE
        # 1. Encode the query with model.encode([query])
        # 2. For each (cached_emb, cached_results) in self.cache:
        #      compute cosine similarity between the new embedding and cached_emb
        #      if similarity >= self.threshold: return cached_results  (cache hit)
        # 3. Cache miss: call search(query, lang, k), append to self.cache, return results
        pass