<a href="https://colab.research.google.com/github/sirius70/NLP_HW4/blob/main/cross_language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-Language Retrieval

In this notebook, you will evaluate models on the task of cross-language retrieval. We will use a sample of the first paragraphs of Wikipedia articles. Sometimes, a Wikipedia article in one language will be a translation of the article in another; in other cases, articles cover the some topic but are not translations. In any case, we use the links between Wikipedia articles in different languages as ground truth for our evaluation.

Since we often want to enrich the context information available to a language model with retrieval results, we will evaluate not only whether the exact matching document ranks highest, but also whether the matching document ranks in the top $k$.

Work through the notebook and complete code and text cells marked **TODO**.

We start by installing the `sentence-transformers` library.

In [2]:
pip install -U sentence-transformers



We then download a sample of the first paragraphs of Wikipedia articles in six languages.

In [3]:
!wget https://raw.githubusercontent.com/dasmiq/cs6120-assignment5/refs/heads/main/sample-6lang.jsonl

--2025-12-09 04:33:51--  https://raw.githubusercontent.com/dasmiq/cs6120-assignment5/refs/heads/main/sample-6lang.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7514418 (7.2M) [text/plain]
Saving to: ‘sample-6lang.jsonl’


2025-12-09 04:33:51 (113 MB/s) - ‘sample-6lang.jsonl’ saved [7514418/7514418]



In [4]:
import json
articles = []

for line in open('sample-6lang.jsonl', mode='r', encoding='utf-8'):
  rec = json.loads(line)
  articles.append(rec)

len(articles)

11838

We include articles from the three most prevalent Wikipedia languages—English, German, and French—and from three other languages in non-Latin scripts—Chinese, Arabic, and Greek. The dataset includes fields for the `text` of the paragraph, as well as the (lower-cased) `title` of the article and `lang` for the language code. Finally, each record contains the Wikidata `id` used to link related articles in different languages. For convenience, the records have been sorted by `id` and `lang`.

If you read a few of these languages (or translate them), you can look at a set of paragraphs and see that most pairs are not translations of each other.

In [5]:
articles[6:12]

[{'id': 'Q1005289',
  'lang': 'ar',
  'title': 'قانون الجنسية الكندي',
  'url': 'https://ar.wikipedia.org/wiki/%D9%82%D8%A7%D9%86%D9%88%D9%86%20%D8%A7%D9%84%D8%AC%D9%86%D8%B3%D9%8A%D8%A9%20%D8%A7%D9%84%D9%83%D9%86%D8%AF%D9%8A',
  'text': 'قانون الجنسية الكندي، يشار إليها أيضًا بالجنسية الكندية، هو وضع قانوني يمنح الشخص الطبيعي حقوقًا ومسؤوليات محددة في كندا. نشأ في عام ، وصار معلمًا هامًا في عملية استقلال كندا عن المملكة المتحدة مع دخول قانون الجنسية الكندية الأول حيز التنفيذ. تخضع الجنسية الكندية الآن لقانون الجنسية لعام 1977، الذي خضع لعدة تعديلات مهمة منذ دخوله حيز التنفيذ. كما ساهمت المحاكم الفيدرالية، من خلال قانونها القضائي، في توضيح التعريف القانوني للجنسية الكندية.'},
 {'id': 'Q1005289',
  'lang': 'de',
  'title': 'kanadische staatsangehörigkeit',
  'url': 'https://de.wikipedia.org/wiki/Kanadische%20Staatsangeh%C3%B6rigkeit',
  'text': 'Die kanadische Staatsbürgerschaft ( bzw. Canadian Citizenship) ist die Staatsbürgerschaft Kanadas, die im engeren Sinne seit 1947 existiert.'},

We load a sentence embedding model, `LaBSE`, that was trained on several languages, including the six we work with here.

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np
labse = SentenceTransformer('sentence-transformers/LaBSE')

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

To demonstrate finding similar paragraphs, we encode the text of the first twelve records, which gives us a 768-dimensional embedding vector for each one.

In [7]:
encoded = labse.encode([r['text'] for r in articles[0:12]])
encoded.shape

(12, 768)

If we multiply this $12 \times 768$ matrix by its transpose, we get a $12 \times 12$ (symmetric) matrix with the cosine similarity between all pairs of paragraphs. The diagonal entries are, of course, approximately 1. In the first six rows, we can see that the first six columns are higher than the latter six. In the latter six rows, we can see that the latter six columns are higher than the first six.

In [8]:
encoded @ encoded.T

array([[1.        , 0.76256883, 0.7722026 , 0.7502576 , 0.5858218 ,
        0.6954646 , 0.37287953, 0.32353517, 0.33085033, 0.2717247 ,
        0.32900903, 0.20401588],
       [0.76256883, 0.99999994, 0.7690512 , 0.92336786, 0.63490856,
        0.65711933, 0.4801091 , 0.4884625 , 0.4031633 , 0.36106873,
        0.42383695, 0.26584467],
       [0.7722026 , 0.7690512 , 1.0000002 , 0.7562181 , 0.5307793 ,
        0.6586367 , 0.3545522 , 0.36299706, 0.35417843, 0.26405597,
        0.3212558 , 0.21048519],
       [0.7502576 , 0.92336786, 0.7562181 , 0.99999976, 0.6517093 ,
        0.6516982 , 0.41325855, 0.4034108 , 0.35402954, 0.3217687 ,
        0.3496586 , 0.21814035],
       [0.5858218 , 0.63490856, 0.5307793 , 0.6517093 , 1.0000002 ,
        0.541567  , 0.3823491 , 0.3980208 , 0.36890098, 0.3174026 ,
        0.33768153, 0.28169265],
       [0.6954646 , 0.65711933, 0.6586367 , 0.6516982 , 0.541567  ,
        0.9999998 , 0.27841944, 0.29993683, 0.30385193, 0.21876109,
        0.2602014 ,

## Evaluating Retrieval

To introduce the problem, we take some example Chinese paragraphs to use as queries and English paragraphs to use as candidate results to search through.

In [9]:
query_articles = [r['text'] for r in articles if r['lang'] == 'zh']
result_articles = [r['text'] for r in articles if r['lang'] == 'en']

To make the example clearer, we will use different numbers of queries and results.

In [10]:
qembed = labse.encode(query_articles[0:200])
rembed = labse.encode(result_articles[0:500])

Multiplying the query embeddings by the result embeddings, we get a $200 \times 500$ queries-by-results matrix.

In [11]:
sim = qembed @ rembed.T
sim.shape

(200, 500)

We use numpy's `argmax` function along the second dimension (`axis=1`) to get the index of the top result for each query.

In [12]:
argmax = np.argmax(sim, axis=1)
argmax

array([454,   1,   2, 282, 410, 120,  13,   7, 162,   9,  10,  11,  12,
        13,  14,  15, 153, 212, 247,  19,  20,  21,  22,  23, 372, 372,
        26,  27,  28,  29,  82,  31,  32,  33,  34,  35,  36,  66,  93,
        39, 397,  41, 383,  83,  44,  45,  46,  47,  48,  49,  50, 245,
       296, 139,  54,  55,  56,  57,  58,  59, 266,  61,  62, 171,  64,
        31,  29,  67,  68,  69,  70,  71, 242, 372,  74,  75,  76,  77,
       253,  79,  80,  81,  82,  83, 160,  85,  86, 492,  88,  17,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99,  32, 101, 102, 103,
       104, 307, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120,  82, 122, 123, 124,  14, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       273, 144, 424, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 120, 165, 166, 167, 171,
       169, 170, 171, 147, 173, 174, 175, 176, 177, 178, 179, 18

Since the query and result documents are in the same order, matching Chinese and English documents have the same index. This allows us to compute the accuracy, or &ldquo;recall at 1&rdquo;, of Chinese-to-English retrieval.

In [13]:
sum([a==b for (a, b) in zip(range(len(argmax)), argmax)])/len(argmax)

np.float64(0.785)

Your first task is to compute the recall at 1 for Arabic, Chinese, French, German, and Greek query documents matching English documents. Use the first 1000 English documents as the candidates you will search through.

In [14]:
candidates = labse.encode(result_articles[0:1000])

For each of the other five languages, construct embeddings for the first 1000 documents and measure how often the most similar English document is the matching one.

In [15]:
# TODO: Compute and print the recall at 1 for X-English retrieval
# where X \in {ar,de,el,fr,zh}

languages = ['ar', 'de', 'el', 'fr', 'zh']
recall_at_1 = {}

for lang in languages:
    # Collect first 1000 documents for this language
    query_texts = [r['text'] for r in articles if r['lang'] == lang][:1000]

    # Encode queries
    qembed = labse.encode(query_texts, convert_to_numpy=True)

    # Compute similarity with English candidate embeddings
    sim = qembed @ candidates.T  # (#queries × 1000)

    # Top-1 retrieved index
    predicted = np.argmax(sim, axis=1)

    # Ground truth: matching docs share the same index
    true_indices = np.arange(len(predicted))

    # Compute recall@1
    recall = np.mean(predicted == true_indices)
    recall_at_1[lang] = recall

# Print results
for lang, score in recall_at_1.items():
    print(f"{lang}-to-English recall@1: {score:.3f}")

ar-to-English recall@1: 0.868
de-to-English recall@1: 0.885
el-to-English recall@1: 0.884
fr-to-English recall@1: 0.856
zh-to-English recall@1: 0.717


We often use retrieved documents to provide extra context to a language model. In that case, we might retrieve more than one document per query to increase the likelihood that useful documents are in the top $k$. For each of the five non-English languages, write code to evaluate the **recall at k** (R@k), i.e., the proportion of queries for which the correct document was anywhere in the top k results.

In [16]:
# TODO: Write a function to compute recall at k

def recall_at_k(sim_matrix, k):
    """
    Computes Recall@k for a similarity matrix.
    sim_matrix: np.ndarray of shape (num_queries, num_candidates)
    k: number of top results to consider
    """
    # Get top-k retrieved candidate indices for each query
    top_k = np.argsort(-sim_matrix, axis=1)[:, :k]

    # Ground truth: correct match for query i is candidate i
    true_idx = np.arange(sim_matrix.shape[0])[:, None]

    # Check if true index appears in top-k list
    hits = (top_k == true_idx).any(axis=1)

    # Proportion of queries for which correct doc was in top-k
    return hits.mean()

In [17]:
# TODO: Compute and print recall at 5 and recall at 10 for X-English retrieval
# where X \in {ar,de,el,fr,zh}

languages = ['ar', 'de', 'el', 'fr', 'zh']

for lang in languages:
    print(f"\nEvaluating {lang}-to-English...")

    # Retrieve first 1000 docs in that language
    query_texts = [r['text'] for r in articles if r['lang'] == lang][:1000]

    # Encode queries
    qembed = labse.encode(query_texts, convert_to_numpy=True)

    # Compute similarity to English candidates
    sim = qembed @ candidates.T  # (1000 × 1000)

    # Compute recall@5 and recall@10
    r5  = recall_at_k(sim, 5)
    r10 = recall_at_k(sim, 10)

    print(f"Recall@5 :  {r5:.3f}")
    print(f"Recall@10:  {r10:.3f}")


Evaluating ar-to-English...
Recall@5 :  0.945
Recall@10:  0.959

Evaluating de-to-English...
Recall@5 :  0.949
Recall@10:  0.962

Evaluating el-to-English...
Recall@5 :  0.948
Recall@10:  0.957

Evaluating fr-to-English...
Recall@5 :  0.929
Recall@10:  0.944

Evaluating zh-to-English...
Recall@5 :  0.831
Recall@10:  0.876


## Different Retrieval Strategies

**TODO**: Not all languages perform equally well using the LaBSE model. Your task is to find an alternative retrieval method that _improves performance for at least one language_ while _not degrading performance for other languages_.

You are free to use any open encoder or generative models available on huggingface. Here are three ideas to get you started. You only need to implement one improvement, although you may keep other dead-ends in the notebook.

1. Find other embedding models on huggingface that work better for, e.g., Chinese, while maintaining performance on the other languages.
1. LaBSE was trained on translation pairs, but Wikipedia articles are not necessarily translations of each other. Use the remaining articles in the dataset to fine-tune LaBSE (or another model). [This huggingface guide to fine-tuning sentence embeddings](https://huggingface.co/blog/train-sentence-transformers) may be helpful.
1. Instead of using embeddings, you could use a generative model to try to directly output the title of the English article given the foreign-language title and article. This approach is known as [generative retrieval](https://arxiv.org/abs/2404.14851).

What you try is up to you. Describe your approach and use the recall at k function above to evaluate your results.

I chose Strategy #1: Try a better multilingual embedding model from HuggingFace, specifically intfloat/multilingual-e5-large, which is known to perform much better than LaBSE on Chinese and still strong for European languages.



**Improved Retrieval Strategy Using multilingual-e5-large
Approach**

The R@1 score (≈0.71) shows that LaBSE performs exceptionally well for Indo-European languages but significantly worse on Chinese.
We replace the embedding model with a more recent, robust multilingual model to enhance retrieval quality:

This model is state-of-the-art for multilingual semantic retrieval and is trained specifically for text similarity and search tasks. It handles Chinese significantly better than LaBSE.

In [18]:
# Load E5 Model and Encode

from sentence_transformers import SentenceTransformer
import numpy as np

# Load improved multilingual model
e5 = SentenceTransformer("intfloat/multilingual-e5-large")

def encode_e5(texts):
    # E5 requires prefix "query: " or "passage: "
    return e5.encode(["query: " + t for t in texts], convert_to_numpy=True)


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

In [19]:
# Compute English Candidate Embeddings (once)

english_texts = [r['text'] for r in articles if r['lang'] == 'en'][:1000]
english_embed_e5 = encode_e5(english_texts)


In [20]:
# Evaluate Recall@5 and Recall@10

languages = ['ar', 'de', 'el', 'fr', 'zh']
e5_results = {}

for lang in languages:
    print(f"\nEvaluating {lang}-to-English using E5...")

    # First 1000 docs for this language
    query_texts = [r['text'] for r in articles if r['lang'] == lang][:1000]

    # Encode queries
    qembed = encode_e5(query_texts)

    # Compute similarity (dot product is recommended for E5)
    sim = qembed @ english_embed_e5.T

    # Compute R@5 and R@10
    r5  = recall_at_k(sim, 5)
    r10 = recall_at_k(sim, 10)

    e5_results[lang] = (r5, r10)

    print(f"Recall@5 : {r5:.3f}")
    print(f"Recall@10: {r10:.3f}")



Evaluating ar-to-English using E5...
Recall@5 : 0.960
Recall@10: 0.975

Evaluating de-to-English using E5...
Recall@5 : 0.982
Recall@10: 0.983

Evaluating el-to-English using E5...
Recall@5 : 0.974
Recall@10: 0.980

Evaluating fr-to-English using E5...
Recall@5 : 0.963
Recall@10: 0.972

Evaluating zh-to-English using E5...
Recall@5 : 0.915
Recall@10: 0.944


**Summary of Improvement**

I replaced the LaBSE model with a newer multilingual embedding model, multilingual-e5-large.
This model is explicitly optimized for retrieval tasks, unlike LaBSE which is trained primarily on translation pairs.

**Why This Helps**

Chinese performance improves significantly because E5 uses modern contrastive objectives and large diverse corpora including Chinese web text.

European languages (German, French, Greek) maintain strong performance since E5 is trained on 100+ languages.

Arabic also stays stable or improves, as E5 includes large Arabic datasets.

**Outcome**

The observed results show:

Substantial improvement for Chinese, which was the weakest LaBSE language (R@1 ≈ 0.71).

Other languages match or exceed LaBSE performance.

Therefore, the new retrieval method satisfies the requirement:
it improves at least one language without degrading the others.