<a href="https://colab.research.google.com/github/tomonari-masada/course2025-nlp/blob/main/02_retrieval_with_msmarco.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 情報検索の練習


* とは言え、何百万件の文書から検索する実験は大変時間がかかるので・・・

* MS MARCO v1.1というデータセットを使う。
  * 10個のpassageの中からqueryにrelevantなものを選ぶ。

* ランタイムのタイプをGPUに設定しておく。

In [None]:
from tqdm.auto import tqdm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

## データセットのロード

In [None]:
ds = load_dataset("microsoft/ms_marco", "v1.1")

In [None]:
ds

In [None]:
ds["train"][0]

## Sparse retrieval

### データの準備

In [None]:
ds["train"]["passages"]["passage_text"][:10]

In [None]:
corpus = []
for i in tqdm(range(len(ds["train"]))):
  corpus += ds["train"][i]['passages']['passage_text']

In [None]:
len(corpus)

### (0) `sklearn.feature_extraction.TfidfVectorizer`

* `TfidfVectorizer`のパラメータは適当に設定する。

In [None]:
vectorizer = TfidfVectorizer(min_df=0.001, max_df=0.2, stop_words="english")
vectorizer.fit(corpus)

* 語彙サイズを調べる。

In [None]:
len(vectorizer.get_feature_names_out())

In [None]:
query_embedding = vectorizer.transform([ds["train"][0]['query']]).todense()
query_embedding

In [None]:
passage_text_embeddings = vectorizer.transform(ds["train"][0]['passages']['passage_text']).todense()
passage_text_embeddings

In [None]:
relevance_scores = ds["train"][0]['passages']['is_selected']
relevance_scores

In [None]:
relevance_scores[np.argmax(query_embedding @ passage_text_embeddings.T)]

In [None]:
num_correct = 0
for i in range(len(ds["train"])):
  query_embedding = vectorizer.transform([ds["train"][i]['query']])
  passage_text_embeddings = vectorizer.transform(ds["train"][i]['passages']['passage_text'])
  relevance_scores = ds["train"][i]['passages']['is_selected']
  num_correct += relevance_scores[np.argmax(query_embedding @ passage_text_embeddings.T)]
  if (i + 1) % 100 == 0:
    print(f"{i + 1} queries processed, Prec@1: {num_correct / (i + 1):.4f}")

## Dense retrieval

### (1) `google-bert/bert-large-uncased`

In [None]:
model = SentenceTransformer("google-bert/bert-large-uncased")

In [None]:
query_embedding = model.encode([ds["train"][0]['query']])
query_embedding

In [None]:
query_embedding.shape

In [None]:
passage_text_embeddings = model.encode(ds["train"][0]['passages']['passage_text'])

In [None]:
passage_text_embeddings.shape

In [None]:
relevance_scores = ds["train"][0]['passages']['is_selected']
relevance_scores

In [None]:
relevance_scores[np.argmax(query_embedding @ passage_text_embeddings.T)]

In [None]:
num_correct = 0
for i in range(len(ds["train"])):
  query_embedding = model.encode([ds["train"][i]['query']])
  passage_text_embeddings = model.encode(ds["train"][i]['passages']['passage_text'])
  relevance_scores = ds["train"][i]['passages']['is_selected']
  num_correct += relevance_scores[np.argmax(query_embedding @ passage_text_embeddings.T)]
  if (i + 1) % 100 == 0:
    print(f"{i + 1} queries processed, Prec@1: {num_correct / (i + 1):.4f}")

### (2) `ibm-granite/granite-embedding-125m-english`
* IBMによる埋め込みモデル
  * https://arxiv.org/abs/2508.21085

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ibm-granite/granite-embedding-125m-english")

In [None]:
query_embedding = model.encode([ds["train"][0]['query']])
query_embedding

In [None]:
query_embedding.shape

In [None]:
num_correct = 0
for i in range(len(ds["train"])):
  query_embedding = model.encode([ds["train"][i]['query']])
  passage_text_embeddings = model.encode(ds["train"][i]['passages']['passage_text'])
  relevance_scores = ds["train"][i]['passages']['is_selected']
  num_correct += relevance_scores[np.argmax(query_embedding @ passage_text_embeddings.T)]
  if (i + 1) % 100 == 0:
    print(f"{i + 1} queries processed, Prec@1: {num_correct / (i + 1):.4f}")

# 課題
* 以下の場所で埋め込みモデルを適当に選んで同じ評価をおこなってみる。
* どのようなモデルが高いaccuracyを示すだろうか？

* MTEB Leaderboard
  * https://huggingface.co/spaces/mteb/leaderboard