# Playground with basic experimental setup (not using modules)

In this Notebook, I took the workflow of `basic_setup.ipynb` 
and created a little mock-corpus to try and figure out why the performance 
of the "plain vanilla" method is so (incredibly!) bad.

Thought is: with a little corpus that has documents about very different things, and having a query about some topic in the corpus, then the search should clearly score the relevant document highest. (Which is not the case!)

Things I tried:
- Using different Models (Bert-like models: mini-bert, bert-base, roberta, deberta, distilbert; e5-small and -base; gte-small and base; and some more)
- Using different Indices (Flat, HSNW, LSH, IVP, ...)
- Encoding: embedding vector either [CLS]-token or average-pool of last hidden layer
- Text Preprocessing: Removing / Substituting some things like numbers/abbreviations ...
- Text-splitting: documents splitted into overlapping "sentences" and those embedded individually


I tried these things on the mini-test-corpus, and on the subset of the tira dataset that contains only relevant documents.
(Experiments on the full corpus were only conducted now and then because embedding the full corpus takes a while)


What i noticed:
- The output score of the search is not well distributed, but i guess with similarity search the distribution of scores dont matter as much as the rank-order.
- The rank order is also not consitently good, some documents seem to be prefered -> embeddings closer to center??



Still, the performance is very bad and stays bad, further experiments will include finetuning the model and an additional reranker.


-----

Dokumente: Von den 2300 die in den Qrels vorkommen, sind ca 350 ohne abstract.

ca 1700 dokumente sind im bereich zwischen 50 und 350 tokens (nach vollem preprocessing)

Minimum sind 5 Tokens und Maximum sind über 5000 Tokens.

## Setup

- install libraries (if necessary)
- all imports here
- connecting to tira & printoptions etc.

In [1]:
#!pip install transformers faiss-gpu faiss-cpu torch
#!pip install tira ir-datasets python-terrier
#!pip install sentence-transformers

In [1]:
import os
import time
import json
import re
import importlib
import random
import glob

import numpy as np
import pandas as pd
import torch
import pyterrier as pt
import faiss

# Encoder and Tokenizer models
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

# Tira and Pyterrier Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.third_party_integrations import ir_datasets
from tira.rest_api_client import Client

In [2]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

# Print options for pandas
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("display.precision", 4)
pd.set_option("display.max_rows", None)
pd.set_option('display.float_format', '{:.5f}'.format)

# Use GPU if available
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
print(f"device: {device}")

# TODO: set seed!!

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


device: cpu


## The Dataset

### instanciate the dataset

In [3]:
DATA_PATH = "."
CORPUS_PATH = os.path.join(DATA_PATH, "dataset_corpus.json")

# Load the dataset
if os.path.exists(CORPUS_PATH):
    with open(CORPUS_PATH, "r") as f:
        corpus = json.load(f)
else:
    dataset = ir_datasets.load("ir-lab-sose-2024/ir-acl-anthology-20240504-training")
    corpus = dataset.docs_store().docs
    with open(CORPUS_PATH, "w") as f:
        json.dump(obj=corpus, fp=f, indent=2, ensure_ascii=False)
    del dataset # Free space? or is this unnecessary??

print(f"{len(corpus)} documents.")

126958 documents.


In [None]:
# corpus is originally a dict: {"docno": ["docno", "text"], }
#               now like this: {"docno": "text", ...}  #  easier to handle.

dict_corpus = {v[0]: v[1] for v in corpus.values()}

#### create subset of corpus - only documents that appear in qrels

In [26]:
# Test corpus of only relevant document (+ a few nonrelevant)
dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

relevant_docnos = dataset.get_qrels()["docno"].unique()

# some random choice of non-relevant docs added to corpus subset
nonrelevant_docnos = list(corpus.keys() - set(relevant_docnos))

nonrelevant_docnos = np.random.choice(nonrelevant_docnos, size=0)
relevant_docnos = list(relevant_docnos) + list(nonrelevant_docnos)

corpus = {k: corpus[k] for k in relevant_docnos}
print(f"{len(corpus)} relevant documents. (relevant to dev-set)")

#### batch the corpus

In [5]:
def batch_corpus(corpus, batch_size):
    corpus_keys = list(corpus.keys())
    for anker in range(0, len(corpus), batch_size):
        batch_keys = corpus_keys[anker:anker+batch_size]
        yield {k: corpus[k] for k in batch_keys}

### preprocessing

#### cleaning up

In [7]:
# Most common abbrevations in corpus and other small things to substitute
abbrevations = {
    "e.g.": "for example",
    "E.g.": "for example",
    "U.S.": "united states",
    "w.r.t.": "with respect to",
    "i.e.": "that is",
    "i.i.d.": "independent and identically distributed",
    "i.i.": "independent and identically",
    "v.s.": "versus", "vs.": "versus",
    "etc.": "and so on", #TODO: besser et cetera? oder ist das zu exotisch
    "1st": "first", "2nd": "second", "3rd": "third", "4th": "fourth", "5th": "fifth",
    "e2e": "end-to-end",
    "E2E": "end-to-end",
    "iii)": "", "ii)": "", "i)": "", "iv)": "", "v)": "",
    "?": ".", "!": ".",
    "a)": "", "b)": "", "c)": "", "d)": "", "e)": ""
}

# Very common letter-number-combinations that will not be substituted
letter_number_exceptions = ["L2","F1","L1","F2","seq2seq","Seq2Seq","word2vec","Word2Vec","2D"]

def preprocess_text(text, lower=False, years=False, percentages=False, numbers=False, 
                        letter_numbers=False, abbrev=False, special_characters=False):
    # reihenfolge ist wichtig!
    if lower:
        text = text.lower()
    if years:
        text = re.sub(r'\b(19|20)\d{2}\b', 'YEAR', text)
    if percentages:
        text = re.sub(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?%', "PERCENTAGE", text)

    # all remaining numbers
    if numbers:
        text = re.sub(r'\b\d{1,3}(?:,\d{3})*(?:\.\d+)?\b', 'NUMBER', text)

    # Remove words that are combinations of letters and numbers 
    # (except L2, F1, word2vec, ... common in corpus and probably important for context)
    if letter_numbers:
        #pattern = r'\b(?!(L2|F1|L1|F2|seq2seq|word2vec|Seq2Seq|Word2Vec|2D)\b)\w*\d+\w*\b'
        pattern = rf'\b(?!({"|".join(letter_number_exceptions)})\b)\w*\d+\w*\b'
        text = re.sub(pattern, '', text)

    # Substitute most common abbrevations
    if abbrev:
        for abbrevation, substitution in abbrevations.items():
            text = text.replace(abbrevation, substitution)

    # Remove all characters that are not normal text
    if special_characters:
        text = re.sub(r'[^a-zA-Z0-9\s\-\.\,]', '', text)

    # Punkt hinter Titel des papers setzen, falls bert genutzt wird, [SEP] token hinter titel setzten.?????
    #text = re.sub(r'\n\n', ". ", text)
    if len(text.split("\n\n")) < 2:
        text += "."
    else:
        text = re.sub(r'\n\n', ". ", text)

    # Aufeinanderfolgende whitespaces durch einzelnes blank ersetzen.
    text = re.sub(r'\s+', ' ', text).strip()

    return text

## The Model / The Retrieval System

### the model

In [3]:
model_name_to_type_map = { # map for models that do not use AutoModel
    "paraphrase-MiniLM-L6-v2": [SentenceTransformer, None],
}

def load_model(name, tokenizer_name=""):
    if name in model_name_to_type_map.keys():
        model_class, tokenizer_class = model_name_to_type_map[name]
    else:
        model_class, tokenizer_class = AutoModel, AutoTokenizer

    model = model_class.from_pretrained(name)
    if tokenizer_class is None:
        tokenizer = None
    elif len(tokenizer_name) > 0:
        tokenizer = tokenizer_class.from_pretrained(tokenizer_name)
    else:
        tokenizer = tokenizer_class.from_pretrained(name)
    return model, tokenizer

In [4]:
# Load the model (TinyBERT or another) 

#model_name = "models/bert-tiny-pt-mlm"
model_name = "prajjwal1/bert-mini"
#model_name = "microsoft/deberta-base" # ACHTUNG FEHLER BEI TOKENIZER! FIXME
#model_name = 'intfloat/e5-base-v2' # add "query: " before queries and "passage: " before passages!
#model_name = "thenlper/gte-small"
#model_name = "thenlper/gte-base"  # beste
#model_name = "olm/olm-roberta-base-dec-2022"  ## nicht so gut
#model_name = 'allenai/specter' # Mit average=False benutzen! und FlatIP statt FlatL2! # Spezialisiert auf Scientific Papers
#model_name = "sap-ai-research/BERT-Large-Contrastive-Self-Supervised-ACL2020"

model, tokenizer = load_model(model_name)
model = model.to(device)

### the embedding (of document corpus)

In [5]:
def average_pool(last_hidden_states, attention_mask):
    """ Calculates average pooling of hidden states (with attention mask) """
    # mask paddings with 0 -> ignore in average calculation
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) 
    #last_hidden = last_hidden_states # without using mask (is this even worth considering?)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]


def encode(model, tokenizer, texts, max_length=512, avg_pool=False): # avg. doc length = 144 (after preprocessing only those with abstract.)
    """ Encode texts with model """
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    inputs.to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        last_hidden_states = outputs.hidden_states[-1]
        if avg_pool: 
            return average_pool(last_hidden_states, inputs["attention_mask"])
        else: # [CLS] embeddings
            return last_hidden_states[:,0,:]


In [18]:
def encode_documents(corpus, model, tokenizer, batch_size, avg_pool=False, 
                     normalize=False, preprocess=False, **preprocess_params):

    embeddings = None  # will be np.array of shape [num_docs, embedding_size]
    docnos = []        # for embedding-vector index to docno translation

    for j, batch in enumerate(batch_corpus(corpus, batch_size=batch_size)):
        print(f"\rBatch {j+1:3d}/{len(corpus)} ", end="")

        # corpus muss dict_corpus sein! {"docno1": "text1", ...}
        docnos += list(batch.keys())
        texts = list(batch.values())

        if preprocess:
            texts = [preprocess_text(t, **preprocess_params) for t in texts]

        #if "e5" in model_name.lower(): # in preprocess params?
        #    texts = ["passage: "+t for t in texts]
        
        batch_embeddings = encode(model=model, tokenizer=tokenizer, texts=texts, avg_pool=avg_pool)

        if embeddings is None:
            embeddings = batch_embeddings
        else:
            embeddings = torch.concatenate([embeddings, batch_embeddings], dim=0)

    if normalize:
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

    return docnos, embeddings # TODO: yield docnos, embeddings!? -> speicherschonender?


In [19]:
# Encode the document corpus
batch_size = 500
avg_pool   = False
normalize  = False
preprocess = False
preprocess_params = {
    "lower": True,
    "numbers": True,
    "letter_numbers": True,
    "abbrev": True,
    "special_characters": True,
}

docnos, embeddings = encode_documents(corpus, model, tokenizer, batch_size, normalize=normalize,
                                      avg_pool=avg_pool, preprocess=preprocess, **preprocess_params)
print("embeddings shape:", embeddings.shape)

embedding_size = embeddings.shape[1]
if np.isnan(embeddings).any():
    print("WARNUNG: NaN-Werte in den Embeddings gefunden!")

Batch   1/254 

: 

In [None]:
with open("encoded_corpus/bert-tiny-ft-mlm-ep3-docnos.txt", "w") as f:
    json.dump(docnos, f, ensure_ascii=False)
with open("encoded_corpus/bert-tiny-ft-mlm-ep3-embeddings.npy", "wb") as f:
    np.save(f, embeddings)

In [6]:
with open("encoded_corpus/bert-tiny-ft-mlm-ep3-docnos.txt", "r") as f:
    docnos = json.load(f)
with open("encoded_corpus/bert-tiny-ft-mlm-ep3-embeddings.npy", "rb") as f:
    embeddings = np.load(f)

### the index (faiss)

In [None]:
# Create a FAISS index
embedding_size = embeddings.shape[1]

index_name =  "IVF"
metric = "IP"
index_params = {
    "nlist": 500, # ivf n cluster
    #"n_bits": 2*embedding_size, # hsnw
}

# index factory string
index_string = f"{index_name}"
if "M" in index_params:
    index_string += f"M{index_params.get('M', 16)}"
if index_name == "IFV":
    index_string += f"IVF{index_params.get('nlist', 100)}"
# Append the base index type
if index_name != "Flat":
    index_string += f",Flat"

index = faiss.index_factory(embedding_size, index_string, faiss.METRIC_INNER_PRODUCT if metric == "IP" else faiss.METRIC_L2)

# additional parameters
if "efConstruction" in index_params and hasattr(index, 'hnsw'):
    index.hnsw.efConstruction = index_params['efConstruction']
if "efSearch" in index_params and hasattr(index, 'hnsw'):
    index.hnsw.efSearch = index_params['efSearch']



In [None]:
# Add the Embeddings to the index

# Normalize embeddings if using inner product similarity
if metric == "IP":
    faiss.normalize_L2(embeddings)

# Train the index if necessary
#index.train(embeddings)

index.add(embeddings)


In [25]:
INDEX_DIR = "./indexe"
index_file = "ivf_10000_IP-bert-tiny-ft-mlm-ep3.index"
faiss.write_index(index, os.path.join(INDEX_DIR, index_file))

In [10]:
INDEX_DIR = "./indexe"
index_file = "ivf_10000_IP-bert-tiny-ft-mlm-ep3.index"
index = faiss.read_index(os.path.join(INDEX_DIR, index_file))

## The Retrieval

In [11]:
# Now with the dataset queries
dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

In [18]:
# Run through queries and search the relevant documents

#name = "vanilla_mini-bert"  # name of this run
#name = "mini-bert_with-tokens"
run_name = "tinybert_finetuned-ivf10000_flatip"

run = []
for i, row in enumerate(dataset.get_topics(variant="description").to_dict(orient="records")):
    query = row["query"]
    print(query)

    if preprocess:
        query = preprocess_text(query, **preprocess_params)

    # Encode the query
    query_embedding = encode(model, tokenizer, [query]).cpu().numpy() # TODO: gpu variant
    query_embedding = query_embedding.astype(np.float32)  # brauch ich das wirklich für faiss???
    faiss.normalize_L2(query_embedding)

    # Search in the Index
    scores, candidates = index.search(query_embedding, k=10)

    if metric == "L2":
        scores = distance2score(scores)

    # Ergebnisse sollten bereits sortiert sein, nur nochmal zur Sicherheit:
    results = sorted(list(zip(scores[0], candidates[0])), key=lambda x: x[0], reverse=True)

    for j, (score, candidate) in enumerate(results):
        run.append({ "qid": row["qid"], "docno": docnos[candidate],
                     "rank": j+1, "score": score, "name": run_name})

In [19]:
RUN_DIR = "./runs"
runfile = os.path.join(RUN_DIR, run_name+"_run.txt")

with open(runfile, "w") as f:
    for item in run:
        # schreibt die selben sachen, die persist_and_normalize_run() schreibt
        f.write(f"{item['qid']} 0 {item['docno']} {item['rank']} {item['score']} {run_name}\n") 

## Evaluation

In [14]:
# Some baselines that were executed in TIRA
bm25_baseline = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 (tira-ir-starter-pyterrier)', dataset)
sparse_cross_encoder = tira.pt.from_submission('ir-benchmarks/fschlatt/sparse-cross-encoder-4-512', dataset)
rank_zephyr = tira.pt.from_submission('workshop-on-open-web-search/fschlatt/rank-zephyr', dataset)

In [23]:

run_files = sorted(list(glob.glob(os.path.join(RUN_DIR, "*.txt"))))
methods = [pt.io.read_results(run_file_path) for run_file_path in run_files]
run_names = [name.split("/")[-1].split(".")[0] for name in run_files]

pt.Experiment(
    [bm25_baseline, sparse_cross_encoder, rank_zephyr] + methods,
    dataset.get_topics(),
    dataset.get_qrels(),
    ["ndcg_cut.10", "recip_rank", "recall_100", "map"],
    names=["BM 25 (Baseline)", "Sparse Cross Encoder", "RankZephyr"] + run_names
)

Unnamed: 0,name,ndcg_cut.10,recip_rank,recall_100,map
0,BM 25 (Baseline),0.37404,0.57988,0.60133,0.26231
1,Sparse Cross Encoder,0.36646,0.61298,0.60133,0.24126
2,RankZephyr,0.34707,0.56841,0.60133,0.26749
3,gte_base-prepr_avgpool-ivf_5000_IP_run,0.07413,0.15825,0.04237,0.02603
4,gte_base-prepr_avgpool-ivf_500_IP_run,0.07396,0.15058,0.04583,0.026
5,gte_base-vanilla-ivf_10000_IP_run,0.10077,0.20662,0.0617,0.04235
6,mini-bert_with-tokens_run,0.06224,0.14412,0.02821,0.01344
7,mini_bert-full_preprocessing_run,0.05907,0.15809,0.02533,0.01325
8,tinybert_finetuned-FlatIP_run,0.01203,0.04216,0.00608,0.00365
9,tinybert_finetuned-ivf10000_flatip_run,0.01203,0.04216,0.00608,0.00365
