## About this Notebook

This notebook is adapted from [Ensemble MPNETV2+BGE+GTE with Faiss-GPU](https://www.kaggle.com/code/medali1992/ensemble-mpnetv2-bge-gte-with-faiss-gpu). The original author provided the individual model scores in their notebook, and I experimented with a few parameters to improve the LB score from 0.291 to 0.300. I appreciate the original author's open notebook, and I believe that by experimenting with different parameter combinations, it is possible to further improve the score.

## Individual Model LB Scores
- `BAAI/bge-large-en-v1.5` → `LB=0.257`
- `sentence-transformers/all-mpnet-base-v2` → `LB=0.251`
- `Alibaba-NLP/gte-base-en-v1.5` → `LB=0.281`

## Version 1
The base notebook `LB=0.291`

## Version 3
weight1, weight2, weight3 = 0.36, 0.33, 0.32 → `LB=0.292`

## Version 4
weight1, weight2, weight3 = 0.46, 0.34, 0.3 → `LB=0.300`

## Version 5
weight1, weight2, weight3 = 0.5, 0.35, 0.29 → `LB=0.300`

## A Mistake
When I was about to publish this notebook and checked it, I realized I had mixed up the order, causing the weights for BGE and MPNET to be swapped. However, it still improved the score. So, if someone wants to modify the code based on my notebook, you can correct the weight order or just leave it as is — as long as it improves the score. I hope my notebook is helpful to you.


# Setting

In [None]:
K=25
VER=5
BS=16
D = 1024


DATA_PATH = "/kaggle/input/eedi-mining-misconceptions-in-mathematics"
BGE_MODEL_PATH = "/kaggle/input/bge-weights-version1/bge_trained_model_version3"
GTE_BASE_MODEL_PATH = "/kaggle/input/mod-gte-base-weights/gte-base-weights/gte-base_trained_model_version2"
MPNETV2_MODEL_PATH = "/kaggle/input/mpnet-weights-version1/mpnetV2_trained_model_version3"

# Install

In [None]:
!pip uninstall -qq -y \
polars

In [None]:
!python -m pip install -qq --no-index --find-links=/kaggle/input/eedi-library-from-sinchiro \
polars\
sentence-transformers\
faiss-gpu

# Import 

In [None]:
import os
import gc

import polars as pl
import numpy as np
from scipy import stats
import torch
import torch.nn.functional as F

import faiss

from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
import sentence_transformers

assert pl.__version__ == "1.7.1"
assert sentence_transformers.__version__ == "3.1.1"

# Data Load

In [None]:
test = pl.read_csv(f"{DATA_PATH}/test.csv")
misconception_mapping = pl.read_csv(f"{DATA_PATH}/misconception_mapping.csv")

# Preprocess

In [None]:
common_col = [
    "QuestionId",
    "ConstructName",
    "SubjectName",
    "QuestionText",
    "CorrectAnswer",
]

test_long = (
    test
    .select(
        pl.col(common_col + [f"Answer{alpha}Text" for alpha in ["A", "B", "C", "D"]])
    )
    .unpivot(
        index=common_col,
        variable_name="AnswerType",
        value_name="AnswerText",
    )
    .with_columns(
        pl.concat_str(
            [
               '<Construct> ' +  pl.col("ConstructName"),
               '<Subject> ' + pl.col("SubjectName"),
               '<Question> '+ pl.col("QuestionText"),
               '<Answer> ' + pl.col("AnswerText"),
            ],
            separator=" ",
        ).alias("AllText"),
        pl.col("AnswerType").str.extract(r"Answer([A-D])Text$").alias("AnswerAlphabet"),
    )
    .with_columns(
        pl.concat_str(
            [pl.col("QuestionId"), pl.col("AnswerAlphabet")], separator="_"
        ).alias("QuestionId_Answer"),
    )
    .sort("QuestionId_Answer")
)
test_long.head()

# Sentence transformer models

### Utils

In [None]:
def encode_texts(test_long, misconception_mapping, model_path, batch_size=8, progress_bar=True):
    model = SentenceTransformer(model_path, local_files_only=True, trust_remote_code=True)
    model.to(device)
    # wrap the model to use all GPUs
    model = torch.nn.DataParallel(model)
    model.eval()
    
    # Encode all text from the test_long DataFrame
    all_text_vec = model.module.encode(test_long["AllText"].to_list(), batch_size=batch_size , normalize_embeddings=True, show_progress_bar=progress_bar)
    
    # Encode misconception names from the misconception_mapping DataFrame
    misconception_mapping_vec = model.module.encode(misconception_mapping["MisconceptionName"].to_list(), batch_size=batch_size, normalize_embeddings=True, show_progress_bar=progress_bar)
    
    torch.cuda.empty_cache()
    gc.collect()
    
    return all_text_vec, misconception_mapping_vec

def search_faiss(k, d, vectors_to_add, query_vectors):
    """
    Perform a FAISS search with L2 distance.
    
    Parameters:
        k (int): Number of nearest neighbors to search for.
        d (int): Dimension of the vectors.
        vectors_to_add (numpy.ndarray): The vectors to add to the FAISS index.
        query_vectors (numpy.ndarray): The vectors to search for the nearest neighbors.
        
    Returns:
        D (numpy.ndarray): The distances to the k nearest neighbors.
        I (numpy.ndarray): The indices of the k nearest neighbors.
    """
    # Create the index
    index = faiss.IndexFlatL2(d)
    
    # Add vectors to the index
    index.add(vectors_to_add)
    
    # Search for k nearest neighbors
    D, I = index.search(query_vectors, k)
    
    return D, I

def ensemble_majority_vote(*indices):
    """
    Apply ensembling with majority voting across multiple index arrays.
    
    Parameters:
        indices (numpy.ndarray): Variable number of index arrays to ensemble.
        
    Returns:
        numpy.ndarray: The majority-voted indices.
    """
    # Stack indices along a new axis (shape: (num_searches, num_queries, k))
    stacked_indices = np.stack(indices, axis=0)
    
    # Apply mode to find the majority vote along the first axis (searches)
    majority_indices, _ = stats.mode(stacked_indices, axis=0)
    
    # Remove the extra dimension added by mode and return the majority-voted indices
    return majority_indices.squeeze()

def ensemble_random_choice(*indices):
    """
    Apply ensembling with random choice across multiple index arrays.
    
    Parameters:
        indices (numpy.ndarray): Variable number of index arrays to ensemble.
        
    Returns:
        numpy.ndarray: Randomly selected indices from the given index arrays.
    """
    # Stack indices along a new axis (shape: (num_searches, num_queries, k))
    stacked_indices = np.stack(indices, axis=0)
    
    # Number of searches (i.e., how many index arrays we have)
    num_searches = stacked_indices.shape[0]
    
    # Randomly choose indices from the 3 arrays
    # For each query and each nearest neighbor (k), randomly select an index from the available searches
    random_choices = np.random.randint(0, num_searches, size=stacked_indices.shape[1:])
    
    # Use the random choices to pick the corresponding indices
    random_indices = np.choose(random_choices, stacked_indices)
    
    return random_indices

## GTE-base Model

In [None]:
gte_base_all_text_vec, gte_misconception_mapping_vec = encode_texts(test_long, misconception_mapping, GTE_BASE_MODEL_PATH, BS)
gte_base_all_text_vec = np.pad(gte_base_all_text_vec, ((0, 0), (0, 256)), mode='constant')
gte_misconception_mapping_vec = np.pad(gte_misconception_mapping_vec, ((0, 0), (0, 256)), mode='constant')
print(gte_base_all_text_vec.shape)
print(gte_misconception_mapping_vec.shape)

## MPNETV2 Model

In [None]:
mpnetv2_base_all_text_vec, mpnetv2_misconception_mapping_vec = encode_texts(test_long, misconception_mapping, MPNETV2_MODEL_PATH, BS)
mpnetv2_base_all_text_vec = np.pad(mpnetv2_base_all_text_vec, ((0, 0), (0, 256)), mode='constant')
mpnetv2_misconception_mapping_vec = np.pad(mpnetv2_misconception_mapping_vec, ((0, 0), (0, 256)), mode='constant')
print(mpnetv2_base_all_text_vec.shape)
print(mpnetv2_misconception_mapping_vec.shape)

## BGE Model

In [None]:
bge_base_all_text_vec, bge_misconception_mapping_vec = encode_texts(test_long, misconception_mapping, BGE_MODEL_PATH, BS)
print(bge_base_all_text_vec.shape)

## Model ensemble

In [None]:
weight1, weight2, weight3 = 0.5, 0.35, 0.29


# ensemble_text_vecs = np.mean(np.stack([gte_base_all_text_vec, bge_base_all_text_vec, mpnetv2_base_all_text_vec]), axis=0)
# ensemble_misconception_vecs = np.mean(np.stack([gte_misconception_mapping_vec, mpnetv2_misconception_mapping_vec, gte_misconception_mapping_vec]), axis=0)
# _, ensemble_indices = search_faiss(K, D, ensemble_misconception_vecs, ensemble_text_vecs)

ensemble_text_vecs = (weight1 * gte_base_all_text_vec + weight2 * bge_base_all_text_vec + weight3 * mpnetv2_base_all_text_vec)

ensemble_misconception_vecs = (weight1 * gte_misconception_mapping_vec + weight2 * mpnetv2_misconception_mapping_vec + weight3 * bge_misconception_mapping_vec)

_, ensemble_indices = search_faiss(K, D, ensemble_misconception_vecs, ensemble_text_vecs)
print(ensemble_text_vecs.shape)
print(ensemble_misconception_vecs.shape)
print(ensemble_indices.shape)

del ensemble_text_vecs, ensemble_misconception_vecs, gte_base_all_text_vec, bge_base_all_text_vec, mpnetv2_base_all_text_vec, gte_misconception_mapping_vec, mpnetv2_misconception_mapping_vec 
_ = gc.collect()

# Make Submit File

In [None]:
submission = (
    test_long.with_columns(
        pl.Series(ensemble_indices[:, :25].tolist()).alias("MisconceptionId")
    )
    .with_columns(
        pl.col("MisconceptionId").map_elements(
            lambda x: " ".join(map(str, x)), return_dtype=pl.String
        )
    ).filter(
        pl.col("CorrectAnswer") != pl.col("AnswerAlphabet")
    ).select(
        pl.col(["QuestionId_Answer", "MisconceptionId"])
    ).sort("QuestionId_Answer")
)

In [None]:
submission.head()

In [None]:
submission.write_csv("submission.csv")

# 