# <b><span style='color:#F1A424'>|</span> EEDI: <span style='color:#F1A424'>Gemma 2b-it + RAG </span><span style='color:#ABABAB'> [Inference]</span></b>

***

In this competition, you’ll develop an NLP model driven by ML to accurately predict the affinity between misconceptions and incorrect answers (distractors) in multiple-choice questions. This solution will suugest candidate misconceptions for distractors, making it easier for expert human teachers to tag distractors with misconceptions.

### <b><span style='color:#F1A424'>Table of Contents</span></b> <a class='anchor' id='top'></a>
<div style=" background-color:#3b3745; padding: 13px 13px; border-radius: 8px; color: white">
    <li> <a href="#introduction">Introduction</a></li>
    <li> <a href="#install_libraries">Install libraries</a></li>
    <li><a href="#import_libraries">Import Libraries</a></li>
    <li><a href="#configuration">Configuration</a></li>
    <li><a href="#load_data">Load Data</a></li>
    <li><a href="#utils">Utils</a></li>
    <li><a href="#model">Model</a></li>
    <li><a href="#prompt_gemma">Prompt Gemma</a></li>
    <li><a href="#preprocessing">Pre-processing</a></li>
    <li><a href="#dataset">Dataset</a></li>
    <li><a href="#embedding_function">Embedding Function</a></li>
    <li><a href="#create_embeddings">Create Embeddings</a></li>
    <li><a href="#retrieval">Retrieval</a></li>
    <li><a href="#submission">Submission</a></li>
</div>


# <b><span style='color:#F1A424'>|</span> Introduction</b><a class='anchor' id='introduction'></a> [↑](#top)

***

In this notebook, we will create embeddings using Hugging Face 🤗 models from all misconceptions texts present in the `misconception_mapping.csv` file. Then, we will pass test Q&A pairs to a Gemma model and ask it to predict a misconception for that pair. We will later embed these predictions and retrieve the top-25 most similar misconceptions. These will be our predictions.


### <b><span style='color:#F1A424'>References</span></b>


- The Hugging Face models that we use here are models directly downloaded from Hugging Face (in notebook [here][1]).
- This notebook is based on @cdeotte work from the AES-2 competition [here][2]
- Nischay Dhankhar's [Gemma-Asking LLM to generate prompt][3]
- [Open LLM Leaderboard][4]

[1]: https://www.kaggle.com/code/cdeotte/download-huggingface-models
[2]: https://www.kaggle.com/code/cdeotte/rapids-svr-starter-cv-0-830-lb-0-804
[3]: https://www.kaggle.com/code/nischaydnk/gemma-asking-llm-to-generate-prompt
[4]: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

# <b><span style='color:#F1A424'>|</span> Install Libraries</b><a class='anchor' id='install_libraries'></a> [↑](#top)

***

Install all the required libraries for this notebook.

In [None]:
!mkdir /kaggle/working/gemma/
!cp /kaggle/input/gemma-pytorch/gemma_pytorch-main/gemma/* /kaggle/working/gemma/
!pip install --no-index --no-deps /kaggle/input/immutabledict/immutabledict-4.1.0-py3-none-any.whl

In [None]:
import sys 

sys.path.append("/kaggle/working/") 

# <b><span style='color:#F1A424'>|</span> Import Libraries</b><a class='anchor' id='import_libraries'></a> [↑](#top)

***

Import all the required libraries for this notebook.

In [None]:
import contextlib
import gc
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random
import re
import torch
import torch.nn.functional as F
import transformers


from gemma.config import GemmaConfig, get_config_for_7b, get_config_for_2b
from gemma.model import GemmaForCausalLM
from gemma.tokenizer import Tokenizer
from glob import glob
from sklearn.manifold import TSNE
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
from transformers import AutoModel,AutoTokenizer
from typing import Any, Dict, List


os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Current device is: {device}")
print('pandas version' ,pd.__version__)
print('transformers version', transformers.__version__)
!mkdir output

# <b><span style='color:#F1A424'>|</span> Configuration</b><a class='anchor' id='configuration'></a> [↑](#top)

***

Central repository for this notebook's hyperparameters.

In [None]:
class config:
    BATCH_SIZE = 128
    DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    MAX_LEN = 128
    MAX_LEN_QA = 512
    MODEL = "intfloat/e5-base-v2"
    

class paths:
    HF_MODELS = "/kaggle/input/download-and-save-huggingface-models/"
    MISCONCEPTION_CSV = "/kaggle/input/eedi-mining-misconceptions-in-mathematics/misconception_mapping.csv"
    OUTPUT = "/kaggle/working/output/"
    TRAIN_CSV = "/kaggle/input/eedi-mining-misconceptions-in-mathematics/train.csv"
    TEST_CSV = "/kaggle/input/eedi-mining-misconceptions-in-mathematics/test.csv"
    

# <b><span style='color:#F1A424'>|</span> Load Data</b><a class='anchor' id='load_data'></a> [↑](#top)

***

Load data.

In [None]:
df_train = pd.read_csv(paths.TRAIN_CSV)
df_test = pd.read_csv(paths.TEST_CSV)
misconception_mapping = pd.read_csv(paths.MISCONCEPTION_CSV)
print(f"Train dataframe shape: {df_train.shape}")
print(f"Test dataframe shape: {df_test.shape}")
print(f"Misconception mapping dataframe shape: {misconception_mapping.shape}")
display(df_train.head())
display(df_test.head())
display(misconception_mapping.head())

# <b><span style='color:#F1A424'>|</span> Utils</b><a class='anchor' id='utils'></a> [↑](#top)

***

Utility functions.

In [None]:
def create_example(df: pd.DataFrame, idx: int) -> str:
    """
    Create examples for RAG.
    """
    QuestionText = df.loc[idx, "QuestionText"]
    AnswerAText = df.loc[idx, "AnswerAText"]
    AnswerBText = df.loc[idx, "AnswerBText"]
    AnswerCText = df.loc[idx, "AnswerCText"]
    AnswerDText = df.loc[idx, "AnswerDText"]
    SubjectName = df.loc[idx, "SubjectName"]
    ConstructName = df.loc[idx, "ConstructName"]
    text = f"""
    ###
    ConstructName: {ConstructName}
    ###
    Subject: {SubjectName}
    ###
    Question: {QuestionText} \n
    ###
    Answers:
    (A) - {AnswerAText} \n
    (B) - {AnswerBText} \n
    (C) - {AnswerCText} \n
    (D) - {AnswerDText} \n
    """
    return text


def create_misconception(df: pd.DataFrame, idx: int) -> str:
    """
    Create misconception for RAG.
    """
    QuestionText = df.loc[idx, "QuestionText"]
    AnswerAText = df.loc[idx, "AnswerAText"]
    AnswerBText = df.loc[idx, "AnswerBText"]
    AnswerCText = df.loc[idx, "AnswerCText"]
    AnswerDText = df.loc[idx, "AnswerDText"]
    SubjectName = df.loc[idx, "SubjectName"]
    ConstructName = df.loc[idx, "ConstructName"]
    MisconceptionNameA = df.loc[idx, "MisconceptionNameA"]
    MisconceptionNameB = df.loc[idx, "MisconceptionNameB"]
    MisconceptionNameC = df.loc[idx, "MisconceptionNameC"]
    MisconceptionNameD = df.loc[idx, "MisconceptionNameD"]
    answer_list = [AnswerAText, AnswerBText, AnswerCText, AnswerDText]
    misc_list = [MisconceptionNameA, MisconceptionNameB, MisconceptionNameC, MisconceptionNameD]
    filtered_data = filtered_data = [(index, item) for index, item in enumerate(misc_list) if not isinstance(item, float) or not np.isnan(item)]
    random_index, random_value = random.choice(filtered_data)
    text = f"""
    ###
    ConstructName: {ConstructName}
    ###
    Subject: {SubjectName}
    ###
    Question: {QuestionText}
    ###
    Wrong Answer:
    - {answer_list[random_index]}
    ###
    Misconception:
    {random_value}
    """
    return text


def create_text(df: pd.DataFrame, idx: int, letter: str) -> str:
    """
    Joins all available text information into a single string
    """
    QuestionText = df.loc[idx, "QuestionText"]
    WrongAnswer = df.loc[idx, f"Answer{letter}Text"]
    SubjectName = df.loc[idx, "SubjectName"]
    ConstructName = df.loc[idx, "ConstructName"]
    text = f"""
    ###
    ConstructName: {ConstructName}
    ###
    Subject: {SubjectName}
    ###
    Question: {QuestionText} \n
    ###
    Wrong Answer:
    - {WrongAnswer} \n
    ###
    Misconception:
    ???
    """
    return text


def ensure_row_vector(arr: np.ndarray) -> np.ndarray:
    """
    Ensures that the input array is a row vector.
    If the input array is one-dimensional, it reshapes it into a row vector (1, n).
    If the array is already two-dimensional, it returns the array as is.
    :param arr: Input array, which can be 1D or 2D.
    :return arr: numpy.ndarray: A 2D array with shape (1, n) if the input was 1D, or the original
    array if it was already 2D.
    """
    if arr.ndim == 1:  # Check if the array has one dimension
        return arr.reshape(1, -1)  # Reshape to row vector
    else:
        return arr
    

def most_similar_embeddings_train(embeddings: np.ndarray, index: int, K: int = 5) -> List[int]:
    """
    Finds the K most similar embeddings to the embedding at the given index based on cosine similarity.

    :param embeddings: An array of shape (N, M) where N is the number of embeddings and M is the embedding dimension.
    :param index: The index of the embedding to compare against the others.
    K (int): The number of most similar embeddings to return (default is 5).

    Returns:
    numpy.ndarray: An array of indices of the K most similar embeddings.
    """
    # Get the embedding for the given index
    query_embedding = embeddings[index]
    
    # Compute the cosine similarity between the query embedding and all other embeddings
    dot_product = np.dot(embeddings, query_embedding)
    norm_query = np.linalg.norm(query_embedding)
    norms = np.linalg.norm(embeddings, axis=1)
    cosine_similarity = dot_product / (norm_query * norms)
    
    # Sort the indices by similarity in descending order, exclude the index itself
    most_similar_indices = np.argsort(-cosine_similarity)  # Sort in descending order
    most_similar_indices = most_similar_indices[most_similar_indices != index]  # Exclude the input index
    
    # Return the top K most similar indices
    return most_similar_indices[:K]


def most_similar_embeddings_test(embed_test: np.ndarray, embed_train: np.ndarray, index: int, K=5) -> np.ndarray:
    """
    Finds the K most similar embeddings from array B to the embedding at the given index in array A using cosine similarity.

    Parameters:
    :param embed_test: An array of shape (N, M) where N is the number of embeddings in A and M is the embedding dimension.
    :param embed_train: An array of shape (P, M) where P is the number of embeddings in B and M is the embedding dimension.
    :param index: The index of the embedding in A to compare against the embeddings in B.
    :param K: The number of most similar embeddings to return from B (default is 5).
    :return most_similar_indices: An array of indices of the K most similar embeddings from B.
    """
    # Get the embedding for the given index from array embed_test
    query_embedding = embed_test[index]
    
    # Compute the cosine similarity between the query embedding from embed_test and all embeddings in embed_train
    dot_product = np.dot(embed_train, query_embedding)
    norm_query = np.linalg.norm(query_embedding)
    norms = np.linalg.norm(embed_train, axis=1)
    cosine_similarity = dot_product / (norm_query * norms)
    
    # Sort the indices of embed_train by similarity in descending order
    most_similar_indices = np.argsort(-cosine_similarity)  # Sort in descending order
    
    # Return the top K most similar indices from embed_train
    return most_similar_indices[:K]


def sep():
    print("-"*100)

# <b><span style='color:#F1A424'>|</span> Pre-processing</b><a class='anchor' id='preprocessing'></a> [↑](#top)

***

    

In [None]:
for letter in ["A", "B", "C", "D"]:
    # === Preprocess Train ===
    df_train = df_train.merge(
        misconception_mapping[['MisconceptionId', 'MisconceptionName']].rename(
            columns={'MisconceptionName': f'MisconceptionName{letter}'}
        ),
        left_on=f"Misconception{letter}Id", right_on='MisconceptionId', how='left'
    )
    df_train.drop("MisconceptionId", axis=1, inplace=True)

# <b><span style='color:#F1A424'>|</span> Model</b><a class='anchor' id='model'></a> [↑](#top)

***

    

In [None]:
# Load the model
VARIANT = "2b-it" 
MACHINE_TYPE = "cuda" 
weights_dir = '/kaggle/input/gemma/pytorch/2b-it/2' 

@contextlib.contextmanager
def _set_default_tensor_type(dtype: torch.dtype):
    """Sets the default torch dtype to the given dtype."""
    torch.set_default_dtype(dtype)
    yield
    torch.set_default_dtype(torch.float)

# Model Config.
model_config = get_config_for_2b() if "2b" in VARIANT else get_config_for_7b()
model_config.tokenizer = os.path.join(weights_dir, "tokenizer.model")
# model_config.quant = "quant" in VARIANT

# Model.
device = torch.device(MACHINE_TYPE)
with _set_default_tensor_type(model_config.get_dtype()):
    model = GemmaForCausalLM(model_config)
    ckpt_path = os.path.join(weights_dir, f'gemma-{VARIANT}.ckpt')
    model.load_weights(ckpt_path)
    model = model.to(device).eval()

# <b><span style='color:#F1A424'>|</span> Dataset</b><a class='anchor' id='dataset'></a> [↑](#top)

***

    

In [None]:
def mean_pooling(model_output: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
    last_hidden_state = model_output.last_hidden_state.detach().cpu()
    input_mask_expanded =  attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
    sum_mask = input_mask_expanded.sum(1)
    sum_mask = torch.clamp(sum_mask, min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask
    return  mean_embeddings


class EmbedQA(Dataset):
    def __init__(self, df, tokenizer, config):
        self.config = config
        self.df = df
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, item):
        text = create_example(self.df, item)
        tokens = self.tokenizer(
            text, None, add_special_tokens=True,
            padding='max_length', truncation=True,
            max_length=self.config.MAX_LEN_QA, return_tensors="pt"
        )
        tokens = {k:v.squeeze(0) for k,v in tokens.items()}
        return tokens


class EmbedMisconception(Dataset):
    def __init__(self, df, tokenizer, config):
        self.config = config
        self.texts = df['MisconceptionName'].values
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, item):
        text = self.texts[item]
        tokens = self.tokenizer(
            text, None, add_special_tokens=True,
            padding='max_length', truncation=True,
            max_length=self.config.MAX_LEN, return_tensors="pt"
        )
        tokens = {k:v.squeeze(0) for k,v in tokens.items()}
        return tokens

# <b><span style='color:#F1A424'>|</span> Embedding Function</b><a class='anchor' id='embedding_function'></a> [↑](#top)

***

    

In [None]:
def get_embeddings_qa(
        config = config,
        model_name: str = '',
        batch_size: int = 32
    ):

    global train, test
    
    MODEL_PATH = paths.HF_MODELS + config.MODEL.replace("/","_")
    
    # === Load Model ===
    model = AutoModel.from_pretrained(MODEL_PATH , trust_remote_code=True)
    model = model.to(config.DEVICE)
    model.eval()
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH , trust_remote_code=True)
    
    # === Dataset & DataLoader ===
    train_dataset = EmbedQA(df_train, tokenizer, config)
    test_dataset = EmbedQA(df_test, tokenizer, config)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    
    # === Compute Train Embeddings ===
    all_train_embeddings = []
    for batch in tqdm(train_loader, total=len(train_loader)):
        input_ids = batch["input_ids"].to(config.DEVICE)
        attention_mask = batch["attention_mask"].to(config.DEVICE)
        with torch.no_grad():
            with torch.cuda.amp.autocast(enabled=True):
                model_output = model(input_ids=input_ids, attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) # Normalize the embeddings
        sentence_embeddings = sentence_embeddings.squeeze(0).detach().cpu().numpy()
        sentence_embeddings = ensure_row_vector(sentence_embeddings)
        all_train_embeddings.extend(sentence_embeddings)
    all_train_embeddings = np.array(all_train_embeddings)

    # === Compute Test Embeddings ===
    all_test_embeddings = []
    for batch in tqdm(test_loader, total=len(test_loader)):
        input_ids = batch["input_ids"].to(config.DEVICE)
        attention_mask = batch["attention_mask"].to(config.DEVICE)
        with torch.no_grad():
            with torch.cuda.amp.autocast(enabled=True):
                model_output = model(input_ids=input_ids,attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) # Normalize the embeddings
        sentence_embeddings = sentence_embeddings.squeeze(0).detach().cpu().numpy()
        sentence_embeddings = ensure_row_vector(sentence_embeddings)
        all_test_embeddings.extend(sentence_embeddings)
    all_test_embeddings = np.array(all_test_embeddings)

    # === Clean memory ===
    del train_dataset, test_dataset
    del train_loader, test_loader
    del model, tokenizer
    del model_output, sentence_embeddings, input_ids, attention_mask
    gc.collect()
    torch.cuda.empty_cache()

    return all_train_embeddings, all_test_embeddings



def get_embeddings(
        config = config,
        model_name: str = '',
        batch_size: int = 128,
    ):

    global train, test
    
    MODEL_PATH = paths.HF_MODELS + config.MODEL.replace("/","_")
    
    # === Load Model ===
    model = AutoModel.from_pretrained(MODEL_PATH , trust_remote_code=True)
    model = model.to(config.DEVICE)
    model.eval()
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH , trust_remote_code=True)
    
    # === Dataset & DataLoader ===
    misconception_mapping_dataset = EmbedMisconception(misconception_mapping, tokenizer, config)
    misconception_preds_dataset = EmbedMisconception(misconception_preds, tokenizer, config)
    misconception_mapping_loader = DataLoader(misconception_mapping_dataset, batch_size=batch_size, shuffle=False)
    misconception_preds_loader = DataLoader(misconception_preds_dataset, batch_size=batch_size, shuffle=False)
    
    # === Compute Misconceptions Embeddings ===
    all_misconception_mapping_embeddings = []
    for batch in tqdm(misconception_mapping_loader, total=len(misconception_mapping_loader)):
        input_ids = batch["input_ids"].to(config.DEVICE)
        attention_mask = batch["attention_mask"].to(config.DEVICE)
        with torch.no_grad():
            with torch.cuda.amp.autocast(enabled=True):
                model_output = model(input_ids=input_ids, attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) # Normalize the embeddings
        sentence_embeddings = sentence_embeddings.squeeze(0).detach().cpu().numpy()
        sentence_embeddings = ensure_row_vector(sentence_embeddings)
        all_misconception_mapping_embeddings.extend(sentence_embeddings)
    all_misconception_mapping_embeddings = np.array(all_misconception_mapping_embeddings)

    # === Compute Predictions Embeddings ===
    all_misconception_preds_embeddings = []
    for batch in tqdm(misconception_preds_loader, total=len(misconception_preds_loader)):
        input_ids = batch["input_ids"].to(config.DEVICE)
        attention_mask = batch["attention_mask"].to(config.DEVICE)
        with torch.no_grad():
            with torch.cuda.amp.autocast(enabled=True):
                model_output = model(input_ids=input_ids,attention_mask=attention_mask)
        sentence_embeddings = mean_pooling(model_output, attention_mask.detach().cpu())
        sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) # Normalize the embeddings
        sentence_embeddings = sentence_embeddings.squeeze(0).detach().cpu().numpy()
        sentence_embeddings = ensure_row_vector(sentence_embeddings)
        all_misconception_preds_embeddings.extend(sentence_embeddings)
    all_misconception_preds_embeddings = np.array(all_misconception_preds_embeddings)

    # === Clean memory ===
    del misconception_mapping_dataset, misconception_preds_dataset
    del misconception_mapping_loader, misconception_preds_loader
    del model, tokenizer
    del model_output, sentence_embeddings, input_ids, attention_mask
    gc.collect()
    torch.cuda.empty_cache()

    return all_misconception_mapping_embeddings, all_misconception_preds_embeddings

# <b><span style='color:#F1A424'>|</span> Create Q&A Embeddings</b><a class='anchor' id='create_embeddings_qa'></a> [↑](#top)

***

    

In [None]:
train_embeddings = []
test_embeddings = []

# === Compute Embeddings ===
name = paths.OUTPUT + config.MODEL.replace("/","_") + ".npy"
print(f"Computing embeddings for {config.MODEL}")
train_embeddings, test_embeddings = get_embeddings_qa(
    model_name=config.MODEL, config=config, batch_size=32
)

gc.collect()
print('Train embeddings have shape', train_embeddings.shape )
print('Test embeddings have shape', test_embeddings.shape )

# === Compute t-SNE projections ===
tsne = TSNE(n_components=2, perplexity=5, random_state=42)
X = tsne.fit_transform(train_embeddings)
df_train["x1"] = X[:, 0]
df_train["x2"] = X[:, 1]

### <b><span style='color:#F1A424'>Visualize</span></b>


In [None]:
import plotly.express as px
from sklearn.preprocessing import LabelEncoder

# Encode the 'SubjectName' column
col = "ConstructName"
label_encoder = LabelEncoder()
df_train[f'{col}_encoded'] = label_encoder.fit_transform(df_train[col])

# Create scatter plot with color based on 'SubjectName_encoded'
fig = px.scatter(
    df_train, x='x1', y='x2', color=f'{col}_encoded', 
    hover_data={col: True, "QuestionId": True},
    color_continuous_scale=px.colors.sequential.Viridis
)
fig.update_layout(
    title=f'{col} | t-SNE projections of embeddings',
    width=1200,  # Set the width of the plot
    height=800  # Set the height of the plot
)
fig.show()

### <b><span style='color:#F1A424'>Check one Q&A train example</span></b>


In [None]:
original_index = np.random.choice(df_train.index)
similar_index = most_similar_embeddings_train(train_embeddings, original_index, 5)
print(f"Original index: {original_index} | Similar Index: {similar_index}")
print(create_example(df_train, original_index)), sep()
print(create_example(df_train, similar_index[0]))

### <b><span style='color:#F1A424'>Check one Q&A test example</span></b>


In [None]:
original_index = np.random.choice(df_test.index)
similar_index = most_similar_embeddings_test(test_embeddings, train_embeddings, original_index, 5)
print(f"Original index: {original_index} | Similar Index: {similar_index}")
print(f"Prediction:\n {create_example(df_test, original_index)}"), sep()
print(f"Most similar:\n {create_example(df_train, similar_index[1])}")

# <b><span style='color:#F1A424'>|</span> Prompt Gemma</b><a class='anchor' id='prompt_gemma'></a> [↑](#top)

***

    

In [None]:
def create_prompt(question: str, index: int, verbose: bool = False) -> str:
    similar_index = most_similar_embeddings_test(test_embeddings, train_embeddings, index, 5)
    example = create_misconception(df_train, similar_index[0])
    
    if verbose:
        print(example)
        
    prompt_for_llm = (
        "<start_of_turn>user\nYou are an expert mathematician. You are presented with a math problem question and a wrong answer. "
        "Your task is to say which common misconceptions is associated to the incorrect answer. "
        "It is important that your answer should be more of a generic error and not tailored to the particular question you are presented. "
        "Your answer should be formated as: ###Misconception: <YOUR-ANSWER-HERE>###"
        f"Here is an example sample: {example}\n"
        f"Here is the question you must answer: {question}\n"
        "<end_of_turn>\n<start_of_turn>model\n"
    )
    response = model.generate(
        prompt_for_llm,
        device=device,
        output_len=128,
    )
    return response

output = []
for index in tqdm(df_test.index):
    question_id = df_test.loc[index, "QuestionId"]
    correct_answer = df_test.loc[index, "CorrectAnswer"]
    for letter in ["A", "B", "C", "D"]:
        if letter != correct_answer:
            question_id_answer = str(question_id) + "_" + letter
            question = create_text(df_test, index, letter)
            response = create_prompt(question, index, False)
            output.append([question_id_answer, response])
        else: 
            pass
        
misconception_preds = pd.DataFrame(output)
misconception_preds.columns = ["QuestionId_Answer", "MisconceptionName"]
misconception_preds["MisconceptionName"] = misconception_preds["MisconceptionName"].apply(lambda x: x.replace("###Misconception:", ""))
misconception_preds["MisconceptionName"] = misconception_preds["MisconceptionName"].apply(lambda x: x.replace("\n", ""))
misconception_preds["len"] = misconception_preds["MisconceptionName"].apply(lambda x: len(x.split()))
misconception_preds.head()

# <b><span style='color:#F1A424'>|</span> Create Misconception Embeddings</b><a class='anchor' id='create_embeddings_misc'></a> [↑](#top)

***

    

In [None]:
all_misconception_mapping_embeddings = []
all_misconception_preds_embeddings = []

# === Compute Embeddings ===
name = paths.OUTPUT + config.MODEL.replace("/","_") + ".npy"
print(f"Computing embeddings for {config.MODEL}")
all_misconception_mapping_embeddings, all_misconception_preds_embeddings = get_embeddings(
    model_name=config.MODEL, config=config, batch_size=config.BATCH_SIZE
)

gc.collect()
print('Misconception mapping embeddings have shape', all_misconception_mapping_embeddings.shape)
print('Misconception predictions embeddings have shape', all_misconception_preds_embeddings.shape)

# === Compute t-SNE projections ===
tsne = TSNE(n_components=2, perplexity=5, random_state=42)
X = tsne.fit_transform(all_misconception_mapping_embeddings)
misconception_mapping["x1"] = X[:, 0]
misconception_mapping["x2"] = X[:, 1]

### <b><span style='color:#F1A424'>Visualize</span></b>


In [None]:
import plotly.express as px
from sklearn.preprocessing import LabelEncoder

columns = ['MisconceptionAId', 'MisconceptionBId', 'MisconceptionCId', 'MisconceptionDId']
all_identifiers = df_train[columns].melt()['value']
identifier_counts = all_identifiers.value_counts(ascending=True)
train_ids = [int(idx) for idx in identifier_counts.index]
misconception_mapping["unseen"] = 1
idx = misconception_mapping[misconception_mapping["MisconceptionId"].isin(train_ids)].index
misconception_mapping.loc[idx, "unseen"] = 0

fig = px.scatter(
    misconception_mapping, x='x1', y='x2', 
    color='unseen',
    hover_data={"MisconceptionName": True},
#     color_continuous_scale=px.colors.sequential.Viridis
)
fig.update_layout(
    title=f'MisconceptionName | t-SNE projections of embeddings',
    width=1200,  # Set the width of the plot
    height=800  # Set the height of the plot
)
fig.show()

### <b><span style='color:#F1A424'>Check one train example</span></b>


In [None]:
original_index = np.random.choice(misconception_mapping.index)
similar_index = most_similar_embeddings_train(all_misconception_mapping_embeddings, original_index, 5)
print(f"Original index: {original_index} | Similar Index: {similar_index}")
print(misconception_mapping.loc[original_index, "MisconceptionName"]), sep()
print(misconception_mapping.loc[similar_index[0], "MisconceptionName"])

### <b><span style='color:#F1A424'>Check one test example</span></b>


In [None]:
original_index = np.random.choice(misconception_preds.index)
similar_index = most_similar_embeddings_test(all_misconception_preds_embeddings, all_misconception_mapping_embeddings, original_index, 5)
print(f"Original index: {original_index} | Similar Index: {similar_index}")
print(f"Prediction:\n {misconception_preds.loc[original_index, 'MisconceptionName']}"), sep()
print(f"Most similar:\n {misconception_mapping.loc[similar_index[0], 'MisconceptionName']}")

# <b><span style='color:#F1A424'>|</span> Retrieval</b><a class='anchor' id='retrieval'></a> [↑](#top)

***

For each predicted misconception we will retrieve the top-25 most similar misconceptions from `misconception_mapping.csv`.

In [None]:
misconception_cols = ["MisconceptionAId", "MisconceptionBId", "MisconceptionCId", "MisconceptionDId"]

def remove_duplicates(lst: List[Any]) -> List[Any]:
    seen = set()
    result = []
    for item in lst:
        if item not in seen:
            result.append(item)
            seen.add(item)
    return result

def get_misconceptions_test(idx: int, df: pd.DataFrame, embeddings_test: np.ndarray, embeddings_train: np.ndarray) -> str:
    similar_index = most_similar_embeddings_test(embeddings_test, embeddings_train, idx, 100)
    misconception_ids = misconception_mapping.loc[similar_index, "MisconceptionId"]
    misconception_ids = misconception_ids[:25] # get top 25
    assert len(misconception_ids) == 25
    misconception_ids = ' '.join(map(str, misconception_ids))
    return misconception_ids

misconception_preds["pred"] = ""
    
for idx in tqdm(misconception_preds.index):
    misconception_preds.loc[idx, "pred"] = get_misconceptions_test(
        idx, misconception_mapping, all_misconception_preds_embeddings, all_misconception_mapping_embeddings
    )
    
display(misconception_preds)

# <b><span style='color:#F1A424'>|</span> Submission</b><a class='anchor' id='retrieval'></a> [↑](#top)

***

Structure our predictions in the required format for our `submission.csv` file.

In [None]:
submission = misconception_preds[["QuestionId_Answer", "pred"]].copy()
submission.columns = ["QuestionId_Answer", "MisconceptionId"]
submission

In [None]:
submission.to_csv("submission.csv", index=False)