# Kaggle LLM Science Exam Perplexity Ranking Ensemble

This code is based on the discussion and notebooks from [Psi](https://www.kaggle.com/code/philippsinger) and [Takamichi Toda](https://www.kaggle.com/takamichitoda)

- https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/424242
- https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
- https://www.kaggle.com/code/takamichitoda/llm-perplexity-ranking-ensemble/notebook

The idea is, rather than using or training an LLM to predict the correct answer given the question and multiple choice answers, we instead:
- Input each answer, prepended with the question, to a pretrained LLM and perform inference
- For each token, see what the model predicts as its logprob
- Calculate the perplexity over these results -- effectively treating the questions and multiple choice answers as a dataset and use the perplexity as the measure of how well the model is modelling the correct distribution that should correspond to the correct answer
- Sort the perplexity for each of the question-answer pairs to get the answer as that with the lowest perplexity and use that as the model's answer to the question

In [1]:
# !pip install bitsandbytes

In [11]:
import torch

import gc
import os
import numpy as np
import pandas as pd
import torch
from torch import nn
from tqdm import tqdm

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [6]:
import os
from pathlib import Path

# TODO fix up for colab or kaggle if needed. No need for local or docker with volume mapped for HF cache
cwd = Path(os.getcwd())
data_dir =  cwd / 'data'
model_dir = [cwd / 'models']

In [7]:
import os
import subprocess
from pathlib import Path
import zipfile

COMPETITION='kaggle-llm-science-exam'

# fix as needed
# data_dir = Path(os.getcwd()) / 'data'
data_dir.mkdir(exist_ok=True)

file_path = data_dir / f'{COMPETITION}.zip'

if not os.path.exists(file_path):
    # download dataset
    subprocess.run(['kaggle', 'competitions', 'download', '-p', data_dir, '-c', COMPETITION], check=True)
    # subprocess.run(['unzip', 'kaggle-llm-science-exam.zip'], check=True)
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        zip_ref.extractall(data_dir)

In [8]:
MODELS = ["mistralai/Mistral-7B-v0.1",
          "01-ai/Yi-6B"]

In [10]:
train_df = pd.read_csv(data_dir / 'train.csv')
test_df = pd.read_csv(data_dir / 'test.csv')
test_df.head()

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [12]:
# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking/notebook

class Perplexity(nn.Module):
    def __init__(self, reduce: bool = True):
        super().__init__()
        self.reduce = reduce
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, logits, labels):
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()

        perplexity = []
        for i in range(labels.shape[0]):
            perplexity.append(self.loss(shift_logits[i], shift_labels[i]))
        perplexity = torch.stack(perplexity, dim=0)
        # perplexity = torch.exp(perplexity)
        if self.reduce:
            perplexity = perplexity.mean()
        return perplexity

In [19]:
# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking/notebook
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u]
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

# NB the calculation here is missing the short circuiting of the precision_at_k function
# This would be needed if the model output multiple copies of the same answer
# This is specced in the MAP@3 calculation description in the competition rules
# https://www.kaggle.com/competitions/kaggle-llm-science-exam
# redo it here for clarity
def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u]
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
            if user_results[k] == 1:
                break
    return map_at_3 / U