# Mixtral Baseline

This notebook offers insights into potential strategies for tackling the competition, along with code for inference using the pretrained Mixtral model, which could serve as a strong baseline for this task.

First of all, does it make sense to use a pretrained model for this task? The answer is 'likely yes' and this notebook aims to validate that through leaderboard evaluation. <br>
A pretrained generative LLM like Mixtral could be viewed as an approximation of a human evaluator. It's probable that, given precise assessment criteria in the prompt, it may highly correlate with human evaluation. Without appropriate assessment criteria, however, it might exhibit bias compared to human assessment. <br>
Supervised training on the provided training set can be seen as calibrating the model to match human assessment criteria. At the same time, an LLM like Mixtral, having been exposed to similar texts in its training set, could in principle evaluate essays in a manner akin to humans, considering that human evaluation, even with defined criteria, still involves a degree of subjectivity. <br>
Evaluating with Mixtral in a zero-shot fashion could provide an estimate of how closely a general LLM's evaluations align with human judgments, given specific assessment criteria.

In [None]:
!pip install -U /kaggle/input/bitsandbytes-0-42-0-py3-none-any-whl/bitsandbytes-0.42.0-py3-none-any.whl -qq
!pip install --no-index /kaggle/input/making-wheels-of-necessary-packages-for-hf-llms/bitsandbytes-0.42.0-py3-none-any.whl --find-links=/kaggle/input/making-wheels-of-necessary-packages-for-hf-llms -qq
!pip install --no-index /kaggle/input/making-wheels-of-necessary-packages-for-hf-llms/accelerate-0.27.2-py3-none-any.whl --find-links=/kaggle/input/making-wheels-of-necessary-packages-for-hf-llms -qq
!pip install --no-index /kaggle/input/making-wheels-of-necessary-packages-for-hf-llms/transformers-4.38.1-py3-none-any.whl --find-links=/kaggle/input/making-wheels-of-necessary-packages-for-hf-llms -qq
!pip install --no-index /kaggle/input/making-wheels-of-necessary-packages-for-hf-llms/optimum-1.17.1-py3-none-any.whl --find-links=/kaggle/input/making-wheels-of-necessary-packages-for-hf-llms -qq

In [None]:
import gc
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoConfig

### Inference parameters estimation

As inference with an LLM can be slow and memory-intensive, it makes sense to roughly estimate the size of the input data and explore approaches for optimization. 

In [None]:
train = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv')

train["length"] = train["full_text"].map(lambda x: len(x.split(" ")))
train.head()

In [None]:
sns.histplot(train, x='length')
plt.xscale('log')

It's clear that most essays contain fewer than 1000 words.

Furthermore, one could hypothesize that the quality of an essay should be approximately uniform. Therefore, if one were to evaluate the first and second parts of an essay independently, they should be scored similarly. Thus, we can limit the essay length in the input to 500 words.

This also leads to the idea of multiple assessments for a single essay; we can split it into several chunks, evaluate them independently, and then take the final score as the mean or mode of the chunk scores.

So for now lets prepare utils for such an evaluation.

## Inference

In [None]:
def get_prompt(essay):
    
    return f"""
Your task is to score a student essay. \n\n
Read the essay below carefully and assign it a score from the following range: [1, 2, 3, 4, 5, 6]. \n\n
1 - is a bad essay, 6 - very good essay.
Avoid additional text description in your answer. \n\n
Your answer should consist of only a single digit score, for example: Answer: 4 \n\n

Essay to evaluate: \n\n
{essay}
"""

In [None]:
def get_prompts(essay, chunk_size=100, max_chunks=5):
    """Represent essay as several chunks."""
    l = len(essay.split(" "))
    n_chunks = l // chunk_size
    essay_words = essay.split(" ")
    prompts = [get_prompt(" ".join(essay_words[k*chunk_size: (k+1)*chunk_size])) for k in range(n_chunks)]
    return prompts[:max_chunks]
    

In [None]:
def parse_output(decoded_output):
    """Parse digit evaluation from model output."""
    single_digits = re.findall(r'\b\d\b', decoded_output)
    single_digits = [int(s) for s in single_digits]
    return single_digits

In [None]:
def evaluate_essay(essay, model, chunk_size=100, max_chunks=5):
    """Evaluate single essay by chunks."""
    scores = []
    query_prompts = get_prompts(essay, chunk_size=chunk_size, max_chunks=max_chunks)
    
    for query_prompt in query_prompts:
        messages = [
            {
                "role": "user",
                "content": query_prompt
            }
        ]
        
        inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
        with torch.no_grad():
            encoded_output = model.generate(
                inputs,
                max_new_tokens=20,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id
            )
        decoded_output = tokenizer.decode(
            encoded_output[0],
            skip_special_tokens=True
        ).replace(query_prompt, '').replace("[INST]", "").replace("[/INST]", "").strip()
        
        score = parse_output(decoded_output)
        scores.extend(score)
    
    try:
        score = int(np.mean(scores))
    except ValueError:
        score = 3
    
    return score

In [None]:
MODEL_PATH = "/kaggle/input/mixtral/pytorch/8x7b-instruct-v0.1-hf/1"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

config = AutoConfig.from_pretrained(MODEL_PATH)
config.gradient_checkpointing = True

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config,
    config=config
)

In [None]:
test = pd.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv')
test.head()

In [None]:
def create_submission(test, model, chunk_size=100, max_chunks=5):
    ids = []
    scores = []
    for i in tqdm(range(len(test))):
        
        essay_id = test["essay_id"].loc[i]
        essay = test["full_text"].loc[i]
        
        score = evaluate_essay(essay, model, chunk_size=chunk_size, max_chunks=max_chunks)
        
        scores.append(score)
        ids.append(essay_id)
        
        torch.cuda.empty_cache()
        gc.collect()
    
    submission = pd.DataFrame()
    submission["essay_id"] = ids
    submission["score"] = scores
    
    return submission

In [None]:
submission = create_submission(test, model, chunk_size=100, max_chunks=3)
submission.to_csv("submission.csv", index=False)

For inference on the full test set, it should be optimized (at least with batchified inference). In the current version, it leads to a timeout error upon submission.

In [None]:
submission.head()