# Language Models Essay Scoring Benchmark

#### Motivation

This notebook aims to achieve three main goals:

1. **Model Selection:** Evaluate the performance of various language models, both local and remote, to determine which models work best for the competition and provide general insights.

2. **Inference Speed:** Measure the time taken by each model for inference, considering whether a specific model is feasible for use within the competition's constraints.

3. **Prompt Engineering:** Experiment with different prompting techniques to compare results and explore how prompting can influence model performance.

This notebook utilizes [ai-microcore](https://github.com/Nayjest/ai-microcore) for querying various locally deployed and remote language models.

**While raw scores generated by LLMs may not be ideal for direct submissions, they could form a crucial part of a high-performing solution**, especially if you're not focused on the Efficiency Prize.

For further analysis, raw essay scores generated by different LLMs are stored in the following file: `/kaggle/working/preds.csv` (see notebook output).

#### General Notebook Settings

In [None]:
VERBOSE = True
QUANTIZE = True
LIMIT_ROWS = 50

#### LLM packages offline insallation, logging setup, show system info, prepare torch & transformers, [see details here](https://www.kaggle.com/code/nayjest/kaggle-llm-notebook-bootstrap)

In [None]:
__import__('kaggle_llm_notebook_bootstrap').bootstrap()

#### Imports

In [None]:
import pandas as pd, timers, sklearn.metrics, os, kaggle_secrets, re, json
import microcore as mc
from microcore import ui
from datetime import datetime
from tqdm.notebook import tqdm

#### Read Essays Dataset

In [None]:
INPUT = f"{mc.utils.is_kaggle() and '/kaggle/input/' or ''}learning-agency-lab-automated-essay-scoring-2/{'train.csv'}"
tdf = pd.read_csv(INPUT)
tdf.tail(2)

In [None]:
defaults = dict(
    api_type=mc.ApiType.TRANSFORMERS,
    chat_mode=True,
    default_args=dict(
        do_sample=True,
    ),
    init_params=dict(
        quantize_4bit=QUANTIZE if mc.utils.is_kaggle() else True,
        always_clear_mem=True,
        gradient_checkpointing=True,
    ),
)

configs = {
    'gemma-1.1-2b': dict(
        model='/kaggle/input/gemma/transformers/1.1-2b-it/1',
        init_params=dict(
            quantize_4bit=False,
            always_clear_mem=True,
        ),
    ),
    'llama3-8b': dict(
        model='/kaggle/input/llama-3/transformers/8b-chat-hf/1/',
        default_args=dict(
            do_sample=True,
            eos_token_id=128009,
        ),
    ),
    'mistral-7b': dict(
        model='/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1',
    ),
    'mixtral-8x-7b(Kaggle T4x2)': dict(
        model='/kaggle/input/mixtral/pytorch/8x7b-instruct-v0.1-hf/1',
    ),
    'gemma-7b': dict(
        model='/kaggle/input/gemma/transformers/7b-it/3',
    ),
    'gpt-3.5-turbo': dict(
        model='gpt-3.5-turbo',
        api_type=mc.ApiType.OPEN_AI,
        api_key = kaggle_secrets.UserSecretsClient().get_secret('OPENAI_API_KEY'),
    ),
    'gpt-4-turbo': dict(
        model='gpt-4-turbo',
        api_type=mc.ApiType.OPEN_AI,
        api_key = kaggle_secrets.UserSecretsClient().get_secret('OPENAI_API_KEY'),
    ),
    'mistral-large': dict(
        model='mistral-large-latest',
        api_type=mc.ApiType.OPEN_AI,
        api_base='https://api.mistral.ai/v1/',
        api_key = kaggle_secrets.UserSecretsClient().get_secret('MISTRAL_API_KEY'),
    ),
    'mixtral-8x22b': dict(
        model='open-mixtral-8x22b',
        api_type=mc.ApiType.OPEN_AI,
        api_base='https://api.mistral.ai/v1/',
        api_key = kaggle_secrets.UserSecretsClient().get_secret('MISTRAL_API_KEY'),
    ),
    'mixtral-8x7b': dict(
        model='open-mixtral-8x7b',
        api_type=mc.ApiType.OPEN_AI,
        api_base='https://api.mistral.ai/v1/',
        api_key = kaggle_secrets.UserSecretsClient().get_secret('MISTRAL_API_KEY'),
    ),
    'claude-3-opus': dict(
        model='claude-3-opus-20240229',
        llm_default_args=dict(max_tokens=2500),
        api_type=mc.ApiType.ANTHROPIC,
        api_key =kaggle_secrets.UserSecretsClient().get_secret('ANTHROPIC_API_KEY'),
    ),
}

#### Check if model works correctly

In [None]:
def configure(config_name: str):
    cfg =  configs[config_name] if configs[config_name].get('api_type') else {**defaults, **configs[config_name]}
    cfg['use_logging'] = VERBOSE
    mc.configure(**cfg).describe()
    if mc.config().uses_local_model():
        mc.utils.show_vram_usage()
# configure('mixtral-8x22b')

# timers.timer()
# out = mc.llm("count from 1 to 5")
# timers.speed(len(out), 'characters')

In [None]:
how_to_score="""# Holistic Rating for Source-Based Writing

After reading the essay and completing the analytical rating form, assign a holistic score based on the rubric
below. For the following evaluations you will need to use a grading scale between 1 (minimum) and 6
(maximum). As with the analytical rating form, the distance between each grade (e.g., 1-2, 3-4, 4-5) should be
considered equal.
SCORE OF 6: An essay in this category demonstrates clear and consistent mastery, although it may have a
few minor errors. A typical essay effectively and insightfully develops a point of view on the issue and
demonstrates outstanding critical thinking; the essay uses clearly appropriate examples, reasons, and other
evidence taken from the source text(s) to support its position; the essay is well organized and clearly focused,
demonstrating clear coherence and smooth progression of ideas; the essay exhibits skillful use of language,
using a varied, accurate, and apt vocabulary and demonstrates meaningful variety in sentence structure; the
essay is free of most errors in grammar, usage, and mechanics.
SCORE OF 5: An essay in this category demonstrates reasonably consistent mastery, although it will have
occasional errors or lapses in quality. A typical essay effectively develops a point of view on the issue and
demonstrates strong critical thinking; the essay generally using appropriate examples, reasons, and other
evidence taken from the source text(s) to support its position; the essay is well organized and focused,
demonstrating coherence and progression of ideas; the essay exhibits facility in the use of language, using
appropriate vocabulary demonstrates variety in sentence structure; the essay is generally free of most errors in
grammar, usage, and mechanics.
SCORE OF 4: An essay in this category demonstrates adequate mastery, although it will have lapses in
quality. A typical essay develops a point of view on the issue and demonstrates competent critical thinking; the
essay using adequate examples, reasons, and other evidence taken from the source text(s) to support its
position; the essay is generally organized and focused, demonstrating some coherence and progression of ideas
exhibits adequate; the essay may demonstrate inconsistent facility in the use of language, using generally
appropriate vocabulary demonstrates some variety in sentence structure; the essay may have some errors in
grammar, usage, and mechanics.
SCORE OF 3: An essay in this category demonstrates developing mastery, and is marked by ONE OR
MORE of the following weaknesses: develops a point of view on the issue, demonstrating some critical
thinking, but may do so inconsistently or use inadequate examples, reasons, or other evidence taken from the
source texts to support its position; the essay is limited in its organization or focus, or may demonstrate some
lapses in coherence or progression of ideas displays; the essay may demonstrate facility in the use of language,
but sometimes uses weak vocabulary or inappropriate word choice and/or lacks variety or demonstrates
problems in sentence structure; the essay may contain an accumulation of errors in grammar, usage, and
mechanics.
SCORE OF 2: An essay in this category demonstrates little mastery, and is flawed by ONE OR MORE of
the following weaknesses: develops a point of view on the issue that is vague or seriously limited, and
demonstrates weak critical thinking; the essay provides inappropriate or insufficient examples, reasons, or
other evidence taken from the source text to support its position; the essay is poorly organized and/or focused,
or demonstrates serious problems with coherence or progression of ideas; the essay displays very little facility
in the use of language, using very limited vocabulary or incorrect word choice and/or demonstrates frequent
problems in sentence structure; the essay contains errors in grammar, usage, and mechanics so serious that
meaning is somewhat obscured.
SCORE OF 1: An essay in this category demonstrates very little or no mastery, and is severely flawed by
ONE OR MORE of the following weaknesses: develops no viable point of view on the issue, or provides little
or no evidence to support its position; the essay is disorganized or unfocused, resulting in a disjointed or
incoherent essay; the essay displays fundamental errors in vocabulary and/or demonstrates severe flaws in
sentence structure; the essay contains pervasive errors in grammar, usage, or mechanics that persistently
interfere with meaning.
"""

In [None]:
prompts = dict(
    default=[
        """
        {{how_to_score}}

        # Target essay
        {{essay}}

        # Task
        Please read the essay and assign a score of 1,2,3,4,5,6 where 6 is the best.
        Output only a single number with no explanation.
        """,
    ],
    simple=[
        """
        Please read the following essay and assign a score of 1,2,3,4,5,6 where 6 is the best.

        {{essay}}

        Output only a single number with no explanation.
        """,
    ],
)

@timers.with_timer('Prediction')
def predict(essay, **kwargs):
    kwargs = {
        "prompt":prompts['default'],
        "how_to_score": how_to_score,
        "essay": essay,
        **kwargs
    }
    prompt = mc.fmt(kwargs.pop('prompt'), **kwargs)
    return mc.llm(prompt).parse_number(default=0, rounding=True)

In [None]:
RESET = False
BENCH_FILE = 'bench.csv'
PREDS_FILE = 'preds.csv'
bench = pd.read_csv(BENCH_FILE) if os.path.exists(BENCH_FILE) and not RESET else pd.DataFrame({
    "model":[],
    "prompt":[],
    "acc":[],"acc%":[],
    "kappa":[],
    "dur":[], 
    
})#.set_index(['model', 'prompt'])

if os.path.exists(PREDS_FILE) and not RESET:
    preds = pd.read_csv(PREDS_FILE)[:LIMIT_ROWS]
else:
    preds = tdf[['essay_id', 'score']][:LIMIT_ROWS]
    preds.columns = ['id', 'gt']

display(bench)
display(preds)

#### Benchmarking & Execution Loop Functions 

In [None]:
def benchmark(model, prompt_id, gt_scores, pred_scores, durations) -> dict:    
    kappa = round(sklearn.metrics.cohen_kappa_score(gt_scores, pred_scores, weights='quadratic'),4)
    acc = sum(gt_scores == pred_scores)
    tot = len(gt_scores)
    pacc = round(100*acc/tot)
    dur = round(sum(durations)/len(durations),2)
    return {
        'model': model,
        'prompt': prompt_id,
        'acc': acc,
        'acc%': pacc,
        'kappa': kappa,
        'dur':dur,
    }  
    
def predict_all(df, prompt_id) -> pd.DataFrame:
    sub = df[['essay_id']].copy()
    sub['score'] = 0
    sub['dur'] = 0
    for i, row in tqdm(df.iterrows(), total=len(df), desc=f"Scoring {len(df)} essays, prompt: {prompt_id}..."):
        s = datetime.now()
        predicted = predict(row.full_text, prompt=prompts[prompt_id][0])
        sub.loc[i, 'score'] = predicted
        sub.loc[i, 'dur'] = (datetime.now() - s).total_seconds()
        
        if predicted == row.score:
            print(ui.green(f"{i+1}: score {predicted} (correct!)"))
        else:
            print(ui.red(f"{i+1}: score {predicted} (gt={row.score})"))
        
    return sub
    
    
def not_benchmarked_prompts(model):
    return [prompt for prompt in prompts if f"{model}:{prompt}" not in preds.columns]

not_benchmarked_prompts('gpt-4-turbo')


#### Inference

In [None]:
df = tdf[:LIMIT_ROWS].copy() if LIMIT_ROWS else tdf.copy()


target_models = [  # Leave empty to run it for all models
    'mixtral-8x-7b(Kaggle T4x2)'
]

for model in configs:
    if target_models and model not in target_models: continue
    target_prompts = not_benchmarked_prompts(model)
    if not target_prompts: continue
        
    print(ui.magenta(f"Starting inference using {model}"))
    configure(model)
    
    for prompt_id in target_prompts:
        
        print(ui.magenta(f"{model} x {prompt_id}"))
        
        sub = predict_all(df, prompt_id)
        
        preds[f"{model}:{prompt_id}"] = sub.score
        preds.to_csv(PREDS_FILE, index=False)
        
        bench_res = benchmark(model, prompt_id, df.score, sub.score, sub.dur)
        bench.loc[bench[(bench.model == model) & (bench.prompt == prompt_id)].index.to_list() or len(bench)] = bench_res
        bench = bench.sort_values(by='kappa', ascending=False)
        bench.to_csv(BENCH_FILE, index=False)
        display(bench)
        display(preds)
        
print("Done")    

#### Display the Results

In [None]:
display(bench)
display(preds)


#### If something goes wrong...

In [None]:
def drop_data(model):
    global bench, preds
    bench = bench[bench['model'] != 'gpt-4-turbo']
    bench.to_csv(BENCH_FILE, index=False)
    for prompt_id in prompts:
        if f'{model}:{prompt_id}' in preds.columns:
            preds = preds.drop(columns=[f'{model}:{prompt_id}'])
    preds.to_csv(PREDS_FILE, index=False)
    
# drop_data('gpt-4-turbo')
# display(bench)
# display(preds.head(3))


In [None]:
bench.to_csv(BENCH_FILE, index=False) 
preds.to_csv(PREDS_FILE, index=False)