## Tuning of the evaluator

The purpose of this notebook is to evaluate how close are evaluations of LLM generated solutions between Human and LLM.

* It allows to create a tuning set by selecting random tasks from the whole benchmark set.
* It contains code for comparing the match scores given to the LLM generated solutions by human and LLM by using confusion matrix

If the difference is too high, the adjustments to the prompt in the evaluation.py file might be needed. The notebook contains code to facilitate analysis of cases where the match scores differed. 

When ready, the evaluator code is used in the "Benchmarking" notebook.

##### Imports

In [13]:
# !pip install -r requirements.txt

In [1]:
import os, sys
import json
import zipfile
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from io import StringIO
import importlib
from collections import defaultdict
import random
from pprint import pprint, pformat, PrettyPrinter
import re
from tqdm import tqdm

In [None]:
from geobenchx.constants import DATA_FOLDER, RESULTS_FOLDER, MODEL_CLAUDE, MODEL_GEMINI_ADV, MODEL_GPT_41, MODEL_GPT_mini, MODEL_O3, MODEL_CLAUDE_ADV4
import geobenchx.dataclasses
importlib.reload(geobenchx.dataclasses)
from geobenchx.dataclasses import TaskSet, Task, Solution

import geobenchx.utils
importlib.reload(geobenchx.utils)
from geobenchx.utils import generate_timestamp_id, get_dataframe_info, get_solution_code


import geobenchx.agent
importlib.reload(geobenchx.agent)
from geobenchx.agent import execute_task

import geobenchx.evaluation
importlib.reload(geobenchx.evaluation)
from geobenchx.evaluation import score_task_solution, generate_eval_stats_evaluator, score_solutions_set

#### Selecting random tasks for annotation

In [4]:
source_tasks = 'tasks_and_reference_solutions.json' # name of file with tasks and reference solutions (ground truth solutions)

In [3]:
tuning_tasks_filename = 'evaluator_tuning_set.json' # name for file with tasks with references solution, candidate solution and manual match score

In [4]:
# Selecting the tasks from 'source_tasks' for the evaluator tuning set or reading them from already exsisting file with tuning tasks set

if os.path.exists(os.path.join(DATA_FOLDER, tuning_tasks_filename)): 
    evaluator_tuning_tasks = TaskSet.read_from_file(tuning_tasks_filename, DATA_FOLDER)   
else:
    tasks = TaskSet.read_from_file(source_tasks, DATA_FOLDER)
    evaluator_tuning_tasks = tasks.sample_stratified(40)

In [5]:
# Checking the size of the tuning set and its composition by types of the selected tasks

print(len(evaluator_tuning_tasks))
evaluator_tuning_tasks.get_labels_counts()

50


{<TaskLabels.MERGE_VISUALIZE: 'Merge, Visualize'>: 8,
 <TaskLabels.TASK_SET_01: 'Task Set 01'>: 8,
 <TaskLabels.SPATIAL_OPERATIONS: 'Spatial operations'>: 16,
 <TaskLabels.TASK_SET_03: 'Task Set 03'>: 16,
 <TaskLabels.VAGUE: 'Vague'>: 4,
 <TaskLabels.HEATMAPS_CONTOUR_LINES: 'Heatmaps, Contour Lines'>: 14,
 <TaskLabels.TASK_SET_04: 'Task Set 04'>: 14,
 <TaskLabels.PROCESS_MERGE_VISUALIZE: 'Process, Merge, Visualize'>: 12,
 <TaskLabels.TASK_SET_02: 'Task Set 02'>: 12,
 <TaskLabels.HARD: 'Hard'>: 1}

In [None]:
# saving the tuning set for evaluation if needed

evaluator_tuning_tasks.save_to_file(tuning_tasks_filename, DATA_FOLDER)

### Attention!

In the evaluator_tuning_set.json file in the repository, the LLM solutions are already generated. 

If a new file generated using the above part of the notebook:
1. Proceed with generating solutions by an LLM of choice and, 
2. Score it manually, by comparing the reference and candidate solutions using GUI by running tasks_editor or direclty in the json file, by inputting match score and match resoning under the keys "match_reasoning_Human" (optional) and "match_score_Human" (required) in the new file.

### Evaluate solutions in the tuning set

The LLM scores will be saved directly in the tuning tasks file. 

After scoring the set with an LLM, proceed to the next part of the notebook to calculate how close the LLM's evaluations are to the human scores.

Repeat for any LLM you plan to use for evaluations.

In [10]:
# Select the model to generate the solutions

model = MODEL_GPT
# Default temperature for evaluation of tasks is 0, to change temperature use line below. For OpenAI's o3-mini, use temperature = None

# temperature = 


In [11]:
model

'gpt-4.1-2025-04-14'

In [12]:
# Run evaluation of the whole set by the selected model
# The LLM match scores will be saved directly in the tuning tasks file. 
# Proceed to the next part of the notebook and see how close the LLM's solutions 

score_solutions_set(tuning_tasks_filename, DATA_FOLDER, model, skip_scored=False)

  0%|          | 0/50 [00:00<?, ?it/s]

Task ID: TASK_250309_135125_530802
Task text: Map agricultural GDP contribution by region.


  2%|▏         | 1/50 [00:05<04:07,  5.05s/it]

Matching score: 1
input tokens: 15145, output tokens: 246
Task ID: TASK_250309_135125_315340
Task text: Map air quality index for major cities in South Asia?


  4%|▍         | 2/50 [00:07<02:45,  3.45s/it]

Matching score: 2
input tokens: 14859, output tokens: 115
Task ID: TASK_250309_135125_310610
Task text: What is the total length of railways within areas that received more than 3 feet of snow this season in the USA?


  6%|▌         | 3/50 [00:14<03:54,  4.99s/it]

Matching score: 1
input tokens: 15809, output tokens: 453
Task ID: TASK_250309_135125_367674
Task text: Make a heatmap showing population concentration in earthquake-affected zones


  8%|▊         | 4/50 [00:20<04:12,  5.49s/it]

Matching score: 0
input tokens: 15168, output tokens: 402
Task ID: TASK_250309_135125_914148
Task text: Visualize agricultural contribution to GDP worldwide.


 10%|█         | 5/50 [00:25<03:59,  5.31s/it]

Matching score: 1
input tokens: 15090, output tokens: 180
Task ID: TASK_250309_135125_397320
Task text: Generate a heatmap of population density around African mineral extraction sites


 12%|█▏        | 6/50 [00:29<03:41,  5.04s/it]

Matching score: 0
input tokens: 14953, output tokens: 251
Task ID: TASK_250309_135125_170096
Task text: Show population density patterns across regions.


 14%|█▍        | 7/50 [00:34<03:27,  4.81s/it]

Matching score: 0
input tokens: 15014, output tokens: 251
Task ID: TASK_250309_135125_251255
Task text: Make contour lines of snow accumulation for areas with railway stations in the USA


 16%|█▌        | 8/50 [00:42<04:09,  5.93s/it]

Matching score: 2
input tokens: 15386, output tokens: 568
Task ID: TASK_250309_135125_488600
Task text: Create a heatmap showing population density near earthquake epicenters


 18%|█▊        | 9/50 [00:49<04:11,  6.14s/it]

Matching score: 0
input tokens: 14933, output tokens: 222
Task ID: TASK_250309_135125_435973
Task text: Generate a heatmap of power station density in regions with water scarcity


 20%|██        | 10/50 [00:55<04:10,  6.26s/it]

Matching score: 0
input tokens: 15260, output tokens: 310
Task ID: TASK_250309_135125_299410
Task text: What are main agriculture cultures in Blefuscu?


 22%|██▏       | 11/50 [00:58<03:26,  5.29s/it]

Matching score: 2
input tokens: 14850, output tokens: 110
Task ID: TASK_250309_135125_304935
Task text: Visualize water usage patterns in Great Lakes region of Africa


 24%|██▍       | 12/50 [01:01<02:46,  4.39s/it]

Matching score: 2
input tokens: 14860, output tokens: 105
Task ID: TASK_250309_135125_932053
Task text: How many African power stations are located in countries with significant forest loss?


 26%|██▌       | 13/50 [01:09<03:30,  5.70s/it]

Matching score: 0
input tokens: 16577, output tokens: 562
Task ID: TASK_250309_135125_222041
Task text: How does electric consumption vary across Arctic nations?


 28%|██▊       | 14/50 [01:14<03:18,  5.51s/it]

Matching score: 0
input tokens: 15245, output tokens: 334
Task ID: TASK_250309_135125_332083
Task text: How many mineral extraction facilities in Algeria are located within areas with population density over 1000 people per square km?


 30%|███       | 15/50 [01:20<03:12,  5.49s/it]

Matching score: 2
input tokens: 15190, output tokens: 417
Task ID: TASK_250309_135125_811107
Task text: What is the total length of railways in Brazilian states with GDP per capita above national average?


 32%|███▏      | 16/50 [01:27<03:20,  5.88s/it]

Matching score: 0
input tokens: 15017, output tokens: 307
Task ID: TASK_250309_135125_536420
Task text: Compare population density between flood-affected and non-flood-affected areas in Bangladesh during August 2018


 34%|███▍      | 17/50 [01:32<03:09,  5.76s/it]

Matching score: 0
input tokens: 15015, output tokens: 250
Task ID: TASK_250309_135125_748943
Task text: Show rural population percentages worldwide.


 36%|███▌      | 18/50 [01:38<03:05,  5.81s/it]

Matching score: 0
input tokens: 15251, output tokens: 332
Task ID: TASK_250309_135125_710732
Task text: How many mineral extraction facilities in Africa are located in countries with rapid urbanization?


 38%|███▊      | 19/50 [01:49<03:42,  7.19s/it]

Matching score: 0
input tokens: 15864, output tokens: 445
Task ID: TASK_250309_135125_498483
Task text: Show the distribution of water stress in Mediterranean coastal areas?


 40%|████      | 20/50 [01:58<03:52,  7.76s/it]

Matching score: 0
input tokens: 15173, output tokens: 438
Task ID: TASK_250309_135125_418513
Task text: How many power stations in Africa are located within 10 km of major railways?


 42%|████▏     | 21/50 [02:04<03:33,  7.35s/it]

Matching score: 2
input tokens: 15340, output tokens: 334
Task ID: TASK_250309_135125_459898
Task text: Map the distribution of renewable water resources in arid regions.


 44%|████▍     | 22/50 [02:07<02:46,  5.95s/it]

Matching score: 2
input tokens: 14860, output tokens: 113
Task ID: TASK_250309_135125_365343
Task text: How many rivers flow through areas with significant forest coverage in South America?


 46%|████▌     | 23/50 [02:32<05:21, 11.91s/it]

Matching score: 2
input tokens: 15458, output tokens: 672
Task ID: TASK_250309_135125_381719
Task text: Calculate the total population affected by floods in Peru during February 2018


 48%|████▊     | 24/50 [02:37<04:08,  9.56s/it]

Matching score: 2
input tokens: 15147, output tokens: 234
Task ID: TASK_250309_135125_918547
Task text: What is the average GDP per capita in 2020 for countries that experienced earthquakes in the last 30 days?


 50%|█████     | 25/50 [02:47<04:04,  9.77s/it]

Matching score: 2
input tokens: 15378, output tokens: 553
Task ID: TASK_250309_135125_913900
Task text: Show rates of deforestation over the last decade.


 52%|█████▏    | 26/50 [02:52<03:20,  8.35s/it]

Matching score: 2
input tokens: 15198, output tokens: 271
Task ID: TASK_250309_135125_460208
Task text: Map total population distribution globally.


 54%|█████▍    | 27/50 [02:56<02:42,  7.07s/it]

Matching score: 2
input tokens: 15111, output tokens: 280
Task ID: TASK_250309_135125_797835
Task text: Visualize fertility rates across Sub-Saharan Africa


 56%|█████▌    | 28/50 [03:03<02:32,  6.93s/it]

Matching score: 0
input tokens: 15238, output tokens: 352
Task ID: TASK_250309_135125_996706
Task text: Create contour lines of snow accumulation near major water bodies


 58%|█████▊    | 29/50 [03:13<02:48,  8.02s/it]

Matching score: 1
input tokens: 15490, output tokens: 650
Task ID: TASK_250309_135125_763412
Task text: Create contour lines from Chile's population density data in relation to rivers


 60%|██████    | 30/50 [03:25<03:05,  9.28s/it]

Matching score: 1
input tokens: 15545, output tokens: 620
Task ID: TASK_250309_135125_265245
Task text: Create a heatmap of power station density in regions with high energy demand


 62%|██████▏   | 31/50 [03:30<02:31,  7.97s/it]

Matching score: 0
input tokens: 14927, output tokens: 240
Task ID: TASK_250309_135125_419069
Task text: Generate a heatmap of USA population density in counties with reported tuberculosis cases


 64%|██████▍   | 32/50 [03:37<02:15,  7.53s/it]

Matching score: 0
input tokens: 15152, output tokens: 343
Task ID: MV2502221355_03890300
Task text: Compare the freshwater withdrawal between African countries with and without significant railway networks


 66%|██████▌   | 33/50 [03:51<02:41,  9.48s/it]

Matching score: 2
input tokens: 15824, output tokens: 836
Task ID: TASK_250309_135125_470604
Task text: Make a heatmap of power station density in regions with high forest depletion


 68%|██████▊   | 34/50 [03:59<02:27,  9.24s/it]

Matching score: 1
input tokens: 15907, output tokens: 644
Task ID: TASK_250309_135125_554817
Task text: Generate contour lines from snow cover data for 2023-2024 season and compare with railway stations locations in the northern states


 70%|███████   | 35/50 [04:06<02:04,  8.33s/it]

Matching score: 0
input tokens: 15468, output tokens: 468
Task ID: TASK_250309_135125_663887
Task text: Compare environmental protection spending across developed nations.


 72%|███████▏  | 36/50 [04:09<01:34,  6.74s/it]

Matching score: 2
input tokens: 14861, output tokens: 126
Task ID: TASK_250309_135125_898861
Task text: Map the distribution of agricultural productivity in sub-Saharan Africa.


 74%|███████▍  | 37/50 [04:15<01:26,  6.64s/it]

Matching score: 0
input tokens: 15232, output tokens: 424
Task ID: TASK_250309_135125_429627
Task text: Calculate the average electric power consumption in African countries with power stations capacity above 1000MW


 76%|███████▌  | 38/50 [04:24<01:27,  7.28s/it]

Matching score: 2
input tokens: 15437, output tokens: 416
Task ID: TASK_250309_135125_544781
Task text: Map marine protected areas in coastal nations.


 78%|███████▊  | 39/50 [04:26<01:04,  5.82s/it]

Matching score: 2
input tokens: 14856, output tokens: 114
Task ID: TASK_250309_135125_826615
Task text: Show agricultural GDP contribution in Nordic countries


 80%|████████  | 40/50 [04:34<01:02,  6.29s/it]

Matching score: 0
input tokens: 15241, output tokens: 425
Task ID: TASK_250309_135125_191153
Task text: Show water withdrawal patterns in BRICS nations.


 82%|████████▏ | 41/50 [04:43<01:04,  7.20s/it]

Matching score: 0
input tokens: 15294, output tokens: 653
Task ID: TASK_250309_135125_279238
Task text: Make a heatmap showing population concentration in flood-risk zones of South America


 84%|████████▍ | 42/50 [04:48<00:53,  6.68s/it]

Matching score: 0
input tokens: 14949, output tokens: 182
Task ID: TASK_250309_135125_305778
Task text: How many power stations in Africa are located in economically strategic regions?


 86%|████████▌ | 43/50 [04:54<00:44,  6.36s/it]

Matching score: 0
input tokens: 14857, output tokens: 133
Task ID: TASK_250309_135125_868995
Task text: Compare healthcare expenditure across Middle Eastern countries.


 88%|████████▊ | 44/50 [05:02<00:40,  6.70s/it]

Matching score: 0
input tokens: 15066, output tokens: 460
Task ID: TASK_250309_135125_565545
Task text: Create a heatmap of earthquakes magnitude in Indonesia, Malaysia and Phillippines


 90%|█████████ | 45/50 [05:43<01:25, 17.12s/it]

Matching score: 2
input tokens: 15236, output tokens: 794
Task ID: TASK_250309_135125_478472
Task text: Map digital literacy rates in Southeast Asian nations.


 92%|█████████▏| 46/50 [05:46<00:51, 12.95s/it]

Matching score: 2
input tokens: 14857, output tokens: 109
Task ID: TASK_250309_135125_861416
Task text: Visualize regional economic patterns using GDP per capita.


 94%|█████████▍| 47/50 [05:57<00:36, 12.33s/it]

Matching score: 1
input tokens: 15163, output tokens: 296
Task ID: TASK_250309_135125_130168
Task text: How many mineral extraction facilities in Africa are located in countries with negative net migration?


 96%|█████████▌| 48/50 [06:30<00:37, 18.52s/it]

Matching score: 0
input tokens: 16058, output tokens: 588
Task ID: TASK_250309_135125_720013
Task text: Show electric consumption in G7 and BRICS nations


 98%|█████████▊| 49/50 [06:44<00:17, 17.21s/it]

Matching score: 0
input tokens: 15380, output tokens: 509
Task ID: TASK_250309_135125_908870
Task text: Visualize total population distribution by country.


100%|██████████| 50/50 [06:50<00:00,  8.20s/it]

Matching score: 2
input tokens: 15124, output tokens: 235
Total input tokens: 762313, total output tokens: 18372





### Evaluate comparisons

1. Count how many human comparisons and LLM's comparisons match. 

In [13]:
# Generate the matrix of LLM scores vs Human scores, percent of tasks for which the scores are the same and CI using Ward formula                                                                                                                                                                                     
generate_eval_stats_evaluator(tuning_tasks_filename, DATA_FOLDER)

         LLM_0  LLM_1  LLM_2
Human_0     23      1      0
Human_1      0      5      0
Human_2      1      1     19
0.94
(0.8378290831116182, 0.979385029651026)


In [None]:
# Select tasks with particular combination of Human and LLM score
tasks_selected = [task for task in evaluator_tuning_tasks if task.match_score_LLM is not None and (task.match_score_Human.value, task.match_score_LLM.value)==(2, 0)]
tasks_selected

[|                                  | Task details                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
 |:---------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
# Select one task from the tuningv set to see the task details, reference and candidate solution and scores

check = [task for task in evaluator_tuning_tasks if task.task_ID == 'TASK_250309_135125_530802']
check

[|                                  | Task details                                                                                                                                                                                                                                                                                                     |
 |:---------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | task_ID                          | TASK_250309_135125_530802                                                                                                                                                                                                                                                       

### When ready, the evaluator code is used in the "Benchmarking" notebook.