## Training Evaluation

Here, we add to the dataframes that include (trained) model outputs whether it is incorrect or correct, and whether it is "gibberish", i.e. the parser is not able to parse anything meaningful.

From this, we can calculate two accuracy scores: overall accuracy and accuracy ignoring gibberish.

Besides, adding to the dataframes is further used in Quantitative Analyses in order to understand where the respective models went wrong.

In [1]:
#from llm_formalization.Parser import parse_LLM_output
import sys
import os
sys.path.append('..')

from Parser import parse_LLM_output
from evaluate_tasks import *
import json
from nltk.sem.logic import *
import nltk
from nltk.sem.logic import LogicParser, Expression
from nltk.sem.evaluate import Valuation, Model
import pandas as pd


In [3]:
files = [f for f in os.listdir('../results/training-hard-eval/') if f.endswith('.json')]
results =[]
for f in files:
    print(f)
    names = f.split('_')
    model_name = names[0] + names [3]
    task_name = names[4]
    task_name = os.path.splitext(task_name)[0]

    dataset = pd.read_json('../results/training-hard-eval/' + f)

    if task_name == "task1":
        correctIncorrect, gibberish = eval_task1(dataset)
    elif task_name == "task2":
        correctIncorrect, gibberish = eval_task2(dataset)
    elif task_name == "task3":
        correctIncorrect, gibberish = eval_task3(dataset)

    # add two new columns to df and change original file
    dataset['Correct'] = correctIncorrect
    dataset['Gibberish'] = gibberish
    dataset.to_json('../results/training-hard-eval/' + f)

    # calculate overall acc + acc without gibberish
    accuracy = sum(correctIncorrect) / len(correctIncorrect)
    print(accuracy)
    if accuracy > 0.0:
        accuracyNoGibberish = sum(correctIncorrect) / (len(correctIncorrect) - sum(gibberish))
    else:
        accuracyNoGibberish = 0.0
    
    results.append({'Task': task_name, 'Model': model_name, 'Accuracy': accuracy, 'AccuracyNoGibberish': accuracyNoGibberish})



wizard-15b_trained_on_t3_task3_hard.json
0.497
orca-13b_trained_on_t3_task3_hard.json
0.554
Llama-2-13b-chat-hf_trained_on_t3_task3_hard.json
0.501
Falcon-7b-instruct_trained_on_t1t2t3_task3_hard.json
0.495
wizard-15b_trained_on_t2_task2_hard.json
0.674
orca-13b_trained_on_t2_task2_hard.json
0.478
Llama-2-13b-chat-hf_trained_on_t2_task2_hard.json
0.602
Llama-2-13b-chat-hf_trained_on_t1_task1_hard.json
0.939
wizard-15b_trained_on_t3_task2_hard.json
0.789
Falcon-7b-instruct_trained_on_t1t2t3_task2_hard.json
0.544
Falcon-7b-instruct_trained_on_t1_task1_hard.json
0.869
Falcon-7b-instruct_trained_on_t1t2t3_task1_hard.json
0.77
orca-13b_trained_on_t1t2t3_task2_hard.json
0.788
Falcon-7b-instruct_trained_on_t2_task2_hard.json
0.668


In [13]:
dataset.iloc[0]

Predictions    Satisfied.\n\nAnswer: Joyful.\n\nQuestion: Ple...
References                                           unsatisfied
Correct                                                    False
Gibberish                                                  False
Name: 80000, dtype: object

In [3]:
results

[{'Task': 'task1',
  'Model': 'Llama-2-13b-chat-hft1',
  'Accuracy': 0.0,
  'AccuracyNoGibberish': 0.0},
 {'Task': 'task2',
  'Model': 'Llama-2-13b-chat-hft1',
  'Accuracy': 0.002,
  'AccuracyNoGibberish': 0.0045871559633027525},
 {'Task': 'task3',
  'Model': 'Llama-2-13b-chat-hft1',
  'Accuracy': 0.459,
  'AccuracyNoGibberish': 0.49461206896551724}]

## Task 3

#### Table / Summary:

In [4]:
summary_df = pd.DataFrame(results, columns=['Task', 'Model', 'Accuracy', 'AccuracyNoGibberish'])
summary_df = summary_df.pivot(index='Model', columns='Task', values=['Accuracy', 'AccuracyNoGibberish'])

display(summary_df)

Unnamed: 0_level_0,Accuracy,Accuracy,Accuracy,AccuracyNoGibberish,AccuracyNoGibberish,AccuracyNoGibberish
Task,task1,task2,task3,task1,task2,task3
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Falcon-7b-instructt1,0.869,,,0.869,,
Falcon-7b-instructt1t2t3,0.77,0.544,0.495,0.77,0.553971,0.495
Falcon-7b-instructt2,,0.668,,,0.671357,
Llama-2-13b-chat-hft1,0.939,,,0.939,,
Llama-2-13b-chat-hft2,,0.602,,,0.62578,
Llama-2-13b-chat-hft3,,,0.501,,,0.501
orca-13bt1t2t3,,0.788,,,0.79196,
orca-13bt2,,0.478,,,0.527012,
orca-13bt3,,,0.554,,,0.554
wizard-15bt2,,0.674,,,0.674,
