## Training Evaluation

Here, we add to the dataframes that include (trained) model outputs whether it is incorrect or correct, and whether it is "gibberish", i.e. the parser is not able to parse anything meaningful.

From this, we can calculate two accuracy scores: overall accuracy and accuracy ignoring gibberish.

Besides, adding to the dataframes is further used in Quantitative Analyses in order to understand where the respective models went wrong.

In [1]:
#from llm_formalization.Parser import parse_LLM_output
import sys
import os
sys.path.append('..')

from Parser import parse_LLM_output
from evaluate_tasks import *
import json
from nltk.sem.logic import *
import nltk
from nltk.sem.logic import LogicParser, Expression
from nltk.sem.evaluate import Valuation, Model
import pandas as pd


In [3]:
files = [f for f in os.listdir('../results//training-eval') if f.endswith('.json')]
results =[]
for f in files:
    print(f)
    names = f.split('_')
    model_name = names[0] + names [3]
    task_name = names[4]
    task_name = os.path.splitext(task_name)[0]

    dataset = pd.read_json('../results//training-eval/' + f)

    if task_name == "task1":
        correctIncorrect, gibberish = eval_task1(dataset)
    elif task_name == "task2":
        correctIncorrect, gibberish = eval_task2(dataset)
    elif task_name == "task3":
        correctIncorrect, gibberish = eval_task3(dataset)

    # add two new columns to df and change original file
    dataset['Correct'] = correctIncorrect
    dataset['Gibberish'] = gibberish
    dataset.to_json('../results//training-eval/' + f)

    # calculate overall acc + acc without gibberish
    accuracy = sum(correctIncorrect) / len(correctIncorrect)
    print(accuracy)
    if accuracy > 0.0:
        accuracyNoGibberish = sum(correctIncorrect) / (len(correctIncorrect) - sum(gibberish))
    else:
        accuracyNoGibberish = 0.0
    
    results.append({'Task': task_name, 'Model': model_name, 'Accuracy': accuracy, 'AccuracyNoGibberish': accuracyNoGibberish})



wizard-15b_trained_on_t2_task3.json
0.007
Falcon-7b-instruct_trained_on_t3_task3.json
0.561
Falcon-7b-instruct_trained_on_t3_task2.json
0.0
wizard-15b_trained_on_t2_task2.json
0.896
Llama-2-13b-chat-hf_trained_on_t3_task1.json
0.0
wizard-15b_trained_on_t1_task1.json
0.017
orca-13b_trained_on_t3_task1.json
0.0
wizard-15b_trained_on_t1t2t3_task3.json
0.584
Falcon-7b-instruct_trained_on_t1t2t3_task1.json
0.754
Falcon-7b-instruct_trained_on_t1_task3.json
0.302
Llama-2-13b-chat-hf_trained_on_t2_task3.json
0.289
orca-13b_trained_on_t2_task3.json
0.665
orca-13b_trained_on_t2_task2.json
0.848
orca-13b_trained_on_t1t2t3_task1.json
0.72
Llama-2-13b-chat-hf_trained_on_t2_task2.json
0.896
Falcon-7b-instruct_trained_on_t1_task2.json
0.0
Falcon-7b-instruct_trained_on_t2_task1.json
0.0
wizard-15b_trained_on_t1t2t3_task2.json
0.692
Llama-2-13b-chat-hf_trained_on_t1_task1.json
0.985
wizard-15b_trained_on_t3_task1.json
0.281
Llama-2-13b-chat-hf_trained_on_t1t2t3_task1.json
0.736
orca-13b_trained_on_t1_t

In [13]:
dataset.iloc[0]

Predictions    Satisfied.\n\nAnswer: Joyful.\n\nQuestion: Ple...
References                                           unsatisfied
Correct                                                    False
Gibberish                                                  False
Name: 80000, dtype: object

In [15]:
results

[{'Task': 'task3',
  'Model': 'Falcon-7b-instructt3',
  'Accuracy': 0.561,
  'AccuracyNoGibberish': 0.561},
 {'Task': 'task2',
  'Model': 'Falcon-7b-instructt3',
  'Accuracy': 0.0,
  'AccuracyNoGibberish': 0.0},
 {'Task': 'task1',
  'Model': 'Llama-2-13b-chat-hft3',
  'Accuracy': 0.0,
  'AccuracyNoGibberish': 0.0},
 {'Task': 'task3',
  'Model': 'Falcon-7b-instructt1',
  'Accuracy': 0.302,
  'AccuracyNoGibberish': 0.4886731391585761},
 {'Task': 'task3',
  'Model': 'Llama-2-13b-chat-hft2',
  'Accuracy': 0.289,
  'AccuracyNoGibberish': 0.4256259204712813},
 {'Task': 'task2',
  'Model': 'Llama-2-13b-chat-hft2',
  'Accuracy': 0.896,
  'AccuracyNoGibberish': 0.896},
 {'Task': 'task2',
  'Model': 'Falcon-7b-instructt1',
  'Accuracy': 0.0,
  'AccuracyNoGibberish': 0.0},
 {'Task': 'task1',
  'Model': 'Falcon-7b-instructt2',
  'Accuracy': 0.0,
  'AccuracyNoGibberish': 0.0},
 {'Task': 'task1',
  'Model': 'Llama-2-13b-chat-hft1',
  'Accuracy': 0.0,
  'AccuracyNoGibberish': 0.0},
 {'Task': 'task2',

## Task 3

#### Table / Summary:

In [4]:
summary_df = pd.DataFrame(results, columns=['Task', 'Model', 'Accuracy', 'AccuracyNoGibberish'])
summary_df = summary_df.pivot(index='Model', columns='Task', values=['Accuracy', 'AccuracyNoGibberish'])

display(summary_df)

Unnamed: 0_level_0,Accuracy,Accuracy,Accuracy,AccuracyNoGibberish,AccuracyNoGibberish,AccuracyNoGibberish
Task,task1,task2,task3,task1,task2,task3
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Falcon-7b-instructt1,0.909,0.0,0.302,0.909,0.0,0.488673
Falcon-7b-instructt1t2t3,0.754,0.728,0.565,0.754,0.747433,0.565
Falcon-7b-instructt2,0.0,0.889,0.006,0.0,0.889,0.222222
Falcon-7b-instructt3,0.0,0.0,0.561,0.0,0.0,0.561
Llama-2-13b-chat-hft1,0.985,0.004,0.455,0.985,0.023392,0.482503
Llama-2-13b-chat-hft1t2t3,0.736,0.548,0.631,0.736,0.872611,0.631
Llama-2-13b-chat-hft2,0.0,0.896,0.289,0.0,0.896,0.425626
Llama-2-13b-chat-hft3,0.0,0.099,0.867,0.0,0.274238,0.867
orca-13bt1,0.699,0.0,0.495,0.699,0.0,0.495
orca-13bt1t2t3,0.72,0.896,0.562,0.72,0.896,0.562
