# LLM Evaluation of AUT Response Utility

This notebook serves to use a large-language model (Llama-3) via the together.ai API to rate the utility of a test set of responses to the Alternate Uses Task.

## Importing Necessary Libraries & Data

In [None]:
import pandas as pd
import together
import requests
import time
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Importing test set
test_set = pd.read_excel('/content/test_set2.xlsx')

## Getting Rating From LLM

In [None]:
# API call to together.ai
api_key = # Insert API key
def together_call(prompt: str, model: str, temp: float) -> str:
    url = "https://api.together.xyz/v1/chat/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    data = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temp
    }
    try:
        response = requests.post(url, headers=headers, json=data)
        response.raise_for_status()
        output = response.json()
        return output['choices'][0]['message']['content']
    except Exception as e:
        print(f"Error during API call: {e}")
        return None

In [None]:
# Prompt creation
def create_prompt(item, response):
    return f'''
The Alternative Uses Task (AUT) is a task used to measure divergent thinking. Divergent
thinking is a process often linked to creativity, where participants must provide as many
answers/solutions as possible to a given question. When taking the AUT, a subject is asked to
think of as many uses as possible for a given object.

Eg. Object: brick
Alternative use: doorstop

A creative use is defined as one that is both original and useful, where originality
outweighs utility. For now, it is important to score utility carefully.

Utility

The utility of a use is determined by how usable the property is. The utility score is
divided into three categories - low, medium and high. The assignment of scores is explained in
more detail below:

(low) Unusable or Difficult to realize: This score should be assigned to uses that are impossible to
realize or difficult to achieve.

Eg. Object: Book
Alternative usage: Dobber

The stated use in this example should be assigned the utility score of low. An essential
characteristic of a dobber is that it floats. A book sinks, therefore it is impossible to use a
book as a dobber.

Eg. Object: Belt
Alternative Usage: Fishing Rod

The mentioned use in this example should be assigned the utility score low. Indeed, to
realize a fishing pole, not only the belt will suffice (which functions as a line), one also
needs an additional action/object; in this case, a stick/rod and bait.

(medium) Reasonably realizable or Easily realized: This score should be assigned when the use is
reasonably realizable or easily achievable. Consider uses that require (very) minor modifications to the
named use.

Eg. Object: Can
Alternative Usage: Camera Tripod

The listed use in this example should be assigned the utility score of medium. A can can be
used as a tripod but this will require multiple adjustments in some cases. For example, to
adjust the height multiple cans will need to be used.

Eg. Object: Stick
Alternative Usage: Fork

The listed use in this example should be assigned the utility score of medium. A stick can be
used well as a replacement for a fork. However, in some cases, a stick will need to
bemade sharper, but generally, a stick works well as a fork.

(high) Always realizable: This score should be assigned to uses that are always
realizable. Consider uses that do not require modifications of the named use or uses
intended for the given object.

Eg. Object: Can
Alternative Usage: Pen holder

The stated use in this example should be assigned the utility score of high. A tin can can be
used as a pen holder without any modifications.

You are an expert reviewer. You will be provided with human responses to the AUT task. You need to rate the utility of objects and their alternate
uses as described above. You will be given multiple responses together and have to rate each one. Make your best judgement and provide a rating
of either low, medium or high for every response. However, you must make sure that you are unbiased and ensure your previous ratings do not impact
your later ratings. Format your response such that all the utility ratings are comma separated in the first line.

item,response,rating
brick, coaster pans, high
paperclip, injure, medium
book, doorstop, high
fork, comb your hair, medium
can, picture frame, low
belt, pinchers, low

item: {item}, response: {response}
'''

In [None]:
# Function for supplementing object and response into prompt and requesting rating
def score_dataset(df, model, temp):
    ratings = []
    for index, row in df.iterrows():
        item = row['object']
        response = row['translated_response']
        prompt = create_prompt(item, response)
        rating = together_call(prompt, model, temp)
        if rating:
            ratings.append(rating.strip())
        else:
            ratings.append("Error")
        time.sleep(0.6) # Pause to not hit rate limit
    return ratings

# Applying to the test set
test_set['rating'] = score_dataset(test_set, "meta-llama/Llama-3-70b-chat-hf", 0)

In [None]:
# Exporting rated test set to csv to clean up ratings
test_set.to_csv('test_set.csv', index=False)
from google.colab import files
files.download('test_set.csv')

## Calculation of Evaluation Metrics

In [None]:
# Importing cleaned file
test_set = pd.read_csv('/content/test_set(1).csv')

# Changing labels to numbers
label_encoder = LabelEncoder()
test_set['final_category_num'] = label_encoder.fit_transform(test_set['final_category'])
test_set['rating_num'] = label_encoder.transform(test_set['rating'])

# Calculating evaluation metrics using numerical labels
accuracy = accuracy_score(test_set['final_category_num'], test_set['rating_num'])
precision = precision_score(test_set['final_category_num'], test_set['rating_num'], average = 'weighted')
recall = recall_score(test_set['final_category_num'], test_set['rating_num'], average = 'weighted')
f1 = f1_score(test_set['final_category_num'], test_set['rating_num'], average = 'weighted')

# Printing results
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

conf_matrix = confusion_matrix(test_set['final_category_num'], test_set['rating_num'])

# Plotting confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
