In [1]:
from rank import prepare_DescriptionModel, prepare_VisionModel, Ranking
import pandas as pd
import os
import pickle

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


Since all of our evaluators use OpenAI API, the first step is to set up your OpenAI API Key.

In [2]:
api_key = "Your OpenAI API Key"

## Pairwise Comparison

### Evaluator: gpt4turbo, Description: Hessel's 

To obtain the accuracy of description models that use descriptions of cartoons when evaluating captions, we need to first call the function named 'prepare_DescriptionModel'.

There are three string values for 'comparison_method' parameter in all our functions or classes: "Pairwise", "Overall", and "BestPick". 

"Pairwise" is comparing two candidate captions at a time, while "Overall" and "BestPick" compare groups of ten captions from different sources, such as human submissions from different ranking levels, or captions generated by different language models. 

In overall comparisons, the evaluator compares the overall funniness of the group of model-generated captions against each group of contestant-submitted captions. In best pick comparisons, the evaluator first pick the funniest caption from each of the two groups and then choose the funnier caption accordingly.

When we want to compare two candidate captions at a time with descriptions from Hessel et al., we need to set 'comparison_method="Pairwise", Hessel=True' so that the 'prepare_DescriptionModel' function can return a dataframe of Hessel's dataset. This is all we need for pairwise comparison with descriptions from Hessel et al. Then, we can create an instance of the 'Ranking' class.

In 'Ranking', 'annotation_type' can be either "Description" or "Image", and 'description_generator' can be "gpt-4-vision", "gpt-4-turbo", or "Hessel". When initializing Ranking, the necessary arguments are 'comparison_method', 'evaluator', 'annotation_type', 'description_generator', 'apiKey', and other arguments specific to the comparisons. After the initialization, we can get the ranking accuracy by calling the 'rank()' method.

In [3]:
testing = prepare_DescriptionModel(comparison_method = "Pairwise", Hessel = True)
model = Ranking(comparison_method = "Pairwise", evaluator = "gpt-4-turbo", annotation_type = "Description",
                  description_generator = "Hessel", testing = testing, apiKey = api_key, num_pairs = 1)
accuracy = model.rank()

100%|██████████| 1/1 [00:00<00:00,  1.01it/s]

The length of results is 1





### Evaluator: gpt4oV, Raw image

To obtain the accuracy of vision models that use raw images of cartoons when evaluating captions, we need to first call the function named 'prepare_VisionModel'. 

The 'image_pairs' returned by the 'prepare_VisionModel' function is the only argument specific to the comparisons using the vision model. It is a list of '[contest_number, base64_image, captionA, captionB, label]'.

In [4]:
image_pairs = prepare_VisionModel(comparison_method = "Pairwise")
model = Ranking(comparison_method = "Pairwise", evaluator = "gpt-4o", annotation_type = "Image",
                 image_pairs = image_pairs, apiKey = api_key, num_pairs = 1)
accuracy = model.rank()

100%|██████████| 1/1 [00:04<00:00,  4.34s/it]

The length of results is 1





## Overall Comparison

### Evaluator: gpt4turbo, Description Generator: gpt4V

To use descriptions generated by specific models, such as gpt4V, we need to obtain a dataframe of descriptions for 100 testing cartoons, which correspond to 500 example cartoons based on our 5-shot technique. 

The dataframe is called 'Dtesting' and is composed of three columns: ['cnum', 'canny', 'uncanny'], representing ['contest_number', 'canny description', 'uncanny description']. 'Deg' is a dataframe containing descriptions of all distinct cartoons in Hessel's dataset generated by specific models, such as gpt4V. We select 500 examples from it by setting a random seed.

With 'Dtesting' and 'Deg' prepared, we can call 'prepare_DescriptionModel'. 

Unlike pairwise comparison using descriptions from Hessel et al., overall comparison and best pick comparison also need the input of 'Deg' and 'Dtesting' besides setting 'comparison_method="Overall", GPT4V=True' or 'comparison_method="BestPick", GPT4V=True'. 

The 'prepare_DescriptionModel' function will then return four variables: 'deg', 'eg', 'dtesting', and 'captions'. 'deg' is a list of '[canny, uncanny]' descriptions for 600 example cartoons, and 'eg' is a list of 600 captions and labels (we actually use only 500 example cartoons' information; please see the code for details). 'dtesting' is a list of descriptions for 100 testing cartoons grabbed from 'Dtesting'. 'captions' is a list of human contestant entries at the top 10 and #1000-#1009.

What's more, in group comparisons, for each cartoon, we flip the groups of captions to apply the recalibrated decision rule.

In [5]:
folder_path = '/Users/chenjiayi/Desktop/humor/D/gpt4V_100descriptions'

Dtesting = pd.DataFrame(columns=['cnum', 'canny', 'uncanny']) # description df for 100 testing cartoons
for filename in os.listdir(folder_path):
    if filename != ".DS_Store":
        file_path = os.path.join(folder_path, filename)
        df = pd.read_csv(file_path)
        Dtesting = pd.concat([Dtesting, df], ignore_index=True)

file_path = '/Users/chenjiayi/Desktop/humor/D/gpt4V_description/example.csv'
Deg1 = pd.read_csv(file_path)
file_path = '/Users/chenjiayi/Desktop/humor/D/gpt4V_description/example_new.csv'
Deg2 = pd.read_csv(file_path)
Deg = pd.concat([Deg1, Deg2], ignore_index=True)

deg, eg, dtesting, captions = prepare_DescriptionModel(comparison_method = "Overall", GPT4V = True, 
                                                       Deg = Deg, Dtesting = Dtesting)

model = Ranking(comparison_method = "Overall", evaluator = "gpt-4-turbo", annotation_type = "Description",
                description_generator = "gpt-4-vision", deg = deg, dtesting = dtesting, eg = eg,
                captions = captions, apiKey = api_key, num_pairs = 1)
accuracy = model.rank()

100%|██████████| 1/1 [00:01<00:00,  1.54s/it]

The length of results is 2





### Evaluator: gpt4turbo-V, Raw image

To perform group comparisons, such as "Overall" and "BestPick", we fix the list of contest numbers to be evaluated to better observe the performance of different evaluators. The list used in the description model is named 'cartoons', and the one used in the vision model is named 'cartoons_v'. These lists should be set as arguments when calling prepare_VisionModel.

When doing group comparisons, 'prepare_VisionModel' will also return 'img' and 'captions' besides 'image_pairs', which are the images and human contestant entries at top 10 and #1000-#1009 corresponding to the list of contest numbers to be evaluated.

In [6]:
file_path = "/Users/chenjiayi/Desktop/humor/cartoon_forGroupComparison.pkl"
with open(file_path, 'rb') as file:
    cartoons_v = pickle.load(file)
    
image_pairs, img, captions = prepare_VisionModel(comparison_method = "Overall", 
                                                 cartoons_GroupComparison = cartoons_v)

model = Ranking(comparison_method = "Overall", evaluator = "gpt-4-turbo", annotation_type = "Image",
                 image_pairs = image_pairs, img = img, captions = captions, apiKey = api_key, num_pairs = 1)
accuracy = model.rank()

100%|██████████| 1/1 [00:12<00:00, 12.00s/it]

The length of results is 2





## Best Pick Comparison

### Evaluator: gpt4turbo, Description Generator: gpt4oV

Similarly, with gpt4oV-generated descriptions of 100 testing cartoons and all distinct cartoons in Hessel's dataset prepared, we can call 'prepare_DescriptionModel' first and 'Ranking' then.

In [7]:
file_path = '/Users/chenjiayi/Desktop/humor/D/gpt4oV_description/cartoon_pairwise_o.csv'
Dtesting = pd.read_csv(file_path)
Dtesting = Dtesting[['cnum', 'canny', 'uncanny']] # description df for 100 testing cartoons

file_path = '/Users/chenjiayi/Desktop/humor/D/gpt4oV_description/example_o.csv'
Deg1 = pd.read_csv(file_path)
file_path = '/Users/chenjiayi/Desktop/humor/D/gpt4oV_description/new_example_o.csv'
Deg2 = pd.read_csv(file_path)
Deg = pd.concat([Deg1, Deg2], ignore_index=True)
    
deg, eg, dtesting, captions = prepare_DescriptionModel(comparison_method = "BestPick", GPT4oV = True, 
                                                       Deg = Deg, Dtesting = Dtesting)
 
model = Ranking(comparison_method = "BestPick", evaluator = "gpt-4-turbo", annotation_type = "Description",
                description_generator = "gpt-4o-vision", deg = deg, dtesting = dtesting, eg = eg,
                captions = captions, apiKey = api_key, num_pairs = 1)
accuracy = model.rank()

100%|██████████| 1/1 [00:13<00:00, 13.55s/it]

The length of results is 2





### Evaluator: gpt4oV, Raw image

Similarly, with 'image_pairs' of example cartoons, images to be evaluated and corresponding human contestant entries at top 10 and #1000-#1009 prepared, we can call 'prepare_VisionModel' first and 'Ranking' then.

In [8]:
image_pairs, img, captions = prepare_VisionModel(comparison_method = "Overall", 
                                                 cartoons_GroupComparison = cartoons_v)

model = Ranking(comparison_method = "BestPick", evaluator = "gpt-4o", annotation_type = "Image",
                 image_pairs = image_pairs, img = img, captions = captions, apiKey = api_key, num_pairs = 1)
accuracy = model.rank()

100%|██████████| 1/1 [00:09<00:00,  9.37s/it]

The length of results is 2





## Overall Comparison with evaluator gpt4turbo and gpt4V-generated description

Here is how we run a complete overall comparison over 100 cartoons. The results include flip-label experiments over the 100 cartoons.

In [10]:
folder_path = '/Users/chenjiayi/Desktop/humor/D/gpt4V_100descriptions'

Dtesting = pd.DataFrame(columns=['cnum', 'canny', 'uncanny']) # description df for 100 testing cartoons
for filename in os.listdir(folder_path):
    if filename != ".DS_Store":
        file_path = os.path.join(folder_path, filename)
        df = pd.read_csv(file_path)
        Dtesting = pd.concat([Dtesting, df], ignore_index=True)

file_path = '/Users/chenjiayi/Desktop/humor/D/gpt4V_description/example.csv'
Deg1 = pd.read_csv(file_path)
file_path = '/Users/chenjiayi/Desktop/humor/D/gpt4V_description/example_new.csv'
Deg2 = pd.read_csv(file_path)
Deg = pd.concat([Deg1, Deg2], ignore_index=True)

deg, eg, dtesting, captions = prepare_DescriptionModel(comparison_method = "Overall", GPT4V = True, 
                                                       Deg = Deg, Dtesting = Dtesting)

model = Ranking(comparison_method = "Overall", evaluator = "gpt-4-turbo", annotation_type = "Description",
                description_generator = "gpt-4-vision", deg = deg, dtesting = dtesting, eg = eg,
                captions = captions, apiKey = api_key)
accuracy = model.rank()
accuracy

100%|██████████| 100/100 [03:39<00:00,  2.19s/it]

The length of results is 200





71.5