# Measure Model Agreement Rate

To identify potential overlaps in the transcripts on which the models perform best and worst, we analyze the maximum and minimum performance values for each model. This approach allows us to determine if there are specific transcripts where a model consistently excels or underperforms, thereby revealing any patterns or overlaps in the models' performance across different transcripts. 

All models demonstrated their highest performance on the second Trevor Noah transcript and their lowest performance on the Ali Wong transcript. This pattern indicates that, regardless of the model, the second Trevor Noah transcript consistently yielded better results, while the Ali Wong transcript consistently yielded poorer results.

### Imports

In [2]:
import sys
import pandas as pd
sys.path.append("..")
import seaborn as sns
import nltk
from thefuzz import fuzz

from humor.bipartite_metric import bipartite_metric
from humor.vector_similarity_metric import vector_similarity_metric

  from tqdm.autonotebook import tqdm, trange


In [3]:
ground_truth = pd.read_csv('/home/ada/humor/data/stand_up_dataset/standup_data.csv')
gemma = pd.read_csv('/home/ada/humor/data/stand_up_dataset/gemma_answers.csv')
phi_model = pd.read_csv("/home/ada/humor/data/stand_up_dataset/phi3_mini_quotes.csv")
gemma2 = pd.read_csv("/home/ada/humor/data/stand_up_dataset/gemma2 - gemma2.csv")
llama = pd.read_csv("/home/ada/humor/data/stand_up_dataset/llama - llama.csv")

### Experiment

In [4]:
gemma_metric = bipartite_metric(gemma, ground_truth)
print("Gemma")
print("Maximum:", gemma_metric.max())
print("Minimum:", gemma_metric.min())
gemma2_metric = bipartite_metric(gemma2, ground_truth)
print("\nGemma2")
print("Maximum:", gemma2_metric.max())
print("Minimum:", gemma2_metric.min())
phi_metric = bipartite_metric(phi_model, ground_truth)
print("\nPhi")
print("Maximum:", phi_metric.max())
print("Minimum:", phi_metric.min())
llama_metric = bipartite_metric(llama, ground_truth)
print("\nLLama")
print("Maximum:", llama_metric.max())
print("Minimum:", llama_metric.min())

Gemma
Maximum: comedian    Trevor_Noah_3
score                76.0
dtype: object
Minimum: comedian    Ali_Wong
score       5.708333
dtype: object

Gemma2
Maximum: comedian    Trevor_Noah_3
score              72.125
dtype: object
Minimum: comedian    Ali_Wong
score       3.233333
dtype: object

Phi
Maximum: comedian    Trevor_Noah_3
score                60.2
dtype: object
Minimum: comedian    Ali_Wong
score          3.875
dtype: object

LLama
Maximum: comedian    Trevor_Noah_3
score                69.7
dtype: object
Minimum: comedian    Ali_Wong
score          4.375
dtype: object


In [5]:
gemma_metric = vector_similarity_metric(gemma, ground_truth)
print("Gemma")
print("Maximum:", gemma_metric.max())
print("Minimum:", gemma_metric.min())
gemma2_metric = vector_similarity_metric(gemma2, ground_truth)
print("\nGemma2")
print("Maximum:", gemma2_metric.max())
print("Minimum:", gemma2_metric.min())
phi_metric = vector_similarity_metric(phi_model, ground_truth)
print("\nPhi")
print("Maximum:", phi_metric.max())
print("Minimum:", phi_metric.min())
llama_metric = vector_similarity_metric(llama, ground_truth)
print("\nLLama")
print("Maximum:", llama_metric.max())
print("Minimum:", llama_metric.min())

Gemma
Maximum: comedian    Trevor_Noah_3
score           75.245221
dtype: object
Minimum: comedian    Ali_Wong
score       9.920471
dtype: object

Gemma2
Maximum: comedian    Trevor_Noah_3
score           75.209508
dtype: object
Minimum: comedian    Ali_Wong
score       7.960477
dtype: object

Phi
Maximum: comedian    Trevor_Noah_3
score           55.204354
dtype: object
Minimum: comedian    Ali_Wong
score       3.842477
dtype: object

LLama
Maximum: comedian    Trevor_Noah_3
score           64.899307
dtype: object
Minimum: comedian     Ali_Wong
score       10.034131
dtype: object
