# Human Agreement

To show the agreement between the evaluators, we use Percentage Agreement (PA), a method frequently used to assess the proportion of agreements in labels across annotators. The results indicate that human evaluators achieved a relatively high agreement across all stand-up comedy transcripts, with an overall average agreement of 86.7%. This level of agreement suggests that human participants were generally able to identify similar humorous quotes in the transcripts.

Humor is widely recognized as a subjective experience, heavily influenced by individual preferences, cultural background, and personal sense of humor. This variability is reflected in the results, where some transcripts, such as Anthony Jeselnik with 90.1% agreement, achieved higher agreement scores, while others, such as Ali Wong with 83.7%. displayed lower levels of consensus. The variability between transcripts suggests that, while humor can be universally understood to a certain extent, individual differences among evaluators can affect their judgments on what constitutes "funny" content. Some participants may find the majority of the transcript humorous, while others may find only select parts amusing, or none at all. 

In [14]:
import pandas as pd
import torch
from torchmetrics.nominal import FleissKappa

In [25]:
human = pd.read_csv("/home/ada/humor/data/stand_up_dataset/human_ans - Sheet1.csv")
num_participants = 11
human['not_funny'] = num_participants - human['funny']

Percent Agreement

In [30]:
results = []
num_participants = 11  
human['agreement_proportion'] = human[['funny', 'not_funny']].max(axis=1) / num_participants
for comedian, group in human.groupby('comedian'):
    percentage_agreement = group['agreement_proportion'].mean() * 100
    results.append({'comedian': comedian, 'percentage_agreement': percentage_agreement})
    

results_df = pd.DataFrame(results)
overall_agreement = results_df['percentage_agreement'].mean()
results_df.loc[len(results_df.index)] = ["Overall", overall_agreement]
results_df


Unnamed: 0,comedian,percentage_agreement
0,Ali_Wong,83.732057
1,Anthony_Jeselnik,90.151515
2,Hasan_Minhaj,85.454545
3,Jimmy_Yang,87.012987
4,Joe_List,88.484848
5,John_Mulaney,85.314685
6,Overall,86.691773


Fleiss' Kappa 

In [17]:
humor_data = human[['funny', 'not_funny']].values 
humor_tensor = torch.tensor(humor_data, dtype=torch.long)

In [18]:
results = []
for comedian, group in human.groupby('comedian'):
    humor_data = group[['funny', 'not_funny']].values 
    humor_tensor = torch.tensor(humor_data, dtype=torch.long)
    metric = FleissKappa(mode="counts")
    kappa = metric(humor_tensor)
    results.append({'comedian': comedian, 'fleiss_kappa': kappa.item()})
    
results_df = pd.DataFrame(results)
overall_kappa = results_df['fleiss_kappa'].mean()
results_df.loc[len(results_df.index)] = ["Overall", overall_kappa.item()]
results_df 


Unnamed: 0,comedian,fleiss_kappa
0,Ali_Wong,0.14256
1,Anthony_Jeselnik,0.376295
2,Hasan_Minhaj,0.127973
3,Jimmy_Yang,0.366601
4,Joe_List,0.219733
5,John_Mulaney,0.294862
6,Overall,0.254671
