# Annotation Guideline information

Given an utterance Ut labeled with an emotion Et, the annotators were asked to extract the set of causal spans CS(Ut) that sufficiently represent the causes of the emotion Et. If the cause of Et was latent, i.e., there was no explicit
causal span in the dialog, the annotators wrote down the assumed causes that they inferred from the text.

In fact, the annotators were asked to look for the casual spans of Ut in the whole
dialog and not only in the past history H(Ut). 

In [1]:
import json
import plotly.express as px
import pandas as pd

In [2]:
json_file = json.load(open('data_level0.json',encoding="utf-8"))

In [3]:
# Index 0: Test file (https://github.com/declare-lab/RECCON/blob/main/data/original_annotation/dailydialog_test.json)
# Index 1: Train file (https://github.com/declare-lab/RECCON/blob/main/data/original_annotation/dailydialog_train.json)
# Index 2: Valid file (https://github.com/declare-lab/RECCON/blob/main/data/original_annotation/dailydialog_valid.json)
dataset_div = ['Test', 'Train', 'Validation']
for index, label in enumerate(dataset_div):
    print(f"{label} size: {len(json_file[index].keys())}")

Test size: 225
Train size: 834
Validation size: 47


In [4]:
# Test water with train set
train = json_file[1]

def parse_dialogue_dict(dialogue: dict) -> None:
    
    print(dialogue)

for key in train.keys():
    
    # Layer 1: Extract the full list of utterances and turns
    # Each key contains a dialogue set
    dialogue_set = train[key][0]
    
    # Layer 2
    for dialogue in dialogue_set:
        parse_dialogue_dict(dialogue)
        print("\n")
    break

{'turn': 1, 'speaker': 'A', 'utterance': 'Hey , you wanna see a movie tomorrow ?', 'emotion': 'happiness', 'expanded emotion cause evidence': [1], 'expanded emotion cause span': ['see a movie tomorrow ?'], 'type': ['no-context']}


{'turn': 2, 'speaker': 'B', 'utterance': 'Sounds like a good plan . What do you want to see ?', 'emotion': 'happiness', 'expanded emotion cause evidence': [1], 'expanded emotion cause span': ['see a movie tomorrow ?'], 'type': ['inter-personal']}


{'turn': 3, 'speaker': 'A', 'utterance': 'How about Legally Blonde .', 'emotion': 'neutral'}


{'turn': 4, 'speaker': 'B', 'utterance': "Ah , my girlfriend wanted to see that movie . I have to take her later so I don't want to watch it ahead of time . How about The Cube ?", 'emotion': 'neutral'}


{'turn': 5, 'speaker': 'A', 'utterance': "Isn't that a scary movie ?", 'emotion': 'neutral'}


{'turn': 6, 'speaker': 'B', 'utterance': "How scary can it be ? Come on , it'll be fun .", 'emotion': 'neutral'}


{'turn': 7

# Layer 1

- Find the distribution of the emotions (Bar chart)
- Find the distribution of type (Bar chart)

In [5]:
def increment_type_counter(type_counter: dict, type_list: list) -> None:
    
    for type in type_list:
        type_counter[type] = type_counter.get(type, 0) + 1
        
    return None

def increment_utterance_counter(utter_counter: list, utterance: str) -> None:
    utter_counter += [len(utterance.split(" "))]
    return None

emotion_counter = {}
type_counter = {}

num_utterance_list = []
utterance_length_list = []

for key in train.keys():
    
    # Layer 1: Extract the full list of utterances and turns
    # Each key contains an utterance list
    dialogue_set = train[key][0]
    num_utterance_list += [len(dialogue_set)]
    
    # Layer 2: Go one utterance at a time
    for utterance in dialogue_set:
        emotion = utterance['emotion']
        
        # For emotion counts
        emotion_counter[emotion] = emotion_counter.get(emotion, 0) + 1
        
        # For emotion-cause-type counts
        if 'type' in utterance:
            increment_type_counter(type_counter, utterance['type'])
        else:
            type_counter['empty'] = type_counter.get('empty', 0) + 1
        
        increment_utterance_counter(utterance_length_list, utterance['utterance'])

In [13]:
# Find the distribution of the emotions
emotion_counter_df = pd.DataFrame(emotion_counter.items(), columns=['emotion', 'count'])

fig = px.bar(emotion_counter_df, x='emotion', y='count', title='Emotion Counts')
fig.update_layout(hovermode="x")
fig.show()

In [12]:
# Find the distribution of type 
type_counter_df = pd.DataFrame(type_counter.items(), columns=['type', 'count'])

fig = px.bar(type_counter_df, x='type', y='count', title='Emotion Counts')
fig.update_layout(hovermode="x")
fig.show()

# Layer 2

- Find the distribution of the expanded emotion cause span (Histogram)
- Find the number of utterances per dialog (Histogram + Total count)
- Find the number of words per utterance (Histogram + Total count)

In [11]:
# Find the number of utterances per dialog (Histogram + Total count)
print(f"Total number of utterances: {sum(num_utterance_list)}")
num_utterance_df = pd.DataFrame(num_utterance_list, columns=['utterance_per_dialogue'])

fig = px.histogram(num_utterance_df, x="utterance_per_dialogue")
fig.update_layout(bargap=0.5, hovermode="x")
fig.show()

Total number of utterances: 8206


In [14]:
# Find the number of words per utterance
print(f"Total number of words: {sum(utterance_length_list)}")
utterance_length_df = pd.DataFrame(utterance_length_list, columns=['utterance_length'])

fig = px.histogram(utterance_length_df, x="utterance_length")
fig.update_layout(bargap=0.5, hovermode="x")
fig.show()

Total number of words: 112553


# Self Analysis

- There are 8,206 utterances.
- There are a total of 112,553 tokens/words.