## Converting the GAP corpus into ConvoKit Format

This notebook describes how to convert the GAP corpus into a corpus with ConvoKit format.

The original version of the GAP corpus can be downloaded from:

https://github.com/gmfraser/gap-corpus


### The GAP corpus

The Group Affect and Performance (GAP) Corpus has been collected at University of the Fraser Valley (UFV, Canada). The original dataset comprises of 28 group meetings of two to four group members, with a total of 84 participants. Group sizes vary from two to four participants. All the recorded conversations are in English. 

Group members must complete a Winter Survival Task (WST), a group decision-making exercise where participants must rank 15 items according to their importance in a hypothetical plane crash scenario. Participants first rank the items individually. Then, each group was given a maximum of 15 minutes to complete the WST. 

The original version of the GAP corpus can be downloaded from https://github.com/gmfraser/gap-corpus.
For this notebook, we are using the following files:
- <b>Individual-Level Meeting Data.xlsx</b> contains informnation about each speaker.
- <b>Group-Level Meeting Data.xlsx</b> contains informnation about each group.
- The folder <b>Merged/No-Punctuation/</b> contains 28  transcripts including the metadata about each utterance.

In [1]:
from tqdm import tqdm 
import pandas as pd
from convokit import Corpus, Speaker, Utterance, download
import re
import glob, os, csv

### 1. Creating Speakers

There are a total of 28 speakers in the GAP corpus.
We will read off the metadata for each speaker from <b>Individual-Level Meeting Data.xlsx</b>

We include the following information for each participant:
- Year at UFV
- Gender
- English: first or second language
- AIS: Absolute Individual Score
- AII: Absolute Individual Influence
- Ind_TE: Time Expectations 
- Ind_WW: Worked Well Together
- Ind_TM: Time Management
- Ind_Eff: Efficiency
- Ind_QW: Overall Quality of Work
- Ind_Sat: Overall Satisfaction 
- Ind_Lead: Leadership
- Group Number

In [2]:
## replace with your directory
speaker_data = "gap-corpus-master/Final-Corpus-Transcripts-Annotations-Data/Group-Individual-Data/Individual-Level Meeting Data.xlsx"
speaker_df = pd.read_excel(speaker_data)
#we add a additional column to indicate which group the speaker belongs to
speaker_df["Group Number"] = speaker_df["Group Member"].str.split('.').str[0]

In [3]:
speaker_df.head()

Unnamed: 0,Group Member,Year at UFV,Gender,English,AIS,AII,Ind_TE,Ind_WW,Ind_TM,Ind_Eff,Ind_QW,Ind_Sat,Ind_Lead,Group Number
0,1.Blue,1,2,1,64.0,76.0,5,5.0,5.0,5.0,5.0,5.0,2,1
1,1.Pink,4,2,1,88.0,40.0,5,5.0,5.0,5.0,5.0,5.0,5,1
2,1.Green,6,1,2,85.0,12.0,4,5.0,5.0,4.0,5.0,4.6,4,1
3,2.Pink,3,1,1,92.0,48.0,4,4.0,5.0,5.0,4.0,4.4,3,2
4,2.Blue,4,2,1,68.0,20.0,5,5.0,5.0,5.0,5.0,5.0,4,2


We convert the dataframe to a dictionary and create a Speaker object for each group member, adding the metadata.

In [4]:
speaker_meta = speaker_df.set_index('Group Member').T.to_dict()

In [5]:
corpus_speakers = {k: Speaker(id = k, meta = v) for k,v in speaker_meta.items()}

We can now verify that there 84 participants in the GAP corpus.

In [6]:
print("Number of speaker in the corpus: {}/84".format(len(corpus_speakers)))

Number of speaker in the corpus: 84/84


Checking a speaker from the corpus, we see that the metadata is now included.

In [7]:
print("Metadata for the speaker: ", corpus_speakers["11.Pink"].meta)

Metadata for the speaker:  {'Year at UFV': 3, 'Gender': 2, 'English': 1, 'AIS': 70.0, 'AII': 38.0, 'Ind_TE': 2, 'Ind_WW': 3.0, 'Ind_TM': 4.0, 'Ind_Eff': 2.0, 'Ind_QW': 3.0, 'Ind_Sat': 2.8, 'Ind_Lead': 4, 'Group Number': '11'}


### 2. Creating Utterance Objects

Utterances can be found in the folder <b>Merged/No-Punctuation/</b>. Each group conversation is recorded in a separate CSV including the sentence-level annotations.

Each utterance from the GAP corpus possesses the following informations, which are aligned with the Utterance schema from ConvoKit:

- idx: unique speaker utterance, e.g. 1.Green.70
- speaker: speaker name with group number, e.g. 1.Green
- root: id of the first utterance of each group, e.g. 1.Pink.1
- reply_to: previous idx, e.g. 1.Blue.105
- timestamp: start time in format HH:MM:SS
- text: sentence of utterance, without punctuation

Additional metadata includes:

- Duration: in seconds and milliseconds
- Sentiment: whether the sentence bears any positive or negative sentiment
- Decision: denotes a group-decision process; possible values include Proposal, Acceptance, Rejection, and Confirmation
- Private: if the speaker is refering to a private item
- Survival Item: what survival item was mentioned


In [8]:
# The folder "Transcripts-NoPunct(csv)" contains 28 CSV files.

all_meetings = glob.glob("gap-corpus-master/Final-Corpus-Transcripts-Annotations-Data/Merged/No-Punctuation/*.csv")

In [9]:
utterance_corpus = {}

for meeting in tqdm(all_meetings):
    
    #print("Meeting Name: ",  meeting)
    df = pd.read_csv(meeting)
    root = df["Participant"][0]

    for index in range(len(df)):
        idx = df["Participant"][index]
        speaker = re.sub("\.\d+","", idx)
        start = df["Start"][index]
        sentence = df["Sentence"][index]
        if index > 0:
            reply_to = df["Participant"][index - 1]
        else:
            reply_to = None
        
        meta = df.drop(columns = ["Participant", "Start", "Sentence"]).to_dict("records")[index]
        
        utterance_corpus[idx] = Utterance(id=idx, speaker=corpus_speakers[speaker], root = root, reply_to = reply_to, timestamp = start, text=sentence, meta = meta)

100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:45<00:00,  1.63s/it]


We can examine an Utterance object to verify that it contains among others an id, the speaker, the actual sentence and the metadata.

In [10]:
utterance_corpus['1.Pink.5']

Utterance({'obj_type': 'utterance', '_owner': None, 'meta': {'End': '00:21.6', 'Duration': '00:01.8', 'Sentiment': nan, 'Decision': 'Proposal', 'Private': nan, 'Survival Item': 'Cigarette Lighter'}, '_id': '1.Pink.5', 'speaker': Speaker({'obj_type': 'speaker', '_owner': None, 'meta': {'Year at UFV': 4, 'Gender': 2, 'English': 1, 'AIS': 88.0, 'AII': 40.0, 'Ind_TE': 5, 'Ind_WW': 5.0, 'Ind_TM': 5.0, 'Ind_Eff': 5.0, 'Ind_QW': 5.0, 'Ind_Sat': 5.0, 'Ind_Lead': 5, 'Group Number': '1'}, '_id': '1.Pink'}), 'conversation_id': '1.Pink.1', '_root': '1.Pink.1', 'reply_to': '1.Blue.6', 'timestamp': '00:19.7', 'text': '"So I would say cigarette lighter is two"'})

### 3. Creating corpus from list of utterances

To instantiate a Corpus, we will use a list of Utterances, from the Utterance corpus.

In [11]:
utterance_list = utterance_corpus.values()
gap_corpus = Corpus(utterances=utterance_list, version = 1)

Let's take a look at a random utterance in the corpus.

In [12]:
gap_corpus.random_utterance()

Utterance({'obj_type': 'utterance', '_owner': <convokit.model.corpus.Corpus object at 0x0000022E0B52B7C0>, 'meta': {'End': '00:14.6', 'Duration': '00:01.0', 'Sentiment': nan, 'Decision': nan, 'Private': nan, 'Survival Item': 'Cigarette Lighter'}, '_id': '13.Yellow.6', 'speaker': Speaker({'obj_type': 'speaker', '_owner': <convokit.model.corpus.Corpus object at 0x0000022E0B52B7C0>, 'meta': {'Year at UFV': 2, 'Gender': 1, 'English': 1, 'AIS': 77.0, 'AII': 24.0, 'Ind_TE': 4, 'Ind_WW': 4.0, 'Ind_TM': 3.0, 'Ind_Eff': 4.0, 'Ind_QW': 3.0, 'Ind_Sat': 3.6, 'Ind_Lead': 4, 'Group Number': '13'}, '_id': '13.Yellow'}), 'conversation_id': '13.Yellow.1', '_root': '13.Yellow.1', 'reply_to': '13.Yellow.5', 'timestamp': '00:13.6', 'text': '"Cigarette lighter"'})

Looking at some quick stats:

In [13]:
gap_corpus.print_summary_stats()

Number of Speakers: 84
Number of Utterances: 8009
Number of Conversations: 28


### 4. Adding Metadata

Each conversation has associated metadata, which can be found in <b>Group-Level Meeting Data.xlsx</b>. We will read off the metadata and attach them to each conversation from the corpus.

The metadata we will include is:

- Group Number
- Meeting Size
- Meeting Length in Minutes
- AGS: Absolute Group Score
- Group_TE: Time Expectations
- Group_WW: Worked Well Together
- Group_TM: Time Management
- Group_Eff: Efficiency
- Group_QW: Overall Quality of Work
- Group_Sat: Overall Satisfaction


In [14]:
group_data = "gap-corpus-master/Final-Corpus-Transcripts-Annotations-Data/Group-Individual-Data/Group-Level Meeting Data.xlsx"
group_df = pd.read_excel(group_data)
group_df.head()

Unnamed: 0,Meeting,Meeting Size,Meeting Length in Minutes,AGS,Group_TE,Group_WW,Group_TM,Group_Eff,Group_QW,Group_Sat
0,1,3,9.75,78,4.67,5.0,5.0,4.67,5.0,4.87
1,2,2,9.83,68,4.5,4.5,5.0,5.0,4.5,4.7
2,3,2,8.05,68,4.5,4.5,4.5,5.0,4.5,4.7
3,4,3,3.95,86,4.67,4.33,4.67,4.33,4.67,4.53
4,5,4,12.6,80,3.25,4.0,4.25,4.25,3.75,3.9


In [15]:
group_meta = group_df.set_index('Meeting').T.to_dict()

In [16]:
for convo in gap_corpus.iter_conversations():
    
    convo_id = convo.get_id()
    group_number = convo_id.split(".")[0]
    
    convo.meta['Group Number'] = group_number
    convo.meta.update(group_meta[int(group_number)])

If we check the second conversation, it will now include the added metadata.

In [17]:
gap_corpus.get_conversation("2.Pink.1").meta

{'Group Number': '2',
 'Meeting Size': 2.0,
 'Meeting Length in Minutes': 9.83,
 'AGS': 68.0,
 'Group_TE': 4.5,
 'Group_WW': 4.5,
 'Group_TM': 5.0,
 'Group_Eff': 5.0,
 'Group_QW': 4.5,
 'Group_Sat': 4.7}

We will finally add the corpus name as corpus-level metadata.

In [18]:
gap_corpus.meta['name'] = 'GAP corpus'

### 5. Saving to disk

As the last step, we will be saving the corpus for later use. The default location to find the saved datasets will be ./convokit/saved-copora in your home directory.

In [19]:
gap_corpus.dump("gap-corpus")

After saving, we will check the available info from dataset directly, without loading.

In [20]:
from convokit import meta_index
import os.path
meta_index(filename = os.path.join(os.path.expanduser("~"), ".convokit/saved-corpora/gap-corpus"))

{'utterances-index': {'End': "<class 'str'>",
  'Duration': "<class 'str'>",
  'Sentiment': "<class 'float'>",
  'Decision': "<class 'float'>",
  'Private': "<class 'float'>",
  'Survival Item': "<class 'float'>"},
 'speakers-index': {'Year at UFV': "<class 'int'>",
  'Gender': "<class 'int'>",
  'English': "<class 'int'>",
  'AIS': "<class 'float'>",
  'AII': "<class 'float'>",
  'Ind_TE': "<class 'int'>",
  'Ind_WW': "<class 'float'>",
  'Ind_TM': "<class 'float'>",
  'Ind_Eff': "<class 'float'>",
  'Ind_QW': "<class 'float'>",
  'Ind_Sat': "<class 'float'>",
  'Ind_Lead': "<class 'int'>",
  'Group Number': "<class 'str'>"},
 'conversations-index': {'Group Number': "<class 'str'>",
  'Meeting Size': "<class 'float'>",
  'Meeting Length in Minutes': "<class 'float'>",
  'AGS': "<class 'float'>",
  'Group_TE': "<class 'float'>",
  'Group_WW': "<class 'float'>",
  'Group_TM': "<class 'float'>",
  'Group_Eff': "<class 'float'>",
  'Group_QW': "<class 'float'>",
  'Group_Sat': "<class 'fl