## Converting the CANDOR Corpus into ConvoKit format 

This notebook is to help people working with CANDOR Corpus to quickly transform it into ConvoKit format.
You can request CANDOR Corpus seperately from: https://betterup-data-requests.herokuapp.com/ and run through this notebook to have ConvoKit CANDOR Corpus instantly!

In [1]:
from tqdm import tqdm
from convokit import Corpus, Speaker, Utterance
from collections import defaultdict
import pandas as pd

Below you should replace the CANDOR_PATH to the CANDOR Corpus with your own local directory path. Note that inside the CANDOR Corpus directory, there should be 3 files, a survey.tsv plus two transcript_method.tsv, and another directory, raw, which contains all the transcriptions in three methods.

In [2]:
# replace the directory with where your CANDOR corpus is saved
CANDOR_PATH = '<YOUR CANDOR CORPUS DIRECTORY>'
# ls: raw  survey.tsv  transcript_backbiter.tsv  transcript_cliffhanger.tsv

In [3]:
survey_path = CANDOR_PATH + "survey.tsv"
survey = pd.read_csv(survey_path, delimiter='\t')

### Creating Speaker List

We create speaker list by going through the survey, and extract all speakers from the survey (survey is filled before and after by every speaker who conducted video calls during the experiment, and it is required when when CANDOR corpus is collecting data from participants!)

In [4]:
all_speakers = list(set(survey['user_id'].to_list() + survey['partner_id'].to_list()))
len(all_speakers)

1454

In [5]:
corpus_speakers = {k: Speaker(id = k, meta = {}) for k in all_speakers}

In [6]:
print("number of speakers in the data = {}".format(len(corpus_speakers)))

number of speakers in the data = 1454


### Creating Utterance List

Now, we get extract all utterances from the corpus, conversation by conversation. Here, each conversation is stored as an individual folder in the "raw" folder. Thus, we go into it, and extract one by one.

Note that, there are three versions of transcripts for each conversation, corresponding to three different ways audio transcriptions are processed. For consistency, we recommend sticking with one type of transcription for corpus construction here. The three types are Audiophile, Cliffhanger, Backbiter. Modify the "transcription_type" varible for your intended processing method. Refer back to the paper for details on what each transcription processing method is about. 

CANDOR corpus paper: https://www.science.org/doi/epdf/10.1126/sciadv.adf3197

In [8]:
import os
conversations_list_path = CANDOR_PATH + "raw/"
conversations = [d for d in os.listdir(conversations_list_path) if os.path.isdir(os.path.join(conversations_list_path, d))]
transcription_type = "cliffhanger" # or "backbiter" or "audiophile"
len(conversations)

99

Note that the fields of ConvoKit Utterance objects are:
Utterance(id=..., speaker =..., conversation_id =..., reply_to=..., timestamp=..., text =..., meta =...)

In [9]:
utt_id_count = 0
corpus_utterances = {}
for convo_id in conversations:
    meta_path = f"{conversations_list_path}{convo_id}/metadata.json"
    transcription_path = f"{conversations_list_path}{convo_id}/transcription/transcript_{transcription_type}.csv"
    transcription = pd.read_csv(transcription_path)
    for index, row in transcription.iterrows():
        reply_to = None if row['turn_id'] == 0 else utt_id_count-1
        meta = {}
        for k, v in row.items():
            if k != "speaker" and k != "utterance":
                meta.update({k : v})
        utt = Utterance(id=str(utt_id_count), speaker=corpus_speakers[row['speaker']], conversation_id=str(convo_id), reply_to=str(reply_to), timestamp=row['start'], text=row['utterance'], meta=meta)
        corpus_utterances[utt_id_count] = utt
        utt_id_count += 1

print(utt_id_count)

30075


In [10]:
utterance_list = corpus_utterances.values()

In [11]:
CANDOR_corpus = Corpus(utterances=utterance_list)

### Updating Conversation Info

Here, we update the conversation info, especially the metadata from surveys participants filled.

For each conversation, we got 1 survey from each conversation participant, and as this conversation is 2 people video calling, we got 2 surveys per conversation. We decided to organize the metadata in the following way:

convo.meta = {"survey field name" : {"sp_A id" : "sp_A survey value", "sp_B" : "sp_B survey value"} ... }

We choose this way or organizing metadata, as we usually focus on several survey fields, and analysis the values from two participants. This format allow us to quickly extract such information. 

You can also feel free to modify the format to suit your research / work needs.

In [12]:
print("number of conversations in the dataset = {}".format(len(CANDOR_corpus.get_conversation_ids())))

number of conversations in the dataset = 99


Below we see how the survey from two participants of a random conversation looks like

In [13]:
convo = CANDOR_corpus.random_conversation()
survey[survey["convo_id"] == convo.id]

Unnamed: 0.1,Unnamed: 0,user_id,partner_id,convo_id,date,survey_duration_in_seconds,time_zone,pre_affect,pre_arousal,technical_quality,...,my_conscientious,my_neurotic,my_open,your_extraversion,your_agreeable,your_conscientious,your_neurotic,your_open,who_i_talked_to_most_past24,most_common_format_past24
1794,0,5b5e7e643bac1d0001f9bf28,5f07070d75038e05e9fcf515,8bf369ea-ab22-4723-ba24-caeff2204a9e,2020-08-13,2793,8.0,5.0,5.0,1.0,...,2.0,4.333333,5.0,2.333333,4.666667,3.0,3.0,3.666667,,
1795,1,5f07070d75038e05e9fcf515,5b5e7e643bac1d0001f9bf28,8bf369ea-ab22-4723-ba24-caeff2204a9e,2020-08-13,5638,10.0,5.0,5.0,1.0,...,4.0,2.0,4.666667,3.666667,3.666667,2.666667,3.666667,4.0,,


In [23]:
for convo in CANDOR_corpus.iter_conversations():
    convo_id = convo.id
    row1 = survey[survey['convo_id'] == convo_id].iloc[0]
    row2 = survey[survey['convo_id'] == convo_id].iloc[1]
    sp_A = row1['user_id']
    sp_B = row2['user_id']
    metadata = {}
    for field in list(row1.index[1:]):
        if field != "convo_id" and field != 'user_id':
            field_values = {sp_A : row1[field], sp_B : row2[field]}
            metadata.update({field : field_values})
    convo.meta = metadata

# Use Below instead if you want to have metadata formatted in: convo.meta = {sp_A : {sp_A survey with keys to be field, value to be values}, sp_B : {sp_B survey}}
# for convo in CANDOR_corpus.iter_conversations():
#     convo_id = convo.id
#     row1 = survey[survey['convo_id'] == convo_id].iloc[0]
#     row2 = survey[survey['convo_id'] == convo_id].iloc[1]
#     metadata = {row1['user_id'] : {}, row2['user_id'] : {}}
#     for row in [row1, row2]:
#         for k, v in row.items():
#             if k != "convo_id":
#                 metadata[row['user_id']].update({k : v})
#     convo.meta = metadata
#     convo.meta.update({"speaker_A" : row1['user_id']})

In [25]:
convo = CANDOR_corpus.random_conversation()
convo.meta

ConvoKitMeta({'partner_id': {'5e52af8dd120e7000bc35826': '5e8ab4b84d3d6775b807e9ba', '5e8ab4b84d3d6775b807e9ba': '5e52af8dd120e7000bc35826'}, 'date': {'5e52af8dd120e7000bc35826': '2020-10-20', '5e8ab4b84d3d6775b807e9ba': '2020-10-20'}, 'survey_duration_in_seconds': {'5e52af8dd120e7000bc35826': 3524, '5e8ab4b84d3d6775b807e9ba': 2726}, 'time_zone': {'5e52af8dd120e7000bc35826': 5.0, '5e8ab4b84d3d6775b807e9ba': 5.0}, 'pre_affect': {'5e52af8dd120e7000bc35826': 4.0, '5e8ab4b84d3d6775b807e9ba': 7.0}, 'pre_arousal': {'5e52af8dd120e7000bc35826': 4.0, '5e8ab4b84d3d6775b807e9ba': 4.0}, 'technical_quality': {'5e52af8dd120e7000bc35826': 1.0, '5e8ab4b84d3d6775b807e9ba': 2.0}, 'conv_length': {'5e52af8dd120e7000bc35826': 32.0, '5e8ab4b84d3d6775b807e9ba': 31.0}, 'affect': {'5e52af8dd120e7000bc35826': 7.0, '5e8ab4b84d3d6775b807e9ba': 8.0}, 'arousal': {'5e52af8dd120e7000bc35826': 7.0, '5e8ab4b84d3d6775b807e9ba': 7.0}, 'overall_affect': {'5e52af8dd120e7000bc35826': 7.0, '5e8ab4b84d3d6775b807e9ba': 7.0}, '

In [26]:
convo.get_speaker_ids()

['5e8ab4b84d3d6775b807e9ba', '5e52af8dd120e7000bc35826']

### Save the Corpus

In [27]:
SAVE_PATH = '<YOUR DIRECTORY TO SAVE CORPUS>'
CANDOR_corpus.dump(f"CANDOR-corpus-{transcription_type}", base_path=SAVE_PATH)

In [28]:
from convokit import meta_index
meta_index(filename = f"{SAVE_PATH}/CANDOR-corpus-{transcription_type}")

{'utterances-index': {'turn_id': ["<class 'int'>"],
  'start': ["<class 'float'>"],
  'stop': ["<class 'float'>"],
  'interval': ["<class 'float'>"],
  'delta': ["<class 'float'>"],
  'questions': ["<class 'int'>"],
  'end_question': ["<class 'bool'>"],
  'overlap': ["<class 'bool'>"],
  'n_words': ["<class 'int'>"]},
 'speakers-index': {},
 'conversations-index': {'partner_id': ["<class 'dict'>"],
  'date': ["<class 'dict'>"],
  'survey_duration_in_seconds': ['bin'],
  'time_zone': ["<class 'dict'>"],
  'pre_affect': ["<class 'dict'>"],
  'pre_arousal': ["<class 'dict'>"],
  'technical_quality': ["<class 'dict'>"],
  'conv_length': ["<class 'dict'>"],
  'affect': ["<class 'dict'>"],
  'arousal': ["<class 'dict'>"],
  'overall_affect': ["<class 'dict'>"],
  'overall_arousal': ["<class 'dict'>"],
  'overall_memory_rating': ["<class 'dict'>"],
  'begin_affect': ["<class 'dict'>"],
  'begin_arousal': ["<class 'dict'>"],
  'begin_memory_rating': ["<class 'dict'>"],
  'begin_memory_text': [

### Retrieve Corpus

In [29]:
my_CANDOR_corpus = Corpus(filename=f"{SAVE_PATH}/CANDOR-corpus-{transcription_type}")

In [30]:
my_CANDOR_corpus.print_summary_stats()

Number of Speakers: 186
Number of Utterances: 30075
Number of Conversations: 99
