# Converting Fora Dataset to ConvoKit format

Notebook Contributors: Yash Chatha, Laerdon Kim

This notebook is to help people working with the Fora Corpus to quickly transform it into ConvoKit format.
Details about the construction of the corpus are available in the original paper (pleace cite this paper if you use the corpus): 
`Schroeder, H., Roy, D., & Kabbara, J. (2024). Fora: A corpus and framework for the study of facilitated dialogue. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13985–14001). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.754`

The notebook is written with google colab setup.

# Installations + Setup

In [None]:
! pip install convokit

In [2]:
from tqdm import tqdm
from convokit import Corpus, Speaker, Utterance
from collections import defaultdict, Counter
import pandas as pd
import numpy as np

In [3]:
# prompt: mount google drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
%cd /content/drive/MyDrive/corpus_resources

/content/drive/MyDrive/corpus_resources


## Reading raw CSV data

We begin by converting the raw CSV file into a pandas DataFrame object.

In [47]:
corpus_raw = pd.read_csv('data.csv')

In [48]:
corpus_raw.head()

Unnamed: 0.1,Unnamed: 0,id,collection_title,collection_id,SpeakerTurn,audio_start_offset,audio_end_offset,duration,conversation_id,speaker_id,...,source_type,Personal story,Personal experience,Express affirmation,Specific invitation,Provide example,Open invitation,Make connections,Express appreciation,Follow up question
0,0,"[225700000, 225700001, 225700002, 225700003, 2...",Maine ED 2050,150,1,0.387,167.35,164.947,2257,42121,...,zoom,0,0,0,1,0,1,0,0,0
1,1,[225700005],Maine ED 2050,150,2,167.862,176.817,8.955,2257,42122,...,zoom,0,0,0,0,0,0,0,0,0
2,2,[225700006],Maine ED 2050,150,3,176.94,204.537,27.597,2257,42121,...,zoom,0,0,1,0,0,0,0,0,0
3,3,[225700007],Maine ED 2050,150,4,204.537,205.362,0.825,2257,42122,...,zoom,0,0,0,0,0,0,0,0,0
4,4,[225700008],Maine ED 2050,150,5,205.5,215.575,10.075,2257,42121,...,zoom,0,0,0,0,0,1,0,0,0


Reformat the unnamed column to 'original_index,' preserving the order of the unmodified data.

In [49]:
corpus_raw.rename( columns={'Unnamed: 0':'original_index'}, inplace=True )
corpus_raw.head()

Unnamed: 0,original_index,id,collection_title,collection_id,SpeakerTurn,audio_start_offset,audio_end_offset,duration,conversation_id,speaker_id,...,source_type,Personal story,Personal experience,Express affirmation,Specific invitation,Provide example,Open invitation,Make connections,Express appreciation,Follow up question
0,0,"[225700000, 225700001, 225700002, 225700003, 2...",Maine ED 2050,150,1,0.387,167.35,164.947,2257,42121,...,zoom,0,0,0,1,0,1,0,0,0
1,1,[225700005],Maine ED 2050,150,2,167.862,176.817,8.955,2257,42122,...,zoom,0,0,0,0,0,0,0,0,0
2,2,[225700006],Maine ED 2050,150,3,176.94,204.537,27.597,2257,42121,...,zoom,0,0,1,0,0,0,0,0,0
3,3,[225700007],Maine ED 2050,150,4,204.537,205.362,0.825,2257,42122,...,zoom,0,0,0,0,0,0,0,0,0
4,4,[225700008],Maine ED 2050,150,5,205.5,215.575,10.075,2257,42121,...,zoom,0,0,0,0,0,1,0,0,0


We convert the 'start_time' format to a datetime object.

In [50]:
from datetime import datetime, timedelta

for i, date in enumerate(corpus_raw['start_time']):
    corpus_raw.loc[i, 'start_time'] = datetime.strptime(date, "%m/%d/%y %H:%M")

## Creating speaker metadata

Following this, we create a list of all the unique speaker_ids within the CSV.

In [51]:
all_speakers = list(set(corpus_raw['speaker_id'].to_list()))

In [52]:
len(all_speakers)

1776

In [53]:
corpus_speakers = {}
for speaker_id in all_speakers:
    # isolate all the rows which have a particular speaker_id
    speaker_csv = corpus_raw[corpus_raw['speaker_id'] == speaker_id]
    speaker_metadata = {'speaker_name' : speaker_csv['speaker_name'].iloc[0],
                        'is_fac' : speaker_csv['is_fac'].iloc[0]}
    corpus_speakers.update({speaker_id : Speaker(id = speaker_id, meta = speaker_metadata)})

In [54]:
print("number of speakers in the data = {}".format(len(corpus_speakers)))

number of speakers in the data = 1776


In [55]:
print(corpus_speakers[42122]) # observe Mark

Speaker(id: 42122, vectors: [], meta: {'speaker_name': 'Mark', 'is_fac': False})


In [56]:
corpus_raw = corpus_raw.sort_values(by='original_index')

## Creating a list of utterances

Our next step is to create a list of utterances from the DataFrame. To do this, we must populate arguments to pass into the Corpus constructor. Please note that the raw .csv provides the 'id' as a string, and when we convert this to a ConvoKit Corpus object, we convert this to a list of ints.

In [57]:
import os

In [58]:
# next goal: get a list of Utterances

# Note that the fields of ConvoKit Utterance objects are: Utterance(id=...,
# speaker =..., conversation_id =..., reply_to=..., timestamp=..., text =..., meta =...)

utt_id_count = 0
corpus_utterances = []
prev_utt_id = []
for index, row in tqdm(corpus_raw.iterrows()):
    # create arguments to put into the utterance

    # note that `id` must be a str.
    current_id = row['id']
    # for easy access of ids for reply
    prev_utt_id = current_id

    # adding metadata
    meta = {}
    for k, v in row.items():
        if k != 'id' and k != 'words':
            meta.update({k : v})

    current_speaker = row['speaker_id']

    # note that `conversation_id` must be a str.
    current_conversation_id = row['conversation_id']

    # if the previous utterance is part of the same conversation_id, add its id to the utterance
    current_reply_to = None if row['SpeakerTurn'] == 1 else prev_utt_id

    current_timestamp = row['start_time'] + timedelta(row['audio_start_offset'])
    str_timestamp = current_timestamp.isoformat()
    current_text = row['words']
    current_meta = {}

    utterance = Utterance(id = current_id,
                          speaker = corpus_speakers[current_speaker],
                          conversation_id = str(current_conversation_id),
                          reply_to = current_reply_to,
                          timestamp = str_timestamp,
                          text = str(current_text),
                          meta=meta)

    corpus_utterances.append(utterance)

    # add index
    utt_id_count += 1

39911it [00:04, 9126.11it/s]


In [59]:
corpus_object = Corpus(utterances=corpus_utterances)

Once we've created this object, we can now access a variety of information which summarizes the content of the object.

In [60]:
corpus_object.print_summary_stats()

Number of Speakers: 1776
Number of Utterances: 39911
Number of Conversations: 262


In [61]:
random_convo = corpus_object.random_conversation()
print(random_convo)
print(corpus_object.random_utterance().meta)

Conversation('id': '2733', 'utterances': ['[273300000]', '[273300001]', '[273300002, 273300003]', '[273300004, 273300005, 273300006, 273300007]', '[273300008]', '[273300009]', '[273300010]', '[273300011]', '[273300012]', '[273300013]', '[273300014]', '[273300015]', '[273300016]', '[273300017]', '[273300018]', '[273300019]', '[273300020]', '[273300021]', '[273300022]', '[273300023]', '[273300024]', '[273300025]', '[273300026]', '[273300027]', '[273300028]', '[273300029]', '[273300030]', '[273300031]', '[273300032]', '[273300033]', '[273300034]', '[273300035]', '[273300036]', '[273300037]', '[273300038]', '[273300039]', '[273300040]', '[273300041]', '[273300042]', '[273300043]', '[273300044]', '[273300045]', '[273300046]', '[273300047, 273300048]', '[273300049]', '[273300050]', '[273300051]', '[273300052]', '[273300053]', '[273300054, 273300055, 273300056]', '[273300057]', '[273300058]', '[273300059]', '[273300060]', '[273300061]', '[273300062]', '[273300063, 273300064, 273300065]', '[27

In [62]:
print("number of conversations in the dataset = {}".format(len(corpus_object.get_conversation_ids())))

number of conversations in the dataset = 262


# Creating Conversation metadata

The following attributes provided in the Fora dataset feature are conversation-wide. These descriptions are from the column_descriptions:

    * collection_id: Numeric identifier of the conversation collection.

    * conversation_id: Unique identifier for the conversation.

    * cofacilitated: Boolean representing whether the current conversation has more than one facilitator.

    * annotated: Boolean representing whether the conversation was annotated by human experts for facilitation strategies and personal sharing.

    * start_time: Date of the conversation start time. Likely reliable as the date the conversation happened, but may be approximate due to potential delay in uploading.

    * source_type: String providing information about the type of audio input (e.g., Zoom, Hearth, iPhone).

    * location: Represents the location of the conversation, typically a town or neighborhood. About 1/3 of conversations do not have a value for this field and are marked "Unknown."


In [63]:
conversation_metadata_headers = ['collection_id', 'conversation_id', 'cofacilitated', 'annotated', 'start_time', 'source_type', 'location']

In [64]:
for convo in corpus_object.iter_conversations():
    convo_id = convo.id
    convo_row = corpus_raw[corpus_raw['conversation_id'] == int(convo_id)].iloc[0]
    metadata = {}
    for field_name in conversation_metadata_headers:
        field_value = convo_row[field_name]
        metadata.update({field_name : field_value})
    convo.meta = metadata

In [69]:
# sample a random conversation and query its metadata

random_convo = corpus_object.random_conversation()
print(random_convo.meta)

ConvoKitMeta({'collection_id': 106, 'conversation_id': 831, 'cofacilitated': True, 'annotated': False, 'start_time': datetime.datetime(2020, 11, 2, 23, 5), 'source_type': 'zoom', 'location': 'Downtown'})


In [66]:
SAVE_PATH = '/content/drive/MyDrive/corpus_resources'
corpus_object.dump(f"fora-corpus", base_path=SAVE_PATH)

from convokit import meta_index
meta_index(filename = f"{SAVE_PATH}/fora-corpus")

{'utterances-index': {'original_index': ["<class 'int'>"],
  'collection_title': ["<class 'str'>"],
  'collection_id': ["<class 'int'>"],
  'SpeakerTurn': ["<class 'int'>"],
  'audio_start_offset': ["<class 'float'>"],
  'audio_end_offset': ["<class 'float'>"],
  'duration': ["<class 'float'>"],
  'conversation_id': ["<class 'int'>"],
  'speaker_id': ["<class 'int'>"],
  'speaker_name': ["<class 'str'>"],
  'is_fac': ["<class 'bool'>"],
  'cofacilitated': ["<class 'bool'>"],
  'annotated': ["<class 'bool'>"],
  'start_time': ['bin'],
  'location': ["<class 'str'>"],
  'source_type': ["<class 'str'>"],
  'Personal story': ["<class 'int'>"],
  'Personal experience': ["<class 'int'>"],
  'Express affirmation': ["<class 'int'>"],
  'Specific invitation': ["<class 'int'>"],
  'Provide example': ["<class 'int'>"],
  'Open invitation': ["<class 'int'>"],
  'Make connections': ["<class 'int'>"],
  'Express appreciation': ["<class 'int'>"],
  'Follow up question': ["<class 'int'>"]},
 'speakers