In [1]:
# import os
# os.chdir('../..')

In [2]:
from convokit import Corpus, Utterance, User




## Merging two corpora

Let's take a look at the characteristics of our input corpora before the merge. Apart from the summary statistics, notice that User 'foxtrot' appears in both corpora. Moreover, it has User metadata that is inconsistent.

The root field in each Utterance indicates where a new Conversation begins. In this case, while there are 2 conversations in each corpus, 1 conversation (with root 2) is featured in both corpuses, so there are only 3 conversations in total.

### Corpus 1

In [4]:
corpus1 = Corpus(utterances = [
            Utterance(id="0", root="0", text="hello world", user=User(name="alice")),
            Utterance(id="1", root="0", reply_to=0, text="my name is bob", user=User(name="bob")),
            Utterance(id="2", root="2", text="this is a sentence", user=User(name="foxtrot", meta={"yellow": "food"})),
        ])

In [24]:
corpus1.print_summary_stats()

Number of Users: 3
Number of Utterances: 3
Number of Conversations: 2


### Corpus 2

In [6]:
corpus2 = Corpus(utterances = [
            Utterance(id="3", root="3", text="i like pie", user=User(name="charlie", meta={"what": "a mood", "hey": "food"})),
            Utterance(id='4', root='3', reply_to=3, text="sentence galore", user=User(name="echo")),
            Utterance(id='2', root='2', text="this is a sentence", user=User(name="foxtrot", meta={"yellow": "mood", "hello": "world"})),
        ])

In [25]:
corpus2.print_summary_stats()

Number of Users: 3
Number of Utterances: 3
Number of Conversations: 2


Let's attempt a merge:

In [8]:
corpus3 = corpus1.merge(corpus2)



In [26]:
corpus3.print_summary_stats()

Number of Users: 5
Number of Utterances: 5
Number of Conversations: 3


### Merging user metadata

Notice that because User 'foxtrot' had conflicting metadata, the latest utterance (i.e. the utterance in corpus2) had its User metadata for 'foxtrot' take precedence. We verify this below. Note too that the other metadata key-value pair ('hello': 'world') has been added to the metadata as well.

In [10]:
corpus3.get_user('foxtrot').meta

{'yellow': 'mood', 'hello': 'world'}

Users were not initialized with their list of corresponding utterances / conversations. Corpus has a method for updating these User lists.

In [28]:
print(list(corpus3.iter_users()))
user_echo = corpus3.get_user('echo')
print()
user_echo.print_user_stats()

[User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138ae8470>, 'meta': {}, '_id': 'alice', '_name': 'alice'}), User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138ae8470>, 'meta': {}, '_id': 'bob', '_name': 'bob'}), User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138ae8470>, 'meta': {'yellow': 'mood', 'hello': 'world'}, '_id': 'foxtrot', '_name': 'foxtrot'}), User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138ae8470>, 'meta': {'what': 'a mood', 'hey': 'food'}, '_id': 'charlie', '_name': 'charlie'}), User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138ae8470>, 'meta': {}, '_id': 'echo', '_name': 'echo'})]

Number of Utterances: 1
Number of Conversations: 1


### Merging Utterance and Corpus metadata 

We quickly demonstrate the Utterance and Corpus metadata merging functionality. This is all handled in the merge() function as well, we just make its effects explicit here. In addition, we encode the corpora with problematic data/metadata so that the warning functionality is explicit.

(Note that if Utterances have the same id but different data, the Utterance from the other Corpus is ignored and a warning is printed, though the User metadata is still kept.)

### Corpus 4

In [14]:
corpus4 = Corpus(utterances = [
            Utterance(id='0', root='0', text="hello world", user=User(name="alice"), meta={'in': 'wonderland'}),
            Utterance(id='1', root='0', reply_to='0', text="my name is bob", user=User(name="bob"), meta={'fu': 'bu'})
        ])
corpus4.add_meta('AB', 1)
corpus4.add_meta('CD', 2)


In [30]:
corpus4.print_summary_stats()

Number of Users: 2
Number of Utterances: 2
Number of Conversations: 1


### Corpus 5

In [16]:
corpus5 = Corpus(utterances = [
            Utterance(id='0', root='0', text="hello world", user=User(name="alice"), meta={'in': 'the hat'}),
            Utterance(id='1', root='0', reply_to='0', text="my name is bobbb", user=User(name="bob"), meta={'barrel': 'roll'})
        ])
corpus5.add_meta('AB', 3)
corpus5.add_meta('EF', 3)

In [31]:
corpus5.print_summary_stats()

Number of Users: 2
Number of Utterances: 2
Number of Conversations: 1


In [18]:
corpus6 = corpus4.merge(corpus5)

Utterance('id': '1', 'root': 0, 'reply-to': 0, 'user': User('id': bob, 'meta': {}), 'timestamp': None, 'text': 'my name is bob', 'meta': {'fu': 'bu'})
Utterance('id': '1', 'root': 0, 'reply-to': 0, 'user': User('id': bob, 'meta': {}), 'timestamp': None, 'text': 'my name is bobbb', 'meta': {'barrel': 'roll'})
Ignoring second corpus's utterance.


In [32]:
corpus6.print_summary_stats()

Number of Users: 2
Number of Utterances: 2
Number of Conversations: 1


In [20]:
corpus6.meta

{'AB': 3, 'CD': 2, 'EF': 3}

In [21]:
corpus6.get_utterance('1')

Utterance({'obj_type': 'utterance', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {'fu': 'bu'}, '_id': '1', 'user': User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {}, '_id': 'bob', '_name': 'bob'}), 'root': '0', 'reply_to': '0', 'timestamp': None, 'text': 'my name is bob'})

In [22]:
corpus6.get_utterance('0')

Utterance({'obj_type': 'utterance', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {'in': 'the hat'}, '_id': '0', 'user': User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {}, '_id': 'alice', '_name': 'alice'}), 'root': '0', 'reply_to': None, 'timestamp': None, 'text': 'hello world'})

For the most part however, as long as the data is well behaved (e.g. User/Utterance/Conversation/Corpus do not have different values for the same key in the metadata, Utterances with the same id have the same data) one should expect to see no warnings when using merge().

In [34]:
list(list(corpus6.iter_conversations())[0].iter_utterances())

[Utterance({'obj_type': 'utterance', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {'in': 'the hat'}, '_id': '0', 'user': User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {}, '_id': 'alice', '_name': 'alice'}), 'root': '0', 'reply_to': None, 'timestamp': None, 'text': 'hello world'}),
 Utterance({'obj_type': 'utterance', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {'fu': 'bu'}, '_id': '1', 'user': User({'obj_type': 'user', '_owner': <convokit.model.corpus.Corpus object at 0x138b1c668>, 'meta': {}, '_id': 'bob', '_name': 'bob'}), 'root': '0', 'reply_to': '0', 'timestamp': None, 'text': 'my name is bob'})]

In [None]:
corpus6.dump('temp-corpus', './')