In [1]:
from model import Corpus, Utterance, User


In [2]:
def stat_summary(corpus: Corpus) -> None:
    print("Number of conversations:", len(list(corpus.iter_conversations())))
    print("Number of utterances:", len(list(corpus.iter_utterances())))
    print("Number of users:", len(list(corpus.iter_users())))

## Merging two corpora

Let's take a look at the characteristics of our input corpora before the merge. Apart from the summary statistics, notice that User 'foxtrot' appears in both corpora. Moreover, it has User metadata that is inconsistent.

The root field in each Utterance indicates where a new Conversation begins. In this case, while there are 2 conversations in each corpus, 1 conversation (with root 2) is featured in both corpuses, so there are only 3 conversations in total.

### Corpus 1

In [3]:
corpus1 = Corpus(utterances = [
            Utterance(id=0, root=0, text="hello world", user=User(name="alice")),
            Utterance(id=1, root=0, reply_to=0, text="my name is bob", user=User(name="bob")),
            Utterance(id=2, root=2, text="this is a sentence", user=User(name="foxtrot", meta={"yellow": "food"})),
        ])

In [4]:
stat_summary(corpus1)

Number of conversations: 2
Number of utterances: 3
Number of users: 3


### Corpus 2

In [5]:
corpus2 = Corpus(utterances = [
            Utterance(id=3, root=3, text="i like pie", user=User(name="charlie", meta={"what": "a mood", "hey": "food"})),
            Utterance(id=4, root=3, reply_to=3, text="sentence galore", user=User(name="echo")),
            Utterance(id=2, root=2, text="this is a sentence", user=User(name="foxtrot", meta={"yellow": "mood", "hello": "world"})),
        ])

In [6]:
stat_summary(corpus2)

Number of conversations: 2
Number of utterances: 3
Number of users: 3


Let's attempt a merge:

In [7]:
corpus3 = corpus1.merge(corpus2)



In [8]:
stat_summary(corpus3)

Number of conversations: 3
Number of utterances: 5
Number of users: 5


### Merging user metadata

Notice that because User 'foxtrot' had conflicting metadata, the latest utterance (i.e. the utterance in corpus2) had its User metadata for 'foxtrot' take precedence. We verify this below. Note too that the other metadata key-value pair ('hello': 'world') has been added to the metadata as well.

In [9]:
corpus3.get_user('foxtrot').meta

{'yellow': 'mood', 'hello': 'world'}

Users were not initialized with their list of corresponding utterances / conversations. Corpus has a method for updating these User lists.

In [10]:
print(list(corpus3.iter_users()))
user_echo = corpus3.get_user('echo')
print("Number of utterances for User echo:", len(list(user_echo.iter_utterances())))
print("Number of conversations for User echo:", len(list(user_echo.iter_conversations())))

[User([('name', 'bob')]), User([('name', 'foxtrot')]), User([('name', 'echo')]), User([('name', 'alice')]), User([('name', 'charlie')])]
Number of utterances for User echo: 0
Number of conversations for User echo: 0


In [11]:
corpus3.update_users_data()

In [12]:
print("Number of utterances for User echo:", len(list(user_echo.iter_utterances())))
print("Number of conversations for User echo:", len(list(user_echo.iter_conversations())))

Number of utterances for User echo: 1
Number of conversations for User echo: 1


### Merging Utterance metadata 

We quickly demonstrate the Utterance metadata merging functionality. (Note that if Utterances have the same id but different data, the Utterance from the other Corpus is ignored and a warning is printed, though the User metadata is still kept.)

In [14]:
corpus4 = Corpus(utterances = [
            Utterance(id=0, root=0, text="hello world", user=User(name="alice"), meta={'in': 'wonderland'}),
            Utterance(id=1, root=0, reply_to=0, text="my name is bob", user=User(name="bob"), meta={'fu': 'bu'})
        ])
corpus4.add_meta('AB': 1)
corpus4.add_meta('CD': 2)
corpus5 = Corpus(utterances = [
            Utterance(id=0, root=0, text="hello world", user=User(name="alice"), meta={'in': 'the hat'}),
            Utterance(id=1, root=0, reply_to=0, text="my name is bob", user=User(name="bob"), meta={'barrel': 'roll'})
        ])

In [15]:
corpus6 = corpus4.merge(corpus5)

