demos additional transformer argument functionality.

In [1]:
import os

note the only reason this is here is because otherwise convokit won't properly import CRAFT.

In [2]:
os.chdir('/home/justine/research/Cornell-Conversational-Analysis-Toolkit/')

In [3]:
import convokit

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


replace with actual download call

In [5]:
corpus = convokit.Corpus(filename='/kitchen/convokit_corpora/tennis-corpus/',
                        utterance_end_index=199)

In [6]:
corpus.print_summary_stats()

Number of Users: 9
Number of Utterances: 200
Number of Conversations: 100


In [7]:
test_utt_id = '1681_14.a'
utt = corpus.get_utterance(test_utt_id)

In [8]:
utt.text

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick."

a transformer that does _not_ take in an additional argument:

In [9]:
from convokit.text_processing import TextProcessor
from convokit import Utterance

In [10]:
def preprocess_text(text):
    text = text.replace(' -- ', ' ')
    return text

In [11]:
prep = TextProcessor(proc_fn=preprocess_text, output_field='clean_text')
corpus = prep.transform(corpus)

In [12]:
utt.get_info('clean_text')

"Yeah, but many friends went with me, Japanese guy. So I wasn't I wasn't like homesick. But now sometimes I get homesick."

In [13]:
from convokit.transformer import Transformer

a toy transformer that takes in an extra argument at transform and transform_utterance.

In [14]:
class AddToken(Transformer):
    
    def __init__(self, output_field):
        self.output_field = output_field
        
    def transform(self, corpus, to_add=' * '):
        for idx, utt in enumerate(corpus.iter_utterances()):
            utt.set_info(self.output_field, utt.text + to_add)
        return corpus
    
    def transform_utterance(self, utt, to_add=' * '):
        if isinstance(utt, str):
            utt = Utterance(text=utt)
        utt.set_info(self.output_field, utt.text + to_add)
        return utt

In [15]:
adder = AddToken('add_test')

In [16]:
corpus = adder.transform(corpus)

default behavior:

In [17]:
utt.get_info('add_test')

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick. * "

ad-hoc behavior:

In [18]:
corpus = adder.transform(corpus, to_add='&&')

In [19]:
utt.get_info('add_test')

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick.&&"

utterance level:

In [21]:
adder.transform_utterance('I -- I am confused.').get_info('add_test')

'I -- I am confused. * '

In [22]:
adder.transform_utterance('I -- I am confused.', '&&').get_info('add_test')

'I -- I am confused.&&'

pipelining. 

In [23]:
from convokit.convokitPipeline import ConvokitPipeline

In [24]:
test_pipe = ConvokitPipeline([
    ('prep',TextProcessor(preprocess_text, 'clean_text_pipe')),
    ('add', AddToken('add_pipe_test'))
])

no additional arguments -- default behavior

In [26]:
corpus = test_pipe.transform(corpus)

In [27]:
utt.get_info('add_pipe_test')

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick. * "

additional arguments to pass in, of form stepname__argumentname:

In [28]:
corpus = test_pipe.transform(corpus, add__to_add='&&')

In [29]:
utt.get_info('add_pipe_test')

"Yeah, but many friends went with me, Japanese guy. So I wasn't -- I wasn't like homesick. But now sometimes I get homesick.&&"

note the pipeline fails if you pass an argument corresponding to a step that doesn't take additional arguments:

In [30]:
corpus = test_pipe.transform(corpus, add__to_add='&&', prep__fake = 8)

TypeError: transform() got an unexpected keyword argument 'fake'

also works at utterance level:

In [32]:
test_pipe.transform_utterance('What -- is going on').get_info('add_pipe_test')

'What -- is going on * '

In [33]:
test_pipe.transform_utterance('What -- is going on', add__to_add='&&')\
    .get_info('add_pipe_test')

'What -- is going on&&'