# WikiConv Create Conversations

This notebook provides different forms of conversations from  the Wikiconv data set. In particular, it showcases the final version of selected conversations and how that conversation developed over time. It also provides a framework to print out rnadomly selected final conversations and the corresponding wikipedia page.

In [1]:
#import relevant modules
from datetime import datetime, timedelta
from convokit import Corpus, User, Utterance, Conversation, download
import re
import random

In [2]:
# Load the 2003 wikiconv corpus (feel free to change this to a year of your preference)
wikiconv_corpus = Corpus(filename=download('wikiconv-2003'))

Dataset already exists at /home/jonathan/.convokit/downloads/wikiconv-2003


Some basic facts about this subset of the corpus: 91,787 conversations and 140,265 utterances

In [3]:
len(list(wikiconv_corpus.iter_conversations()))

91787

In [4]:
len(list(wikiconv_corpus.iter_utterances()))

140265

From the corpus, we will randomly select conversations to print out based on a few metrics:
1. number_of_conversations - how many conversations we want to print out
2. conversation_min_length -  the minimum number of utterances we want in the conversation

In [5]:
#Randomly chooses the set number of conversations to print from the entire conversaton set
def print_random_conversations(conversation_list, number_of_conversations, conversation_min_length,  conversation_corpus): 
    randomly_generated_conversation_list = []
    while (len(randomly_generated_conversation_list) != number_of_conversations):
        new_conversation = random.randint(0, (len(conversation_list)-1))
        new_conversation_id = conversation_list[new_conversation]
        conversation_ids_list = new_conversation_id.get_utterance_ids()
        if (new_conversation not in randomly_generated_conversation_list 
            and (len(conversation_ids_list) >= conversation_min_length)):
            randomly_generated_conversation_list.append(new_conversation_id)
        
    return randomly_generated_conversation_list


Here we'll get a set of random conversatinos from the corpus based on our specifications (print out 3, conversations, with a minimum length of 2) and the output will be a set of serialized conversations.

In [6]:
conversation_list = list(wikiconv_corpus.iter_conversations())
number_of_conversations_to_print = 3
conversation_min_length = 2

random_conversations = print_random_conversations(conversation_list, number_of_conversations_to_print,
                                                     conversation_min_length, wikiconv_corpus)
print (random_conversations)

[<convokit.model.Conversation object at 0x7faa2e7f0320>, <convokit.model.Conversation object at 0x7faa2ed34f28>, <convokit.model.Conversation object at 0x7faa2ed60710>]


Next, stored in the conversation meta data is the wikipedia information from the page that this conversation is from.
We will find that information and print out the link to the associated wikipedia page for each conversation.


In [7]:
def wikipedia_link_info(conversation):
    page_title = conversation.meta['page_title']
    page_title = re.sub('\s+', '_', page_title)
    page_type = conversation.meta['page_type']
    link_value = "https://en.wikipedia.org/w/index.php?title="+page_type+":"+page_title
    
    return link_value

for conversation in random_conversations:
    print(wikipedia_link_info(conversation))
    conversation_ids_list = conversation.get_utterance_ids()

https://en.wikipedia.org/w/index.php?title=talk:Housewife
https://en.wikipedia.org/w/index.php?title=talk:Dance_music_(traditional)
https://en.wikipedia.org/w/index.php?title=talk:Main_Page


Now that we have the conversation and the actual wikipedia page where they exist, we will want to print out the conversation's final form from the utterance data. But to do this, first we will need to compute the correct order of the comments. 

The corpus functionality does not guarantee the comments are in the right order, so we will compute this flow now.


In [8]:
#For any comments that do not have matching reply to ids, sort these comments in order of recency 
def sort_by_timestamp(conversation_ids_list, conversation_corpus):
    list_of_utterances = []
    for id_val in conversation_ids_list:
        utterance_value = conversation_corpus.get_utterance(id_val)
        timestamp_val = utterance_value.timestamp
        tuple_val = (id_val, timestamp_val)
        list_of_utterances.append(tuple_val)

    sorted_utterance_list = sorted(list_of_utterances, key = lambda x:x[1])
    sorted_utterance_list.reverse()
    id_list = [i[0] for i in sorted_utterance_list]
    return (id_list)

In [9]:
#Find cases in which an utterance's reply to is to a comment in the chain that has been modified, deleted or restored
def check_lists_for_match(x, conversation_ids_list, utterance, next_utterance_value, conversation_corpus):
    modification_list = utterance.meta['modification']
    deletion_list = utterance.meta['deletion']
    restoration_list = utterance.meta['restoration']
    if (len(modification_list)>0):
        for utterance_val in modification_list:
            if (utterance_val.id == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
    if (len(deletion_list)>0):
        for utterance_val in deletion_list:
            if (utterance_val.id == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
    if (len(restoration_list)>0):
        for utterance_val in restoration_list:
            if (utterance_val.id == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)

In [10]:
# Build the conversation flow correctly and add utterances if the reply-to id matches the current utterance in the list
def add_utterance(conversation_ids_list, next_utterance_value, conversation_corpus):
    if next_utterance_value.id in conversation_ids_list:
        return conversation_ids_list
    elif (next_utterance_value.reply_to is None):
        conversation_ids_list.append(next_utterance_value.id)
    else:
        for x in range(0,len(conversation_ids_list)):
            utterance_id = conversation_ids_list[x]
            if (utterance_id == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
            else:
                check_lists_for_match(x, conversation_ids_list, conversation_corpus.get_utterance(utterance_id), next_utterance_value, conversation_corpus)

    return conversation_ids_list

In [11]:
#The order of the returned conversation ids is not guaranteed; compute the correct ordering 
def find_correct_order(conversation_ids_list, conversation_corpus):
    correct_list_order = []
    #if the conversation has only one comment, return the conversation list
    if (len(conversation_ids_list) == 1 ):
        return conversation_ids_list

    #When the conversation has more than one comment, find the correct order of the comments
    if (len(conversation_ids_list) >1):
        #Implement a fail safe to efficiently sort 
        number_of_iterations = 0
        while (number_of_iterations <20 and len(correct_list_order) != len(conversation_ids_list)):
            for utterance_id in conversation_ids_list:
                correct_list_order = add_utterance(correct_list_order, conversation_corpus.get_utterance(utterance_id), conversation_corpus)
            number_of_iterations+=1

        #In some of the conversations, new utterances will be added that don't reply directly to the current conversation
        #Instead, these new utterances are part of the topic at hand (under the same conversation root) and are sorted by recency
        if (len(correct_list_order) != len(conversation_ids_list)):
            difference_in_sets = set(conversation_ids_list).difference(correct_list_order)
            timestamp_sorted_difference = sort_by_timestamp(list(difference_in_sets), conversation_corpus)
            correct_list_order.extend(timestamp_sorted_difference)
    return correct_list_order


And so, we can compute the correct order of utterances in each randomly selected conversation.

In [12]:
for conversation in random_conversations:
    conversation_ids_list = conversation.get_utterance_ids()
    print ('Original Order of IDs:' + str(conversation_ids_list))
    print('Correct Order of IDs:' + str(find_correct_order(conversation_ids_list, wikiconv_corpus)))
    print ('\n')

Original Order of IDs:['990187.2761.2761', '992726.3054.3054', '990639.3380.3380', '992726.3596.3596', '16071065.3661.3661']
Correct Order of IDs:['990187.2761.2761', '992726.3054.3054', '990639.3380.3380', '992726.3596.3596', '16071065.3661.3661']


Original Order of IDs:['1782876.60.60', '16232568.2769.2768', '1791042.774.774', '1791069.1156.1156', '1817544.1517.1517', '16232568.1623.1623', '1817651.2355.2355', '16232568.2607.2608']
Correct Order of IDs:['1782876.60.60', '16232568.1623.1623', '16232568.2607.2608', '16232568.2769.2768', '1817651.2355.2355', '1817544.1517.1517', '1791069.1156.1156', '1791042.774.774']


Original Order of IDs:['782044.0.14815', '782044.0.15026']
Correct Order of IDs:['782044.0.14815', '782044.0.15026']




Print out the final form of the conversations

In [13]:
#Print the conversation text from the list of conversation ids
def print_final_conversation(random_conversations, conversation_corpus):
    for conversation in random_conversations:
        print(wikipedia_link_info(conversation))
        conversation_ids_list = conversation.get_utterance_ids()
        #First correctly reorder the comments
        ordered_list = find_correct_order(conversation_ids_list, conversation_corpus)
        #For each utterance, print the text present if the utterance has not been deleted
        for utterance_id in ordered_list:
            utterance_value = conversation_corpus.get_utterance(utterance_id)
            if (utterance_value.text != " "):
                print (utterance_value.text)
                date_time_val = datetime.fromtimestamp(utterance_value.timestamp).strftime('%H:%M %d-%m-%Y')
                formatted_user_name = "--" + str(utterance_value.user.name) + "  " + str(date_time_val)
                print (formatted_user_name)
        print ('\n\n')

In [14]:
print_final_conversation(random_conversations,  wikiconv_corpus)

https://en.wikipedia.org/w/index.php?title=talk:Housewife
I strongly disapprove the sentence ''Men are increasingly adopting the role of homemaker''. Unless someone will take care of finding a more twisted but proper and accurate sentence to express this, I will just remove the sentence as being strongly misleading. a Wikipedia:WikiWomen
--Anthere  13:39 02-06-2003
I agree, this sort of statement desperatly needs a cite ("According to a research, Men are increasingly adopting the role of homemaker.." would sound more valid (in a junk science newspaper way...)) but I think we should think of a replacement for that statement...   17:59 2 Jun 2003 (UTC)
--Rotem Dan  16:41 02-06-2003
even better, in country xxx, according to the study from institute yyy, the number of hours men spent cleaning water closets and bathrooms during a year in 1960 and in 2000, compared to equivalent hours per women.
--Anthere  14:44 02-06-2003
lol )   20:41 2 Jun 2003 (UTC)
--Rotem Dan  16:41 02-06-2003
Thanks f

Let's create a compact method to change the default values easily

In [15]:
def change_defaults_print_final(conversation_list, number_of_conversations, conversation_min_length,  
                                conversation_corpus):
    random_conversations = print_random_conversations(conversation_list, number_of_conversations_to_print,
                                                     conversation_min_length, wikiconv_corpus)
    print_final_conversation(random_conversations, conversation_corpus)

In [16]:
conversation_list = list(wikiconv_corpus.iter_conversations())
number_of_conversations_to_print = 1
conversation_min_length = 2

change_defaults_print_final(conversation_list, number_of_conversations_to_print, conversation_min_length,
                            wikiconv_corpus)

https://en.wikipedia.org/w/index.php?title=user_talk:Tarquin





Finally, we will create a method to print out the final comment and the intermediate steps in the conversation

In [17]:
def sort_changes_by_timestamp(modification_list, deletion_list, restoration_list,  original_utterance):
    text_time_tuple_list = []
    if (original_utterance is not None):
        text_time_original  = (original_utterance.text,original_utterance.timestamp,
                           original_utterance.user.name, 'original')
        text_time_tuple_list.append(text_time_original)
        

    for utterance in modification_list:
        text_time= (utterance.text, utterance.timestamp,
                    utterance.user.name, 'modification')
        text_time_tuple_list.append(text_time)
    
    for utterance in deletion_list:
        text_time= ('', utterance.timestamp,
                    utterance.user.name, 'deletion')
        text_time_tuple_list.append(text_time)
        
    for utterance in restoration_list:
        text_time= (utterance.text, utterance.timestamp,
                    utterance.user.name, 'restoration')
        text_time_tuple_list.append(text_time)
            
    text_time_tuple_list.sort(key=lambda x: x[1])
    #text_time_tuple_list.reverse()
    
    
    
    return text_time_tuple_list
        
    

In [18]:
def print_intermediate_conversation(random_conversations, conversation_corpus):
    for conversation in random_conversations:
        conversation_ids_list = conversation.get_utterance_ids()
        #First correctly reorder the comments
        ordered_list = find_correct_order(conversation_ids_list, conversation_corpus)
        #For each utterance, print the text present if the utterance has not been deleted
        for utterance_id in ordered_list:
            utterance_value = conversation_corpus.get_utterance(utterance_id)
            if (utterance_value.text != " "):
                final_comment =  utterance_value.text
                date_time_val = datetime.fromtimestamp(utterance_value.timestamp).strftime('%H:%M %d-%m-%Y')
                formatted_user_name = "--" + str(utterance_value.user.name) + "  " + str(date_time_val)
                
        
                final_timestamp = utterance_value.timestamp
                modification_list = utterance_value.meta['modification']
                deletion_list = utterance_value.meta['deletion']
                restoration_list = utterance_value.meta['restoration']
                
                sorted_timestamps = sort_changes_by_timestamp(modification_list, deletion_list, restoration_list,
                                                             utterance_value.meta['original'])
                
                if (len(sorted_timestamps)>0):
                    print(wikipedia_link_info(conversation))
                    print ('Final Comment')
                    print (final_comment)
                    print (formatted_user_name)
                    
                    for value in sorted_timestamps:
                        print ('\n')
                        print (value[3])
                        print (value[0])
                        formatted_user_name = "--" + str(value[2]) + "  " + str(datetime.fromtimestamp(float(value[1])).strftime('%H:%M %d-%m-%Y'))
                        #str(datetime.fromtimestamp(value[1]).strftime('%H:%M %d-%m-%Y'))
                        print (formatted_user_name)

                        

Our method to quikcly print out intermediate conversations defined below (only conversations with modification, deletion and restoration conversations  are shown)

In [19]:
def change_defaults_print_intermediate(conversation_list, number_of_conversations, conversation_min_length,  
                                conversation_corpus):
    random_conversations = print_random_conversations(conversation_list, number_of_conversations_to_print,
                                                     conversation_min_length, wikiconv_corpus)
    print_intermediate_conversation(random_conversations, conversation_corpus)

Here, the flow of different conversations  is shown with the final comment first displayed and the corresponding actions that have occurred from earliest to latest actions

In [20]:
conversation_list = list(wikiconv_corpus.iter_conversations())
number_of_conversations_to_print = 10
conversation_min_length = 3

change_defaults_print_intermediate(conversation_list, number_of_conversations_to_print, conversation_min_length,
                            wikiconv_corpus)

https://en.wikipedia.org/w/index.php?title=user_talk:80.255/archive_2
Final Comment
 What you suggest would work for administrative counties whose name differs from traditinal Counties, such as South Gloucestershire and Gloucestershire; however, it obviously wouldn't work for those Counties where this is not the case (which is most of them), and I think any system must be consistent. I think what was adopted for the welsh Counties is fine - thus Gloucestershire (traditional) and South Gloucestershire (administrative) for the corresponding articles, which can then be inter-referential without too much differentiating underbrush.  This implies:
 ''is in the traditional county County of Gloucestershire and also lies within the administrative county of South Gloucestershire'' 
Capitalisation is of no help to a blind user with a text reader; or a search engine. Still, in "the ideal world", we wouldn't have to deal with such reactionary viewpoints. 
 In an "ideal world" ''you'' wouldn't have