# WikiConv Create Conversations

This notebook provides different forms of conversations from  the Wikiconv data set. In particular, it showcases the final version of selected conversations and how that conversation developed over time. It also provides a framework to print out rnadomly selected final conversations and the corresponding wikipedia page.

In [1]:
#import relevant modules
from datetime import datetime, timedelta
from convokit import Corpus, download
import re
import random

In [2]:
# Load the 2003 wikiconv corpus (feel free to change this to a year of your preference)
wikiconv_corpus = Corpus(filename=download('wikiconv-2003'))

Downloading wikiconv-2003 to /kitchen/convokit_corpora_jpc/wikiconv-2003
Downloading wikiconv-2003 from http://zissou.infosci.cornell.edu/convokit/datasets/wikiconv-corpus/corpus-zipped/2003/full.corpus.zip (38.7MB)... Done


Some basic facts about this subset of the corpus: 91,787 conversations and 140,265 utterances

In [3]:
len(list(wikiconv_corpus.iter_conversations()))

91787

In [4]:
len(list(wikiconv_corpus.iter_utterances()))

140265

From the corpus, we will randomly select conversations to print out based on a few metrics:
1. number_of_conversations - how many conversations we want to print out
2. conversation_min_length -  the minimum number of utterances we want in the conversation

In [5]:
#Randomly chooses the set number of conversations to print from the entire conversaton set
def print_random_conversations(conversation_list, number_of_conversations, conversation_min_length,  conversation_corpus): 
    randomly_generated_conversation_list = []
    while (len(randomly_generated_conversation_list) != number_of_conversations):
        new_conversation = random.randint(0, (len(conversation_list)-1))
        new_conversation_id = conversation_list[new_conversation]
        conversation_ids_list = new_conversation_id.get_utterance_ids()
        if (new_conversation not in randomly_generated_conversation_list 
            and (len(conversation_ids_list) >= conversation_min_length)):
            randomly_generated_conversation_list.append(new_conversation_id)
        
    return randomly_generated_conversation_list


Here we'll get a set of random conversatinos from the corpus based on our specifications (print out 3, conversations, with a minimum length of 2) and the output will be a set of serialized conversations.

In [6]:
conversation_list = list(wikiconv_corpus.iter_conversations())
number_of_conversations_to_print = 3
conversation_min_length = 2

random_conversations = print_random_conversations(conversation_list, number_of_conversations_to_print,
                                                     conversation_min_length, wikiconv_corpus)
print (random_conversations)

[Conversation({'obj_type': 'conversation', 'meta': {'page_id': '383784', 'page_title': 'Matty j', 'page_type': 'user_talk'}, 'vectors': [], 'tree': None, 'owner': <convokit.model.corpus.Corpus object at 0x7f0d5a31d390>, 'id': '2081953.389.389'}), Conversation({'obj_type': 'conversation', 'meta': {'page_id': '187772', 'page_title': 'Large numbers', 'page_type': 'talk'}, 'vectors': [], 'tree': None, 'owner': <convokit.model.corpus.Corpus object at 0x7f0d5a31d390>, 'id': '1053969.145.145'}), Conversation({'obj_type': 'conversation', 'meta': {'page_id': '269015', 'page_title': 'Creationism/Archive 6', 'page_type': 'talk'}, 'vectors': [], 'tree': None, 'owner': <convokit.model.corpus.Corpus object at 0x7f0d5a31d390>, 'id': '1324714.25382.25382'})]


Next, stored in the conversation meta data is the wikipedia information from the page that this conversation is from.
We will find that information and print out the link to the associated wikipedia page for each conversation.


In [7]:
def wikipedia_link_info(conversation):
    page_title = conversation.meta['page_title']
    page_title = re.sub('\s+', '_', page_title)
    page_type = conversation.meta['page_type']
    link_value = "https://en.wikipedia.org/w/index.php?title="+page_type+":"+page_title
    
    return link_value

for conversation in random_conversations:
    print(wikipedia_link_info(conversation))
    conversation_ids_list = conversation.get_utterance_ids()

https://en.wikipedia.org/w/index.php?title=user_talk:Matty_j
https://en.wikipedia.org/w/index.php?title=talk:Large_numbers
https://en.wikipedia.org/w/index.php?title=talk:Creationism/Archive_6


Now that we have the conversation and the actual wikipedia page where they exist, we will want to print out the conversation's final form from the utterance data. But to do this, first we will need to compute the correct order of the comments. 

The corpus functionality does not guarantee the comments are in the right order, so we will compute this flow now.


In [8]:
#For any comments that do not have matching reply to ids, sort these comments in order of recency 
def sort_by_timestamp(conversation_ids_list, conversation_corpus):
    list_of_utterances = []
    for id_val in conversation_ids_list:
        utterance_value = conversation_corpus.get_utterance(id_val)
        timestamp_val = utterance_value.timestamp
        tuple_val = (id_val, timestamp_val)
        list_of_utterances.append(tuple_val)

    sorted_utterance_list = sorted(list_of_utterances, key = lambda x:x[1])
    sorted_utterance_list.reverse()
    id_list = [i[0] for i in sorted_utterance_list]
    return (id_list)

In [9]:
#Find cases in which an utterance's reply to is to a comment in the chain that has been modified, deleted or restored
def check_lists_for_match(x, conversation_ids_list, utterance, next_utterance_value, conversation_corpus):
    modification_list = utterance.meta['modification']
    deletion_list = utterance.meta['deletion']
    restoration_list = utterance.meta['restoration']
    if (len(modification_list)>0):
        for utterance_val in modification_list:
            if (utterance_val['id'] == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
    if (len(deletion_list)>0):
        for utterance_val in deletion_list:
            if (utterance_val['id'] == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
    if (len(restoration_list)>0):
        for utterance_val in restoration_list:
            if (utterance_val['id'] == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)

In [10]:
# Build the conversation flow correctly and add utterances if the reply-to id matches the current utterance in the list
def add_utterance(conversation_ids_list, next_utterance_value, conversation_corpus):
    if next_utterance_value.id in conversation_ids_list:
        return conversation_ids_list
    elif (next_utterance_value.reply_to is None):
        conversation_ids_list.append(next_utterance_value.id)
    else:
        for x in range(0,len(conversation_ids_list)):
            utterance_id = conversation_ids_list[x]
            if (utterance_id == next_utterance_value.reply_to):
                conversation_ids_list.insert(x+1, next_utterance_value.id)
            else:
                check_lists_for_match(x, conversation_ids_list, conversation_corpus.get_utterance(utterance_id), next_utterance_value, conversation_corpus)

    return conversation_ids_list

In [11]:
#The order of the returned conversation ids is not guaranteed; compute the correct ordering 
def find_correct_order(conversation_ids_list, conversation_corpus):
    correct_list_order = []
    #if the conversation has only one comment, return the conversation list
    if (len(conversation_ids_list) == 1 ):
        return conversation_ids_list

    #When the conversation has more than one comment, find the correct order of the comments
    if (len(conversation_ids_list) >1):
        #Implement a fail safe to efficiently sort 
        number_of_iterations = 0
        while (number_of_iterations <20 and len(correct_list_order) != len(conversation_ids_list)):
            for utterance_id in conversation_ids_list:
                correct_list_order = add_utterance(correct_list_order, conversation_corpus.get_utterance(utterance_id), conversation_corpus)
            number_of_iterations+=1

        #In some of the conversations, new utterances will be added that don't reply directly to the current conversation
        #Instead, these new utterances are part of the topic at hand (under the same conversation root) and are sorted by recency
        if (len(correct_list_order) != len(conversation_ids_list)):
            difference_in_sets = set(conversation_ids_list).difference(correct_list_order)
            timestamp_sorted_difference = sort_by_timestamp(list(difference_in_sets), conversation_corpus)
            correct_list_order.extend(timestamp_sorted_difference)
    return correct_list_order


And so, we can compute the correct order of utterances in each randomly selected conversation.

In [12]:
for conversation in random_conversations:
    conversation_ids_list = conversation.get_utterance_ids()
    print ('Original Order of IDs:' + str(conversation_ids_list))
    print('Correct Order of IDs:' + str(find_correct_order(conversation_ids_list, wikiconv_corpus)))
    print ('\n')

Original Order of IDs:['2081953.413.389', '2081953.389.389']
Correct Order of IDs:['2081953.389.389', '2081953.413.389']


Original Order of IDs:['1053969.145.145', '1054046.1046.1046', '1054707.1680.1680', '3744612.1995.1995']
Correct Order of IDs:['1053969.145.145', '1054707.1680.1680', '3744612.1995.1995', '1054046.1046.1046']


Original Order of IDs:['1344757.132.30543', '1344757.132.30573', '1344757.132.30760', '1344757.132.31076', '1344757.132.30860', '1344757.132.31731', '1344757.132.32775', '1344757.132.32926', '1329814.35766.35766']
Correct Order of IDs:['1344757.132.30543', '1344757.132.30573', '1344757.132.30760', '1344757.132.31076', '1344757.132.30860', '1344757.132.31731', '1344757.132.32775', '1344757.132.32926', '1329814.35766.35766']




Print out the final form of the conversations

In [13]:
#Print the conversation text from the list of conversation ids
def print_final_conversation(random_conversations, conversation_corpus):
    for conversation in random_conversations:
        print(wikipedia_link_info(conversation))
        conversation_ids_list = conversation.get_utterance_ids()
        #First correctly reorder the comments
        ordered_list = find_correct_order(conversation_ids_list, conversation_corpus)
        #For each utterance, print the text present if the utterance has not been deleted
        for utterance_id in ordered_list:
            utterance_value = conversation_corpus.get_utterance(utterance_id)
            if (utterance_value.text != " "):
                print (utterance_value.text)
                date_time_val = datetime.fromtimestamp(utterance_value.timestamp).strftime('%H:%M %d-%m-%Y')
                formatted_user_name = "--" + str(utterance_value.user.name) + "  " + str(date_time_val)
                print (formatted_user_name)
        print ('\n\n')

In [14]:
print_final_conversation(random_conversations,  wikiconv_corpus)

https://en.wikipedia.org/w/index.php?title=user_talk:Matty_j
 Baseball/temp 
--Ktsquare  20:56 23-12-2003
Hi Matty, I noticed you have contributed to many bios of MLB players, I'd like to ask for your opinion whether it's time to move the rewritten Baseball/temp to Baseball. Please comment at Talk:Baseball/temp. Thanks.  
--Ktsquare  20:56 23-12-2003



https://en.wikipedia.org/w/index.php?title=talk:Large_numbers
Can I suggest that we include only pure numbers in this article, not distances and other measurements?  Would anyone object if I deleted the astronomical distances, since they are only large numbers when expressed in small units?  I suppose I should go further and say that Avogradro's number is also just an arbitrary unit, but I shan't, because I feel I'm on a slippery slope towards excluding everything!   
--Heron  05:16 18-06-2003
Let me put this another way.  I think the present article should be, as it mostly is, about the mathematics of large numbers.  Other large quanti



Let's create a compact method to change the default values easily

In [15]:
def change_defaults_print_final(conversation_list, number_of_conversations, conversation_min_length,  
                                conversation_corpus):
    random_conversations = print_random_conversations(conversation_list, number_of_conversations_to_print,
                                                     conversation_min_length, wikiconv_corpus)
    print_final_conversation(random_conversations, conversation_corpus)

In [16]:
conversation_list = list(wikiconv_corpus.iter_conversations())
number_of_conversations_to_print = 1
conversation_min_length = 2
#Refresher on where the wikiconv_corpus  is defined
# corpus_path = "/Users/adityajha/Desktop/ConvoKit-master/second_set/conversation_corpus_year_2015"
# wikiconv_corpus = Corpus(filename=corpus_path)

change_defaults_print_final(conversation_list, number_of_conversations_to_print, conversation_min_length,
                            wikiconv_corpus)

https://en.wikipedia.org/w/index.php?title=talk:Main_Page
age pump|Village pump]]. See talk:Wikipedia category schemes for general discussion of the category scheme on Wikipedia's Main Page.'''
'''See Wikipedia talk:Selected Articles on the Main Page for discussion of (and recommendations for) the Selected Articles on the Main Page. See below for more discussion of particular issues regarding the Main Page (e.g., whether to include a particular category on the page). Please add your additions at the bottom.'''

Some older talk has been archived to
talk:Main Page/Archive 1
talk:Main Page/Archive 2
talk:Main Page/Archive 3
talk:Main Page/Archive 4
talk:Main Page/Archive 5
talk:Main Page/Archive 6
--Schneelocke  08:26 05-09-2003







Finally, we will create a method to print out the final comment and the intermediate steps in the conversation

In [17]:
def sort_changes_by_timestamp(modification_list, deletion_list, restoration_list,  original_utterance):
    text_time_tuple_list = []
    if (original_utterance is not None):
        text_time_original  = (original_utterance['text'],original_utterance['timestamp'],
                           original_utterance['speaker.id'], 'original')
        text_time_tuple_list.append(text_time_original)
        

    for utterance in modification_list:
        text_time= (utterance['text'], utterance['timestamp'],
                    utterance['speaker.id'], 'modification')
        text_time_tuple_list.append(text_time)
    
    for utterance in deletion_list:
        text_time= ('', utterance['timestamp'],
                    utterance['speaker.id'], 'deletion')
        text_time_tuple_list.append(text_time)
        
    for utterance in restoration_list:
        text_time= (utterance['text'], utterance['timestamp'],
                    utterance['speaker.id'], 'restoration')
        text_time_tuple_list.append(text_time)
            
    text_time_tuple_list.sort(key=lambda x: x[1])
    #text_time_tuple_list.reverse()
    
    
    
    return text_time_tuple_list
        
    

In [18]:
def print_intermediate_conversation(random_conversations, conversation_corpus):
    for conversation in random_conversations:
        conversation_ids_list = conversation.get_utterance_ids()
        #First correctly reorder the comments
        ordered_list = find_correct_order(conversation_ids_list, conversation_corpus)
        #For each utterance, print the text present if the utterance has not been deleted
        for utterance_id in ordered_list:
            utterance_value = conversation_corpus.get_utterance(utterance_id)
            if (utterance_value.text != " "):
                final_comment =  utterance_value.text
                date_time_val = datetime.fromtimestamp(utterance_value.timestamp).strftime('%H:%M %d-%m-%Y')
                formatted_user_name = "--" + str(utterance_value.user.name) + "  " + str(date_time_val)
                
        
                final_timestamp = utterance_value.timestamp
                modification_list = utterance_value.meta['modification']
                deletion_list = utterance_value.meta['deletion']
                restoration_list = utterance_value.meta['restoration']
                
                sorted_timestamps = sort_changes_by_timestamp(modification_list, deletion_list, restoration_list,
                                                             utterance_value.meta['original'])
                
                if (len(sorted_timestamps)>0):
                    print(wikipedia_link_info(conversation))
                    print ('Final Comment')
                    print (final_comment)
                    print (formatted_user_name)
                    
                    for value in sorted_timestamps:
                        print ('\n')
                        print (value[3])
                        print (value[0])
                        formatted_user_name = "--" + str(value[2]) + "  " + str(datetime.fromtimestamp(float(value[1])).strftime('%H:%M %d-%m-%Y'))
                        #str(datetime.fromtimestamp(value[1]).strftime('%H:%M %d-%m-%Y'))
                        print (formatted_user_name)

                        

Our method to quikcly print out intermediate conversations defined below (only conversations with modification, deletion and restoration conversations  are shown)

In [19]:
def change_defaults_print_intermediate(conversation_list, number_of_conversations, conversation_min_length,  
                                conversation_corpus):
    random_conversations = print_random_conversations(conversation_list, number_of_conversations_to_print,
                                                     conversation_min_length, wikiconv_corpus)
    print_intermediate_conversation(random_conversations, conversation_corpus)

Here, the flow of different conversations  is shown with the final comment first displayed and the corresponding actions that have occurred from earliest to latest actions

In [20]:
conversation_list = list(wikiconv_corpus.iter_conversations())
number_of_conversations_to_print = 10
conversation_min_length = 3

change_defaults_print_intermediate(conversation_list, number_of_conversations_to_print, conversation_min_length,
                            wikiconv_corpus)

https://en.wikipedia.org/w/index.php?title=talk:Book_of_Revelation/Archive_1
Final Comment
OK, I can go look up some info. I am a little unsure of which part is considered sweeping. I assume that we agree that this historical interpretation is common among non-Christians and secular Bible scholars. It will be easy for me to get references to the Catholic part of this claim; this view is what their new publications have taught, for well over 20 years. References will be forthcoming. As for the viewpoint of liberal protestants, this will take a bit more work! 
--RK  21:39 24-04-2003


original
OK, I can go look up some info. I am a little unsure of which part is considered sweeping. I assume that we agree that this historical interpretation is common among non-Christians and secular Bible scholars. It will be easy for me to get references to the Catholic part of this claim; this view is what their new publications have taught, for well over 20 years. References will be forthcoming. As fo

