<a href="https://colab.research.google.com/github/zahra-arjm/wikipedia_discussions/blob/main/Wiki_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is a report of a mini-project for visualizing wikipedia conversations on summer 2022. The data was gathered by Christine De Kock and Andreas Vlachos. The code was written by me ([Zahra Arjmandi Lari](mailto:z.arjmandi@gmail.com)). Part of the credit for the ideas on "how to visualize" goes back to Tom Stafford.

## The Story In Short
Wikipedia pages have a [talk/discussion page](https://en.wikipedia.org/wiki/Help:Talk_pages) where editors discuss how to improve a page. In some occassions, the editors have different opinions on some topics and the talk page will be tagged as "dispute" by editors. In some cases the editors resolve the disputes on their own; in other cases the disputes are "escalated" and needs mediation.
Wikipedia suggests its users to use hierarchy of disagreement, proposed by Paul Graham [1](#refs) to resolve disputes constructively. Christine De Kock and Andreas Vlachos have annotated ~200 conversation from wikipedia talk pages categorized as "dispute". They have labeled each part of the conversation (utterance) accordingly which consist of ~4000 utterance [2](#refs). Part of the labels are on the basis of Graham's proposed hierarchy of disagreement [1](#refs) which addresses the "rebuttal tactics". Graham suggest 7 levels for disagreements 
which (starts from name-calling at the bottom or DH0 to refuting the central point or DH7). The other part of labels are called "resolution tactics" and (attemps to promote 
understanding and consensus.) See table 1 [table 1 from ref2; ask permission]. For more information see [2](#refs).
[add overall analysis and results?]
### Our Question
What we wanted to know was "what is the difference in how conversation flows in "escalated" versus "non-escalated" discussions?"
### What We Did In Summary
I divided data into escalated and non-escalated conversations. For each utterance, picked one or multiple label(s). Then, built the transition matrix and visualized the graphs; each label served as a node and the transition probability between each two nodes, became the strength of the edges.
First, we wanted to know whether escalated conversation lack higher order DHs or not. I drawed the graph such that in case of multilabel utterance, the higher label is shown (for more information see [1](#refs)). Our graphs showed that escalated discussion do have at least the same amount of higher order DHs.[reference to related cells]
Then we thought that maybe the higher order DHs are accompanied with lower order DHs in escalated conversations. Potentially, this could neutralize the effect of using higher order resolution techniques. [add what we observed and reference to related cells]
In the end, we had the idea of dividing transitions such that each label has the chance to be part of the graph.[add what we observed]






###What happens in the code


## **Part 1: Downloading dataset and import libraries**

Start by downlaoding/uploading raw data file (frequency_data.json)

In [1]:
# import data

from google.colab import files

# upload "frequency_data.json"
uploaded = files.upload()

#download the raw data from a git repo



Saving frequency_data.json to frequency_data (1).json


Import libraries. Numpy for handling matrices and json for handling the raw data file (opening and reading .json format). Pyvis and networkx are the two libraries for visulaizing graphs. I have used IPython to help in displaying the graphs (.html format) inside this notebook.

In [2]:

#import needed libraries
import numpy as np
import json

# install pyvis
# pyvis is a network visualization library based on Networkx 
#which allow graphs to be interactive. You can also save them as html
!pip install pyvis
from pyvis.network import Network

# to be able to display output graph which is in html format
from IPython.core.display import display, HTML

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/



## **Part 2: Data cleaning and wrangling**

Next, open the data file and load it. Let's have a look at how the data format:

In [3]:

# Opening JSON file
raw_data = open('frequency_data.json')
  
# returns JSON object as 
# a dictionary
raw_data = json.load(raw_data)

raw_data[0:5]



[{'utt_labels': [['Coordinating edits'],
   ['Contextualisation'],
   ['Providing clarification'],
   ['DH5: Counterargument'],
   ['DH5: Counterargument'],
   ['DH4: Repeated argument'],
   ['DH4: Repeated argument'],
   ['Suggesting a compromise', 'Coordinating edits'],
   ['Coordinating edits'],
   ['Coordinating edits'],
   ['Coordinating edits'],
   ['DH1: Ad hominem/ad argument'],
   ['DH4: Repeated argument'],
   ['DH4: Stating your stance'],
   ['Other'],
   ['Coordinating edits'],
   ['Coordinating edits'],
   ['DH5: Counterargument'],
   ['DH4: Stating your stance'],
   ['DH4: Stating your stance'],
   ['DH4: Stating your stance'],
   ['DH4: Repeated argument'],
   ['DH4: Stating your stance']],
  'escalation': 0},
 {'utt_labels': [['Contextualisation'],
   ['Coordinating edits'],
   ['Providing clarification'],
   ['DH1: Ad hominem/ad argument', 'DH5: Counterargument'],
   ['DH5: Counterargument'],
   ['DH5: Counterargument',
    'DH0: Name calling/hostility',
    'DH1: Ad h

The raw_data is a list containing multiple dictionaries. Each dictionary has two keys: 

*   escalation: which shows the type of coversation (either escalated with value 1 or non-escalated with value 0)
*   utt_labels: which is shows each utterance's label. Its value is a list of lists containing each utterance label(s). These lists are in the same order that happend in the conversation.



To get an idea of the labels' frequency, I built a dictionary contining all the labels and their frequnecies:

In [4]:
#count frequency of each label in a dictionary

from collections import Counter
super_dict_labels = {}
for i in range(len(raw_data)):


  dict_list = ([dict(Counter(x)) for x in raw_data[i]['utt_labels']])
  for item in dict_list:
    for k, v in item.items():
      if k in super_dict_labels.keys():
        super_dict_labels[k] += v
      else:
        super_dict_labels[k] = v
super_dict_labels

{'Coordinating edits': 972,
 'Contextualisation': 208,
 'Providing clarification': 144,
 'DH5: Counterargument': 988,
 'DH4: Repeated argument': 572,
 'Suggesting a compromise': 89,
 'DH1: Ad hominem/ad argument': 566,
 'DH4: Stating your stance': 433,
 'Other': 25,
 'DH0: Name calling/hostility': 65,
 'DH3: Policing the discussion': 249,
 'Asking questions': 92,
 'Conceding / recanting': 59,
 'DH6: Refutation': 300,
 'DH2: Attempted derailing/off-topic': 65,
 'DH-1: Bailing out': 37,
 'DH7: Refuting the central point': 19,
 "I don't know": 7,
 'Other: Quote': 1,
 'DH5: Counterargument with new evidence / reasoning': 8,
 "DH6: Refutation of opponent's argument (with evidence or reasoning)": 5,
 'DH1: Attacks to the person or argument': 9,
 'DH4: Stating your stance without evidence or reasoning': 5,
 'DH2: Attempted derailing / off-topic comments': 1}

There are multiple labels. All DH labels are in format of DH[a digit] exept for 'DH-1: Bailing out'. Let's change that:

In [5]:
# change label "DH-1: Bailing out" to formal format "DH1: Bailing out"
for conversation in raw_data:
  for labels in conversation['utt_labels']:
    #check if DH- is there
    if "DH-1: Bailing out" in labels:
      idx = labels.index("DH-1: Bailing out")
      labels[idx] = "DH1: Bailing out"


#count frequency of each label in a dictionary after changing DH-1 to DH1

super_dict_labels = {}
for i in range(len(raw_data)):


  dict_list = ([dict(Counter(x)) for x in raw_data[i]['utt_labels']])
  for item in dict_list:
    for k, v in item.items():
      if k in super_dict_labels.keys():
        super_dict_labels[k] += v
      else:
        super_dict_labels[k] = v
super_dict_labels

{'Coordinating edits': 972,
 'Contextualisation': 208,
 'Providing clarification': 144,
 'DH5: Counterargument': 988,
 'DH4: Repeated argument': 572,
 'Suggesting a compromise': 89,
 'DH1: Ad hominem/ad argument': 566,
 'DH4: Stating your stance': 433,
 'Other': 25,
 'DH0: Name calling/hostility': 65,
 'DH3: Policing the discussion': 249,
 'Asking questions': 92,
 'Conceding / recanting': 59,
 'DH6: Refutation': 300,
 'DH2: Attempted derailing/off-topic': 65,
 'DH1: Bailing out': 37,
 'DH7: Refuting the central point': 19,
 "I don't know": 7,
 'Other: Quote': 1,
 'DH5: Counterargument with new evidence / reasoning': 8,
 "DH6: Refutation of opponent's argument (with evidence or reasoning)": 5,
 'DH1: Attacks to the person or argument': 9,
 'DH4: Stating your stance without evidence or reasoning': 5,
 'DH2: Attempted derailing / off-topic comments': 1}

To build a transition matrix, I needed to simplify the multi-label utterances. The 'pick_label' function gets the list of labels and the method, and outputs the final label(s).
I needed to either reduce multi-labels to one label or distribute the transition from i to i+1 th utterance between all the labels; such that each pair of label is involved equally.
the function 'pick_label' accepts three methods:


*   all: the output will be the same as the input. It gives back all the labels
*   max: if the labels contain a DH label, the output will be the highest DH. Otherwise, the output is the label with the highest frequency

*   min: if the labels contain a DH label, the output will be the lowest DH. Otherwise, the output is the label with the highest frequency


\begin{align}
\frac{1}{\text{(no of labels in ith utterance)} \text{(no labels in (i+1)th utterance)}}
\end{align}



In [6]:

# a function to pick one of the lables if there are multiple labels available for a part of an utter
def pick_label(list_of_labels, method):
  ''' 
  This function picks one of the lables in case there are multiple labels
   available.
  In case both DH and non-DH labels are availabe it picks the DHs.
  In case all of the labels are non-DH it picks the one with the highest frequency.

  Args:
  list_of_labels (list): labels for part of the utterance
  method (string): if max or min, picks the highest or lowest DH. If all gives back all labels
  '''
  
  # first check no of labels or if method is 'all'
  if len(list_of_labels) == 1 or method == 'all':
    final_label = list_of_labels
  else: #case of multiple labels
    #check how many DHs are present
    count_DH = 0
    for item in list_of_labels:
      if item[0:2] == 'DH':
        count_DH += 1
    if count_DH == 0:
      #pick the one with the highest frequency
      label_freqs = [super_dict_labels[x] for x in list_of_labels]
      final_label = [list(super_dict_labels.keys())[list(super_dict_labels.values()).index(max(label_freqs))]]
    elif count_DH == 1:
      # pick the DH
      final_label = [i for i in list_of_labels if i.startswith('DH')]
    else:
      
      #extract DH_no s
      DH_no = [int(i[2]) for i in list_of_labels if i.startswith('DH')]
      # select the DH based on the method specified
      if method == 'min':
        final_label = [i for i in list_of_labels 
                       if (i.startswith('DH') and int(i[2]) == min(DH_no))]
      elif method == 'max':
        final_label = [i for i in list_of_labels 
                       if (i.startswith('DH') and int(i[2]) == max(DH_no))]                                                 

  return final_label


Now, build the transition matrix. The matrix contains all the labels plus a label called 'end'. The last list in each 'utter_label' is followed by 'end' (or end of the conversation).
Here, we can change the method of extracting labels in 'pick_label' function by changing 'method' variable.
The 'transition_all' variable is a 3-dimensional array. On the first layer, we have the transitions between every two labels, for the non-escalated conversations (or label 0 on escalation), and on the second layer, we have the same thing for escalated conversations. I created three versions of transition matrix to try differnt methods of picking labels on the data.

In [7]:
# create 3 np arrays to keep transitions for esc and non-esc conversations
# 24 unique labels + 1 row/col for "end"
#it's a 25*25*2 matrix. Layer 0 is for non_esc and 1 is for escelated conversations
transition_all = np.zeros((len(super_dict_labels)+1, len(super_dict_labels)+1, 2), float)
transition_max = np.zeros_like(transition_all)
transition_min = np.zeros_like(transition_all)

# put the list of labels into labels_long variable
labels_long = list(super_dict_labels.keys())
#add 'end' label to the list of labels
labels_long.append('end')
#choose which method to apply
methods = ['all', 'max', 'min']
# fill out the transition matrix
for conversation in raw_data:
  for idx_labels in range(len(conversation['utt_labels'])):

    #loop through methods
    for method in methods:
      # find from and to labels
      from_labels = pick_label(conversation['utt_labels'][idx_labels], method)
      len_from = len(from_labels)
      
      # two for loops in case we wanted to more than 1 label:
      for from_label in from_labels:
        #find index of the label in the list of all labels
        from_idx = labels_long.index(from_label)
    
        if idx_labels == len(conversation['utt_labels'])-1:
          #if it's the last part of conversation the to_label is 'end'
          to_idx = 24
          if method == 'all':              
            transition_all[from_idx,to_idx,int(conversation['escalation'])] += 1/(len_from*len_to)
          elif method == 'min':
            transition_min[from_idx,to_idx,int(conversation['escalation'])] += 1/(len_from*len_to)
          else:
            transition_max[from_idx,to_idx,int(conversation['escalation'])] += 1/(len_from*len_to)
        else:
          to_labels = pick_label(conversation['utt_labels'][idx_labels+1], method)
          len_to = len(to_labels)
          for to_label in to_labels:
            #find index of the label in the list of all labels
            to_idx = labels_long.index(to_label)
            #add transition to transition matrix
            if method == 'all':              
              transition_all[from_idx,to_idx,int(conversation['escalation'])] += 1/(len_from*len_to)
            elif method == 'min':
              transition_min[from_idx,to_idx,int(conversation['escalation'])] += 1/(len_from*len_to)
            else:
              transition_max[from_idx,to_idx,int(conversation['escalation'])] += 1/(len_from*len_to)
            

View long labels and then shorten them. Later, I will use short labels for node identification in the graph.

In [8]:
# view long labels
print(labels_long)

['Coordinating edits', 'Contextualisation', 'Providing clarification', 'DH5: Counterargument', 'DH4: Repeated argument', 'Suggesting a compromise', 'DH1: Ad hominem/ad argument', 'DH4: Stating your stance', 'Other', 'DH0: Name calling/hostility', 'DH3: Policing the discussion', 'Asking questions', 'Conceding / recanting', 'DH6: Refutation', 'DH2: Attempted derailing/off-topic', 'DH1: Bailing out', 'DH7: Refuting the central point', "I don't know", 'Other: Quote', 'DH5: Counterargument with new evidence / reasoning', "DH6: Refutation of opponent's argument (with evidence or reasoning)", 'DH1: Attacks to the person or argument', 'DH4: Stating your stance without evidence or reasoning', 'DH2: Attempted derailing / off-topic comments', 'end']


In [9]:
#short labels
labels_short = ['Coords',
 'Context',
 'clarify',
 'DH5',
 'DH4:Repeat',
 'Suggest',
 'DH1:Ad',
 'DH4:stance',
 'Other',
 'DH0',
 'DH3',
 'Ask',
 'Conced',
 'DH6',
 'DH2',
 'DH1:Bail',
 'DH7',
 "Don't know",
 'Other:Quote',
 'DH5:arg with',
 "DH6:with evid",
 'DH1:Attack',
 'DH4:without evid',
 'DH2:off',
 'end']

Multiple categories of the same DH are among labels. Since we need to simplify the most, I merged them!
Then, I created another filtered list of the labels and updated the 'trasiton_all' matrix accordingly.

In [10]:
# sum instances of similar DHs
rep_DHs = ['DH1', 'DH2', 'DH4', 'DH5', 'DH6']
#keep indices of duplicates
to_be_rmvd_idx = []
for DH in rep_DHs:

  #find DH instances
  idx = [labels_short.index(i) for i in labels_short if i.startswith(DH)]
  to_be_rmvd_idx.extend(idx)
  #sum similir DHs cols and keep it in desired shape
  last_col = np.sum(transition_all[:,idx,:], axis = 1)[:,None,:]
  #sum all DH4s rows and keep it in desired shape 
  last_row = np.sum(transition_all[idx,:,:], axis = 0)[None,:,:]
  diagonal_elements = np.sum(transition_all[idx,idx,:],axis = 0)[None,None,:]
  #add computed rows and columns to the transition_all array
  transition_all = np.vstack((np.hstack((transition_all, last_col)), np.hstack((last_row, diagonal_elements))))

  # method 'min'
  #sum similir DHs cols and keep it in desired shape
  last_col = np.sum(transition_min[:,idx,:], axis = 1)[:,None,:]
  #sum all DH4s rows and keep it in desired shape 
  last_row = np.sum(transition_min[idx,:,:], axis = 0)[None,:,:]
  diagonal_elements = np.sum(transition_min[idx,idx,:],axis = 0)[None,None,:]
  transition_min = np.vstack((np.hstack((transition_min, last_col)), np.hstack((last_row, diagonal_elements))))

  # method 'max'
  #sum similir DHs cols and keep it in desired shape
  last_col = np.sum(transition_max[:,idx,:], axis = 1)[:,None,:]
  #sum all DH4s rows and keep it in desired shape 
  last_row = np.sum(transition_max[idx,:,:], axis = 0)[None,:,:]
  diagonal_elements = np.sum(transition_max[idx,idx,:],axis = 0)[None,None,:] 
  transition_max = np.vstack((np.hstack((transition_max, last_col)), np.hstack((last_row, diagonal_elements))))
#append careated labels to label list 
labels_short.extend(rep_DHs)
labels_long.extend(rep_DHs)

#create a new list of lables and remove previous duplicates
labels_short_filtered = [label for i, label in enumerate(labels_short) if i not in to_be_rmvd_idx]
labels_long_filtered = [label for i, label in enumerate(labels_long) if i not in to_be_rmvd_idx]

#filter transition matrix for to be removed indices
transition_all = np.delete(transition_all, to_be_rmvd_idx, axis = 0)
transition_all = np.delete(transition_all, to_be_rmvd_idx, axis = 1)

transition_min = np.delete(transition_min, to_be_rmvd_idx, axis = 0)
transition_min = np.delete(transition_min, to_be_rmvd_idx, axis = 1)

transition_max = np.delete(transition_max, to_be_rmvd_idx, axis = 0)
transition_max = np.delete(transition_max, to_be_rmvd_idx, axis = 1)

Let's see the total number of transitions in each layer (escalated and non-escalated conversations)

In [11]:
np.sum(np.sum(transition_max, axis=0), axis=0)

array([1928., 1827.])

We have ~2000 transition in each layer. To simplify, I merged labels with low inward transitions (because label 'end' does not have any outward transition). I summed over each column over both layers.
After filtering, I gathered all the filtered labels into label 'Other'. Since there was an 'Other' label in the given labels, first, I checked if 'Other' label has survived filtering! To make the code simpler I have used three different cells for each method.

In [12]:
#filter the columns with less than a threshold of the inward transitions
threshold_in = 40 # overall 4000 transitions. 40 is one percent

#find filter indices
filter = (np.sum(np.sum(transition_all, axis=0), axis=1)) < threshold_in

# find index of other in label list
other_idx = labels_short.index('Other')



#check if "other" label has survived!
if not filter[other_idx]: #if it's survived on its own

  #sum all filtered labels into other column/row
  #and add it with 'Other' col
  transition_all[:,other_idx,:] += np.sum(transition_all[:,filter,:], axis = 1)
  #sum all DH4s rows and keep it in desired shape 
  transition_all[other_idx,:,:] += np.sum(transition_all[filter,:,:], axis = 0)

  # filter to find the above threshold ones
  filter = (np.sum(np.sum(transition_all, axis=0), axis=1)) > threshold_in
  trans_flt_all = transition_all[filter,:,:][:,filter,:]
else: #if 'Other' transitions was lower than the threshold on its own and therefore was removed
  #sum filtered labels col and keep it in desired shape
  last_col = np.sum(transition_all[:,filter,:], axis = 1)[:,None,:]
  diagonal_elements = np.sum(transition_all[filter,filter,:],axis = 0)[None,None,:]
  #check if col sum is below threshod after summation
  if last_col.sum() + diagonal_elements.sum() < threshold_in:
    filter = (np.sum(np.sum(transition_all, axis=0), axis=1)) > threshold_in
    trans_flt_all = transition_all[filter,:,:][:,filter,:]
  else:
    #sum all removed rows and keep it in desired shape 
    last_row = np.sum(transition_all[filter,:,:], axis = 0)[None,:,:]
    

    #add 'Other' sum to transition matrix
    transition_all = np.vstack((np.hstack((transition_all, last_col)), np.hstack((last_row, diagonal_elements))))
      # filter to find the remaining ones
    filter = (np.sum(np.sum(transition_all, axis=0), axis=1)) > threshold_in

    trans_flt_all = transition_all[filter,:,:][:,filter,:]

#filter labels list accordingnly
labels_long_flt_all = [item for (item, cond) in zip(labels_long_filtered, filter) if cond]
labels_short_flt_all = [item for (item, cond) in zip(labels_short_filtered, filter) if cond]

#check if Other has survived this time
if trans_flt_all.shape[0] != len(labels_long_flt_all): #'Other' label transition has survived after summation
  labels_long_flt_all.append('Other')
  labels_short_flt_all.append('Other')

In [13]:
#method 'min'

#find filter indices
filter = (np.sum(np.sum(transition_all, axis=0), axis=1)) < threshold_in

# find index of other in label list
other_idx = labels_short.index('Other')


#check if "other" label has survived!
if not filter[other_idx]: #if it's survived on its own

  #sum all filtered labels into other column/row
  #and add it with 'Other' col
  transition_min[:,other_idx,:] += np.sum(transition_min[:,filter,:], axis = 1)
  #sum all DH4s rows and keep it in desired shape 
  transition_min[other_idx,:,:] += np.sum(transition_min[filter,:,:], axis = 0)

  # filter to find the above threshold ones
  filter = (np.sum(np.sum(transition_min, axis=0), axis=1)) > threshold_in
  trans_flt_min = transition_min[filter,:,:][:,filter,:]
else: #if 'Other' transitions was lower than the threshold on its own and therefore was removed
  #sum filtered labels col and keep it in desired shape
  last_col = np.sum(transition_min[:,filter,:], axis = 1)[:,None,:]
  diagonal_elements = np.sum(transition_min[filter,filter,:],axis = 0)[None,None,:]
  #check if col sum is below threshod after summation
  if last_col.sum() + diagonal_elements.sum() < threshold_in:
    filter = (np.sum(np.sum(transition_min, axis=0), axis=1)) > threshold_in
    trans_flt_min = transition_min[filter,:,:][:,filter,:]
  else:
    #sum all removed rows and keep it in desired shape 
    last_row = np.sum(transition_min[filter,:,:], axis = 0)[None,:,:]
    

    #add 'Other' sum to transition matrix
    transition_min = np.vstack((np.hstack((transition_min, last_col)), np.hstack((last_row, diagonal_elements))))
      # filter to find the remaining ones
    filter = (np.sum(np.sum(transition_min, axis=0), axis=1)) > threshold_in

    trans_flt_min = transition_min[filter,:,:][:,filter,:]

#filter labels list accordingnly
labels_long_flt_min = [item for (item, cond) in zip(labels_long_filtered, filter) if cond]
labels_short_flt_min = [item for (item, cond) in zip(labels_short_filtered, filter) if cond]

#check if Other has survived this time
if trans_flt_min.shape[0] != len(labels_long_flt_min): #'Other' label transition has survived after summation
  labels_long_flt_min.append('Other')
  labels_short_flt_min.append('Other')

In [14]:
#method 'max'

#find filter indices
filter = (np.sum(np.sum(transition_all, axis=0), axis=1)) < threshold_in

# find index of other in label list
other_idx = labels_short.index('Other')


#check if "other" label has survived!
if not filter[other_idx]: #if it's survived on its own

  #sum all filtered labels into other column/row
  #and add it with 'Other' col
  transition_max[:,other_idx,:] += np.sum(transition_max[:,filter,:], axis = 1)
  #sum all DH4s rows and keep it in desired shape 
  transition_max[other_idx,:,:] += np.sum(transition_max[filter,:,:], axis = 0)

  # filter to find the above threshold ones
  filter = (np.sum(np.sum(transition_max, axis=0), axis=1)) > threshold_in
  trans_flt_max = transition_max[filter,:,:][:,filter,:]
else: #if 'Other' transitions was lower than the threshold on its own and therefore was removed
  #sum filtered labels col and keep it in desired shape
  last_col = np.sum(transition_max[:,filter,:], axis = 1)[:,None,:]
  diagonal_elements = np.sum(transition_max[filter,filter,:],axis = 0)[None,None,:]
  #check if col sum is below threshod after summation
  if last_col.sum() + diagonal_elements.sum() < threshold_in:
    filter = (np.sum(np.sum(transition_max, axis=0), axis=1)) > threshold_in
    trans_flt_max = transition_max[filter,:,:][:,filter,:]
  else:
    #sum all removed rows and keep it in desired shape 
    last_row = np.sum(transition_max[filter,:,:], axis = 0)[None,:,:]
    

    #add 'Other' sum to transition matrix
    transition_max = np.vstack((np.hstack((transition_max, last_col)), np.hstack((last_row, diagonal_elements))))
      # filter to find the remaining ones
    filter = (np.sum(np.sum(transition_max, axis=0), axis=1)) > threshold_in

    trans_flt_max = transition_max[filter,:,:][:,filter,:]

#filter labels list accordingnly
labels_long_flt_max = [item for (item, cond) in zip(labels_long_filtered, filter) if cond]
labels_short_flt_max = [item for (item, cond) in zip(labels_short_filtered, filter) if cond]

#check if Other has survived this time
if trans_flt_max.shape[0] != len(labels_long_flt_max): #'Other' label transition has survived after summation
  labels_long_flt_max.append('Other')
  labels_short_flt_max.append('Other')

## **Part 3: graph for filtered transitions and fixed nodes**

For the visualizing nodes, we wanted to divide the graph into two parts. A part for DH labels which are ordered based on their level. The other part would be all other lables. In the next cell, I separated the labels and also sorted them alphabetically.




In [15]:
#sort the labels alphabetically and get old indices for both DH and non-DHs:
#method 'all'
mask = ['DH' in s for s in labels_short_flt_all]
sorted_DH_all = sorted(((name, index) for index, name in 
                         enumerate([item for (item, cond) in zip(labels_short_flt_all, mask) if cond]))
                          , reverse=True)
sorted_non_DH_all = sorted((name, index) for index, name in 
                         enumerate([item for (item, cond) in zip(labels_short_flt_all, mask) if not cond]))

#method 'min'
mask = ['DH' in s for s in labels_short_flt_min]
sorted_DH_min = sorted(((name, index) for index, name in 
                         enumerate([item for (item, cond) in zip(labels_short_flt_min, mask) if cond]))
                          , reverse=True)
sorted_non_DH_min = sorted((name, index) for index, name in 
                         enumerate([item for (item, cond) in zip(labels_short_flt_min, mask) if not cond]))

#method 'max'
mask = ['DH' in s for s in labels_short_flt_max]
sorted_DH_max = sorted(((name, index) for index, name in 
                         enumerate([item for (item, cond) in zip(labels_short_flt_max, mask) if cond]))
                          , reverse=True)
sorted_non_DH_max = sorted((name, index) for index, name in 
                         enumerate([item for (item, cond) in zip(labels_short_flt_max, mask) if not cond]))


In the end, I built six (three, number of methods by two, escalated and non-escalated) graphs. 
The size of each node is proportional to the number of inward transitions. The width of each edge is proportional to the number of transition between the two nodes (except for the transitions to node 'end').
I added a variable 'threshod' to omit the eadges with less number of transition. If any edge transition is lower than the 'threshod' the edge won't be shown in the graph.
I also have removed the transitions from an edge to itself. It helped declutter the graph substantially. Besides, it meant that the conversation remained in a state for a longer time (which we did not care about).

The graphs will be save as 'html' files in the current directory (after running next cell, open 'files' on the left menu bar; you should see two files with names "nonesc_fixed.html" and "esc_fixed.html"; the files are downloadable).


First, let's look at the 'max' graphs:

In [25]:
# create two empty directed Network instance, with certain size in pixels
non_esc_net_max = Network(height='900px', width='900px',directed =True)
esc_net_max = Network(height='900px', width='900px',directed =True)
# a threshold for showing the frequency of transitions
threshold = 12


#keep track of node index
idx_node = 0
# add nodes to network 
#left column for DHs & right col for the rest
# index is the sorted index in DH or non-DH (sotrted_DH or sorted_non_DH)
# idx is the index in the filtered label matrix (all the labels)
# each node would be part of an ellipse for the visualization purposes
# each group of nodes is part of a half ellipse (two different circles)


# number of nodes in the left part of the ellipse; all DHs
node_no_col = 7 #len(sorted_DH_max)
# generate coordinates for all the nodes in the left part
y = np.linspace(0,600,node_no_col)
x = -np.sqrt(300**2 - ((y-300)**2)/1.1) - 50
# add nodes from sorted_DH
for node, _ in sorted_DH_max:
  idx = labels_short_flt_max.index(node)

  #check its frequency by summing the column
  value = int(np.sum(trans_flt_max[:,idx,0]))
  
  non_esc_net_max.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node]), y=int(y[idx_node]))
               
  value = int(np.sum(trans_flt_max[:,idx,1]))
  
  esc_net_max.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node]), y=int(y[idx_node]))
               
  idx_node += 1

# number of nodes in the left part of the circle
node_no_col = len(sorted_non_DH_max)
#positions of the nodes for the right part
y = np.linspace(0,600,node_no_col)
x = (np.sqrt(300**2 - ((y-300)**2)/1.1) + 50)

#add non-DH nodes to the graph
for node, _ in sorted_non_DH_max:
  # idx of the node on labels_short list
  idx = labels_short_flt_max.index(node)

  #check its frequency by summing the column
  value = int(np.sum(trans_flt_max[:,idx,0]))
  # print(value)
  non_esc_net_max.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node - len(sorted_DH_max)]),
               y=int(y[idx_node - len(sorted_DH_max)]))
              
  value = int(np.sum(trans_flt_all[:,idx,1]))
  # print(value)
  esc_net_max.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node - len(sorted_DH_max)]),
               y=int(y[idx_node - len(sorted_DH_max)]))
             
  idx_node += 1
  



# add edges from DHs to both
idx_source = 0
for source, _ in sorted_DH_max:
  # idx of the node on labels_short list
  idx = labels_short_flt_max.index(source)
  # print(node, idx)


  idx_target = 0
  for target, _ in sorted_DH_max:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue
  # idx of the node on labels_short list
    idx_t = labels_short_flt_max.index(target)
    weight = int(trans_flt_max[idx,idx_t,0])
    # weight = penfunc(trans_freq,5)
    
    if weight > threshold:
      non_esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
    weight = int(trans_flt_max[idx,idx_t,1])
    # weight = penfunc(trans_freq,5)
    
    if weight > threshold:
      esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)      
    
      
    idx_target += 1
  for target, _ in sorted_non_DH_max:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue    
  # idx of the node on labels_short list
    idx_t = labels_short_flt_max.index(target)
    weight = int(trans_flt_max[idx,idx_t,0])
    if target == 'end' and weight > 0:
      weight = threshold + 1
    if weight > threshold:
      non_esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    weight = int(trans_flt_max[idx,idx_t,1])
    if target == 'end' and weight > 0:
      weight = threshold + 1
    if weight > threshold:
      esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  idx_source += 1
# add edges from non_DHs to both
for source, _ in sorted_non_DH_max:
  # idx of the node on labels_short list
  idx = labels_short_flt_max.index(source)
  # print(node, idx)

 
  idx_target = 0
  for target, _ in sorted_DH_max:
  
  # idx of the node on labels_short list
    idx_t = labels_short_flt_max.index(target)
    weight = int(transition_all[idx,idx_t,0])
    if weight > threshold:
      non_esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    weight = int(trans_flt_all[idx,idx_t,1])
    if weight > threshold:
      esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  
  for target, _ in sorted_non_DH_max:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue   
  # idx of the node on labels_short list
    idx_t = labels_short_flt_max.index(target)
    weight = int(trans_flt_max[idx,idx_t,0])
    # weight = penfunc(trans_freq,5)
    if target == 'end':
      weight = threshold + 1
    if weight > threshold:
      non_esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
    weight = int(trans_flt_max[idx,idx_t,1])
    # weight = penfunc(trans_freq,5)
    if target == 'end':
      weight = threshold + 1
    if weight > threshold:
      esc_net_max.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  idx_source += 1


# toggle_physics method changes the position of nodes based on the strength
# of the edges. Since we wanted to have the same positions for both graphs, this 
# option is off
non_esc_net_max.toggle_physics(False)
esc_net_max.toggle_physics(False)




#create and save the graphs in current directory
non_esc_net_max.show("nonesc_fixed_max.html")
esc_net_max.show("esc_fixed_max.html")


In [26]:
#show the escalated graph
display(HTML('esc_fixed_max.html'))
#you can drag the nodes or zoom/unzoom the plot 

In [20]:
#show the non-escalated graph
display(HTML('nonesc_fixed_max.html'))
#you can drag the nodes or zoom/unzoom the plot 

We could see that using less higher order 'DH' arguments is not the case for 'escalated' vs 'non-escalated' conversations. However, it seems that using higher DHs did not associate with more "cooperative" conversations in "escalated" discussions.
Now we built the graph with 'min' method to see if lower order DHs were more present in "escalated" conversations. 

In [27]:
# create two empty directed Network instance, with certain size in pixels
non_esc_net_min = Network(height='900px', width='900px',directed =True)
esc_net_min = Network(height='900px', width='900px',directed =True)
# a threshold for showing the frequency of transitions
threshold = 12


#keep track of node index
idx_node = 0
# add nodes to network 
#left column for DHs & right col for the rest
# index is the sorted index in DH or non-DH (sotrted_DH or sorted_non_DH)
# idx is the index in the filtered label matrix (all the labels)
# each node would be part of an ellipse for the visualization purposes
# each group of nodes is part of a half ellipse (two different circles)


# number of nodes in the left part of the ellipse; all DHs
node_no_col = 7 #len(sorted_DH_min)
# generate coordinates for all the nodes in the left part
y = np.linspace(0,600,node_no_col)
x = -np.sqrt(300**2 - ((y-300)**2)/1.1) - 50
# add nodes from sorted_DH
for node, _ in sorted_DH_min:
  idx = labels_short_flt_min.index(node)

  #check its frequency by summing the column
  value = int(np.sum(trans_flt_min[:,idx,0]))
  
  non_esc_net_min.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node]), y=int(y[idx_node]))
               
  value = int(np.sum(trans_flt_min[:,idx,1]))
  
  esc_net_min.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node]), y=int(y[idx_node]))
               
  idx_node += 1

# number of nodes in the left part of the circle
node_no_col = len(sorted_non_DH_min)
#positions of the nodes for the right part
y = np.linspace(0,600,node_no_col)
x = (np.sqrt(300**2 - ((y-300)**2)/1.1) + 50)

#add non-DH nodes to the graph
for node, _ in sorted_non_DH_min:
  # idx of the node on labels_short list
  idx = labels_short_flt_min.index(node)

  #check its frequency by summing the column
  value = int(np.sum(trans_flt_min[:,idx,0]))
  # print(value)
  non_esc_net_min.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node - len(sorted_DH_min)]),
               y=int(y[idx_node - len(sorted_DH_min)]))
              
  value = int(np.sum(trans_flt_all[:,idx,1]))
  # print(value)
  esc_net_min.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node - len(sorted_DH_min)]),
               y=int(y[idx_node - len(sorted_DH_min)]))
             
  idx_node += 1
  



# add edges from DHs to both
idx_source = 0
for source, _ in sorted_DH_min:
  # idx of the node on labels_short list
  idx = labels_short_flt_min.index(source)
  # print(node, idx)


  idx_target = 0
  for target, _ in sorted_DH_min:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue
  # idx of the node on labels_short list
    idx_t = labels_short_flt_min.index(target)
    weight = int(trans_flt_min[idx,idx_t,0])
    # weight = penfunc(trans_freq,5)
    
    if weight > threshold:
      non_esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
    weight = int(trans_flt_min[idx,idx_t,1])
    # weight = penfunc(trans_freq,5)
    
    if weight > threshold:
      esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)      
    
      
    idx_target += 1
  for target, _ in sorted_non_DH_min:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue    
  # idx of the node on labels_short list
    idx_t = labels_short_flt_min.index(target)
    weight = int(trans_flt_min[idx,idx_t,0])
    if target == 'end' and weight > 0:
      weight = threshold + 1
    if weight > threshold:
      non_esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    weight = int(trans_flt_min[idx,idx_t,1])
    if target == 'end' and weight > 0:
      weight = threshold + 1
    if weight > threshold:
      esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  idx_source += 1
# add edges from non_DHs to both
for source, _ in sorted_non_DH_min:
  # idx of the node on labels_short list
  idx = labels_short_flt_min.index(source)
  # print(node, idx)

 
  idx_target = 0
  for target, _ in sorted_DH_min:
  
  # idx of the node on labels_short list
    idx_t = labels_short_flt_min.index(target)
    weight = int(transition_all[idx,idx_t,0])
    if weight > threshold:
      non_esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    weight = int(trans_flt_all[idx,idx_t,1])
    if weight > threshold:
      esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  
  for target, _ in sorted_non_DH_min:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue   
  # idx of the node on labels_short list
    idx_t = labels_short_flt_min.index(target)
    weight = int(trans_flt_min[idx,idx_t,0])
    # weight = penfunc(trans_freq,5)
    if target == 'end':
      weight = threshold + 1
    if weight > threshold:
      non_esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
    weight = int(trans_flt_min[idx,idx_t,1])
    # weight = penfunc(trans_freq,5)
    if target == 'end':
      weight = threshold + 1
    if weight > threshold:
      esc_net_min.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  idx_source += 1


# toggle_physics method changes the position of nodes based on the strength
# of the edges. Since we wanted to have the same positions for both graphs, this 
# option is off
non_esc_net_min.toggle_physics(False)
esc_net_min.toggle_physics(False)




#create and save the graphs in current directory
non_esc_net_min.show("nonesc_fixed_min.html")
esc_net_min.show("esc_fixed_min.html")


In [20]:
#show the non-escalated graph
display(HTML('nonesc_fixed_min.html'))
#you can drag the nodes or zoom/unzoom the plot

In [21]:
#show the non-escalated graph
display(HTML('esc_fixed_min.html'))
#you can drag the nodes or zoom/unzoom the plot

It seems like lower order DHs were actually used more in "escalated" conversations. This means that although higher order DHs were used in "escalated" as much as "non-escalated" conversation, but these higher DHs were accompanied by lower DHs (especially DH1); this could be part of the reason why less coordination was seen.
Given both graphs, we had the idea of giving all the labels equall opportunities to be seen; we built the graphs with "all" method.

In [28]:
# create two empty directed Network instance, with certain size in pixels
non_esc_net_all = Network(height='900px', width='900px',directed =True)
esc_net_all = Network(height='900px', width='900px',directed =True)
# a threshold for showing the frequency of transitions
threshold = 12


#keep track of node index
idx_node = 0
# add nodes to network 
#left column for DHs & right col for the rest
# index is the sorted index in DH or non-DH (sotrted_DH or sorted_non_DH)
# idx is the index in the filtered label matrix (all the labels)
# each node would be part of an ellipse for the visualization purposes
# each group of nodes is part of a half ellipse (two different circles)


# number of nodes in the left part of the ellipse; all DHs
node_no_col = 7 #len(sorted_DH_all)
# generate coordinates for all the nodes in the left part
y = np.linspace(0,600,node_no_col)
x = -np.sqrt(300**2 - ((y-300)**2)/1.1) - 50
# add nodes from sorted_DH
for node, _ in sorted_DH_all:
  idx = labels_short_flt_all.index(node)

  #check its frequency by summing the column
  value = int(np.sum(trans_flt_all[:,idx,0]))
  
  non_esc_net_all.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node]), y=int(y[idx_node]))
               
  value = int(np.sum(trans_flt_all[:,idx,1]))
  
  esc_net_all.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node]), y=int(y[idx_node]))
               
  idx_node += 1

# number of nodes in the left part of the circle
node_no_col = len(sorted_non_DH_all)
#positions of the nodes for the right part
y = np.linspace(0,600,node_no_col)
x = (np.sqrt(300**2 - ((y-300)**2)/1.1) + 50)

#add non-DH nodes to the graph
for node, _ in sorted_non_DH_all:
  # idx of the node on labels_short list
  idx = labels_short_flt_all.index(node)

  #check its frequency by summing the column
  value = int(np.sum(trans_flt_all[:,idx,0]))
  # print(value)
  non_esc_net_all.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node - len(sorted_DH_all)]),
               y=int(y[idx_node - len(sorted_DH_all)]))
              
  value = int(np.sum(trans_flt_all[:,idx,1]))
  # print(value)
  esc_net_all.add_node(idx_node, label=node, value=value*4,
               x= int(x[idx_node - len(sorted_DH_all)]),
               y=int(y[idx_node - len(sorted_DH_all)]))
             
  idx_node += 1
  



# add edges from DHs to both
idx_source = 0
for source, _ in sorted_DH_all:
  # idx of the node on labels_short list
  idx = labels_short_flt_all.index(source)
  # print(node, idx)


  idx_target = 0
  for target, _ in sorted_DH_all:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue
  # idx of the node on labels_short list
    idx_t = labels_short_flt_all.index(target)
    weight = int(trans_flt_all[idx,idx_t,0])
    # weight = penfunc(trans_freq,5)
    
    if weight > threshold:
      non_esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
    weight = int(trans_flt_all[idx,idx_t,1])
    # weight = penfunc(trans_freq,5)
    
    if weight > threshold:
      esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)      
    
      
    idx_target += 1
  for target, _ in sorted_non_DH_all:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue    
  # idx of the node on labels_short list
    idx_t = labels_short_flt_all.index(target)
    weight = int(trans_flt_all[idx,idx_t,0])
    if target == 'end' and weight > 0:
      weight = threshold + 1
    if weight > threshold:
      non_esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    weight = int(trans_flt_all[idx,idx_t,1])
    if target == 'end' and weight > 0:
      weight = threshold + 1
    if weight > threshold:
      esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  idx_source += 1
# add edges from non_DHs to both
for source, _ in sorted_non_DH_all:
  # idx of the node on labels_short list
  idx = labels_short_flt_all.index(source)
  # print(node, idx)

 
  idx_target = 0
  for target, _ in sorted_DH_all:
  
  # idx of the node on labels_short list
    idx_t = labels_short_flt_all.index(target)
    weight = int(transition_all[idx,idx_t,0])
    if weight > threshold:
      non_esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    weight = int(trans_flt_all[idx,idx_t,1])
    if weight > threshold:
      esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  
  for target, _ in sorted_non_DH_all:
    #remove transitions to the source itself
    if target == source:
      idx_target += 1
      continue   
  # idx of the node on labels_short list
    idx_t = labels_short_flt_all.index(target)
    weight = int(trans_flt_all[idx,idx_t,0])
    # weight = penfunc(trans_freq,5)
    if target == 'end':
      weight = threshold + 1
    if weight > threshold:
      non_esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
    weight = int(trans_flt_all[idx,idx_t,1])
    # weight = penfunc(trans_freq,5)
    if target == 'end':
      weight = threshold + 1
    if weight > threshold:
      esc_net_all.add_edge(idx_source, idx_target, 
                  weight=weight,
                  value=weight)
      
    idx_target += 1
  idx_source += 1


# toggle_physics method changes the position of nodes based on the strength
# of the edges. Since we wanted to have the same positions for both graphs, this 
# option is off
non_esc_net_all.toggle_physics(False)
esc_net_all.toggle_physics(False)




#create and save the graphs in current directory
non_esc_net_min.show("nonesc_fixed_all.html")
esc_net_all.show("esc_fixed_all.html")


In [21]:
#show the non-escalated graph
display(HTML('nonesc_fixed_all.html'))
#you can drag the nodes or zoom/unzoom the plot

In [24]:
#show the non-escalated graph
display(HTML('esc_fixed_all.html'))
#you can drag the nodes or zoom/unzoom the plot

Comparing the left and right parts of the graphs in "escalated" and "non-escalated" conversation, using "all" labels shows less transitions between the left part (rebutal tactics) and right part (resolusional tactics) in the "escalated" graph. This means that although authors in escalated conversations used almost the same level of reasoning to discuss their points of view, but it did not encourge using more resolusional arguments such as "Asking quesions", "providing a clarification", or "suggesting a compromise".

<a name='refs'></a>
***References***:
1. Paul Graham. 2008. How to disagree. http://www.paulgraham.com/disagree.html
2. Christine DeKock, Andrew Velachos. 2022. [ask for more info]