# Notebook 5: Conversion for Sankey diagram

Based on the contract lifecycles that were constructed combining all documents, their labels (preliminary labels and those from notebook #4) and timestamps, this notebook formats the contract lifecycles for the subsequent construction of a Sankey diagram.

The results are reported in the thesis section 4.3.3.

Table of Contents:
* [5.1 Loading contract life cycles as list](#load)
* [5.2 Exploratory analysis of underlying distribution](#explore)
* [5.3 Format contract life cycles for Sankey diagram](#sankey)

In [None]:
#loading required packages
import pickle
from collections import Counter
from copy import deepcopy
import matplotlib.pyplot as plt

# 5.1 Loading contract life cycles <a id="load"></a>

In [None]:
# loading contract life cycles from notebook #4

#load list_contracts
    
print(len(list_contracts)) #print number of contracts to ensure correct length
print(list_contracts[0]) #print sample contract life cycle to ensure correct format

# 5.2 Exploratory analysis of underlying distribution <a id="explore"></a>

In [None]:
# retrieve dictionary of contract lengths
t = [len(list_contracts[i]) for i in range(len(list_contracts))] #list of contract lengths
t_d = dict(sorted(dict(Counter(t)).items())) #aggregate and sort contract length distribution
print(t_d) 

In [None]:
# visualize contract length distribution
t_p = deepcopy(t_d)
base = 0
for x in t_p:
    new_p = (t_p[x]/len(list_contracts)) + base # iteratively increase cumulative percentage
    t_p[x] = new_p #update cumulative percentage
    base = new_p #update basis for new iteration

# plot distribution
fig, ax = plt.subplots()
ax.plot([*t_p], list(t_p.values()))

ax.set(xlabel='Number of documents per contract', ylabel='Cumulative %')
plt.show()

# print cumulative percentage for selection of contract lengths
for i in [1, 2, 5, 10, 20]:
    print('Percentage of contract containing up to', i, 'documents:', t_p[i])

# 5.3 Format contract life cycles for Sankey diagram <a id="sankey"></a>

In [None]:
text_labels = ['Agreement', 'Amendment', 'Attachment', 'LOI', 'NDA', 'Offer', 'SOW']
colors = ['#0B86F3', '#F3B40B', '#92D53C', '#3CD5A2', '#7A440E', '#9F0BF3', '#EAF74C']  # red: #F53B13

# define functions for formatting

# retrieve list of all documents types for first document (num = 0), second (num = 1) etc.
def dist_list(num):
    l = list()
    for i in range(len(list_contracts)): #access each contract
        try:
            l.append(list_contracts[i][num]) #add desired document of contract to list
        except IndexError:
            continue
    return l #return list of documents

# iterate through all document types and analyse their next labels
def label_new(num):
    s = 0 #counter for index of document type for color etc.
    for i in text_labels: #assess next classes for each document type
        next_labels = list() #create list of next labels
        for j in range(len(list_contracts)):
            try:
                if list_contracts[j][num-1] == i: #access contract depending on previous document type
                    next_labels.append(list_contracts[j][num]) #add next document type

            except IndexError: #catch error if contract is not long enough
                continue

        dist_analysis(next_labels, i, num, s) #analyse list of next labels
        s += 1 #update index of document type 

# analyse distribution of next labels
def dist_analysis(l, y, n, s):
    counter = dict(Counter(l)) #aggregate next document types
    print(':{} {}'.format(str(y)+'_'+str(n), colors[s])) #print node + color
    for label in counter:
        print('{} [{}] {} {}.5'.format(str(y)+'_'+str(n), counter[label], str(label)+'_'+str(n+1), colors[s])) #print for each next document type how many document flow to it
    print()

In [None]:
# generate input for Sankey diagram

zero_dist = dist_list(0) #get initial distribution
for i in text_labels:
    n = zero_dist.count(i) #get number of documents for each class
    if n > 0:
        s = text_labels.index(str(i)) #access index of document type
        print(':{} {}'.format(str(i)+'_'+str(0), colors[s])) #print node with color
        print('{} [{}] {} {}.5 \n'.format(str(i)+'_'+str(0), n, str(i)+'_'+str(1), colors[s])) #print flow with color

for i in range(1, 5): #analyze first 2-5 documents (indicated by list index)
    label_new(i)

# print final colored nodes 
for label in text_labels:
    s = text_labels.index(str(label)) #access index of document type
    print(':{} {}'.format(str(label)+'_'+str(5), colors[s])) #print node with respective color

In [None]:
# for visualization purposes: first flow/connection was constructed to show how many contracts end after first document
# -> crop output picture until Agreement_1 etc.