## MA SNA Automated Generation of Directed Network Graph
## DS Discovery Project 
## Team 1: Adam, Win
## Dataset Edited by Jacob Jan de Ridder, 2020, for the Akkadian Epistolary Corpora project and released under the Creative Commons Attribution Share-Alike license 3.0.
### (The following md description has been written up by Niek Veldhuis for a similar project for a select group of Neo-Assyrian letters in ORACC. For a copy of the original, see: https://colab.research.google.com/drive/1rM9fgzHGWxEnj0TYTE5sJyjvS4PRanOo?usp=sharing)

> The purpose of this notebook is to make a preliminary investigation on how to best go about the automated parsing of the textual data in the MA letters and administrative texts into a prosopographic network structure. The impetus for automatic parsing is multifold, and I will list two main reasons here as I think of them: 1) Automated parsing is a faster method to grow the network from the textual data than a parsing approach that involves humans scanning through the texts making the structural decisions themselves. 2) It would be difficult to contrive a human annotation system that would yield a consistent prosopographic structure when multiple people scan the texts and try to match up their results. Thus the reproducibility of a human-based approach is weak. An automated approach would produce the same result each time whatever algorithm is decided upon. Afterwards, anybody can go from the textual data to the generated network structure with ease and confidence that they can reproduce our results.


> Nevertheless, there are costs to such an approach, some of which may make one wonder why not to revert to a human parsing approach. Indeed, the development of such a parsing algorithm is time-consuming in itself and one hopes that the total time of this development does not exceed the time for a human to scan the texts. Such an argument is rendered moot if the text corpus is sufficiently large, but for the corpus of SAA letters, the total text data is not exceedingly unattainable, numbering between roughly 250 and 350 letters per book, and none of these letters is very long. In addition (2), for any automatic approach, we must glance at the texts regardless to try to determine which rules would be best to add to the parsing algorithm. This could involve reading a substantial amount in itself. (3) Also, one hopes to be able to find such rules to properly generate appropriate nodes and edges for the network. This can be confounded by the complex nature of the text language itself. Since we are dealing with letters and not administrative/economic documents. The language is more free flowing and less prone to formula, except in the greeting section.



> In that regard, for an automated parsing technique to be viable, we must attempt to use as few rules as possible to develop the algorithm, so that it doesn't become overly complex, which both consumes time and confuses the researcher who comes in after us. The role of this notebook is to address the last issue brought in the preceding paragraph, i.e. can we really find these rules to generate a useful network? To do so, we will take a focused look at two parts of speech to determine how they can guide us in our endeavor: prepositions and verbs. 

## Research Questions:
* How to disambiguate the Neo-Assyrian prosopography?
* How to model the flow of information?
* How to repersent an itinerary (i.e. people moving to places)? 
* How to caluclate nearest neighbor promabilities for proximate geographic names?



# 0 Before you get started...
1. MAKE sure you are in "playground" mode. Go to "File"->"Open in playground mode". This will ensure that any changes you make to the code will not be saved.
2. Use SHIFT+ENTER to glide through the notebook. This command will both run the cell and advance to the next cell

# 1 Code Introduction

Our code begins with some basic steps:
1. Mount drive
2. Import modules
3. Setup folders

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import zipfile
import json
import requests
from tqdm import tqdm
import os
import errno
import re
import random
import numpy as np
import sys
import copy


import networkx as nx
#import matplotlib.pyplot as plt
#from ipywidgets import interact
#import ipywidgets as widgets
#from ipywidgets import Layout
#from IPython.display import display, clear_output

In [None]:
#Set folder for local drive
#folder = 'C://Users/jason/OneDrive/Documents/ORACC_MASTER/'

#Set folder for remote drive
folder = '/content/drive/My Drive/tcma/'

#importing utils for the method which downloads the current text json files
os.chdir(folder + 'network/utils/')
from utils import oracc_download

# This is a user defined module that searches through the texts to find the entities in the text that
# are people and places, to be imported as nodes into the network
os.chdir(folder + 'network/')
import rank_parser4 as rp

# 2 Extract Lemmatization from JSON
The code in this notebook will the parse [ORACC](http://oracc.museum.upenn.edu) `JSON` file of the Ur III corpus to extract lemmatization data.

The output contains text IDs, line IDs, lemmas, and (potentially) other data.

#2.1 The parsejson() function
The parsejson() function will "dig into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of cdl nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another cdl node is encountered. For more information about the data hierarchy in the ORACC `json` files, see [ORACC Open Data](http://oracc.org/doc/opendata/index.html).

The argument of the parsejson() function is a `JSON` object, essentially a `Python` dictionary that initially contains the entire contents of the original `JSON` file. The code takes the key cdl the value of which is a list of `JSON` dictionaries. Iterating through these dictionaries, if a dictionary contains another cdl node, the function calls itself with this lower-level dictionary as argument. This way the function digs deeper and deeper into the `JSON` tree, until it does not encounter a cdl key anymore. Here we are at the level of individual words. The code checks for a key f, if it exists the value of that key (another dictionary) is appended to the list lemm_l. The list lemm_l, which is initiated outside of the function proper, will become a list of dictionaries, where each dictionary represents a single word.

The variable id_text consists of a project abbreviation, such as blms or cams/gkab plus a text ID, in the format cams/gkab/P338616 or dcclt/Q000039. The id_text is a global variable that is defined each time parsejson() is called in the main process. Therefore, it can be accessed from within the function and is added to the lemmatization data of every word.

The variable ftype is used to denote if a node is a year name (yn) or not. This is a specific function for Ur3 text.

The field word_id consists of three parts, namely a text ID, line ID, and word ID, in the format Q000039.76.2 meaning: the second word in line 76 of text object Q000039. Note that 76 is not a line number strictly speaking but an object reference within the text object. Things like horizontal rulings, columns, and breaks also get object references. The word_id field allows us to put lines, breaks, and horizontal drawings together in the proper order.

The field label is a human-legible label that refers to a line or another part of the text; it may look like o i 23 (obverse column 1 line 23) or r v 23' (reverse column 5 line 23 prime). The label field is used in online [ORACC](http://oracc.org/) editions to indicate line numbers.

The fields extent, scope, and state give metatextual data about the condition of the object; they capture the number of broken lines or columns and similar information.

In [None]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            #print('cdl in JSON')
            parsejson(JSONobject)
        if "label" in JSONobject:
            meta_d["label"] = JSONobject['label']
        if "f" in JSONobject:
            lemma = JSONobject["f"]
            if "ftype" in JSONobject:   # you don't need this - useful for distinguishing between regular text and year names 
                lemma['ftype'] = JSONobject['ftype']
            lemma["id_word"] = JSONobject["ref"]
            lemma['label'] = meta_d["label"]
            lemma["id_text"] = meta_d["id_text"]
            lemm_l.append(lemma)
            #print('Appending Lemma: ' + str(lemma))
        if "strict" in JSONobject and JSONobject["strict"] == "1":
            lemma = {key: JSONobject[key] for key in dollar_keys}
            lemma["id_word"] = JSONobject["ref"]
            lemma["id_text"] = meta_d["id_text"]
            lemm_l.append(lemma)
    return

#2.2 Call the parsejson() function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory jsonzip, named PROJECT.zip. The zip file contains a directory that is called corpusjson that contains a `JSON` file for every text that is available in that corpus. The files are called after their text IDs in the pattern P######.json (or Q######.json or X######.json).

The function namelist() of the zipfile package is used to create a list of the names of all the files in the ZIP. From this list we select all the file names in the corpusjson directory with extension .json (this way we exclude the name of the directory itself).

Each of these files is read from the zip file and loaded with the command json.loads(), which transforms the string into a proper `JSON` object.

This `JSON` object (essentially a Python dictionary), which is called data_json is now sent to the parsejson() function. The function adds lemmata to the lemm_l list. In the end, lemm_l will contain as many list elements as there are words in all the texts in the projects requested.

The dictionary meta_d is created to hold temporary information. The value of the key id_text is updated in the main process every time a new `JSON` file is opened and send to the parsejson() function. The parsejson() function itself will change values or add new keys, depending on the information found while iterating through the `JSON` file. When a new lemma row is created, parsejon() will supply data such as id_text, label and (potentially) other information from meta_d.

Now let's download the JSON files for the TCMA project

The project files are the following: (make sure to separate the lists with a comma)


"tcma/ali1"
"tcma/amarna"
"tcma/assur"
"tcma/barri"
"tcma/bazmusian"
"tcma/billa"
"tcma/brak"
"tcma/chuera"
"tcma/emar"
"tcma/fekheriye"
"tcma/giricano"
"tcma/hana"
"tcma/haradum"
"tcma/hatti"
"tcma/kalhu"
"tcma/kartn"
"tcma/kulishinas"
"tcma/miscellaneous"
"tcma/nineveh"
"tcma/nippur"
"tcma/nuzi"
"tcma/qitar"
"tcma/rimah"
"tcma/suri"
"tcma/taban"
"tcma/tsa1"
"tcma/tsh1"
"tcma/ugarit"

In [None]:
os.chdir(folder + 'network/')

projects = [
     "tcma/ali1",
      "tcma/amarna",
      "tcma/assur",
      "tcma/barri",
      "tcma/bazmusian",
      "tcma/billa",
      #"tcma/brak",
      "tcma/chuera",
      "tcma/emar",
      "tcma/fekheriye",
      "tcma/giricano",
      "tcma/hana",
      "tcma/haradum",
      "tcma/hatti",
      "tcma/kalhu",
      "tcma/kartn",
      "tcma/kulishinas",
      "tcma/miscellaneous",
      "tcma/nineveh",
      "tcma/nippur",
      "tcma/nuzi",
      "tcma/qitar",
      "tcma/rimah",
      "tcma/suri",
      #"tcma/taban",
      "tcma/tsa1",
      "tcma/tsh1",
      "tcma/ugarit"

]
#projects = oracc_download(projects,'tcma') #DOWNLOAD REDUNDANCY
projects

['tcma/ali1',
 'tcma/amarna',
 'tcma/assur',
 'tcma/barri',
 'tcma/bazmusian',
 'tcma/billa',
 'tcma/chuera',
 'tcma/emar',
 'tcma/fekheriye',
 'tcma/giricano',
 'tcma/hana',
 'tcma/haradum',
 'tcma/hatti',
 'tcma/kalhu',
 'tcma/kartn',
 'tcma/kulishinas',
 'tcma/miscellaneous',
 'tcma/nineveh',
 'tcma/nippur',
 'tcma/nuzi',
 'tcma/qitar',
 'tcma/rimah',
 'tcma/suri',
 'tcma/tsa1',
 'tcma/tsh1',
 'tcma/ugarit']

Now let's extract the JSON from the downloaded ZIP files and convert the JSON data into a pandas dataframe

In [None]:
lemm_l = []
meta_d = {"label": None, "id_text": None}
dollar_keys = ["extent", "scope", "state"]

df_cat = pd.DataFrame()
used_pnums = []
cat_d = {}
for project in projects:
  print('Project: ' + str(project))
  z = zipfile.ZipFile('jsonzip/' + project.replace('/','-') + '.zip')
  #print(file + " does not exist or is not a proper ZIP file")
  files = z.namelist()     # list of all the files in the ZIP
  files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']
  cat_file = z.read(project + '/catalogue.json').decode('utf-8')
  cat_json = json.loads(cat_file)
  cat_d.update(dict(cat_json['members']))
  #df_cat = pd.concat([df_cat,pd.DataFrame(cat_json['members']).T])                          #that holds all the P, Q, and X numbers.

  for filename in tqdm(files):       #iterate over the file names
      id_text = filename[-12:-5]
      if id_text in used_pnums:
        continue
      else:
        used_pnums.append(id_text)
      meta_d["id_text"] = id_text

      st = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
      data_json = json.loads(st)                # make it into a json object (essentially a dictionary)
      #print(str(data_json))
      parsejson(data_json)               # and send to the parsejson() function
  z.close()

df_cat = pd.DataFrame(cat_d).T
words_df = pd.DataFrame(lemm_l)
#words_df


Project: tcma/ali1


100%|██████████| 24/24 [00:00<00:00, 135.84it/s]


Project: tcma/amarna


100%|██████████| 3/3 [00:00<00:00, 22.74it/s]


Project: tcma/assur


100%|██████████| 999/999 [00:02<00:00, 356.13it/s]


Project: tcma/barri


100%|██████████| 4/4 [00:00<00:00, 28.19it/s]


Project: tcma/bazmusian


100%|██████████| 7/7 [00:00<00:00, 49.92it/s]


Project: tcma/billa


100%|██████████| 66/66 [00:00<00:00, 274.78it/s]


Project: tcma/chuera


100%|██████████| 97/97 [00:00<00:00, 195.74it/s]


Project: tcma/emar


100%|██████████| 3/3 [00:00<00:00, 23.22it/s]


Project: tcma/fekheriye


100%|██████████| 14/14 [00:00<00:00, 105.98it/s]


Project: tcma/giricano


100%|██████████| 15/15 [00:00<00:00, 95.52it/s]


Project: tcma/hana


100%|██████████| 3/3 [00:00<00:00, 22.88it/s]


Project: tcma/haradum


100%|██████████| 2/2 [00:00<00:00, 464.43it/s]


Project: tcma/hatti


100%|██████████| 11/11 [00:00<00:00, 63.73it/s]


Project: tcma/kalhu


100%|██████████| 1/1 [00:00<00:00,  6.95it/s]


Project: tcma/kartn


100%|██████████| 62/62 [00:00<00:00, 104.56it/s]


Project: tcma/kulishinas


100%|██████████| 10/10 [00:00<00:00, 70.52it/s]


Project: tcma/miscellaneous


100%|██████████| 28/28 [00:00<00:00, 160.32it/s]


Project: tcma/nineveh


100%|██████████| 2/2 [00:00<00:00, 247.46it/s]


Project: tcma/nippur


100%|██████████| 4/4 [00:00<00:00, 29.29it/s]


Project: tcma/nuzi


100%|██████████| 1/1 [00:00<00:00,  7.96it/s]


Project: tcma/qitar


100%|██████████| 1/1 [00:00<00:00, 96.29it/s]


Project: tcma/rimah


100%|██████████| 124/124 [00:00<00:00, 178.21it/s]


Project: tcma/suri


100%|██████████| 1/1 [00:00<00:00, 470.16it/s]


Project: tcma/tsa1


100%|██████████| 17/17 [00:00<00:00, 115.52it/s]


Project: tcma/tsh1


100%|██████████| 254/254 [00:01<00:00, 184.71it/s]


Project: tcma/ugarit


100%|██████████| 5/5 [00:00<00:00, 29.76it/s]


#3 Data Structuring

##3.1 Add new columns into the dataframe
Adding new columns into the dataframe such as the line ID, the lemma and the dossier according to the SAA letter

In [None]:
words_df = words_df.fillna('')   # replace NaN (Not a Number) with empty string

findreplace = {' ' : '-', ',' : ''}
words_df = words_df.replace({'gw' : findreplace, 'sense' : findreplace}, regex=True)

words_df['id_line'] = [int(wordid.split('.')[1]) for wordid in words_df['id_word']]

words_df["norm1"] = words_df["norm"]
words_df.loc[words_df["norm1"] == "" , 'norm1'] = words_df['form']

words_df['lemma'] = words_df["cf"] + "[" + words_df["gw"] + "]" + words_df["pos"]
words_df.loc[words_df["cf"] == "" , 'lemma'] = words_df['form'] + "[NA]NA"
words_df.loc[words_df["form"] == "", 'lemma'] = ""

#words_df = words_df.merge(df_cat[['dossier_list']],how='left',left_on='id_text',right_index=True)

d = words_df.to_dict(orient='index')

words_df

Unnamed: 0,lang,form,delim,gdl,cf,gw,sense,norm,pos,epos,...,ftype,headform,contrefs,norm0,base,morph,cont,id_line,norm1,lemma
0,akk-x-midass,ṭup-pi₂,,"[{'v': 'ṭup', 'id': 'X001009.3.1.0', 'delim': ...",ṭuppu,tablet,tablet,ṭuppi,N,N,...,,,,,,,,3,ṭuppi,ṭuppu[tablet]N
1,akk-x-midass,5,,"[{'n': 'n', 'sexified': '5(diš)', 'form': '5',...",,,,,n,,...,,,,,,,,3,5,5[NA]NA
2,akk-x-midass,{udu}zi-bu-tu-MEŠ,,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",zibbatu,tail,fat-tailed-sheep,zibbutu,N,N,...,,,,,,,,3,zibbutu,zibbatu[tail]N
3,akk-x-midass,ša,,"[{'v': 'ša', 'id': 'X001009.4.1.0'}]",ša,of,of,ša,DET,DET,...,,,,,,,,4,ša,ša[of]DET
4,akk-x-midass,E₂.GAL{+lim},,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",ēkallu,palace,palace,ēkalle,N,N,...,,,,,,,,4,ēkalle,ēkallu[palace]N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130846,akk-x-midass,tu-še-ri-id-šu-nu,,"[{'v': 'tu', 'id': 'P282524.15.2.0', 'break': ...",warādu,descend,send-down,tušēridšunu,V,V,...,,,,,,,,15,tušēridšunu,warādu[descend]V
130847,akk-x-midass,x,,"[{'x': 'ellipsis', 'id': 'P282524.15.3.0', 'br...",,,,,u,,...,,,,,,,,15,x,x[NA]NA
130848,,,,,,,,,,,...,,,,,,,,16,,
130849,,,,,,,,,,,...,,,,,,,,17,,


In [None]:
#df cat
df_cat

Unnamed: 0,langs,project,designation,subgenre,id_text,language,museum_no,object_type,period,provenience,...,primary_publication,seal_information,seal_id,citation,date_of_origin,object_preservation,surface_preservation,acquisition_history,findspot_remarks,join_information
X001001,0x01000000,tcma/ali1,Ali 1,List,X001001,Middle Assyrian,IM 82993,tablet,Middle Assyrian,Atmannu (mod. Tell Ali),...,,,,,,,,,,
X001002,0x01000000,tcma/ali1,Ali 2,List,X001002,Middle Assyrian,IM 82979,tablet,Middle Assyrian,Atmannu (mod. Tell Ali),...,,,,,,,,,,
X001003,0x01000000,tcma/ali1,Ali 3,List,X001003,Middle Assyrian,IM 82902,tablet,Middle Assyrian,Atmannu (mod. Tell Ali),...,,,,,,,,,,
X001004,0x01000000,tcma/ali1,Ali 4,List,X001004,Middle Assyrian,IM 82987,tablet,Middle Assyrian,Atmannu (mod. Tell Ali),...,,,,,,,,,,
X001005,0x01000000,tcma/ali1,Ali 5,List,X001005,Middle Assyrian,IM 82971,tablet,Middle Assyrian,Atmannu (mod. Tell Ali),...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
P282522,0x01000000,tcma/ugarit,RS 18.0054 a,Diplomatic letter,P282522,Middle Assyrian,,tablet,Middle Assyrian,Ugarit (mod. Ras Shamra),...,,,,,,,,,,
P282523,0x01000000,tcma/ugarit,RS 18.0268,Diplomatic letter,P282523,Middle Assyrian,,tablet,Middle Assyrian,Ugarit (mod. Ras Shamra),...,,,,,,,,,,
P282524,0x01000000,tcma/ugarit,RS 19.0122,Diplomatic letter,P282524,Middle Assyrian,,tablet,Middle Assyrian,Ugarit (mod. Ras Shamra),...,,,,,,,,,,
P347884,0x01000000,tcma/ugarit,RS 34.0165,Diplomatic letter,P347884,Middle Babylonian,,tablet,Middle Assyrian,Ugarit (mod. Ras Shamra),...,,,,,,,,,,


## This notebook needs to be changed for TCMA from here onward (aa)

### NOTE (wm): brak and taban do not work into the dataframe.

1. Decide what to do with PNs in a given text 
* link to each other (co-mention network)
  1. For each PN in 'epos' that is not yn in'ftype', add the 'id_text' +'id_word' + 'norm1' + 'lemma' into a new data frame (df_nodelist)
  2. For each PN in 'epos' with the same 'id_text' add each 'id_word' to one of two columns: 'source', 'target' in a new data frame (df_edgelist).

* consider linking all related PNS (kinship network) in subsequent step
2. Consider whether PNs should be merged, and if so, how to address homonyms
3. Adding metadata? For example, in the case of the NA letters, PNs were tagged with 'sender' or 'recipient' for a directed network graph.
4. Adding date info...



## Node List
include rn for royal name
dn for divine names

"Remove separators from the Id (formerly, id_word) (just 
remove the periods)
Keep id_text
Keep norm1 as Label
lemma_ no change


In [None]:
# changing the frame to co mention - win 
df_nodelist_old = words_df[( (words_df.epos == "PN") | (words_df.epos == "DN") | (words_df.epos == "RN")) & (words_df.ftype != "yn")]

""" add the 'id_text' +'id_word' + 'norm1' + 'lemma' into a new data frame (df_nodelist)
add each 'id_word' to one of two columns: 'source', 'target' in a new data frame (df_edgelist)."""
size = df_nodelist_old.shape[0]
print("How many rows: ",size, "\nHow many columns: ", df_nodelist_old.shape[1])


How many rows:  12415 
How many columns:  26


In [None]:
# keeping only certain columns

drop = ['lang', 'form', 'delim', 'gdl', 'cf', 'gw' ,'sense', 'norm', 'pos' ,  
        'label', 'extent', 'scope', 'state', 'ftype', 'headform', 'contrefs', 
        'norm0', 'base', 'morph' ,'cont' ,'id_line']

df_nodelist = df_nodelist_old.drop(drop,axis = 1)
df_nodelist = df_nodelist.copy()

df_nodelist.insert(5, "ancient_recipient", "")
df_nodelist.insert(5, "ancient_author", "")
df_nodelist.insert(5, "occurencebracket", "")

df_nodelist.rename(columns = {'id_word':'Id'}, inplace = True)
df_nodelist.rename(columns = {'norm1':'Label'}, inplace = True)

df_nodelist['occurencebracket'] = df_nodelist['lemma'].str.extract(r"\[([0-9 _]+)\]")

df_nodelist

Unnamed: 0,epos,Id,id_text,Label,lemma,occurencebracket,ancient_author,ancient_recipient
14,PN,X001009.8.3,X001009,Ekuza,Ekuza[1]PN,1,,
16,PN,X001009.10.1,X001009,Iddin-Marduk,Iddin-Marduk[1]PN,1,,
31,PN,X001014.4.3,X001014,Ekuza,Ekuza[1]PN,1,,
35,PN,X001014.7.1,X001014,Iddin-Marduk,Iddin-Marduk[1]PN,1,,
63,PN,X001001.9.2,X001001,Takbaru,Takbaru[1]PN,1,,
...,...,...,...,...,...,...,...,...
130554,PN,P347884.53.4,P347884,Ziti,Ziti[1]PN,1,,
130621,DN,P347884.63.9,P347884,Šamaš,Šamaš[1]DN,1,,
130625,DN,P347884.64.3,P347884,Šamaš,Šamaš[1]DN,1,,
130653,DN,P347884.68.4,P347884,Šamaš,Šamaš[1]DN,1,,


In [None]:
#dedup
df_nodelist = df_nodelist.drop_duplicates(subset=['id_text', 'lemma'], keep='first')

In [None]:
nodesize = df_nodelist.shape[0]
catsizeold = df_cat.shape[0]

print("nodesize: ",nodesize)
print("catsizeold: ",catsizeold)



nodesize:  10450
catsizeold:  1779


##meeting with Dr. anderson Oct 4
--
When making the node list we want to include the attributes for either ancient author or ancient recipient if the name corresponds

If same id_text and the ancient author in this data frame df_cat == the label for a row in the nodelist,

—- Add the ancient_author attribute for that node to our nodelist

—- 2 Column in node_list - ancient_author or ancient_recepient, text id in as string

Include designation, subgenre, period, provenience, archive from df_cat in the node list


## meeting oct 11 remember to download again for the updates

- check the nodelist author recipent again
- gephi type
- two different edgelists
- long term additional markers around the name to avoid comention
- next steps: limu automated classification of the yn pn

## meeting oct 18
- Directed edge between author and recipient
- 1 = Undirected between panumena to both
- Column called type “directed” or “undirected”
- Delete repeated names dedup function
- In the nodelist too

BREAK

- the two tags are useful
- In the lemma column take the stuff from the bracket
- Make a column that has that number num
PN or etc attribute

In [None]:
for i in range(catsizeold):
  if (type(df_cat.iloc[i]["ancient_author"]) != str) or len(df_cat.iloc[i]["ancient_author"]) == 0:
    df_cat.iloc[i]["ancient_author"] = "NaN"

df_cat_filter = df_cat[ (df_cat.ancient_author != "NaN") ]

catsize = df_cat_filter.shape[0]

print("catsize: ",catsize)

df_cat_filter.loc[:,"ancient_author"]

catsize:  140


P270976      Aššur-uballiṭ I
P271024      Aššur-uballiṭ I
P281732      Bābu-aha-iddina
P281733      Bābu-aha-iddina
P281734      Bābu-aha-iddina
                 ...        
P282667    Tukultī-Ninurta I
P282669      Salmānu-mušabši
P531248        Aššur-tappūtī
P347884    Tukultī-Ninurta I
P503260             Bēlu-būr
Name: ancient_author, Length: 140, dtype: object

In [None]:
for i in range(nodesize):
  df_nodelist.iat[i,1] = df_nodelist.iloc[i]["Id"].replace(".", "")

df_nodelist

Unnamed: 0,epos,Id,id_text,Label,lemma,occurencebracket,ancient_author,ancient_recipient
14,PN,X00100983,X001009,Ekuza,Ekuza[1]PN,1,,
16,PN,X001009101,X001009,Iddin-Marduk,Iddin-Marduk[1]PN,1,,
31,PN,X00101443,X001014,Ekuza,Ekuza[1]PN,1,,
35,PN,X00101471,X001014,Iddin-Marduk,Iddin-Marduk[1]PN,1,,
63,PN,X00100192,X001001,Takbaru,Takbaru[1]PN,1,,
...,...,...,...,...,...,...,...,...
130290,RN,P34788442,P347884,Salmanu-ašared,Salmanu-ašared[Shalmaneser-I]RN,,,
130327,RN,P347884152,P347884,Tudhuliya,Tudhaliya[Tudhaliya-IV]RN,,,
130443,DN,P347884342,P347884,Adad,Adad[1]DN,1,,
130445,DN,P347884344,P347884,Šamaš,Šamaš[1]DN,1,,


In [None]:
#df nodelist matching with cat through lists

for i in range(catsize):
  df_nodelist.loc[df_nodelist["id_text"] == str(df_cat_filter.iat[i,4]),"ancient_author"] = df_cat_filter.iloc[i]["ancient_author"]
  df_nodelist.loc[df_nodelist["id_text"] == str(df_cat_filter.iat[i,4]),"ancient_recipient"] = df_cat_filter.iloc[i]["ancient_recipient"]



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [None]:
#designation, subgenre, period, provenience, archive matching
for j in range(df_cat.shape[0]):
  df_nodelist.loc[df_nodelist["id_text"] == str(df_cat.iat[j,4]),"designation"] = df_cat.iloc[j]["designation"]
  df_nodelist.loc[df_nodelist["id_text"] == str(df_cat.iat[j,4]),"subgenre"] = df_cat.iloc[j]["subgenre"]
  df_nodelist.loc[df_nodelist["id_text"] == str(df_cat.iat[j,4]),"period"] = df_cat.iloc[j]["period"]
  df_nodelist.loc[df_nodelist["id_text"] == str(df_cat.iat[j,4]),"provenience"] = df_cat.iloc[j]["provenience"]
  df_nodelist.loc[df_nodelist["id_text"] == str(df_cat.iat[j,4]),"archive"] = df_cat.iloc[j]["archive"]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [None]:
df_nodelist

Unnamed: 0,epos,Id,id_text,Label,lemma,occurencebracket,ancient_author,ancient_recipient,designation,subgenre,period,provenience,archive
14,PN,X00100983,X001009,Ekuza,Ekuza[1]PN,1,,,Ali 9,Envelope(?),Middle Assyrian,Atmannu (mod. Tell Ali),Flock-master’s archive
16,PN,X001009101,X001009,Iddin-Marduk,Iddin-Marduk[1]PN,1,,,Ali 9,Envelope(?),Middle Assyrian,Atmannu (mod. Tell Ali),Flock-master’s archive
31,PN,X00101443,X001014,Ekuza,Ekuza[1]PN,1,,,Ali 14,Document,Middle Assyrian,Atmannu (mod. Tell Ali),Flock-master’s archive
35,PN,X00101471,X001014,Iddin-Marduk,Iddin-Marduk[1]PN,1,,,Ali 14,Document,Middle Assyrian,Atmannu (mod. Tell Ali),Flock-master’s archive
63,PN,X00100192,X001001,Takbaru,Takbaru[1]PN,1,,,Ali 1,List,Middle Assyrian,Atmannu (mod. Tell Ali),Flock-master’s archive
...,...,...,...,...,...,...,...,...,...,...,...,...,...
130290,RN,P34788442,P347884,Salmanu-ašared,Salmanu-ašared[Shalmaneser-I]RN,,Tukultī-Ninurta I,šar Ugarit,RS 34.0165,Diplomatic letter,Middle Assyrian,Ugarit (mod. Ras Shamra),Urtenu archive
130327,RN,P347884152,P347884,Tudhuliya,Tudhaliya[Tudhaliya-IV]RN,,Tukultī-Ninurta I,šar Ugarit,RS 34.0165,Diplomatic letter,Middle Assyrian,Ugarit (mod. Ras Shamra),Urtenu archive
130443,DN,P347884342,P347884,Adad,Adad[1]DN,1,Tukultī-Ninurta I,šar Ugarit,RS 34.0165,Diplomatic letter,Middle Assyrian,Ugarit (mod. Ras Shamra),Urtenu archive
130445,DN,P347884344,P347884,Šamaš,Šamaš[1]DN,1,Tukultī-Ninurta I,šar Ugarit,RS 34.0165,Diplomatic letter,Middle Assyrian,Ugarit (mod. Ras Shamra),Urtenu archive


In [None]:
#saving the df node_list
node = pd.DataFrame(df_nodelist) 

In [None]:
df = pd.DataFrame(df_nodelist) 
    
# saving the dataframe 
f = "result_df_nodelist_dedup.csv"

with open(folder+f, 'w', encoding = 'utf-8-sig') as f:
  df.to_csv(f) 

##oct 20 do this for every ORACC project?
completed dedup to around 10k entries. also extracted if number, and the epos PN DN etc.

nodedir discussion, some names are not the same. take substring?

1. reproduce for the other datasets
2. geographic location
3. n grams



In [None]:
#filter nodelist for directed edges
df_nodedir = df_nodelist[(df_nodelist.ancient_author != "")]
# we want to save NODE DIR 
df = pd.DataFrame(df_nodedir) 
    
# saving the dataframe 
f = "result_df_nodedirtest.csv"

with open(folder+f, 'w', encoding = 'utf-8-sig') as f:
  df.to_csv(f) 

## Edge List

For edgelist, keep Id, what goes in the source target columns are the Id words without the periods,

For each edge whether for source or target, a column called Weight (value would be 1.0) ids go into source/target

There would be new rows… different data frame (will be bigger)"

Edgelist note
Weight = 1
Note: make it runnable
Change runtime
Changed if statement for edgelist

ancient_author	ancient_recipient
df_cat note

--

If the node has this attribute, ancient_author, make them the source for every node in that text.

If the node in the same text has ancient recipient then make the weight of the edge = 2, else make edge weight 1
Don’t want two edges between these



In [None]:
#list of unique id text
textlist = df_nodelist.id_text.unique()
print("HOW MANY DIFFERENT TEXTS:",len(textlist))

HOW MANY DIFFERENT TEXTS: 1540


In [None]:
#create frame for edgelist
import timeit
import pandas as pd 
d = {'Id': [], 'id_text': [], 'weight': [], 'Source': [], 'Target': [], 'Type': []}
df_edgelist = pd.DataFrame(data=d)
df_edgelist

Unnamed: 0,Id,id_text,weight,Source,Target,Type


In [None]:
start = 0
end = 1540

for k in range(start,end):
  match = df_nodelist.loc[df_nodelist['id_text'] == textlist[k]]
  length = len(match)
  counteredge = 0
  for i in range(length):
    id = match.iloc[i]["Id"]
    idt = match.iloc[i]["id_text"]
    for j in range(length):
      if i != j:
        #print(match.iloc[i]['Id'],match.iloc[j]['Id'])
        df_edgelist.loc[len(df_edgelist.index)] = [id, idt, 1.0,
                                                 id, match.iloc[j]["Id"], "undirected"] 
        counteredge += 1
        
  print("Current progress: ",np.round((k-start)/(end-start)*100,2),'%', "----> Iteration:",k)
  print("NODES:", length, "EDGES: ", counteredge)
  print("==========")

      
  

Current progress:  0.0 % ----> Iteration: 0
NODES: 2 EDGES:  2
Current progress:  0.06 % ----> Iteration: 1
NODES: 2 EDGES:  2
Current progress:  0.13 % ----> Iteration: 2
NODES: 1 EDGES:  0
Current progress:  0.19 % ----> Iteration: 3
NODES: 2 EDGES:  2
Current progress:  0.26 % ----> Iteration: 4
NODES: 5 EDGES:  20
Current progress:  0.32 % ----> Iteration: 5
NODES: 4 EDGES:  12
Current progress:  0.39 % ----> Iteration: 6
NODES: 1 EDGES:  0
Current progress:  0.45 % ----> Iteration: 7
NODES: 3 EDGES:  6
Current progress:  0.52 % ----> Iteration: 8
NODES: 7 EDGES:  42
Current progress:  0.58 % ----> Iteration: 9
NODES: 6 EDGES:  30
Current progress:  0.65 % ----> Iteration: 10
NODES: 1 EDGES:  0
Current progress:  0.71 % ----> Iteration: 11
NODES: 2 EDGES:  2
Current progress:  0.78 % ----> Iteration: 12
NODES: 4 EDGES:  12
Current progress:  0.84 % ----> Iteration: 13
NODES: 2 EDGES:  2
Current progress:  0.91 % ----> Iteration: 14
NODES: 2 EDGES:  2
Current progress:  0.97 % ---->

In [None]:
df = pd.DataFrame(df_edgelist) 
    
# saving the dataframe, it is too large so let's do 1000 at a time.
f = "result"+str(start)+"_"+str(end)+"result_df_edgelist.csv"

with open(folder+f, 'w', encoding = 'utf-8-sig') as f:
  df.to_csv(f) 


# Original Continuation of Automated Generation

# 4 Letter Metadata
Here we extract the metadata from the letters, namely the author, recipient and the place from where the letter was sent. We can use the catalogue data to get this information instead of searching through the text data. Even so, we still need to map the sender information from catalogue data (not yet done), which is in normalized form, onto the lemmatized forms of those same names as they occur in the text data. The JSON format of the catalogue is transformed into a pandas dataframe (df_cat) and a python dictionary (d_cat) with the P numbers as the indices.

If author and/or recipient are only partially preserved in the text (or not at all) this is indicated with square brackets in the catalogue fields `ancient_author` and `ancient_recipient`. These square brackets (as well as round brackets and question markes) are removed, so that "the king" in the catalogue of one letter and "the k\[ing\]" or "(the king)" or "the king?" in another become the same recipient. If author or recipient are "unknown" the entry is removed, since it is unlikely that all "unknown"s represent the same person. 

In the current data set the catalogue is much larger that the set of texts that are included. Since some of the nodes and edges are derived immediately from the catalogue the irrelevant catalogue entries need to be removed from `cat_df`.

In [None]:
cat_json = json.loads(cat_file)
df_cat = pd.DataFrame(cat_json['members']).T
"""CHANGED TO FIT DATA, -win

df_cat["ancient_author"] = df_cat["ancient_author"].str.replace('\[|\]|\(|\)|\?', '', regex=True)
df_cat["ancient_recipient"] = df_cat["ancient_recipient"].str.replace('\[|\]|\(|\)?|\?', '', regex=True)


df_cat["ancient_author"] = df_cat["ancient_author"].str.replace("’", "ʾ", regex = True)
df_cat["ancient_recipient"] = df_cat["ancient_recipient"].str.replace("’’", "ʾ", regex = True)


df_cat = df_cat[['ancient_author','ancient_recipient','provenience','dossier_list','title']]
df_cat = df_cat[df_cat['ancient_author'] != "unknown"]
df_cat = df_cat[df_cat['ancient_recipient'] != "unknown"]
df_cat = df_cat.fillna('')
df_cat = df_cat[df_cat.index.isin(used_pnums)]
"""
df_cat

Next, transform the names found in the catalogue to lemma format, compatible with the Proper Noun format found in the text data. In order to do so, we create a dictionary of all Proper Nouns in `words_df` with the Citation Form as key and the Lemma as value. This dictionary is used to replace the bare names (such as Aššur-reṣuwa) with the full lemma (Aššur-reṣuwa\[high-ranking-intelligence-agent\]PN). 

> Note 1: If the text data have multiple variants for the Guide Word of a PN (because of namesakes and/or because of inconsistencies in the data set), this approach may have unexpected results. An example is Adad-isse'a who is represented as Adad-isse'a\[governor of Mazamu'a\]PN and as Adad-isse'a\[military-officer-active-in-the-west-possibly-a-governor\]PN. The dictionary will recognize only one key Adad-isse'a.

> Note 2: if the catalogue mentions more than one ancient author or ancient recipient, this is not properly dealt with at the moment.

In [None]:
ProperNouns_df = words_df[words_df["pos"].str.contains(".N$")]
ProperNouns_d = dict(zip(ProperNouns_df['cf'], ProperNouns_df['lemma']))
"""

df_cat['ancient_author'] = df_cat['ancient_author'].replace(ProperNouns_d, regex=False) 


  # regex=False ensures that only full matches are replaced. So Aššur in Aššur-reṣuwa is not replaced by Aššur[1]DN.
df_cat['ancient_recipient'] = df_cat['ancient_recipient'].replace(ProperNouns_d, regex=False) 

"""
d_cat = df_cat.to_dict(orient='index')

Inspect the results of the replacement action.

In [None]:
df_cat

#5 Building the Network
In this section we build the nodes and edges of the network with two relational types. The first type consists of directed edges from the sender to the recipient. The information for these edges is currently collected from the catalogue data. (At some later point, this information will be discovered in the letters themselves)

The second type of edge involves the entities that are parsed from the letters themselves. These entities are connected to the sender of the letter because in a sense they are "informing" the sender.

We may not always want to include all the entities found in the letters. Therefore, there are two modes we can set according to our specifications.

Set the mode in two options:
1. pnonly - only look for proper names in the letters to be nodes in the network (ignoring the patterns from the rank parser)
2. entities - search for all people and place entities in the letters as found by the rank parser

In [None]:
mode = 'pnonly' #entities

If we select the "entities" mode, we must instantiate a class from the rank_parser4 module with the appropriate pattern formulae for our corpus (here called "corpus-saao"). This will find the entities in the text that match those patterns

In [None]:
if mode == 'entities':
  saao = rp.Parses('corpus-saao')

## 5.1 Generate Edges List

This method takes the sender/recipient data from the catalogue and converts it into directed edges for the network

In [None]:
def process_author_recips(tl,d_cat):
  l = []
  for text_id in tl:
    author = d_cat.get(text_id,{}).get('ancient_author','AUTH')
    author_id = text_id + '.AUTH.ID'
    recip = d_cat.get(text_id,{}).get('ancient_recipient','RECIP')
    recip_id = text_id + '.RECIP.ID'
    l.append({
        'Source': author_id,
        'Target': recip_id,
        'Type': 'Directed',
        'edge_type': 'letter',
        'Weight': 1.0,
        'source_lems': author,
        'target_lems': recip,
        'text_id': text_id        
    })
  df_auths = pd.DataFrame(l)
  return df_auths

This method takes the information from the parsed entities found in a letter and converts them into edges for the network that are directed toward the sender of that letter. This function is not used if mode = pnonly.

In [None]:
def process_text_entities(infos,text_id,dossier,mode='pnonly'):
    l = []

    author = d_cat.get(text_id,{}).get('ancient_author','AUTH')
    author_id = text_id + '.AUTH.ID'
    recip = d_cat.get(text_id,{}).get('ancient_recipient','RECIP')
    recip_id = text_id + '.RECIP.ID'
    l.append({
        'Source': author_id,
        'Target': recip_id,
        'Type': 'Directed',
        #'Label': 'info-flow',
        'Weight': 1.0,
        'source_lems': author,
        'target_lems': recip,
        'text_id': text_id,
        'dossier_list': dossier        
    })
    if mode == 'entities':
      i = 0
      if not re.search(r'[\(\[]',author):
          i = i + 1
      if not re.search(r'[\(\[]',recip):
          i = i + 2
      while i < len(infos):
          info = infos[i]
          lemmas = [x['lemma'] for x in info]
          word_ids = [x['id_word'] for x in info]
          #print(str(lemmas))
          #if lemmas == ['šarru[king]N'] or lemmas == ['šarru[king]N','bēlu[lord]N']:
          #    i = i + 1
          #    continue
          data = {
              'Source': ','.join(word_ids),
              'Target': author_id,
              'Type': 'Directed',
              #'Label': 'info-flow',
              'Weight': 1.0,
              'source_lems': ' '.join(lemmas),
              'target_lems': author,
              'text_id': text_id,
              'dossier_list': dossier
          }
          l.append(data)
          i = i + 1

    df_edges = pd.DataFrame(l)
    return df_edges

This method skips the entity parsing and is used only for finding proper names in the letters. Doing this will simplify the network.

In [None]:
def process_text_pns(df_one,d_cat,text_id):
  author = d_cat.get(text_id,{}).get('ancient_author','AUTH')
  author_id = text_id + '.AUTH.ID'

  df_pns = df_one[df_one['pos'].str.contains('^.N$')]
  df_pns = df_pns[~df_pns['pos'].str.contains('MN$')] # exclude Month names
  l = []
  for i,row in df_pns.iterrows():
    #if not re.search(r'[\(\[]',author) and i == 0:
    #  continue
    l.append({
        'Source': row['id_word'],
        'Target': author_id,
        'Type': 'Directed',
        'edge_type': 'info-flow',
        'Weight': 1.0,
        'source_lems': row['lemma'],
        'target_lems': author,
        'text_id': text_id
    })
  df_edges = pd.DataFrame(l)
  return df_edges

The following code iterates through every letter in our corpus and uses the previous methods to build the edges of the network

In [None]:
tl = list(set(words_df['id_text']))
col_list = ['lemma','pos','id_word']
df_edges_all = pd.DataFrame()
df_auths = process_author_recips(tl,d_cat)
for text_id in tqdm(tl,position=0):
    df_one = words_df[words_df['id_text'] == text_id].reset_index()[col_list]
    if mode == 'entities':
      d_one = df_one.to_dict(orient='records')
      d_one = [[x] for x in d_one]
      all_parses_all = rp.fit_all(saao,d_one)
      max_parses_all = [p.max_parses for p in all_parses_all]
      infos = [x[0]['filled']['entity']['forms'] for x in max_parses_all]
      print(text_id)
      print(infos)

      df_edges = process_text_entities(infos,text_id)
      df_edges_all = pd.concat([df_edges_all,df_edges])
    elif mode == 'pnonly':
      df_edges = process_text_pns(df_one,d_cat,text_id)
      df_edges_all = pd.concat([df_edges_all,df_edges])


Small modifications to the edge list.

If a Proper Noun (not sender or recipient) appears more than once in a single letter, this results in multiple edges between this PN and the sender of the letter. Such duplicating edges (where source, target and text ID are thge same) are removed.

Also removed are edges that involve source/target of the type "AUTH", or "RECIP" (those are unknown) and edges that involve a source or target lemmatized as 'x\[1\]SN' etc.

In [None]:
if mode == 'pnonly':
  df_edges_all = pd.concat([df_edges_all,df_auths])
df_edges_all = df_edges_all.reset_index()
df_edges_all['Id'] = df_edges_all.index
df_edges_all = df_edges_all.drop(['index'],axis=1)
df_edges_all['weight'] = df_edges_all['Weight'].astype('int')
df_edges_all = df_edges_all.drop_duplicates(["source_lems", "target_lems", "text_id"])
df_edges_all = df_edges_all[df_edges_all['source_lems'] != 'AUTH']
df_edges_all = df_edges_all[~df_edges_all['target_lems'].isin(['AUTH', 'RECIP'])]
df_edges_all = df_edges_all[~df_edges_all['source_lems'].str.contains('^[xX]\[.+].N$')]
df_edges_all = df_edges_all[~df_edges_all['target_lems'].str.contains('^[xX]\[.+].N$')]

merge edges that connect the same source to the same target. Adjust the weight.

In [None]:
df_edges_merged = df_edges_all.groupby(
    ['source_lems', 'target_lems', 'edge_type']).agg(
        {'text_id' : list, 'weight' : sum}).reset_index()

## 5.2 Create initial Directed Graph
NetworkX may create a network directly from the edgelist that was created in Pandas. The node list is created from the edge list. Node attributes need to be added separately.

In [None]:
G=nx.convert_matrix.from_pandas_edgelist(df_edges_merged, source = 'source_lems', target = 'target_lems', 
                                         edge_attr = ['text_id', 'weight', 'edge_type'],
                                        create_using = nx.DiGraph())

In [None]:
G.remove_edges_from(nx.selfloop_edges(G))

In [None]:
nodecolors = {'PN' : '#525252', 'DN' :'#00ffff', 'RN' : '#8469e7'}
degree = {name : G.degree[name] for name in G.nodes}
edge_size = {(a, b) : G[a][b]['weight'] for (a,b) in G.edges}
edge_color = {(source, target): 'blue' if G[source][target]['edge_type'] == 'letter' 
                 else 'green' for source, target in G.edges}
"""
eigenvector = nx.eigenvector_centrality(G)

nx.set_node_attributes(G, degree, "degree")
nx.set_node_attributes(G, eigenvector, "eigenvector_centrality")
nx.set_edge_attributes(G, edge_color, "edge_color")
nx.set_edge_attributes(G, edge_size, "edge_size")
"""

##5.3 Save Results

# New section

In [None]:
import pickle
pickle.dump(G, open(folder + 'network/tcma-network.p', 'wb'))
df_edges_all.to_csv(folder + 'network/tcma2022_edges_info.csv',encoding='utf-8',index=False)
df_nodes.to_csv(folder + 'network/tcma2022_nodes_info.csv',encoding='utf-8',index=False)

# Next Steps

* work with the dates in the texts:
Limmu eponyms are to be linked to their corresponding min / max date (see link: https://docs.google.com/spreadsheets/d/1fS9v1CgpEzw0o0fi-h1JoPPtN88HDuqmLNeYP41zuwc/edit?usp=sharing)
