# Python script for analyzing the EDH dataset
*Created by: Vojtech Kase, Petra Hermankova*


Requirements:
*   Google Colab account 
*   Access to Sciencedata.dk or access alternatively to the dataset in JSON
*   Basic knowledge of Python (how to run scripts in Python notebooks)



In [0]:
### REQUIREMENTS - will install the libraries
import numpy as np
import math
import pandas as pd
import sys
import requests
from urllib.request import urlopen 
from bs4 import BeautifulSoup

import io

# to avoid errors, we sometime use time.sleep(N) before retrying a request
import time

# the input data have typically a json structure
import json
import getpass

import datetime as dt

!pip install --ignore-installed --index-url https://test.pypi.org/simple/ --no-deps sddk ### our own package under construction, always install to have up-to-date version
import sddk

Looking in indexes: https://test.pypi.org/simple/
Collecting sddk
  Downloading https://test-files.pythonhosted.org/packages/b9/98/f9ffbef66a2909b3cf2be3555230dfac6ffff4eb515f5433c1fbee3c9876/sddk-0.0.8-py3-none-any.whl
Installing collected packages: sddk
Successfully installed sddk-0.0.8


## Establishing connection to the Sciencedata.dk: configure session and group URL

In [0]:
### configure session and groupurl
### in the case of "SDAM_root", the group owner is Vojtech with username 648597@au.dk
s, sddk_url = sddk.configure_session_and_url("SDAM_root")   # Vojtech: Which user and password am I supposed to enter here? Vojtech's or mine? If Vojtech's than I don't know his password and will fail. If I enter mine, it does not work :(
    

sciencedata.dk username (format '123456@au.dk'): 648597@au.dk
sciencedata.dk password: ··········
personal connection established
connection with shared folder established with you as its owner
endpoint variable has been configured to: https://sciencedata.dk/files/SDAM_root/


## Connecting to the preprocessed and enriched JSON file / dataframe from sciencedata.dk


In [0]:
### Once the connection has been succesfuilly established, we can upload the data from sciencedata into Pandas dataframe
### Look at Pandas documention to learn how to navigate Pandas dataframe with their endless functionality: https://pandas.pydata.org/pandas-docs/version/0.23.4/index.html
EDH_df = pd.DataFrame(s.get(sddk_url + "SDAM_data/EDH/EDH_inscriptions_rich.json").json())
EDH_df.set_index("id", inplace=True) ### perhaps the best index is the "ID" # Vojtech: Why? As a user I am not sure why we need to index? - answer: once you have the data with your own index column, it makes some queries in pandas a little bit simpler, for instance, to explore particular inscription, you can easily run EDH_df.loc["HD000004"]
EDH_df.head(5) ### use ".head(5)" to inspect first 5 rows of the dataframe

Unnamed: 0_level_0,diplomatic_text,literature,trismegistos_uri,findspot_ancient,not_before,type_of_inscription,work_status,edh_geography_uri,not_after,country,province_label,transcription,material,height,width,findspot_modern,depth,commentary,uri,responsible_individual,last_update,language,modern_region,letter_size,type_of_monument,people,year_of_find,findspot,present_location,external_image_uris,religion,fotos,geography,military,social_economic_legal_history,coordinates,text_cleaned,origdate_text,objecttype
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
HD000001,D M / NONIAE P F OPTATAE / ET C IVLIO ARTEMONI...,"AE 1983, 0192.; M. Annecchino, Puteoli 4/5, 19...",https://www.trismegistos.org/text/251193,"Cumae, bei",71,epitaph,provisional,https://edh-www.adw.uni-heidelberg.de/edh/geog...,130,Italy,Latium et Campania (Regio I),D(is) M(anibus) / Noniae P(ubli) f(iliae) Opta...,"Marmor, geädert / farbig",33 cm,34 cm,"Cuma, bei",2.7 cm,(C): 2. Hälfte 1. - Anfang 2. Jh. - AE; Ende ...,https://edh-www.adw.uni-heidelberg.de/edh/insc...,Feraudi,2014-04-07,Latin,Campania,3.2-2 cm,tabula,"[{'cognomen': 'Optata', 'person_id': '1', 'gen...",,,,,,,,,,"40.8471577,14.0550756",Dis Manibus Noniae Publi filiae Optatae et Cai...,71 AD – 130 AD,"[Tafel, 257]"
HD000002,C SEXTIVS PARIS / QVI VIXIT / ANNIS LXX,"AE 1983, 0080. (A); A. Ferrua, RAL 36, 1981, 1...",https://www.trismegistos.org/text/265631,Roma,51,epitaph,no image,https://edh-www.adw.uni-heidelberg.de/edh/geog...,200,Italy,Roma,C(aius) Sextius Paris / qui vixit / annis LXX,marble: rocks - metamorphic rocks,28 cm,85 cm,Roma,,AE 1983: Breite: 35 cm.,https://edh-www.adw.uni-heidelberg.de/edh/insc...,Feraudi,2014-04-07,Latin,Lazio,4 cm,tabula,"[{'age: years': '70', 'cognomen': 'Paris', 'ge...",1937,"Via Nomentana, S. Alessandro, Kirche",,,,,,,,"41.895466,12.482324",Caius Sextius Paris qui vixit annis LXX ...,51 AD – 200 AD,"[Tafel, 257]"
HD000003,[ ]VMMIO [ ] / [ ]ISENNA[ ] / [ ] XV[ ] / [ ] / [,"AE 1983, 0518. (B); J. González, ZPE 52, 1983,...",https://www.trismegistos.org/text/220675,,131,honorific inscription,provisional,https://edh-www.adw.uni-heidelberg.de/edh/geog...,170,Spain,Baetica,[P(ublio) M]ummio [P(ubli) f(ilio)] / [Gal(eri...,marble: rocks - metamorphic rocks,(37) cm,(34) cm,Tomares,(12) cm,(B): [S]isenna ist falscher Kasus; folgende E...,https://edh-www.adw.uni-heidelberg.de/edh/insc...,Feraudi,2006-08-31,Latin,Sevilla,4.5-3 cm,statue base,"[{'nomen': 'Mummius+', 'cognomen': 'Sisenna+ R...",before 1975,,"Sevilla, Privatbesitz",,,,,,,"37.37281,-6.04589",Publio Mummio Publi filio Galeria Sisennae Rut...,131 AD – 170 AD,"[Statuenbasis, 57]"
HD000004,[ ]AVS[ ]LLA / M PORCI NIGRI SER / DOMINAE VEN...,"AE 1983, 0533. (B); A.U. Stylow, Gerión 1, 198...",https://www.trismegistos.org/text/222102,Ipolcobulcula,151,votive inscription,checked with photo,https://edh-www.adw.uni-heidelberg.de/edh/geog...,200,Spain,Baetica,[---?]AV(?)S(?)[---]L(?)L(?)A / M(arci) Porci ...,limestone: rocks - clastic sediments,(39) cm,27 cm,Carcabuey,18 cm,Material: lokaler grauer Kalkstein. (B): Styl...,https://edh-www.adw.uni-heidelberg.de/edh/insc...,Gräf,2015-03-27,Latin,Córdoba,2.5 cm,altar,"[{'cognomen': '[---]', 'status': 'slaves', 'pe...",before 1979,,"Carcabuey, Grupo Escolar",[http://cil-old.bbaw.de/test06/bilder/datenban...,names of pagan deities,,,,,"37.4442,-4.27471",AVSLLA Marci Porci Nigri serva dominae Veneri ...,151 AD – 200 AD,"[Altar, 29]"
HD000005,[ ] L SVCCESSVS / [ ] L L IRENAEVS / [ ] C L T...,"AE 1983, 0078. (B); A. Ferrua, RAL 36, 1981, 1...",https://www.trismegistos.org/text/265629,Roma,1,epitaph,no image,https://edh-www.adw.uni-heidelberg.de/edh/geog...,200,Italy,Roma,[---] l(ibertus) Successus / [---] L(uci) l(ib...,,,,Roma,,(B): Z. 3: C(ai) l(ibertae) Tyches.,https://edh-www.adw.uni-heidelberg.de/edh/insc...,Feraudi,2010-01-04,Latin,Lazio,,stele,"[{'status': 'freedmen / freedwomen', 'name': '...",,Via Cupa (ehem. Vigna Nardi),,,,,,,,"41.895466,12.482324",libertus Successus Luci libertus Irenaeus C...,1 AD – 200 AD,"[Stele, 250]"


In [0]:
len(EDH_df)

80270

# Working offline (if the connection to Sciencedata.dk fails)
You need to have an offline version of the enriched JSON file.

In [0]:
# for uploading offline files from the local computer (loading may take few minutes in case of large files)

from google.colab import files
uploaded = files.upload()

In [0]:
EDH_df = pd.read_json("EDH_inscriptions_rich.json") # pandas load the json file and saves it as new object
EDH_df.set_index("id", inplace=True) ### indexing by ID
EDH_df.head(5) ### use ".head(5)" to inspect first 5 rows of the dataframe

In [0]:
# Inspect how many rows and columns we have
EDH_df.shape

## Subsetting the dataset

In [0]:
# Inspect all unique values within "type_of_inscription"
EDH_df["type_of_inscription"].unique()

array(['epitaph', 'honorific inscription', 'votive inscription',
       'defixio', 'owner/artist inscription', 'owner/artist inscription?',
       'mile-/leaguestone', 'acclamation', 'boundary inscription',
       'building/dedicatory inscription', None, 'votive inscription?',
       'military diploma', 'building/dedicatory inscription?', 'epitaph?',
       'honorific inscription?', 'identification inscription',
       'public legal inscription', 'private legal inscription',
       'boundary inscription?', 'label', 'label?', 'list',
       'private legal inscription?', 'calendar',
       'identification inscription?', 'list?', 'seat inscription',
       'elogium', 'assignation inscription', 'seat inscription?',
       'elogium?', 'prayer', 'acclamation?', 'defixio?', 'calendar?',
       'letter', 'mile-/leaguestone?', 'adnuntiatio',
       'public legal inscription?', 'prayer?', 'letter?',
       'assignation inscription?', 'military diploma?'], dtype=object)

In [0]:
# Example how to subset the dataset, this time based on a specific string in the type of inscription
EDH_miles = EDH_df[EDH_df["type_of_inscription"].str.startswith("mile-/lea", na=False)]
len(EDH_miles) ### shows how many records in the dataset fulfils the condition

1679

In [0]:
EDH_miles.head(2) # shows the first (2) rows of the dataset

Unnamed: 0_level_0,diplomatic_text,literature,trismegistos_uri,findspot_ancient,not_before,type_of_inscription,work_status,edh_geography_uri,not_after,country,province_label,transcription,material,height,width,findspot_modern,depth,commentary,uri,responsible_individual,last_update,language,modern_region,letter_size,type_of_monument,people,year_of_find,findspot,present_location,external_image_uris,religion,fotos,geography,military,social_economic_legal_history,coordinates,text_cleaned,origdate_text,objecttype
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
HD000024,D N / VALENTIN[ ] / VICTORI AC TRIVMPHATORI [ ...,"AE 1983, 0575.; L. Dos Santos - P. Le Roux - A...",https://www.trismegistos.org/text/226605,"Bracara Augusta - Lucus Augusti, inter",364,mile-/leaguestone,provisional,https://edh-www.adw.uni-heidelberg.de/edh/geog...,375,Portugal,Hispania citerior,D(omino) n(ostro) / Valentin[iano] / victori a...,Granit: rocks - magmatic rocks,(107) cm,53 cm,Romarigães,,(B): AE 1980: kleinere Abweichungen in Lesung...,https://edh-www.adw.uni-heidelberg.de/edh/insc...,Feraudi,2013-12-03,Latin,,7-10 cm,mile-/leaguestone,"[{'cognomen': 'Valentinianus+', 'person_id': '...",,,Mus. Pio XII Braga,,,,,,data available,"41.86757,-8.63463",Domino nostro Valentiniano victori ac triumpha...,364 AD – 375 AD,"[Meilen-/Leugenstein, 89]"
HD000177,]AV[ ] / [ ] VIII CON[,"AE 1983, 0572.; L. Dos Santos - P. Le Roux - A...",https://www.trismegistos.org/text/226604,"Bracara Augusta - Lucus Augusti, inter",1,mile-/leaguestone,provisional,https://edh-www.adw.uni-heidelberg.de/edh/geog...,300,Portugal,Hispania citerior,------]AV(?)[---] / [imp(erator)?] VIII con[s(...,Granit: rocks - magmatic rocks,(22) cm,(53) cm,Arcozelo,,Meilenstein der via 19 des Itinerarium Antoni...,https://edh-www.adw.uni-heidelberg.de/edh/insc...,Feraudi,2013-12-05,Latin,Braga,8 cm,mile-/leaguestone,"[{'gender': 'male', 'praenomen': '[-]', 'nomen...",,"Pfarrkirche, sekundär verwendet",Mus. Pio XII Braga,,,,,,,"41.665548,-8.531168",AV imp(erator) VIII con[sul &,1 AD – 300 AD,"[Meilen-/Leugenstein, 89]"


In [0]:
# how to show only the dated ones
EDH_miles_date = EDH_miles[EDH_miles["origdate_text"].str.startswith("", na=False)]
len(EDH_miles_date) ### how long it is?



1658

In [0]:
# with geolocations
len(EDH_miles[EDH_miles["coordinates"].notnull()])

In [0]:
# selects only the milestones in the province Sardinia
EDH_miles_sardinia = EDH_miles[EDH_miles["province_label"].str.startswith("Sardinia", na=False)]
len(EDH_miles_sardinia)


6

### Saving the subset as CSV file

In [0]:
# If you need to save the subset into a CSV and save it into a local computer
from google.colab import files
EDH_miles.to_csv('EDH_milestones.csv') 
files.download('EDH_milestones.csv')

In [0]:
# prints as CSV into a local computer
from google.colab import files
EDH_miles_sardinia.to_csv('EDH_milestones_sardinia.csv') 
files.download('EDH_milestones_sardinia.csv')

## Inscriptions from one province (Example of Sardinia)

In [0]:
EDH_df["province_label"].unique()

In [0]:
# subset based on the name of province 
EDH_sardinia = EDH_df[EDH_df["province_label"].str.startswith("Sardinia", na=False)]
len(EDH_sardinia) ### how long it is?

In [0]:
# prints as CSV into a local computer
from google.colab import files
EDH_sardinia.to_csv('EDH_all_sardinia.csv') 
files.download('EDH_all_sardinia.csv')

### Example of Thrace

In [0]:
### to get a smaller dataset 
EDH_thracia = EDH_df[EDH_df["province_label"].str.startswith("Thracia", na=False)]
len(EDH_thracia) ### how long it is?

In [0]:
# prints as CSV into a local computer
from google.colab import files
EDH_thracia.to_csv('EDH_all_thracia.csv') 
files.download('EDH_all_thracia.csv')

### Example of Meosia Inferior

In [0]:
### to get a smaller dataset 
EDH_moesia_inf = EDH_df[EDH_df["province_label"].str.startswith("Moesia inf", na=False)]
len(EDH_moesia_inf) ### how long it is?

In [0]:
# prints as CSV into a local computer
from google.colab import files
EDH_moesia_inf.to_csv('EDH_all_moesia_inf.csv') 
files.download('EDH_all_moesia_inf.csv')

# Working with one CSV file

If you prefer to work with one CSV file (containing a subset of all data), instead of the large JSON.

The aim is to find all inscriptions containing mentions of a road, people using the road or any of the establishments and buildings associated with roads.

In [0]:
# loads CSV and displays first three records to check
Sardinia = pd.read_csv('EDH_all_sardinia.csv', sep=',')
Sardinia.head(3)

In [0]:
# searches through text for a specific term and outputs only those inscriptions containing the full term
language = ['Latin', 'Greek']
sardinia_lang = Sardinia.loc[Sardinia['language'].isin(language)]
sardinia_lang.head(2)


In [0]:
# using partial strings to find specific inscriptions, https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe
# example of one term search, using regexes
Sardinia[Sardinia['transcription'].str.contains(r'viat')]


In [0]:
# list based search, searches for all the occurences of the terms in the list roads_vocab
roads_vocab = ['\bvia\b', '\bviat', '\bmansio', '\bmutatio','\bmilia', 'millia', '\bpassuum', '\bcaput', '\bpons', '\bpont'] # The list still needs tweaking and more testing, plus expanding to Greek (Petra)
Sardinia[Sardinia['transcription'].str.contains('|'.join(roads_vocab))]


### List based search for an entire JSON dataset

In [0]:
# for uploading offline files from the local computer (loading will take few minutes in case of large files)

from google.colab import files
uploaded = files.upload()

In [0]:
EDH_df = pd.read_json("EDH_inscriptions_rich.json") # pandas load the json file and saves it as new object
EDH_df.set_index("id", inplace=True) ### index is the "ID"
EDH_df.head(3)

In [0]:
# list based serach, searches for all terms in the list
roads_vocab = ['\bvia\b', '\bviat', '\bmansio', '\bmutatio','\bmilia', 'millia', '\bpassuum', '\bcaput', '\bpons', '\bpont']
EDH_roads_vocab = EDH_df[EDH_df['transcription'].str.contains('|'.join(roads_vocab), na=False)]
len(EDH_roads_vocab)

In [0]:
# prints as CSV into a local computer
from google.colab import files
EDH_roads_vocab.to_csv('EDH_roads_vocab.csv') 
files.download('EDH_roads_vocab.csv')