This notebook serves to get additional data from the EDH XML/epidoc files. The reason is that some information is missing in the API data.


In [1]:
### REQUIREMENTS
import numpy as np
import math
import pandas as pd

import sys
### we do a lot of requests during the scrapping. Some of them with requests package, some of them with urllib
import requests
from urllib.request import urlopen 
from urllib.parse import quote  
from bs4 import BeautifulSoup
import xml.etree.cElementTree as ET
import re

import zipfile
import io

# to avoid errors, we sometime use time.sleep(N) before retrying a request
import time
# the input data have typically a json structure
import json
import getpass

import datetime as dt
# for simple paralel computing:
from concurrent.futures import ThreadPoolExecutor

import sddk

# google sheets integration:
import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.oauth2 import service_account # based on google-auth library


In [2]:
conf = sddk.configure("SDAM_root", "648597@au.dk")

sciencedata.dk username (format '123456@au.dk'): 648597@au.dk
sciencedata.dk password: ········
connection with shared folder established with you as its owner
endpoint variable has been configured to: https://sciencedata.dk/files/SDAM_root/


In [3]:
# to access gsheet, you need Google Service Account key json file
# I have mine located in my personal space on sciencedata.dk, so I read it from there:

# (1) read the file and parse its content
try:
    file_data = conf[0].get("https://sciencedata.dk/files/ServiceAccountsKey.json").json()
except:
    print("cannot find file ServiceAccountsKey.json")
# (2) transform the content into crendentials object
credentials = service_account.Credentials.from_service_account_info(file_data)
# (3) specify your usage of the credentials
scoped_credentials = credentials.with_scopes(['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive'])
# (4) use the constrained credentials for authentication of gspread package
gc = gspread.Client(auth=scoped_credentials)
# (5) establish connection with spreadsheets specified by their url
EDH_overview = gc.open_by_url("https://docs.google.com/spreadsheets/d/164MLxVcCZg95Bzf9fVyD1-iCA5V97eM3KAFllyhTvt4/edit?usp=sharing")

Now we turn to the download section of the EDH website, where we can find zip archives containing xml files with individual inscriptions. Instead of downloading them manually, we will download them directly into our Python environment.

In [4]:
# extract the download page
resp = requests.get("https://edh-www.adw.uni-heidelberg.de/data/export", headers={"User-Agent" : ""})
url_text = resp.text

In [5]:
# extract urls of individual zip archives for download
download_urls = re.findall("download\/edhEpidocDump_HD.+", url_text)
download_urls

['download/edhEpidocDump_HD000001-HD010000.zip',
 'download/edhEpidocDump_HD010001-HD020000.zip',
 'download/edhEpidocDump_HD020001-HD030000.zip',
 'download/edhEpidocDump_HD030001-HD040000.zip',
 'download/edhEpidocDump_HD040001-HD050000.zip',
 'download/edhEpidocDump_HD050001-HD060000.zip',
 'download/edhEpidocDump_HD060001-HD070000.zip',
 'download/edhEpidocDump_HD070001-HD082046.zip']

In [8]:
# check how many files we have

url_base = "https://edh-www.adw.uni-heidelberg.de/"

filenames = []
for d_url in download_urls:
    url = url_base + d_url
    print(url)
    resp = requests.get(url, headers={'User-Agent': ''})
    zipped = zipfile.ZipFile(io.BytesIO(resp.content))
    ### names of all files within the zipped directory
    namelist = zipped.namelist()
    namelist = [file for file in namelist if ".xml" in file]
    filenames.extend(namelist)
len(filenames)

https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD000001-HD010000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD010001-HD020000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD020001-HD030000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD030001-HD040000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD040001-HD050000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD050001-HD060000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD060001-HD070000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD070001-HD082046.zip


81156

In [11]:
# collecting filenames from first url for testing

url_base = "https://edh-www.adw.uni-heidelberg.de/"

url = url_base + download_urls[0]
resp = requests.get(url, headers={"User-Agent" : ""})
zipped = zipfile.ZipFile(io.BytesIO(resp.content))
### names of all files within the zipped directory
namelist = zipped.namelist()

In [12]:
len(namelist)

9930

In [13]:
def get_filecontent_from_filename(filename, zipped):
    try:
        return str(zipped.read(filename))
    except:
        pass

In [14]:
# test with first ten files within the namelist
edh_filecontents = {}

for filename in namelist[:10]:
    edh_filecontents[filename] = get_filecontent_from_filename(filename, zipped)
# transform it into dataframe

In [15]:
# let's try to parse it as xml
soup = BeautifulSoup(edh_filecontents[namelist[0]])
soup

b'<?xml version="1.0" encoding="UTF-8"?><?xml-model href="http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng" schematypens="http://relaxng.org/ns/structure/1.0"?><tei xml:base="ex-epidoctemplate.xml" xml:lang="de" xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0">\n    <teiheader>\n        <filedesc>\n            <titlestmt>\n                <title>Grabinschrift auf Tafel</title>\n            </titlestmt>  \n            <publicationstmt>\n                <authority>Epigraphische Datenbank Heidelberg</authority>\n                <idno type="URI">http://edh-www.adw.uni-heidelberg.de/edh/inschrift/HD000001</idno>\n                <idno type="TM">251193</idno><idno type="localID">HD000001</idno>\n                <availability>\n                    <p>\xc2\xa9 Heidelberg Academy of Sciences and Humanities</p>\n                    <licence target="http://creativecommons.org/licenses/by-sa/4.0/">This file is licensed under the Creative Commons Attribution-ShareAlike 4.0 license.\n

In [16]:
# define function for data parsing
def get_data_from_filename(filename):
    try:
        soup = BeautifulSoup(edh_filecontents[filename])
        xml_data = {} 
        idno_uri = soup.find("idno", attrs={"type" : "URI"}).get_text()
        xml_data["idno_uri"] = idno_uri.rpartition("/")[2]
        xml_data["idno_tm"] = soup.find("idno", attrs={"type" : "TM"}).get_text()
        placenames_refs = []
        try: 
            placenames = soup.find_all("placename")
            for placename in placenames:
                placenames_refs.append(placename["ref"])
        except: placenames_refs = []
        xml_data["placenames_refs"] = placenames_refs
        text_tag = soup.find("div", attrs={"type" : "edition"})
        xml_data["text_edition"] = " ".join(text_tag.get_text().splitlines()[1:])
        xml_data["origdate_text"] = soup.find("origdate").get_text().replace("\n", "")
        try: 
            layout_execution = soup.layout.find("rs")["ref"]
            xml_data["layout_execution"] = layout_execution.rpartition("/")[2]
        except: xml_data["layout_execution"] = ""
        try: xml_data["layout_execution_text"] = soup.layout.rs.get_text()
        except: xml_data["layout_execution_text"] = ""
        try: 
            support_objecttype = soup.support.find("objecttype")["ref"]
            xml_data["support_objecttype"] = support_objecttype.rpartition("/")[2]
        except: xml_data["support_objecttype"] = ""
        try: xml_data ["support_objecttype_text"] = soup.support.objecttype.get_text()
        except: xml_data ["support_objecttype_text"] = ""
        try: 
            support_material = soup.support.find("material")["ref"]
            xml_data["support_material"] = support_material.rpartition("/")[2]
        except: xml_data["support_material"] = ""    
        try: xml_data["support_material_text"] = soup.support.material.get_text()
        except: xml_data["support_material_text"] = ""
        try: 
            support_decoration = soup.support.find("rs")["ref"]
            xml_data["support_decoration"] = support_decoration.rpartition("/")[2]
        except: xml_data["support_decoration"] = ""
        try: 
            keywords_term = soup.keywords.find("term")["ref"]
            xml_data["keywords_term"] = keywords_term.rpartition("/")[2]
        except: xml_data["keywords_term"] = ""
        try: xml_data["keywords_term_text"] = soup.keywords.get_text().replace("\n", "")
        except: xml_data["keywords_term_text"] = ""
        return xml_data
    except:
        pass

In [17]:
# test with first ten files within the namelist
edh_xml_data = []

for filename in namelist[:10]:
    edh_filecontents[filename] = get_filecontent_from_filename(filename, zipped)
    edh_xml_data.append(get_data_from_filename(filename))
# transform it into dataframe
pd.DataFrame(edh_xml_data)

Unnamed: 0,idno_uri,idno_tm,placenames_refs,text_edition,origdate_text,layout_execution,layout_execution_text,support_objecttype,support_objecttype_text,support_material,support_material_text,support_decoration,keywords_term,keywords_term_text
0,HD000001,251193,"[http://www.trismegistos.org/place/033152, htt...",,71 AD \xe2\x80\x93 130 AD\n ...,21,unbestimmt,257,Tafel,,"Marmor, ge\xc3\xa4dert / farbig",1000,92,\n Grabinschrift\n ...
1,HD000002,265631,"[http://www.trismegistos.org/place/000172, htt...",,51 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,257,Tafel,48.0,Marmor,1000,92,\n Grabinschrift\n ...
2,HD000003,220675,"[http://www.trismegistos.org/place/025443, htt...",,131 AD \xe2\x80\x93 170 AD\n ...,21,unbestimmt,57,Statuenbasis,48.0,Marmor,1000,69,\n Ehreninschrift\n ...
3,HD000004,222102,"[http://www.trismegistos.org/place/025443, htt...",,151 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,29,Altar,60.0,Kalkstein,1000,80,\n Weihinschrift\n ...
4,HD000005,265629,"[http://www.trismegistos.org/place/000172, htt...",,1 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,250,Stele,138.0,unbestimmt,1000,92,\n Grabinschrift\n ...
5,HD000006,222924,"[http://www.trismegistos.org/place/025443, htt...",,71 AD \xe2\x80\x93 150 AD\n ...,21,unbestimmt,250,Stele,60.0,Kalkstein,1000,92,\n Grabinschrift\n ...
6,HD000007,265588,"[http://www.trismegistos.org/place/000172, htt...",,100 BC \xe2\x80\x93 51 BC\n ...,21,unbestimmt,257,Tafel,71.0,Travertin,1000,92,\n Grabinschrift\n ...
7,HD000008,265611,"[http://www.trismegistos.org/place/000172, htt...",,101 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,257,Tafel,48.0,Marmor,1000,92,\n Grabinschrift\n ...
8,HD000009,168722,[],,201 AD \xe2\x80\x93 300 AD\n ...,21,unbestimmt,276,Tessera,108.0,"Blei, Zinn",1000,76,\n Defixio\n
9,HD000010,244297,[],,101 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,78,Urne,138.0,unbestimmt,1000,92,\n Grabinschrift\n ...


In [18]:
EDH_xml_cols = pd.DataFrame(pd.DataFrame(edh_xml_data).columns, columns=["columns"])
EDH_xml_cols

Unnamed: 0,columns
0,idno_uri
1,idno_tm
2,placenames_refs
3,text_edition
4,origdate_text
5,layout_execution
6,layout_execution_text
7,support_objecttype
8,support_objecttype_text
9,support_material


In [77]:
# uncomment to export to gsheets
# set_with_dataframe(EDH_overview.add_worksheet("EDH_xml_cols", 1, 1), EDH_xml_cols)

# Extract xml files content from the zip files as raw strings 

In [19]:
%%time

# main loop

url_base = "https://edh-www.adw.uni-heidelberg.de/"

for d_url in download_urls:
    url = url_base + d_url
    print(url)
    resp = requests.get(url, headers={'User-Agent': ''})
    zipped = zipfile.ZipFile(io.BytesIO(resp.content))
    ### names of all files within the zipped directory
    namelist = zipped.namelist()
    for filename in namelist:
        try:        
            # original: edh_xml_data.append(get_data_from_filename(filename, zipped))
            edh_filecontents[filename] = get_filecontent_from_filename(filename, zipped)
        except:
            pass
        ### index "0" is for main directory

https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD000001-HD010000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD010001-HD020000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD020001-HD030000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD030001-HD040000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD040001-HD050000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD050001-HD060000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD060001-HD070000.zip
https://edh-www.adw.uni-heidelberg.de/download/edhEpidocDump_HD070001-HD082046.zip
CPU times: user 7.53 s, sys: 563 ms, total: 8.1 s
Wall time: 28.9 s


In [20]:
len(edh_filecontents)

81156

In [80]:
sddk.write_file("SDAM_data/EDH/edh_raw_xmls_20201109.json", edh_filecontents, conf)

A file with the same name ("edh_raw_xmls_20201109.json") already exists in this location.
Press Enter to overwrite it or choose different path and filename: 
Your <class 'dict'> object has been succefully written as "https://sciencedata.dk/files/SDAM_root/SDAM_data/EDH/edh_raw_xmls_20201109.json"


In [21]:
# lets look at what is within raw xml strings
# how many references to http://www.eagle-network.eu/ are there?
eagle_n = 0
for filename in edh_filecontents.keys():
    eagle_n = eagle_n + str(edh_filecontents[filename]).count("http://www.eagle-network.eu/")
eagle_n

400479

In [24]:
edh_xml_lens = []
for filename in edh_filecontents.keys():
    edh_xml_lens.append((filename, len(edh_filecontents[filename])))

In [26]:
# perhaps there are some strenge values
sorted(edh_xml_lens, key=lambda x: x[1])

[('./xml/HD070403.xml', 3),
 ('./xml/HD071377.xml', 3),
 ('./xml/HD071378.xml', 3),
 ('./xml/HD072745.xml', 3),
 ('./xml/HD072755.xml', 3),
 ('./xml/HD055405.xml', 6077),
 ('./xml/HD055406.xml', 6077),
 ('./xml/HD071495.xml', 6166),
 ('./xml/HD071485.xml', 6173),
 ('./xml/HD073406.xml', 6179),
 ('./xml/HD051245.xml', 6189),
 ('./xml/HD051249.xml', 6189),
 ('./xml/HD071472.xml', 6190),
 ('./xml/HD051252.xml', 6193),
 ('./xml/HD051253.xml', 6193),
 ('./xml/HD071508.xml', 6194),
 ('./xml/HD073408.xml', 6195),
 ('./xml/HD051246.xml', 6206),
 ('./xml/HD051250.xml', 6206),
 ('./xml/HD071494.xml', 6206),
 ('./xml/HD076796.xml', 6207),
 ('./xml/HD071487.xml', 6211),
 ('./xml/HD071483.xml', 6222),
 ('./xml/HD051247.xml', 6223),
 ('./xml/HD016471.xml', 6226),
 ('./xml/HD033450.xml', 6226),
 ('./xml/HD071505.xml', 6226),
 ('./xml/HD031206.xml', 6229),
 ('./xml/HD072660.xml', 6229),
 ('./xml/HD013883.xml', 6231),
 ('./xml/HD013877.xml', 6233),
 ('./xml/HD013880.xml', 6233),
 ('./xml/HD054082.xml',

In [31]:
sorted(edh_xml_lens, key=lambda x: x[1])[-10:-1]

[('./xml/HD026625.xml', 59673),
 ('./xml/HD044434.xml', 61868),
 ('./xml/HD026775.xml', 65168),
 ('./xml/HD053580.xml', 66633),
 ('./xml/HD032316.xml', 69544),
 ('./xml/HD043295.xml', 77805),
 ('./xml/HD043289.xml', 81016),
 ('./xml/HD044445.xml', 85459),
 ('./xml/HD056719.xml', 101147)]

In [36]:
soup = BeautifulSoup(edh_filecontents['./xml/HD056719.xml'])
soup.find("idno", attrs={"type" : "URI"}).get_text()

'http://edh-www.adw.uni-heidelberg.de/edh/inschrift/HD056719'

# Parse the xml data

In [37]:
edh_xml_data = []
for filename in edh_filecontents.keys():
    edh_xml_data.append(get_data_from_filename(filename))

In [38]:
# remove empty
#edh_xml_data = [elem for elem in edh_xml_data if elem != None]
# how many we have
# last time we had 81143
len(edh_xml_data)

81156

In [42]:
# look et invalid
[el for el in edh_xml_data if el == None]

[None, None, None, None, None]

In [43]:
# take only valid data
edh_xml_data_f = [el for el in edh_xml_data if el != None]

In [44]:
len(edh_xml_data_f)

81151

In [45]:
# make a dataframe from 
edh_xml_data_df = pd.DataFrame(edh_xml_data_f)
edh_xml_data_df.head(5)

Unnamed: 0,idno_uri,idno_tm,placenames_refs,text_edition,origdate_text,layout_execution,layout_execution_text,support_objecttype,support_objecttype_text,support_material,support_material_text,support_decoration,keywords_term,keywords_term_text
0,HD000001,251193,"[http://www.trismegistos.org/place/033152, htt...",,71 AD \xe2\x80\x93 130 AD\n ...,21,unbestimmt,257,Tafel,,"Marmor, ge\xc3\xa4dert / farbig",1000,92,\n Grabinschrift\n ...
1,HD000002,265631,"[http://www.trismegistos.org/place/000172, htt...",,51 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,257,Tafel,48.0,Marmor,1000,92,\n Grabinschrift\n ...
2,HD000003,220675,"[http://www.trismegistos.org/place/025443, htt...",,131 AD \xe2\x80\x93 170 AD\n ...,21,unbestimmt,57,Statuenbasis,48.0,Marmor,1000,69,\n Ehreninschrift\n ...
3,HD000004,222102,"[http://www.trismegistos.org/place/025443, htt...",,151 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,29,Altar,60.0,Kalkstein,1000,80,\n Weihinschrift\n ...
4,HD000005,265629,"[http://www.trismegistos.org/place/000172, htt...",,1 AD \xe2\x80\x93 200 AD\n ...,21,unbestimmt,250,Stele,138.0,unbestimmt,1000,92,\n Grabinschrift\n ...


In [46]:
sddk.write_file("SDAM_data/EDH/edh_xml_data_2020-11-10.json", edh_xml_data_df, conf)

Your <class 'pandas.core.frame.DataFrame'> object has been succefully written as "https://sciencedata.dk/files/SDAM_root/SDAM_data/EDH/edh_xml_data_2020-11-10.json"


# Read the data back

In [42]:
edh_xml_data_df = sddk.read_file("SDAM_data/EDH/edh_xml_data_2020-11-10.json", "df", conf)

In [43]:
len(edh_xml_data_df)

81151