- ETL
This repository contains scripts for accesing, extracting and transforming epigraphic datasets from the Epigraphic Database Heidelberg. The repository will serve as a template for SDAM future collaborative research projects in accesing and analysing large digital datasets.
The scripts access the main dataset via a web API, tranform it into one dataframe object, merge and enrich these data with geospatial data and additional data from XML files, and save the outcome to SDAM project directory on sciencedata.dk and the finished product on Zenodo. Since the most important data files are in a public folder, you can use and re-run our analyses even without a sciencedata.dk account and access to our team folder. If you face any issues with accessing the data, please contact us at sdam.cas@list.au.dk.
A separate Python package sddk
was created specifically for accessing sciencedata.dk from Python (see https://github.com/sdam-au/sddk). If you want to save the dataset in a different location, the scripts might be easily modified.
- Petra Heřmánková SDAM project, petra.hermankova@cas.au.dk
- Vojtěch Kaše SDAM project, vojtech.kase@gmail.com
DATASET 2022: Heřmánková, Petra, & Kaše, Vojtěch. (2022). EDH_text_cleaned_2022_11_03 (v2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.7303886
http://doi.org/10.5281/zenodo.7303886
SCRIPTS 2022: Heřmánková, Petra, & Kaše, Vojtěch. (2022). sdam-au/EDH_ETL: Scripts (v2.0). Zenodo. https://doi.org/10.5281/zenodo.7303867
https://doi.org/10.5281/zenodo.7303867
The 2022 datasets contains 81,883 cleaned and streamlined Latin inscriptions from the Epigraphic Database Heidelberg (EDH, https://edh-www.adw.uni-heidelberg.de/), aggregated on 2022/11/03, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 69 attributes with original and streamlined data. Compared to the 2021 dataset, there are 407 more inscriptions and 5 fewer attributes containing redundant legacy data, thus the entire dataset is approximately the same size but some of the attributes are streamlined (260 MB in 2022 compared to 234 MB in 2021). Some of the attribute were removed as they are no longer available due to the changes in the EDH itself, e.g. edh_geography_uri
, external_image_uris
, fotos
, geography
, military
, social_economic_legal_history
, uri
; and some new attributes were added due to the streamlining of the ETL process, e.g. pleiades_id
. For full overview, see the Metadata section.
Metadata
EDH 2022 dataset metadata with descriptions for all attributes
DATASET 2021: Heřmánková, Petra, & Kaše, Vojtěch. (2021). EDH_text_cleaned_2021_01_21 (v1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888168
http://doi.org/10.5281/zenodo.4888168
SCRIPTS 2021: Heřmánková, Petra, & Kaše, Vojtěch. (2021). sdam-au/EDH_ETL: Scripts (v2.0). Zenodo. https://doi.org/10.5281/zenodo.6478243
https://doi.org/10.5281/zenodo.6478243
Metadata
EDH 2021 dataset metadata with descriptions for all attributes.
The original data come from two sources:
- the Epidoc XML files available at https://edh.ub.uni-heidelberg.de/data (inscriptions)
- the web API available at https://edh.ub.uni-heidelberg.de/data/api (inscriptions and geospatial data)
The scripts merge data from these two sources into Pandas dataframe, which is then exported into one JSON file for further usage. A separate Python package sddk
was created specifically for accessing sciencedata.dk from Python (see https://github.com/sdam-au/sddk). If you want to save the dataset in a different location, the scripts might be easily modified. You can access the file without having to login into sciencedata.dk. Here is a path to the file on sciencedata.dk:
SDAM_root/SDAM_data/EDH/public
folder on sciencedata.dk or alternatively as https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/
To access the files created in previous steps of the ETL process, you can use the dataset from the public folder, or you have to rerun all scripts on your own.
is produced by the scripts in this repository is called EDH_text_cleaned_[timestamp].json
and published on Zenodo in all its versions, for details and links see How to cite us
section above.
Additionally, the identical dataset can be accessed via Sciencedata.dk: SDAM_root/SDAM_data/EDH/public
folder on sciencedata.dk or alternatively as https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/
.
We use Python scripts (Jupyter notebooks) for accessing the API & extracting data from it, parse the XML files for additional metadata and combining these two reseources into one. Subsequently, we use both R and Python for further cleaning and transformming the data. The scripts can be found in the folder scripts
and they are named according to the sequence they should run in.
The data via the API are easily accessible and might be extracted by means of R and Python in a rather straigtforward way. First we extract the geocordinates from the public API, using the script 1_0.
Extracting geographical coordinates
File | Source commentary | |
---|---|---|
input | edhGeographicData.json |
containting all EDH geographies, loaded from https://edh.ub.uni-heidelberg.de/data/api |
output | EDH_geo_dict_[timestamp].json |
As a next step we access the public API to access and download all the incriptions. To obtain the whole dataset of circa 81,000+ inscriptions into a Python dataframe takes about 12 minutes (see the respective script 1_1). We have decided to save the dataframe as a JSON file for interoperability reasons between Python and R.
Extracting all inscriptions from API
File | Source commentary | |
---|---|---|
input | requests to https://edh.ub.uni-heidelberg.de/data/api | |
output | EDH_onebyone[timestamp].json |
However, the dataset from the API is a simplified one (when compared with the records online and in XML), primarily to be used for queries in the web interface. For instance, the API data encode the whole information about dating by means of two variables: "not_before" and "not_before". This makes us curious about how the data translate dating information like "around the middle of the 4th century CE." etc. Therefore, we decided to enrich the JSON created from the API files with data from the original XML files, which also including some additional variables (see script 1_2).
Extracting XML files
File | Source commentary | |
---|---|---|
input | edhEpidocDump_HD[first_number]-HD[last_number].zip |
https://edh.ub.uni-heidelberg.de/data/download |
output | EDH_xml_data_[timestamp].json |
To enrich the JSON with geodata extracted in the script 1_0, we have developed the following script: script 1_3).
Merging geographies, API, and XML files
File | Source commentary | |
---|---|---|
input 1 | EDH_geographies_raw.json |
https://edh.ub.uni-heidelberg.de/data/download |
input 2 | EDH_onebyone[timestamp].json |
|
input 3 | EDH_xml_data_[timestamp].json |
|
output | EDH_merged_[timestamp].json |
In the next step we clean and streamline the API attributes in a reproducible way in R, (see script 1_4) so they are ready for any future analysis. We keep the original attributes along with the new clean ones.
Cleaning and streamlining attributes
File | Source commentary | |
---|---|---|
input | EDH_merged_[timestamp].json |
The current script works with JSON file containing all merged inscriptions. |
output | EDH_attrs_cleaned_[timestamp].json |
The cleaning process of the text of inscriptions is in the script 1_5.
Cleaning and streamlining of the text of the inscription
File | Source commentary | |
---|---|---|
input | EDH_attrs_cleaned_[timestamp].json |
The current script works with JSON file containing all inscriptions will their streamlined attributes. |
output | EDH_text_cleaned_[timestamp].json |
The following scripts document the basic usage cases for Python and R (they do not change the dataset, only demonstrate the access to the data using both languages)
Script demonstrating loading the dataset to Python via Sciencedata.dk (with or without credentials), using sddk
package.
Script demonstrating loading the dataset to R via Sciencedata.dk (without credentials).
Heřmánková, P., Kaše, V., & Sobotkova, A. (2021). Inscriptions as data: Digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1), 99. https://doi.org/10.1515/jdh-2021-1004
- the article working with version 1, but version 2 follows the same principles. Some attribute names may vary in the version 2 as well as the contents of the dataset (that reflect the changes made by the EDH).