# INTRODUCTION

This notebook explores the potential of the Epigraphic Database Heidelberg web API , [EDH API](https://edh-www.adw.uni-heidelberg.de/data/api) in combination with sciencedata.dk as a datastorage (see more about our current progress in using sciencedata.dk [here](https://docs.google.com/document/d/1sojHsxkcAbZH9DpWFuHDomQwTZHPQv_WaAxO_erP6FE/edit?usp=sharing)).

The ambition here is to use cloud based solutions as much as possible, without any dependence on local machines. At the same time, we do not like to rely completely upon google services. 

In [0]:
### REQUIREMENTS
import numpy as np
import math
import pandas as pd

import sys
### we do a lot of requests during the scrapping. Some of them with requests package, some of them with urllib
import requests
from urllib.request import urlopen 
from urllib.parse import quote  
from bs4 import BeautifulSoup
import xml.etree.cElementTree as ET

# to avoid errors, we sometime use time.sleep(N) before retrying a request
import time
# the input data have typically a json structure
import json
import getpass

import datetime as dt
# for simple paralel computing:
from concurrent.futures import ThreadPoolExecutor
### google drive
from google.colab import drive
#import gspread
#from gspread_dataframe import get_as_dataframe, set_with_dataframe

!pip install --ignore-installed --index-url https://test.pypi.org/simple/ --no-deps sddk ### our own package under construction, always install to have up-to-date version
import sddk

Looking in indexes: https://test.pypi.org/simple/
Collecting sddk
  Using cached https://test-files.pythonhosted.org/packages/65/8b/d682c15a7335215ac119538ad8455b408cd7e8be4f6614678888dd2c88ed/sddk-0.0.7-py3-none-any.whl
Installing collected packages: sddk
Successfully installed sddk-0.0.7


## configure session and url

In [0]:
### configure session and url
### in the case of "SDAM_root", the group owner is Vojtech with username 648597@au.dk
s, sddk_url = sddk.configure_session_and_url("SDAM_root")

sciencedata.dk username (format '123456@au.dk'): 648597@au.dk
sciencedata.dk password: ··········
personal connection established
group connection established with you as owner
endpoint for requests has been configured to: https://sciencedata.dk/files/SDAM_root/


# EDH via API

The basis form of an request is as follows:
```
https://edh-www.adw.uni-heidelberg.de/data/api/inscriptions/search?
```
With this, to create query based on inscription number, you will have tospecify the paramenter **hd_nr**, like here:

```
https://edh-www.adw.uni-heidelberg.de/data/api/inscriptions/search?hd_nr=1
```
 (Feel free to explore this in the browser).

Here we use the function ```requests.get()``` to make our requests from python.

## One inscription query example

In [0]:
%%time
inscription_number = 100
URL_form = "https://edh-www.adw.uni-heidelberg.de/data/api/inscriptions/search?"

response = requests.get(URL_form + "hd_nr=" + str(inscription_number))
response
json_data = response.json()
print(json_data)

{'limit': '20', 'total': 1, 'items': [{'last_update': '2015-05-21', 'transcription': 'D[---] / ANELI[---] / BERVE[---] / P[---]IT[------', 'work_status': 'provisional', 'diplomatic_text': 'D[ ] / ANELI[ ] / BERVE[ ] / P[ ]IT[', 'responsible_individual': 'Gräf', 'country': 'Spain', 'uri': 'https://edh-www.adw.uni-heidelberg.de/edh/inschrift/HD000100', 'language': 'Latin', 'trismegistos_uri': 'https://www.trismegistos.org/text/226731', 'id': 'HD000100', 'findspot_modern': 'El Burgo de Osma', 'edh_geography_uri': 'https://edh-www.adw.uni-heidelberg.de/edh/geographie/9371', 'literature': 'AE 1983, 0597.; C. García Merino, in: Homenaje al Prof. Martin Almagro Basch 3 (Madrid 1983) 355, Nr. 2; lám. 1, 2. - AE 1983.', 'findspot_ancient': 'Uxama', 'modern_region': 'Soria', 'commentary': ' Text in vier Zeilen, nahezu unlesbar.', 'province_label': 'Hispania citerior', 'type_of_monument': 'stele'}]}
CPU times: user 13.6 ms, sys: 1.32 ms, total: 14.9 ms
Wall time: 836 ms


In [0]:
def get_inscription_data(num):
  try:
    response = requests.get(URL_form + "hd_nr=" + str(num))
    json_data_items = response.json()["items"]
  except:
    time.sleep(1)
    try:
      response = requests.get(URL_form + "hd_nr=" + str(num))
      json_data_items = response.json()["items"]
    except:
      json_data_items = [{}]
  return json_data_items[0]

for num in range(1, 10):
  print(get_inscription_data(num))


{'depth': '2.7 cm', 'findspot_modern': 'Cuma, bei', 'width': '34 cm', 'diplomatic_text': 'D M / NONIAE P F OPTATAE / ET C IVLIO ARTEMONI / PARENTIBVS / LIBERTIS LIBERTABVSQVE / POSTERISQVE EORVM / C IVLIVS C F OPTATVS / FILIVS', 'material': 'Marmor, geädert / farbig', 'edh_geography_uri': 'https://edh-www.adw.uni-heidelberg.de/edh/geographie/11843', 'modern_region': 'Campania', 'uri': 'https://edh-www.adw.uni-heidelberg.de/edh/inschrift/HD000001', 'not_after': '0130', 'people': [{'cognomen': 'Optata', 'name': 'Noniae P.f. Optatae', 'person_id': '1', 'nomen': 'Nonia', 'gender': 'female'}, {'cognomen': 'Artemo', 'name': 'C. Iulio Artemoni', 'person_id': '2', 'praenomen': 'C.', 'nomen': 'Iulius', 'gender': 'male'}, {'nomen': 'Iulius', 'praenomen': 'C.', 'gender': 'male', 'cognomen': 'Optatus', 'name': 'C. Iulius C.f. Optatus', 'person_id': '3'}, {'name': 'Noniae P.f. Optatae', 'person_id': '4', 'cognomen': 'Optata', 'gender': 'female', 'nomen': 'Nonia'}, {'name': 'C. Iulio Artemoni', 'per

In [0]:
### the actual data are part of the tag "items"
%%time 
pd.DataFrame(json_data["items"]) 


CPU times: user 2.87 ms, sys: 0 ns, total: 2.87 ms
Wall time: 2.82 ms


Unnamed: 0,findspot_ancient,findspot_modern,id,diplomatic_text,uri,edh_geography_uri,literature,trismegistos_uri,work_status,province_label,type_of_monument,language,last_update,modern_region,transcription,commentary,responsible_individual,country
0,Uxama,El Burgo de Osma,HD000100,D[ ] / ANELI[ ] / BERVE[ ] / P[ ]IT[,https://edh-www.adw.uni-heidelberg.de/edh/insc...,https://edh-www.adw.uni-heidelberg.de/edh/geog...,"AE 1983, 0597.; C. García Merino, in: Homenaje...",https://www.trismegistos.org/text/226731,provisional,Hispania citerior,stele,Latin,2015-05-21,Soria,D[---] / ANELI[---] / BERVE[---] / P[---]IT[--...,"Text in vier Zeilen, nahezu unlesbar.",Gräf,Spain


# Inscriptions one by one (using simple paralel computing)

In [0]:
from concurrent.futures import ThreadPoolExecutor

In [0]:
#### TEST without paralel computing:

%%time
all_inscriptions = []
for num in range(1,200): 
  currently_parsed = get_inscription_data(num)
  all_inscriptions.extend(currently_parsed)

CPU times: user 2.23 s, sys: 113 ms, total: 2.34 s
Wall time: 2min 48s


In [0]:
### TEST with paralel computing
###to make N requests in paralel, we first have to generate a range of ranges: [1,2,3], [4,5,6], [7,8,9]
%%time
all_inscriptions = []
for num in range(1,200, 100): 
  actual_nums = list(range(num, num+100))
  with ThreadPoolExecutor(max_workers=100) as pool:
    currently_parsed = list(pool.map(get_inscription_data,actual_nums))
  all_inscriptions.extend(currently_parsed)

CPU times: user 3.76 s, sys: 257 ms, total: 4.02 s
Wall time: 13.2 s


ok, the testing clearly demonstrate that using 100 workers in paralel is about 10 times faster. Let's scale it up for the whole dataset

In [15]:
### main run of the function

%%time
all_inscriptions = []
for num in range(1,90000, 200): 
  actual_nums = list(range(num, num+200))
  with ThreadPoolExecutor(max_workers=300) as pool:
    currently_parsed = list(pool.map(get_inscription_data,actual_nums))
  all_inscriptions.extend(currently_parsed)

CPU times: user 32min 3s, sys: 2min 34s, total: 34min 38s
Wall time: 1h 50min 50s


In [0]:
inscriptions_data_df = pd.DataFrame(all_inscriptions)

In [17]:
s.put(sddk_url + "SDAM_data/EDH/EDH_onebyone.json", data=inscriptions_data_df.to_json())

<Response [204]>