# Introduction

* **Description**

This script contains a series of functions for extracting numerical dates (e.g. an interval 101-200) from a historical dataset experessing this information in a textual form (e.g. "2nd c. AD"). The functions have been developed especially for extracting dates from the PHI dataset, but might be reused for other applications.

In the core of the script is the function `date_extractor()`. This function takes as an input a textual date (e.g. "s. III/II p." [= "3rd or 2nd c. BC"]) and returns a dictionary of dating values, e.g:
```python
{"not_before" : -300, "not_after" : -101, "date_tags": ["range", "cent", "morece"]}
```
where `"date_tags"` contains tags specifying what kind of dating it is: `range` means that it is an interval; `cent` means that the interval is based on information about centuries; and `morece` implies that there is more than one century.

`date_extractor()` relies upon a number of other functions designed to extract individual types of dates:
* `extract_ante_and_post(datation, dating)` looks for words like "post" and "ante", "not before" etc. and modifies the numerical dating (in `dating` dictionary) accordingly. E.g. "ante 305BC" means all years before the stop year `-305`; "not before the reign of Trajan" means all years after `97`
*  `extract_period(datation)` checks whether the textual datation contains a period (like "reign of Tiberius") which could be translated into an interval (14-37).
* `parse_centuries()` extract intervals for individual centuries. It deals with cases in which more centuries are present (e.g. "s. III/II a.") and even where one is BC and another AD (e.g. "1st c. BC-1st c. AD")
* `modify_by_phase(datation, dating)` modify the ranges in `dating` by evaluating presence of words like "beginning", "early", "late" and "end". We use these parameters:
  * "beginning": first 10% of the range (defined by "start" and "stop" in the `dating` dictionary)
  * "early": first 25% of the range
  * "late": last 25% of the range
  * "end": last 10% of the range
  * "ca.": extends the range by adding 10% on the left and 10% on the right


* **data input**
  * `PHI_merged_[timestamp].json`
  * `PHI_overview` gsheet
* **data output**
  * `PHI_dated_[timestamp].json`
  * `PHI_overview` gsheet

* **author**: Vojtěch Kaše
* **last complete run**: 2020-06-30


# Requirements

In [1]:
import numpy as np  # as usual
import math
import pandas as pd
import re

import sys
import requests
from bs4 import BeautifulSoup
import json

import datetime as dt
# for simple paralel computing:
from concurrent.futures import ThreadPoolExecutor

import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.oauth2 import service_account # based on google-auth library

import sddk

# authentication

In [2]:
# login to sciencedata 
conf = sddk.configure("SDAM_root", "648597@au.dk")

sciencedata.dk username (format '123456@au.dk'): 648597@au.dk
sciencedata.dk password: ········
connection with shared folder established with you as its owner
endpoint variable has been configured to: https://sciencedata.dk/files/SDAM_root/


In [3]:
# to access gsheet, you need Google Service Account key json file
# I have mine located in my personal space on sciencedata.dk, so I read it from there:

# (1) read the file and parse its content
file_data = conf[0].get("https://sciencedata.dk/files/ServiceAccountsKey.json").json()
# (2) transform the content into crendentials object
credentials = service_account.Credentials.from_service_account_info(file_data)
# (3) specify your usage of the credentials
scoped_credentials = credentials.with_scopes(['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive'])
# (4) use the constrained credentials for authentication of gspread package
gc = gspread.Client(auth=scoped_credentials)

PHI_overview = gc.open_by_url("https://docs.google.com/spreadsheets/d/1zfTw0Hf304maBmrYvaMxRLnv1zfAVFixrtGTTsLCcT4/edit?usp=sharing")

# Read data

In [4]:
# read the PHI dataset from sciencedata.dk
PHI = sddk.read_file("SDAM_data/PHI/PHI_lemmatized_20201217.json", "df", conf)
# older version used during development: PHI = sddk.read_file("SDAM_data/PHI/PHI_enriched_raw.json", "df", conf)
# print first 5 rows of the data
PHI.head(5)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,...,PHI_ID,string_pythia,clean_text_conservative,clean_text_interpretive_word,clean_text_interpretive_sentence,clean_text_pythia,sents,sents_N,lem_sents,lemmata
0,/text/1?location=1701&patt=&bookid=4&offset=0&...,IG I³,1,Regions\nAttica (IG I-III),IG I³\n1,Att. — Ath.: Akr. — stoich. 35 — c. 510-500 a....,,12,1\n\n\n\n5\n\n\n\n\n10\n\n,ἔδοχσεν το͂ι δέμοι· τ̣[ὸς ἐ Σ]αλαμ̣[ῖνι κλερόχ...,...,1,ἔδοχσεν τοι δέμοι τ[ὸς ἐ σ]αλαμ[ῖνι κλερόχ]ος ...,ἔδοχσεν το͂ι δέμοι ταλαμος οἰκε͂ν ἐᾶ Σαλαμῖνι ...,ἔδοχσεν το͂ι δέμοι τὸς ἐ Σαλαμῖνι κλερόχος οἰκ...,ἔδοχσεν το͂ι δέμοι τὸς ἐ Σαλαμῖνι κλερόχος οἰκ...,ἔδοχσεν τοι δέμοι τὸς ἐ σαλαμῖνι κλερόχος οἰκε...,[ἔδοχσεν τοι δέμοι τὸς ἐ σαλαμῖνι κλερόχος οἰκ...,1,"[[ἔδοχσεν, δέμοι, Σαλαμίς, κλερόχος, οἰκεν, Σα...","[ἔδοχσεν, δέμοι, Σαλαμίς, κλερόχος, οἰκεν, Σαλ..."
1,/text/2?location=1701&patt=&bookid=4&offset=0&...,IG I³,2,Regions\nAttica (IG I-III),IG I³\n2,Att. — non-stoich. — c. 500 a.,,14,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n,[․․8-9․․․]ν̣ βολ — — — — — — — — — —\n[․6-7․․]...,...,2,[--------9---]ν βολ ---------- [------7--] α ἑ...,ν βολ α ℎεκον σιον γνοσθε͂ι δὲ ν ἀτεχνος μὲ π ...,ν βολ α ℎεκον σιον γνοσθε͂ι δὲ ν ἀτεχνος μὲ π ...,"ν βολ ․ α ⋮ ℎεκον σιον, γνοσθε͂ι δὲ ν ἀτεχνος ...",ν βολ α ἑκον σιον γνοσθει δὲ ν ἀτεχνος μὲ π ἄλ...,[ν βολ α ἑκον σιον γνοσθει δὲ ν ἀτεχνος μὲ π ἄ...,1,"[[βολ, ἑκών, σίον, γνοσθει, ἄτεχνος, μεδὲ, κελ...","[βολ, ἑκών, σίον, γνοσθει, ἄτεχνος, μεδὲ, κελε..."
2,/text/3?location=1701&patt=&bookid=4&offset=0&...,IG I³,3,Regions\nAttica (IG I-III),IG I³\n3,Att. — stoich. 21 — 490-480 a.,,13,1\n\n\n\n5\n\n\n\n\n10\n\n\n,[․]αρ[․․․․]ι ℎερακλειο[․․5․․]\n[․]αρ̣ο#⁷[․] τι...,...,3,[-]αρ[----]ι ἑρακλειο[-----] [-]αρο [-] τιθένα...,αρι ℎερακλειο αρο τιθέναι τὸς ἀέτας τριάκοντα ...,αρι ℎερακλειο αρο τιθέναι τὸς ἀθλοθέτας τριάκο...,αρι ℎερακλειο αρο τιθέναι τὸς ἀθλοθέτας τριάκο...,αρ ι ἑρακλειο αρο τιθέναι τὸς ἀθλοθέτας τριάκο...,[αρ ι ἑρακλειο αρο τιθέναι τὸς ἀθλοθέτας τριάκ...,1,"[[ἑρακλειο, ἀρόω, τίθημι, ἀθλοθέτης, ἀνήρ, ἄγο...","[ἑρακλειο, ἀρόω, τίθημι, ἀθλοθέτης, ἀνήρ, ἄγον..."
3,/text/4?location=1701&patt=&bookid=4&offset=0&...,IG I³,4,Regions\nAttica (IG I-III),IG I³\n4,Att. — stoich. 38 — 485/4 a.,,56,face A.1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n\n\...,[․․․․․․․․․․․․․․․․․․38․․․․․․․․․․․․․․․․․․]\n[․․․...,...,4,[--------------------------------------] [----...,δέ τις αν ἒ φρορὰν μ ντέκοντα δχμὰς τ ας ℎες π...,ἐὰν δέ τις αν ἒ φρορὰν μὲ πεντέκοντα δραχμὰς τ...,ἐὰν δέ τις αν ⋮ ἒ φρορὰν ⋮ μὲ πεντέκοντα ⋮ δρα...,ἐὰν δέ τις αν ἒ φρορὰν μὲ πεντέκοντα δραχμὰς τ...,[ἐὰν δέ τις αν ἒ φρορὰν μὲ πεντέκοντα δραχμὰς ...,2,"[[τὶς, φρορὰν, πεντέκοντα, δραχμή, τ, πρᾶχσιν,...","[τὶς, φρορὰν, πεντέκοντα, δραχμή, τ, πρᾶχσιν, ..."
4,/text/5?location=1701&patt=&bookid=4&offset=0&...,IG I³,5,Regions\nAttica (IG I-III),IG I³\n5,Att. — c. 500 a.,,6,1\n\n\n\n5\n,[ἔδοχσε]ν [⋮ τε͂ι βολε͂ι] ⋮ καὶ [τ]ο͂ι δέμοι ⋮...,...,5,[ἔδοχσε]ν [ τει βολει] καὶ [τ]οι δέμοι ὅτε παρ...,ν καὶ ο͂ι δέμοι ℎότε Παραιβάτες λεια θν τὸς ℎι...,ἔδοχσεν τε͂ι βολε͂ι καὶ το͂ι δέμοι ℎότε Παραιβ...,ἔδοχσεν ⋮ τε͂ι βολε͂ι ⋮ καὶ το͂ι δέμοι ⋮ ℎότε ...,ἔδοχσεν τει βολει καὶ τοι δέμοι ὅτε παραιβάτες...,[ἔδοχσεν τει βολει καὶ τοι δέμοι ὅτε παραιβάτε...,1,"[[ἔδοχσεν, τει, βολει, δέμοι, παραιβάτες, γραμ...","[ἔδοχσεν, τει, βολει, δέμοι, παραιβάτες, γραμμ..."


# Raw date column

In the PHI dataset, the datation information is usually contained in the "tildeinfo" column. "tildeinfo" has a form of a list, with individual elements separated by " — ". Unfortunately, this list does not have a fully consistent structure. Typically, the datation information is the last element within the list (e.g. "Dacia Sup. — Tibiscum (Jupa) — 2nd/3rd c. AD" - PH298501), but not always (e.g. "N. Black Sea — Pantikapaion (Kerch) — 1st c. BC — IosPE IV 253" - PH183001). Thus, our first task is to extract the element which most probably contains the datation information.


In [5]:
def get_date_from_tildeinfo(tildeinfo):
  try:
    tildeinfo_list = tildeinfo.split("— ")
    datation = tildeinfo_list[-1]
    for el in tildeinfo_list:
      if any(time_indicator in el for time_indicator in [" a.", " p.", "BC", "AD", "period", "reign"]):
        datation = el.partition("\n")[0]
        break
  except: 
    datation = ""
  return datation 

In [6]:
# test 1
get_date_from_tildeinfo("N. Black Sea — Pantikapaion (Kerch) — 1st c. BC — IosPE IV 253")

'1st c. BC '

In [7]:
# test 2
get_date_from_tildeinfo("Att. — Athens: Agora — stoich. 29 — 301/0-295/4 a. — *Hesp. 13.1944.242,7 — *SEG 24.119; 29.93")

'301/0-295/4 a. '

In [8]:
# application on the whole dataset
PHI["raw_date"] = PHI.apply(lambda row: get_date_from_tildeinfo(row["tildeinfo"]), axis=1)

# Generating a sample
For developemnt purposes, the functions below have been firstly tested by using a representative sample from the dataset, containing every 500th inscription, i.e. inscriptions PH2501, PH3001, ..., PH218501 etc.	

In [9]:
# generate sample for testing purposes:
PHI_by_500 = PHI[PHI["PHI_ID"].isin(range(1, 300000, 500))]
PHI_by_500.head(5)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,...,string_pythia,clean_text_conservative,clean_text_interpretive_word,clean_text_interpretive_sentence,clean_text_pythia,sents,sents_N,lem_sents,lemmata,raw_date
0,/text/1?location=1701&patt=&bookid=4&offset=0&...,IG I³,1,Regions\nAttica (IG I-III),IG I³\n1,Att. — Ath.: Akr. — stoich. 35 — c. 510-500 a....,,12,1\n\n\n\n5\n\n\n\n\n10\n\n,ἔδοχσεν το͂ι δέμοι· τ̣[ὸς ἐ Σ]αλαμ̣[ῖνι κλερόχ...,...,ἔδοχσεν τοι δέμοι τ[ὸς ἐ σ]αλαμ[ῖνι κλερόχ]ος ...,ἔδοχσεν το͂ι δέμοι ταλαμος οἰκε͂ν ἐᾶ Σαλαμῖνι ...,ἔδοχσεν το͂ι δέμοι τὸς ἐ Σαλαμῖνι κλερόχος οἰκ...,ἔδοχσεν το͂ι δέμοι τὸς ἐ Σαλαμῖνι κλερόχος οἰκ...,ἔδοχσεν τοι δέμοι τὸς ἐ σαλαμῖνι κλερόχος οἰκε...,[ἔδοχσεν τοι δέμοι τὸς ἐ σαλαμῖνι κλερόχος οἰκ...,1,"[[ἔδοχσεν, δέμοι, Σαλαμίς, κλερόχος, οἰκεν, Σα...","[ἔδοχσεν, δέμοι, Σαλαμίς, κλερόχος, οἰκεν, Σαλ...",c. 510-500 a.
500,/text/501?location=1701&patt=&bookid=4&offset=...,IG I³,486,Regions\nAttica (IG I-III),IG I³\n486,Att. — stoich. — s. V a.,,11,\n1\n\n\n\n5\n\n\n\n\n10,— — — — — —\n[v?․]Ε#⁷ — — — —\n[v?]ολοι #⁷#⁷ —...,...,— — — — — —\n[v?․]Ε#⁷ — — — —\n[v?]ολοι #⁷#⁷ —...,Ε ολοι χρυσίον τον πετ σταθμὸ μὸν το͂ χρ ΔΔΔΔ ...,Ε ολοι χρυσίον τον πεταλ σταθμὸν σταθμὸν το͂ χ...,․Ε ολοι χρυσίον τον πεταλ σταθμὸν σταθμὸν το͂ ...,․Ε ολοι χρυσίον τον πεταλ σταθμὸν σταθμὸν το͂ ...,[․Ε ολοι χρυσίον τον πεταλ σταθμὸν σταθμὸν το͂...,1,"[[ολοι, χρυσίον, πεταλ, σταθμόν, σταθμόν, χρυσ...","[ολοι, χρυσίον, πεταλ, σταθμόν, σταθμόν, χρυσί...",s. V a.
1000,/text/1001?location=1701&patt=&bookid=4&offset...,IG I³,886,Regions\nAttica (IG I-III),IG I³\n886,Att. — Athens: Akropolis — c. 440? a. — IG I² ...,,2,1\n,— — —ένης vacat\nvacat,...,— — —ένης vacat\nvacat,ένης,ένης,ένης,ένης,[ένης],1,[[νέω]],[νέω],c. 440? a.
1500,/text/1501?location=1701&patt=&bookid=4&offset...,IG I³,1314,Regions\nAttica (IG I-III),IG I³\n1314,Att. — Salamis: Koulouri — c. 420-410? a. — IG...,,1,1,Χαιρέδημος. Λυκέας.,...,Χαιρέδημος. Λυκέας.,Χαιρέδημος Λυκέας,Χαιρέδημος Λυκέας,Χαιρέδημος. Λυκέας.,Χαιρέδημος. Λυκέας.,"[Χαιρέδημος., Λυκέας.]",2,"[[χαιρέδημος], [λυκέη]]","[χαιρέδημος, λυκέη]",c. 420-410? a.
2108,/text/2501?location=1700&patt=&bookid=5&offset...,IG II²,284,Regions\nAttica (IG I-III)\nAttica,IG II²\n284,Att. — stoich. 28 — ante 336/5,,17,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n\n,․․․7․․․#⁷#⁷#⁷․․․․․․․․18․․․․․․․․\n․․․κράτης κα[...,...,------- ------------------ ---κράτης κα[ὶ ----...,κράτης καν ὑπὸ τῶν λη υλῆι ὑπὸ τοὺς π προεδρεύ...,κράτης καὶ ἑάλωσαν ὑπὸ τῶν ληιστῶν ἐψήφισθαι τ...,"κράτης καὶ ἑάλωσαν ὑπὸ τῶν ληιστῶν, ἐψήφισθαι ...",κράτης καὶ ἑάλωσαν ὑπὸ τῶν ληιστῶν ἐψήφισθαι τ...,[κράτης καὶ ἑάλωσαν ὑπὸ τῶν ληιστῶν ἐψήφισθαι ...,1,"[[κρατύς, ἁλίσκομαι, ληιστῶν, ψηφίζω, βουλή, π...","[κρατύς, ἁλίσκομαι, ληιστῶν, ψηφίζω, βουλή, πρ...",ante 336/5


# Parse ante quem and post quem

In [10]:
### simple demonstration of the logic
datation = "not before 304 AD"
match = re.search("(not\s(before|bef\.)\s|non\sante\s)(\d+)",  datation, flags=re.IGNORECASE)
if match:
  dating_update = {"start" : int(match.groups()[2]), "type" : "post"}
dating_update

{'start': 304, 'type': 'post'}

In [11]:
def extract_ante_and_post(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if "unknown" in dating["type"]: 
    # if "NOT BEFORE"
    match = re.search("(not\s(before|bef\.)\s|non\sante\s)(\-?\d+)(\s|$)",  datation, flags=re.IGNORECASE)
    if match:
      if "AD" not in datation:
        start = (int(match.groups()[2]) * -1)
        dating_update = {"start" : start, "type" : "post"}
      else:
        dating_update = {"start" : int(match.groups()[2]), "type" : "post"}
    # if "BEFORE"
    else:
      match = re.search('(before\s|ante\s)(\-?\d+)(\s|$)', datation, flags=re.IGNORECASE)
      if match:
        if "AD" not in datation:
          dating_update = {"stop" : (int(match.groups()[1]) * -1) - 1, "type" : "ante"}
        else:
          dating_update = {"stop" : int(match.groups()[1]) - 1, "type" : "ante"}
      # if "NOT AFTER"
      else:
        match = re.search("(not\safter\s|non\spost\s)(\-?\d+)(\s|$)",  datation, flags=re.IGNORECASE)
        if match:
          if "AD" not in datation:
            dating_update = {"stop" : (int(match.groups()[1]) * -1), "type" : "ante"}
          else:
            dating_update = {"stop" : int(match.groups()[1]), "type" : "ante"}
        # if "AFTER"
        else:
            match = re.search('(after\s|aft.\s|post\s)(\-?\d+)(\s|$)', datation, flags=re.IGNORECASE)
            if match:
              if "AD" not in datation:
                dating_update = {"start" : (int(match.groups()[1]) * -1) + 1, "type" : "post"}
              else:
                dating_update = {"start" : int(match.groups()[1]) + 1, "type" : "post"}
            else:
              dating_update = dating
  elif "exact+or" in dating["type"]: 
    # if "NOT BEFORE"
    match = re.search("(not\s(before|bef\.)\s|non\sante\s)",  datation, flags=re.IGNORECASE)
    if match:
      dating_update = {"start" : dating["exact"], "or": {"start" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "post+or"}
    # if "BEFORE"
    else:
      match = re.search('(before\s|ante\s)', datation, flags=re.IGNORECASE)
      if match:
        dating_update = {"stop" : dating["exact"], "or": {"stop" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "ante+or"}
      # if "NOT AFTER"
      else:
        match = re.search("(not\safter\s|non\spost\s)",  datation, flags=re.IGNORECASE)
        if match:
              dating_update = {"stop" : dating["exact"], "or": {"stop" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "ante+or"}
        # if "AFTER"
        else:
            match = re.search('(after\s|aft.\s|post\s)', datation, flags=re.IGNORECASE)
            if match:
              dating_update = {"start" : dating["exact"], "or": {"start" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "post+or"}
            else:
              dating_update = dating
  elif "range" in dating["type"]:
    # if "NOT BEFORE"
    match = re.search("(not\s(before|bef\.)\s|non\sante\s)",  datation, flags=re.IGNORECASE)
    if match:
      dating_update = {"start" : dating["start"], "stop":None, "type" : dating["type"]+"+post"}
    # if "BEFORE"
    else:
      match = re.search('(before\s|ante\s)', datation, flags=re.IGNORECASE)
      if match:
        dating_update = {"stop" : int(dating["start"]) - 1, "start":None, "type" : dating["type"]+"+ante"}  
      # if "NOT AFTER"
      else:
        match = re.search("(not\safter\s|non\spost\s)",  datation, flags=re.IGNORECASE)
        if match:
          dating_update = {"stop" : dating["stop"], "start":None,"type" : dating["type"]+"+ante"}
        # if "AFTER"
        else:
          match = re.search('(after\s|aft.\s|post\s)', datation, flags=re.IGNORECASE)
          if match:
            dating_update = {"start" : int(dating["stop"]) + 1, "stop":None,"type" : dating["type"]+"+post"}
          else:
            dating_update = dating
  else: 
    #datation = re.sub("(not\s(after|bef\.)\s|non\spost)", "ante\s", datation)
    dating_update = dating
  if ("shortly" in datation) and ("shortly" not in dating_update["type"]):
    dating_update["type"] = dating_update["type"] + "+shortly"
  return dating_update

In [12]:
extract_ante_and_post("after 4th c. BC", {"type" : "range", "start": -400, "stop": -301})

{'start': -300, 'stop': None, 'type': 'range+post'}

In [13]:
extract_ante_and_post("shortly after 230 AD")

{'start': 231, 'type': 'post+shortly'}

In [14]:
datation, dating = "not after reign of Trajan", {"start": 98, "stop" : 117, "type" : "range+period", "era": None}
extract_ante_and_post(datation, dating)

{'stop': 117, 'start': None, 'type': 'range+period+ante'}

In [15]:
# example with "unknown"
dating = {"type" : "unknown", "era": None}
for datation in ["non post 230 AD", "shortly after 320 BC", "not bef. 114 BC","not after 317 AD", "before 200 AD", "Ante 114 BC", "post 2nd century BC"]:
  print(datation, extract_ante_and_post(datation, dating))

non post 230 AD {'stop': 230, 'type': 'ante'}
shortly after 320 BC {'start': -319, 'type': 'post+shortly'}
not bef. 114 BC {'start': -114, 'type': 'post'}
not after 317 AD {'stop': 317, 'type': 'ante'}
before 200 AD {'stop': 199, 'type': 'ante'}
Ante 114 BC {'stop': -115, 'type': 'ante'}
post 2nd century BC {'type': 'unknown', 'era': None}


# Parse periods

In [16]:
# read periods from our external coding
periods = get_as_dataframe(PHI_overview.worksheet("periods"))
periods

Unnamed: 0,period,start,stop,type,era,source,notes,link
0,Roman imp,-31,410,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p08m57hqcc5
1,Rom. Imp,-31,410,range+period,BC/AD,PeriodO,,http://n2t.net/ark:/99152/p08m57hqcc5
2,aet. imp.,-31,410,range+period,BC/AD,PeriodO,,http://n2t.net/ark:/99152/p08m57hqcc5
3,aet. Rom.,-146,324,range+period,BC/AD,,,
4,Roman period,-146,324,range+period,BC/AD,,,
5,reign of Hadrian,117,138,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p0jrrjbntfj
6,reign of Justinian,527,565,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p06c6g3r7ht
7,reign of Ant. Pius,138,161,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p06c6g3drk4
8,reign of Augustus,-27,14,range+period,BC/AD,PeriodO,,http://n2t.net/ark:/99152/p06c6g3xnmx
9,reign of Tiberius,14,37,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p0jrrjbts8w


In [17]:
periods_dict = periods.set_index("period").T.to_dict()
periods_dict["reign of Claudius"]

{'start': 41,
 'stop': 54,
 'type': 'range+period',
 'era': 'AD',
 'source': 'PeriodO',
 'notes': nan,
 'link': 'http://n2t.net/ark:/99152/p0jrrjb8spw'}

In [18]:
def extract_period(datation, dating=None):
  if (dating==None): dating = {"type" : "unknown"} 
  if dating["type"] == "unknown":
    for key in periods_dict.keys():
      if periods_dict[key]["notes"] != "alone":
        if re.search(key, datation, flags=re.IGNORECASE): # use lower cases to match everything
          dating_update = periods_dict[key]
          break
      elif re.search("^\s?" + key + "\s?$", datation):
          dating_update = periods_dict[key]
          break
      else:
          dating_update = {"type" : "unknown"}
    return dating_update
  else:
    return dating

In [19]:
# example:
for datation in ["Roman Imperial", "reign of Augustus", "Antonine period", "Christian Anderson", "Christian ","Book about Byzantine", " Byzantine",]:
  print({datation : extract_period(datation)})

{'Roman Imperial': {'start': -31, 'stop': 410, 'type': 'range+period', 'era': 'AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p08m57hqcc5'}}
{'reign of Augustus': {'start': -27, 'stop': 14, 'type': 'range+period', 'era': 'BC/AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p06c6g3xnmx'}}
{'Antonine period': {'start': 96, 'stop': 192, 'type': 'range+period', 'era': 'AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p06c6g34zjk'}}
{'Christian Anderson': {'type': 'unknown'}}
{'Christian ': {'start': 1, 'stop': 2000, 'type': 'range+period', 'era': 'AD', 'source': 'Vojtech', 'notes': 'alone', 'link': nan}}
{'Book about Byzantine': {'type': 'unknown'}}
{' Byzantine': {'start': 324, 'stop': 1453, 'type': 'range+period', 'era': 'AD', 'source': 'PeriodO', 'notes': 'alone', 'link': 'http://n2t.net/ark:/99152/p0m63njtm6w'}}


# Parse "/" for individual dates

The "/" character is use in several different cases, each of which requires slightly different approach. Here we are parsing cases in which it is used for individual date numbers, e.g. "114/3 BC", what is translated as an interval (-114, -113). However, if there is a loger range between the two numbers, the "/" character is treated as "or" and the alternative date is extracted into the "or" key within the dictionary.

In [20]:
ors = PHI[PHI["raw_date"].str.contains("(\d+)(\/)(\d+)")]["raw_date"].tolist()
len(ors)


This pattern has match groups. To actually get the groups, use str.extract.



9909

In [21]:
# examples of more "/" combined with "-"
[datation for datation in ors if re.search(r'(\d+)(\/)(\d+).?-.?(\d+)(\/)(\d+)', datation)][:10]

['441/0-440/39 a.',
 '430/29-427/6 a.',
 '418/7-415/4 a. ',
 '409/8-407/6 a.',
 '413/2-405/4 a.',
 '447/6-433/2 a.',
 '447/6-433/2 a.',
 '447/6-433/2 a.',
 '447/6-433/2 a.',
 '447/6-433/2 a.']

In [22]:
# examples of more "/" combined with " or "
[datation for datation in ors if re.search(r'(\d+)(\/)(\d+)().?(\sor\s).?(\d+)(\/)(\d+)', datation)][:10]

['238/9 or 242/3',
 '238/9 or 242/3',
 '262/3 or 266/7',
 '321/0 or 318/7',
 '175/4 or 172/1',
 '329/8 or 323/2',
 '340/39 or 313/2',
 '180/79 or 179/8 BC',
 '148/7 or 147/6 BC',
 '147/6 or 146/5 BC']

In [23]:
def complete_numbers(datation, date1, date2):
  # if the second number contains less numerals, try to complete it
  len_diff = len(date1) - len(date2)
  if len_diff > 0:
    date2 = date1[:len_diff] + date2
  # transform it into integer
  date1 = int(date1)
  date2 = int(date2)
  if ("AD" not in datation) and (date1 > date2):
    date1 = date1 * -1
    date2 = date2 * -1
    #if date1 > date2:
       #  date1, date2 = date2, date1
  return date1, date2

def match_or(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if dating["type"] == "unknown":
    if (len(re.findall("[a-z|A-Z]", datation)) < 4) or (re.search("BC|AD|early|late|beg|end|after|post|before|ante", datation)):
      matches = re.findall(r'(\d+)(\/)(\d+)', str(datation), flags=re.IGNORECASE)
      if len(matches) != 0:
          date1, date2 = complete_numbers(datation, matches[0][0], matches[0][2])
          if len(matches) > 1: # if there is more than one match
            date3, date4 = complete_numbers(datation, matches[1][0], matches[1][2])         
          #if date1 > date2:
          #  date1, date2 = date2, date1
          if re.search(r'(\d+)(\/)(\d+)().?(\sor\s).?(\d+)(\/)(\d+)', datation):
            if abs(date1 - date4) < 5: # if it is something like "331/0 or 330/29 BC"
              dating.update({"start" : date1, "stop": date4, "type" : "range"})
            else: # treat the or numbers as an alternative range
              dating.update({"start" : date1, "stop": date2, "type" : "range+or", "or": {"start" : date3, "stop": date4, "type" : "range"}})
          elif re.search(r'(\d+)(\/)(\d+).?-.?(\d+)(\/)(\d+)', datation):
            dating.update({"start" : date1, "stop": date4, "type" : "range"})
          else:
            if abs(date1 - date2) < 3:
              dating.update({"start" : date1, "stop": date2, "type" : "range"})
            else:
              dating.update({"exact" : date1, "or": {"exact" : date2, "type" : "exact"}, "type" : "exact+or"})
          dating = extract_ante_and_post(datation, dating)
          return dating
          #if dating_update["type"] == "post":
          #   return {"start" : date1, "or": {"start" : date2, "type" : "post"}, "type" : "post+or"}
          #elif extract_ante_and_post(datation, dating)["type"] == "ante":
          #  return {"stop" : date1, "or": {"stop" : date2, "type" : "ante"}, "type" : "ante+or"}
          #else:
          #  return {"exact" : date1, "or": {"exact" : date2, "type" : "exact"}, "type" : "exact+or"}
      else:
        return dating
  return {"type" : "unknown"}

In [24]:
match_or('ASAA 6/7 (1923/4) 446, 161')

{'type': 'unknown'}

In [25]:
match_or(" 12/1 BC")

{'type': 'range', 'start': -12, 'stop': -11}

In [26]:
match_or("229/30 or 230/1")

{'type': 'range', 'start': 229, 'stop': 231}

In [27]:
match_or("27/6 or 17/8")

{'type': 'range+or',
 'start': -27,
 'stop': -26,
 'or': {'start': 17, 'stop': 18, 'type': 'range'}}

In [28]:
match_or("14/13 or 13/12 BC")

{'type': 'range', 'start': -14, 'stop': -12}

In [29]:
match_or("aft. 14/13 or 13/12 BC")

{'start': -11, 'stop': None, 'type': 'range+post'}

In [30]:
match_or("139/8-122/1 BC")

{'type': 'range', 'start': -139, 'stop': -121}

# Parse phase

In [31]:
# parametrization 
early_late = 0.25 # i.e. first or last 25% of the range
beginning_end = 0.1 # i.e. first or last 25% of the range
middle = 0.05 # i.e. 5% left of the middle, 5% right of the middle
ca = 0.1 # i.e. plus 10% of the range on the left side and plus 10% on the right side

In [32]:
# application of this function requires that you already have a dating dictionary having either start and stop or an exact date (for "ca.")
def modify_by_phase(datation, dating):
  if (not "phase" in dating["type"]) and (not "morece" in dating["type"]):
    try:
      start, stop = dating["start"], dating["stop"]
      try: 
        duration = abs(dating["stop"] - dating["start"])
      except:
        duration = 1
      datation = datation.lower()
      if "firsthalf" in datation:
        dating["stop"] = start + round(duration * 0.5)
        dating["type"] = dating["type"] + "+phase+firsthalf"
      if "secondhalf" in datation:
        dating["start"] = start + round(duration * 0.5)
        dating["type"] = dating["type"] + "+phase+secondhalf"
      if "early" in datation:
        coef = early_late
        dating["stop"] = start + round(duration * coef)
        dating["type"] = dating["type"] + "+phase+early"
      if "late" in datation:
        if "late antiquity" not in datation:
          coef = early_late
          dating["start"] = stop - round(duration * coef)
          dating["type"] = dating["type"] + "+phase+late"
      if re.search("(init\.\s|beginning|beg\.?\s)", datation):
        coef = beginning_end
        dating["stop"] = start + round(duration * coef)
        dating["type"] = dating["type"] + "+phase+beg"
      if re.search("(end\s|fin.\s)", datation):
        coef = beginning_end
        dating["start"] = stop - round(duration * coef)
        dating["type"] = dating["type"] + "+phase+end"
      if re.search("(middle|mid\.?\s|med\.\s)", datation):
        coef = middle # that means: "middle 2nd c. AD" => 140 - 161
        dating_avr = (dating["start"] + dating["stop"]) / 2
        dating["start"] = round(dating_avr - (coef * duration))
        dating["stop"] = round(dating_avr + (coef * duration))
        dating["type"] = dating["type"] + "+phase+middle"
      if re.search("ca\.\s", datation):
        dating["type"] = dating["type"] + "+phase+ca"
        if ("exact" in dating["type"]) or duration < 10:
          dating.update({"start" : dating["exact"] - 5, "stop" : dating["exact"] + 5})
          dating["exact"] = None
        else:
          dating["start"] = start - round(duration * ca)
          dating["stop"] = stop + round(duration * ca)
      return dating
    except:
      return dating
  else: 
    return dating

In [33]:
# example 0
modify_by_phase("beg. 5th c. BC", {"start" : -500, "stop" : -401, "type" : "range"})

{'start': -500, 'stop': -490, 'type': 'range+phase+beg'}

In [34]:
# example 1: "ca." in case of individual date
datation = "ca. 200 BC"
dating = {"start" : None, "stop" : None, "exact" : -200, "type" : "exact", "era" : "BC"}
print(modify_by_phase(datation, dating))

{'start': -205, 'stop': -195, 'exact': None, 'type': 'exact+phase+ca', 'era': 'BC'}


In [35]:
#  example 2: "ca." in case of century
datation = "ca. s. II BC"
dating = {"start" : -200, "stop" : -101, "type" : "range+cent"}
print(modify_by_phase(datation, dating))

{'start': -210, 'stop': -91, 'type': 'range+cent+phase+ca'}


In [36]:
#  example 3: "early"
datation = "early 2nd BC"
dating = {"start" : -200, "stop" : -101, "type" : "range+cent"}
print(modify_by_phase(datation, dating))

{'start': -200, 'stop': -175, 'type': 'range+cent+phase+early'}


# Parse centuries

In [37]:
# read centuries table from gsheet
centuries_df = get_as_dataframe(PHI_overview.worksheet("centuries"))
centuries_df.set_index("arabic", inplace=True)
centuries_df

Unnamed: 0_level_0,roman,start_BC,stop_BC,start_AD,stop_AD
arabic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8th,VIII,-800,-701,701,800
7th,VII,-700,-601,601,700
6th,VI,-600,-501,501,600
4th,IV,-400,-301,301,400
5th,V,-500,-401,401,500
3rd,III,-300,-201,201,300
2nd,II,-200,-101,101,200
1st,I,-100,-1,1,100


In [38]:
arabics = centuries_df.index.tolist()
arabics

['8th', '7th', '6th', '4th', '5th', '3rd', '2nd', '1st']

In [39]:
centuries_df["roman"].tolist()

['VIII', 'VII', 'VI', 'IV', 'V', 'III', 'II', 'I']

In [40]:
# navigating through the dataframe using index and ".loc[]"
centuries_df.loc["3rd"]["roman"]

'III'

In [41]:
any(re.search(arabic, "2nd c. AD") for arabic in arabics)

True

In [42]:
def parse_centuries(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if dating["type"] == "unknown": 
    if any(cent for cent in centuries_df["roman"].tolist() if (re.search("^\s?" + cent + "($|/)", datation)) or (re.search("(s|c)\.", datation))):
      for roman, arabic in zip(centuries_df["roman"].tolist(), arabics):
          match = re.search(r"(^|\s|/)(" + roman + r")($|\s|/)", datation)
          if match:
            datation = re.sub(match[2], arabic, datation)
    cents = [cent for cent in centuries_df.index.tolist() if re.search(cent, datation)] # we have the centuries mentioned, but not in right order :-)
    if len(cents) > 0:
      cents_list = re.split("-|/|\sor\s|\sand|&|,", datation)
      if len(cents_list) > 1:
        try:
          century1 = [re.sub(".*" + num + ".*", num, cents_list[0])  for num in arabics if num in cents_list[0]][0]
          century2 = [re.sub(".*" + num + ".*", num, cents_list[1])  for num in arabics if num in cents_list[1]][0]
          if (" AD" in datation) and (" BC" not in datation): # if explicit AD and only AD:
            start = modify_by_phase(cents_list[0], {"start" : centuries_df.loc[century1]["start_AD"], "stop" : centuries_df.loc[century1]["stop_AD"], "type" : "cent"})["start"]
            stop = modify_by_phase(cents_list[1], {"start" : centuries_df.loc[century2]["start_AD"], "stop" : centuries_df.loc[century2]["stop_AD"], "type" : "cent"})["stop"]
            era = "AD"
          elif (" BC" in cents_list[0]) and (" AD" in cents_list[1]):
            start = modify_by_phase(cents_list[0], {"start" : centuries_df.loc[century1]["start_BC"], "stop" : centuries_df.loc[century1]["stop_BC"], "type" : "cent"})["start"]
            stop = modify_by_phase(cents_list[1], {"start" : centuries_df.loc[century2]["start_AD"], "stop" : centuries_df.loc[century2]["stop_AD"], "type" : "cent"})["stop"]
            era = "BC/AD"
          else:
            start = modify_by_phase(cents_list[0], {"start" : centuries_df.loc[century1]["start_BC"], "stop" : centuries_df.loc[century1]["stop_BC"], "type" : "cent"})["start"]
            stop = modify_by_phase(cents_list[1], {"start" : centuries_df.loc[century2]["start_BC"], "stop" : centuries_df.loc[century2]["stop_BC"], "type" : "cent"})["stop"]
            era = "BC"
          dating_update = {"start" : start, "stop" : stop, "era" : era, "type" : "range+cent+morece"}
        except:
          try: # try to identify at least the first of them 
            century = [re.sub(".*" + num + ".*", num, datation)  for num in arabics if num in datation][0]
            if " AD" in datation: 
              start = centuries_df.loc[century]["start_AD"]
              stop = centuries_df.loc[century]["stop_AD"]
              era = "AD"
            else:
              start = centuries_df.loc[century]["start_BC"]
              stop = centuries_df.loc[century]["stop_BC"]
              era = "BC"
            dating_update = {"start" : start, "stop" : stop, "era" : era, "type" : "range+cent"}
            dating_update = modify_by_phase(datation, dating_update)
          except:
            dating_update = dating
      elif len(cents) == 1:
        century = [re.sub(".*" + num + ".*", num, datation)  for num in arabics if num in datation][0]
        if " AD" in datation: 
          start = centuries_df.loc[century]["start_AD"]
          stop = centuries_df.loc[century]["stop_AD"]
          era = "AD"
        else:
          start = centuries_df.loc[century]["start_BC"]
          stop = centuries_df.loc[century]["stop_BC"]
          era = "BC"
        dating_update = {"start" : start, "stop" : stop, "era" : era, "type" : "range+cent"}
        dating_update = modify_by_phase(datation, dating_update)
      else:
        dating_update = {"type": "unknown"}
      return dating_update
    else:
      return dating
  else:
    return dating

In [43]:
# example 1
datation = "III/II" # "p." and "a." are replaced previously by "BC" and "AD"
parse_centuries(datation)

{'start': -300, 'stop': -101, 'era': 'BC', 'type': 'range+cent+morece'}

In [44]:
datation = "fin. s. III/II" # "p." and "a." are replaced previously by "BC" and "AD"
parse_centuries(datation)

{'start': -211, 'stop': -101, 'era': 'BC', 'type': 'range+cent+morece'}

In [45]:
# example 1
datation = "late 1st c. AD" # - early 1st c. AD" # "p." and "a." are replaced previously by "BC" and "AD"
parse_centuries(datation, {"type": "unknown"})

{'start': 75, 'stop': 100, 'era': 'AD', 'type': 'range+cent+phase+late'}

In [46]:
# example 
parse_centuries("2nd/3rd c. AD", {"era": "AD", "type": "unknown"})

{'start': 101, 'stop': 300, 'era': 'AD', 'type': 'range+cent+morece'}

In [47]:
# example
parse_centuries("late 2nd c. BC/beginning 1st c. BC")

{'start': -126, 'stop': -90, 'era': 'BC', 'type': 'range+cent+morece'}

# Simple dates and ranges
This the last function we apply, since it overlooks all potential relevant information and looks for individual dates and straighforward ranges.

In [48]:
 def simple_dates_and_ranges(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if dating["type"] == "unknown":
    if " AD" in datation:
      dating["era"] = "AD" 
      try:
        date_both = re.search('(\d+)(\-)(\d+)', datation, flags=re.IGNORECASE).groups()
        if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
          date1, date2 = date_both[0], date_both[2]
          len_diff = len(date1) - len(date2)
          if len_diff > 0:
            date2 = date1[:len_diff] + date2
          dating.update({"start" : int(date1), "stop" : int(date2), "type" : "range"})
      except:  
        try:
          match = re.search('\d+', datation, flags=re.IGNORECASE)
          if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
            dating.update({"exact" : int(match[0]), "type": "exact"})
        except:
          pass
    else:
      try:
        date_both = re.search('(\d+)(\-)(\d+)', datation, flags=re.IGNORECASE).groups()
        #if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
        try:
          date1, date2 = date_both[0], date_both[2]
          len_diff = len(date1) - len(date2)
          if len_diff > 0:
            date2 = date1[:len_diff] + date2
          date1, date2 = int(date1) * -1, int(date2) * -1
          if date1 < date2:
            dating.update({"start" : date1, "stop" : date2, "type" : "range", "era" : "BC"})
        except:
          pass
      except:  
        match = re.match("^\s?(\d+)\s?$", datation) # if it is just one number and nothing else (single numbers are good) 
        if match:
          dating.update({"exact" : int(match[0]) * -1, "type": "exact"})
        else:
          if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
            match = re.search('\d+', datation, flags=re.IGNORECASE)
            dating.update({"exact" : int(match[0]) * -1, "type": "exact"})
  return dating

In [49]:
simple_dates_and_ranges('ASAA 6/7 (1923/4) 446, 161')

{'type': 'unknown'}

In [50]:
simple_dates_and_ranges("4th or 5th c. AD", {"type": "century"})

{'type': 'century'}

In [51]:
simple_dates_and_ranges("320 BC")

{'type': 'exact', 'exact': -320}

In [52]:
simple_dates_and_ranges("213 AD")

{'type': 'exact', 'era': 'AD', 'exact': 213}

In [53]:
simple_dates_and_ranges("213-4 AD")

{'type': 'range', 'era': 'AD', 'start': 213, 'stop': 214}

In [54]:
simple_dates_and_ranges("124-15 BC")

{'type': 'range', 'start': -124, 'stop': -115, 'era': 'BC'}

# Main function

In [55]:
# function to modify the structure of the data
def change_dict_structure(dictionary):
  try:
    for tup in [["not_before", "start"], ["not_after", "stop"]]:
      try: dictionary[tup[0]] = int(dictionary.pop(tup[1]))
      except: dictionary[tup[0]] = None
    try: dictionary.update({"not_before" : int(dictionary["exact"]), "not_after" : int(dictionary["exact"])})
    except: pass
    try: dictionary["date_tags"] = dictionary.pop("type").split("+")
    except: dictionary["date_tags"] = "unknown"
    #for key in dictionary.keys():
    #  if key in ["exact"]:
    #    del dictionary[key]
    return dictionary
  except: 
    print("some problem")
    pass

In [56]:
def date_extractor(datation):
  dating = {"start": None, "stop": None, "exact": None, "or": None, "type": "unknown", "era": None}
  #replace " a." and " p." when at the end of datation or before "/" or "-"
  datation = re.sub("(\s+$|^\s)", "", datation) # remove spaces at the beginning and end
  datation = re.sub("(\s)(a\.)(\-|\/|$)", r"\1BC\3", datation)
  datation = re.sub("(\s)(p\.)(\-|\/|$)", r"\1AD\3", datation)
  datation = re.sub("^(\s?c\.)(\s\d+)", r"ca.\2", datation) # "c." -> "ca." if at the beginning of the string and followed by numbers
  datation = re.sub("mid-", "mid. ", datation)
  datation = re.sub("1st half", "firsthalf", datation) 
  datation = re.sub("2nd half", "secondhalf", datation) 
  datation = re.sub("\(II?\)", "", datation) 
  datation = re.sub("(\(|\)|\[|\])", "", datation)   # remove brackets
  # UNCERTAINTY
  if "?" in datation:
    datation = datation.replace("?", "")
    dating["certainty"] = "?"
  #if datation[0] == "[":
  #  datation = datation[1:-1]
  #  dating["certainty"] = "?"
  # ranges combining BC and AD:
  match = re.search("(\d+)(\s(a\.|BC))(\-)(\d+)(\s(p\.|AD))", datation) # (/sa/.)(/-)(/d+)(/sp/.)"
  if match:
    dating.update({"start" : int(match.groups()[0]) * -1, "stop" : match.groups()[4], "era" : "BC/AD", "type" : "range"})
  # simple ante quem and postquem
  dating.update(extract_ante_and_post(datation, dating))
  # PERIODS
  dating.update(extract_period(datation, dating))
  # CENTURIES
  dating.update(parse_centuries(datation, dating))
  # "year/year"
  if dating["type"] == "unknown":
    dating.update(match_or(datation, dating)) # find all "/" instances linked with individual years
  # if we still don't know:
  dating.update(simple_dates_and_ranges(datation, dating))
  # extract phases (e.g. "early", "middle", "late", "beginning" etc.)
  dating.update(modify_by_phase(datation, dating))
  # extract ante quem and post quem for ranges
  try:
    dating.update(extract_ante_and_post(datation, dating))
  except:
    pass
  if dating["type"]=="unknown":
    dating.update({"start": None, "stop": None, "exact": None, "or": None, "type": "unknown", "era" : None})
  dating = change_dict_structure(dating)
  if dating["or"] != None:
    dating["or"] = change_dict_structure(dating["or"])
  del dating["exact"], dating["era"]
  for key in ["certainty", "or", "link"]:
    try: dating[key] = dating[key]
    except: dating[key] = None
  return dating

In [57]:
date_extractor("med. s. V a.")

{'or': None,
 'not_before': -455,
 'not_after': -446,
 'date_tags': ['range', 'cent', 'phase', 'middle'],
 'certainty': None,
 'link': None}

In [58]:
datation = "5th (or 4th?) c. BC"
date_extractor(datation)

{'or': None,
 'certainty': '?',
 'not_before': -500,
 'not_after': -301,
 'date_tags': ['range', 'cent', 'morece'],
 'link': None}

In [59]:
date_extractor("14/13 or 13/12 BC") 

{'or': None,
 'not_before': -14,
 'not_after': -12,
 'date_tags': ['range'],
 'certainty': None,
 'link': None}

# Various examples

In [60]:
date_extractor("after 216/5") 

{'or': None,
 'not_before': -214,
 'not_after': None,
 'date_tags': ['range', 'post'],
 'certainty': None,
 'link': None}

In [61]:
date_extractor("c. 145 BC")

{'or': None,
 'not_before': -150,
 'not_after': -140,
 'date_tags': ['exact', 'phase', 'ca'],
 'certainty': None,
 'link': None}

In [62]:
date_extractor("after 14/13 or 13/12 BC") 

{'or': None,
 'not_before': -11,
 'not_after': None,
 'date_tags': ['range', 'post'],
 'certainty': None,
 'link': None}

In [63]:
date_extractor("s. II/III AD")

{'or': None,
 'not_before': 101,
 'not_after': 300,
 'date_tags': ['range', 'cent', 'morece'],
 'certainty': None,
 'link': None}

In [64]:
date_extractor("fin. s. I a./s. I p.")

{'or': None,
 'not_before': -11,
 'not_after': 100,
 'date_tags': ['range', 'cent', 'morece'],
 'certainty': None,
 'link': None}

In [65]:
date_extractor("2nd-early 3rd c. AD")

{'or': None,
 'not_before': 101,
 'not_after': 226,
 'date_tags': ['range', 'cent', 'morece'],
 'certainty': None,
 'link': None}

In [66]:
date_extractor("320 BC")

{'or': None,
 'not_before': -320,
 'not_after': -320,
 'date_tags': ['exact'],
 'certainty': None,
 'link': None}

In [67]:
date_extractor("post 120 AD")

{'or': None,
 'not_before': 121,
 'not_after': None,
 'date_tags': ['post'],
 'certainty': None,
 'link': None}

In [68]:
date_extractor("after 4th c. BC")

{'or': None,
 'not_before': -300,
 'not_after': None,
 'date_tags': ['range', 'cent', 'post'],
 'certainty': None,
 'link': None}

In [69]:
# testing/example
for datation in ["Byzantine", "non ante s. II a.", "140/39 BC", "Rom. Imp", "reign of Augustus", "ante 140 BC", "late Antonine period", "IosPE IV 348"]:
  print(datation, date_extractor(datation))

Byzantine {'or': None, 'source': 'PeriodO', 'notes': 'alone', 'link': 'http://n2t.net/ark:/99152/p0m63njtm6w', 'not_before': 324, 'not_after': 1453, 'date_tags': ['range', 'period'], 'certainty': None}
non ante s. II a. {'or': None, 'not_before': -200, 'not_after': None, 'date_tags': ['range', 'cent', 'post'], 'certainty': None, 'link': None}
140/39 BC {'or': None, 'not_before': -140, 'not_after': -139, 'date_tags': ['range'], 'certainty': None, 'link': None}
Rom. Imp {'or': None, 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p08m57hqcc5', 'not_before': -31, 'not_after': 410, 'date_tags': ['range', 'period'], 'certainty': None}
reign of Augustus {'or': None, 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p06c6g3xnmx', 'not_before': -27, 'not_after': 14, 'date_tags': ['range', 'period'], 'certainty': None}
ante 140 BC {'or': None, 'not_before': None, 'not_after': -141, 'date_tags': ['ante'], 'certainty': None, 'link': None}
late Antonine pe

In [70]:
datation = "Nécrop.Myr. 224,5"
date_extractor(datation)

{'or': None,
 'not_before': None,
 'not_after': None,
 'date_tags': ['unknown'],
 'certainty': None,
 'link': None}

In [71]:
 datation = " non ante med. s. II p."
 date_extractor(datation)

{'or': None,
 'not_before': 146,
 'not_after': None,
 'date_tags': ['range', 'cent', 'phase', 'middle', 'post'],
 'certainty': None,
 'link': None}

In [72]:
date_extractor("not bef. the Antonine period")

{'or': None,
 'source': 'PeriodO',
 'notes': nan,
 'link': 'http://n2t.net/ark:/99152/p06c6g34zjk',
 'not_before': 96,
 'not_after': None,
 'date_tags': ['range', 'period', 'post'],
 'certainty': None}

In [73]:
date_extractor("1st BC/1st AD")

{'or': None,
 'not_before': -100,
 'not_after': 100,
 'date_tags': ['range', 'cent', 'morece'],
 'certainty': None,
 'link': None}

In [74]:
date_extractor("Newton, Disc. II 742-43:91")

{'or': None,
 'not_before': -200,
 'not_after': -101,
 'date_tags': ['range', 'cent'],
 'certainty': None,
 'link': None}

In [75]:
datation = "27 a.-14 p."
match = re.search("(\d+)(\s(a\.|BP))(\-)(\d+)(\s(p\.|AD))", datation) # (/sa/.)(/-)(/d+)(/sp/.)"
if match:
  dating = {"start" : int(match.groups()[0]) * -1, "stop" : match.groups()[4], "era" : "BC/AD", "type" : "range"}
dating

{'start': -27, 'stop': '14', 'era': 'BC/AD', 'type': 'range'}

# Testing with sample 1 (by 500)

In [76]:
test_list = PHI_by_500["raw_date"].tolist()
for datation in test_list:
  print(datation, date_extractor(datation))

c. 510-500 a.  {'or': None, 'not_before': -511, 'not_after': -499, 'date_tags': ['range', 'phase', 'ca'], 'certainty': None, 'link': None}
s. V a. {'or': None, 'not_before': -500, 'not_after': -401, 'date_tags': ['range', 'cent'], 'certainty': None, 'link': None}
c. 440? a.  {'or': None, 'certainty': '?', 'not_before': -445, 'not_after': -435, 'date_tags': ['exact', 'phase', 'ca'], 'link': None}
c. 420-410? a.  {'or': None, 'certainty': '?', 'not_before': -421, 'not_after': -409, 'date_tags': ['range', 'phase', 'ca'], 'link': None}
ante 336/5 {'or': None, 'not_before': None, 'not_after': -337, 'date_tags': ['range', 'ante'], 'certainty': None, 'link': None}
204/3? {'or': None, 'certainty': '?', 'not_before': -204, 'not_after': -203, 'date_tags': ['range'], 'link': None}
med. s. III a. {'or': None, 'not_before': -255, 'not_after': -246, 'date_tags': ['range', 'cent', 'phase', 'middle'], 'certainty': None, 'link': None}
post med. s. II p.  {'or': None, 'not_before': 156, 'not_after': Non

In [77]:
PHI_list_of_dict = []
for inscription_date_tuple in list(zip(PHI_by_500["PHI_ID"].tolist(), PHI_by_500["tildeinfo"].tolist(),  PHI_by_500["raw_date"].tolist())):
  dating = date_extractor(inscription_date_tuple[2])
  data_dict = {"PHI_ID": inscription_date_tuple[0], "tildeinfo" : inscription_date_tuple[1], "raw_date" : inscription_date_tuple[2]}
  data_dict.update(dating)
  PHI_list_of_dict.append(data_dict)

In [78]:
PHI_by_500_dates_v9 = pd.DataFrame(PHI_list_of_dict)
PHI_by_500_dates_v9.head(5)

Unnamed: 0,PHI_ID,tildeinfo,raw_date,or,not_before,not_after,date_tags,certainty,link,source,notes
0,1,Att. — Ath.: Akr. — stoich. 35 — c. 510-500 a....,c. 510-500 a.,,-511.0,-499.0,"[range, phase, ca]",,,,
1,501,Att. — stoich. — s. V a.,s. V a.,,-500.0,-401.0,"[range, cent]",,,,
2,1001,Att. — Athens: Akropolis — c. 440? a. — IG I² ...,c. 440? a.,,-445.0,-435.0,"[exact, phase, ca]",?,,,
3,1501,Att. — Salamis: Koulouri — c. 420-410? a. — IG...,c. 420-410? a.,,-421.0,-409.0,"[range, phase, ca]",?,,,
4,2501,Att. — stoich. 28 — ante 336/5,ante 336/5,,,-337.0,"[range, ante]",,,,


In [79]:
PHI_by_500_dates_v9.columns

Index(['PHI_ID', 'tildeinfo', 'raw_date', 'or', 'not_before', 'not_after',
       'date_tags', 'certainty', 'link', 'source', 'notes'],
      dtype='object')

In [80]:
PHI_by_500_dates_v9 = PHI_by_500_dates_v9[['PHI_ID', 'tildeinfo', 'raw_date', 'not_before', 'not_after', 'certainty', 'or',
       'date_tags', 'source', 'notes', 'link']]

In [81]:
#set_with_dataframe(PHI_overview.add_worksheet("PHI_by_500_dates_v9", 1,1), PHI_by_500_dates_v9)

# Test with sample 2 (by 200)

In [82]:
# generate sample for testing purposes:
PHI_by_200 = PHI[PHI["PHI_ID"].isin(range(0, 300000, 200))]
PHI_by_200.head(5)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,...,string_pythia,clean_text_conservative,clean_text_interpretive_word,clean_text_interpretive_sentence,clean_text_pythia,sents,sents_N,lem_sents,lemmata,raw_date
199,/text/200?location=1701&patt=&bookid=4&offset=...,IG I³,193,Regions\nAttica (IG I-III),IG I³\n193,Att. — stoich. 39? — 450-435,,7,\n1\n\n\n\n5\n,— — — — — — — — — — — — — — — — — — — — — — —\...,...,----------------------- [----------------]ονδε...,ονδε Λ ταῦτα μ ℎο γραμματεὺτο ἐμ πόλει δὲ κολα...,ονδε Λ ταῦτα μ ἀναγραφσάτο ℎο γραμματεὺς ℎο τε...,ονδε Λ ταῦτα μ ․ ἀναγραφσάτο ℎο γραμματεὺς ℎο ...,ονδε λ ταῦτα μ ἀναγραφσάτο ὁ γραμματεὺς ὁ τες ...,[ονδε λ ταῦτα μ ἀναγραφσάτο ὁ γραμματεὺς ὁ τες...,2,"[[ονδε, οὗτος, ἀναγραφσάτο, γραμματεύς, τες, β...","[ονδε, οὗτος, ἀναγραφσάτο, γραμματεύς, τες, βο...",450-435
399,/text/400?location=1701&patt=&bookid=4&offset=...,IG I³,388,Regions\nAttica (IG I-III),IG I³\n388,Att. — stoich. — 420-405 a.,,17,\n1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n,— — — — — — — — — — — —\n— — — — Ι̣#⁷ — — — — ...,...,---------------- ι --------------- 0 χρυ[σίς σ...,Ι ΙΙ χρυ οἰνοχ οἰνοχόε χρυσίο σ Κυζικενό ℎέκτα...,Ι ΙΙ χρυσίς σταθμὸν ταύτες Η Δ οἰνοχόε ἀργυρᾶ ...,Ι ΙΙ χρυσίς σταθμὸν ταύτες Η Δ οἰνοχόε ἀργυρᾶ ...,ι χρυσίς σταθμὸν ταύτες οἰνοχόε ἀργυρᾶ σταθμόν...,[ι χρυσίς σταθμὸν ταύτες οἰνοχόε ἀργυρᾶ σταθμό...,1,"[[χρυσίς, σταθμόν, ταύτες, οἰνοχόε, ἀργύρεος, ...","[χρυσίς, σταθμόν, ταύτες, οἰνοχόε, ἀργύρεος, σ...",420-405 a.
599,/text/600?location=1701&patt=&bookid=4&offset=...,IG I³,564,Regions\nAttica (IG I-III),IG I³\n564,Att. — Athens: Akropolis — c. 500-475? a. — IG...,,1,1,Ἐπιγέν[ες — — —].,...,Ἐπιγέν[ες — — —].,Ἐπιγέν,Ἐπιγένες,Ἐπιγένες .,Ἐπιγένες .,[Ἐπιγένες .],1,[[ἐπιγένες]],[ἐπιγένες],c. 500-475? a.
799,/text/800?location=1701&patt=&bookid=4&offset=...,IG I³,703,Regions\nAttica (IG I-III),IG I³\n703,Att. — Athens: Akropolis — c. 500-480? a. — IG...,,2,1\n,[ἄρ]γ̣ματα Θότιμ— — — ἀνέθ[εκε — — —] /\n—2-3—...,...,[ἄρ]γ̣ματα Θότιμ— — — ἀνέθ[εκε — — —] /\n—2-3—...,γματα Θότιμ ἀνέθ ενο τεσε,ἄργματα Θότιμ ἀνέθεκε ενο τεσε,ἄργματα Θότιμ ἀνέθεκε ενο τεσε .,ἄργματα Θότιμ ἀνέθεκε ενο τεσε .,[ἄργματα Θότιμ ἀνέθεκε ενο τεσε .],1,"[[ἄργμα, θότιμ, ἀνέθεκε, ενο, τεσε]]","[ἄργμα, θότιμ, ἀνέθεκε, ενο, τεσε]",c. 500-480? a.
999,/text/1000?location=1701&patt=&bookid=4&offset...,IG I³,885,Regions\nAttica (IG I-III),IG I³\n885,Att. — Athens: Akropolis — c. 440? a. — IG I² ...,,3,1\n\n,[τόνδε Πυρε͂]ς ἀνέθεκε Πολυμνέστ̣ο φίλο[ς ℎυιὸ...,...,[τόνδε Πυρε͂]ς ἀνέθεκε Πολυμνέστ̣ο φίλο[ς ℎυιὸ...,ς ἀνέθεκε Πολυμνέστο φίλο εὐξάμενος δεκάτεν Πα...,τόνδε Πυρε͂ς ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,τόνδε Πυρε͂ς ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,τόνδε Πυρε͂ς ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,[τόνδε Πυρες ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,2,"[[ὅδε, πυρες, ἀνέθεκε, πολυμνέστο, φίλος, ℎυιὸ...","[ὅδε, πυρες, ἀνέθεκε, πολυμνέστο, φίλος, ℎυιὸς...",c. 440? a.


In [83]:
date_extractor("c. 450 a. ")

{'or': None,
 'not_before': -455,
 'not_after': -445,
 'date_tags': ['exact', 'phase', 'ca'],
 'certainty': None,
 'link': None}

In [84]:
for datation in PHI_by_200["raw_date"].tolist()[:50]:
  print(datation, date_extractor(datation))

450-435 {'or': None, 'not_before': -450, 'not_after': -435, 'date_tags': ['range'], 'certainty': None, 'link': None}
420-405 a. {'or': None, 'not_before': -420, 'not_after': -405, 'date_tags': ['range'], 'certainty': None, 'link': None}
c. 500-475? a.  {'or': None, 'certainty': '?', 'not_before': -502, 'not_after': -473, 'date_tags': ['range', 'phase', 'ca'], 'link': None}
c. 500-480? a.  {'or': None, 'certainty': '?', 'not_before': -502, 'not_after': -478, 'date_tags': ['range', 'phase', 'ca'], 'link': None}
c. 440? a.  {'or': None, 'certainty': '?', 'not_before': -445, 'not_after': -435, 'date_tags': ['exact', 'phase', 'ca'], 'link': None}
c. 450 a.  {'or': None, 'not_before': -455, 'not_after': -445, 'date_tags': ['exact', 'phase', 'ca'], 'certainty': None, 'link': None}
SEG 22.73,adn. {'or': None, 'not_before': None, 'not_after': None, 'date_tags': ['unknown'], 'certainty': None, 'link': None}
c. 525 a.  {'or': None, 'not_before': -530, 'not_after': -520, 'date_tags': ['exact', 'ph

In [85]:
PHI_by_200["dating_dict"] = PHI_by_200.apply(lambda row: date_extractor(row["raw_date"]), axis=1)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [86]:
PHI_by_200.head(5)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,...,clean_text_conservative,clean_text_interpretive_word,clean_text_interpretive_sentence,clean_text_pythia,sents,sents_N,lem_sents,lemmata,raw_date,dating_dict
199,/text/200?location=1701&patt=&bookid=4&offset=...,IG I³,193,Regions\nAttica (IG I-III),IG I³\n193,Att. — stoich. 39? — 450-435,,7,\n1\n\n\n\n5\n,— — — — — — — — — — — — — — — — — — — — — — —\...,...,ονδε Λ ταῦτα μ ℎο γραμματεὺτο ἐμ πόλει δὲ κολα...,ονδε Λ ταῦτα μ ἀναγραφσάτο ℎο γραμματεὺς ℎο τε...,ονδε Λ ταῦτα μ ․ ἀναγραφσάτο ℎο γραμματεὺς ℎο ...,ονδε λ ταῦτα μ ἀναγραφσάτο ὁ γραμματεὺς ὁ τες ...,[ονδε λ ταῦτα μ ἀναγραφσάτο ὁ γραμματεὺς ὁ τες...,2,"[[ονδε, οὗτος, ἀναγραφσάτο, γραμματεύς, τες, β...","[ονδε, οὗτος, ἀναγραφσάτο, γραμματεύς, τες, βο...",450-435,"{'or': None, 'not_before': -450, 'not_after': ..."
399,/text/400?location=1701&patt=&bookid=4&offset=...,IG I³,388,Regions\nAttica (IG I-III),IG I³\n388,Att. — stoich. — 420-405 a.,,17,\n1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n,— — — — — — — — — — — —\n— — — — Ι̣#⁷ — — — — ...,...,Ι ΙΙ χρυ οἰνοχ οἰνοχόε χρυσίο σ Κυζικενό ℎέκτα...,Ι ΙΙ χρυσίς σταθμὸν ταύτες Η Δ οἰνοχόε ἀργυρᾶ ...,Ι ΙΙ χρυσίς σταθμὸν ταύτες Η Δ οἰνοχόε ἀργυρᾶ ...,ι χρυσίς σταθμὸν ταύτες οἰνοχόε ἀργυρᾶ σταθμόν...,[ι χρυσίς σταθμὸν ταύτες οἰνοχόε ἀργυρᾶ σταθμό...,1,"[[χρυσίς, σταθμόν, ταύτες, οἰνοχόε, ἀργύρεος, ...","[χρυσίς, σταθμόν, ταύτες, οἰνοχόε, ἀργύρεος, σ...",420-405 a.,"{'or': None, 'not_before': -420, 'not_after': ..."
599,/text/600?location=1701&patt=&bookid=4&offset=...,IG I³,564,Regions\nAttica (IG I-III),IG I³\n564,Att. — Athens: Akropolis — c. 500-475? a. — IG...,,1,1,Ἐπιγέν[ες — — —].,...,Ἐπιγέν,Ἐπιγένες,Ἐπιγένες .,Ἐπιγένες .,[Ἐπιγένες .],1,[[ἐπιγένες]],[ἐπιγένες],c. 500-475? a.,"{'or': None, 'certainty': '?', 'not_before': -..."
799,/text/800?location=1701&patt=&bookid=4&offset=...,IG I³,703,Regions\nAttica (IG I-III),IG I³\n703,Att. — Athens: Akropolis — c. 500-480? a. — IG...,,2,1\n,[ἄρ]γ̣ματα Θότιμ— — — ἀνέθ[εκε — — —] /\n—2-3—...,...,γματα Θότιμ ἀνέθ ενο τεσε,ἄργματα Θότιμ ἀνέθεκε ενο τεσε,ἄργματα Θότιμ ἀνέθεκε ενο τεσε .,ἄργματα Θότιμ ἀνέθεκε ενο τεσε .,[ἄργματα Θότιμ ἀνέθεκε ενο τεσε .],1,"[[ἄργμα, θότιμ, ἀνέθεκε, ενο, τεσε]]","[ἄργμα, θότιμ, ἀνέθεκε, ενο, τεσε]",c. 500-480? a.,"{'or': None, 'certainty': '?', 'not_before': -..."
999,/text/1000?location=1701&patt=&bookid=4&offset...,IG I³,885,Regions\nAttica (IG I-III),IG I³\n885,Att. — Athens: Akropolis — c. 440? a. — IG I² ...,,3,1\n\n,[τόνδε Πυρε͂]ς ἀνέθεκε Πολυμνέστ̣ο φίλο[ς ℎυιὸ...,...,ς ἀνέθεκε Πολυμνέστο φίλο εὐξάμενος δεκάτεν Πα...,τόνδε Πυρε͂ς ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,τόνδε Πυρε͂ς ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,τόνδε Πυρε͂ς ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,[τόνδε Πυρες ἀνέθεκε Πολυμνέστο φίλος ℎυιὸς εὐ...,2,"[[ὅδε, πυρες, ἀνέθεκε, πολυμνέστο, φίλος, ℎυιὸ...","[ὅδε, πυρες, ἀνέθεκε, πολυμνέστο, φίλος, ℎυιὸς...",c. 440? a.,"{'or': None, 'certainty': '?', 'not_before': -..."


In [87]:
# extract data from dating_dict to individual columns:
for key in ["not_before", "not_after", "or", "date_tags", "certainty", "link"]:
  PHI_by_200[key] = PHI_by_200.apply(lambda row: row["dating_dict"][key], axis=1)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [88]:
PHI_by_200

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,...,lem_sents,lemmata,raw_date,dating_dict,not_before,not_after,or,date_tags,certainty,link
199,/text/200?location=1701&patt=&bookid=4&offset=...,IG I³,193,Regions\nAttica (IG I-III),IG I³\n193,Att. — stoich. 39? — 450-435,,7,\n1\n\n\n\n5\n,— — — — — — — — — — — — — — — — — — — — — — —\...,...,"[[ονδε, οὗτος, ἀναγραφσάτο, γραμματεύς, τες, β...","[ονδε, οὗτος, ἀναγραφσάτο, γραμματεύς, τες, βο...",450-435,"{'or': None, 'not_before': -450, 'not_after': ...",-450.0,-435.0,,[range],,
399,/text/400?location=1701&patt=&bookid=4&offset=...,IG I³,388,Regions\nAttica (IG I-III),IG I³\n388,Att. — stoich. — 420-405 a.,,17,\n1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n,— — — — — — — — — — — —\n— — — — Ι̣#⁷ — — — — ...,...,"[[χρυσίς, σταθμόν, ταύτες, οἰνοχόε, ἀργύρεος, ...","[χρυσίς, σταθμόν, ταύτες, οἰνοχόε, ἀργύρεος, σ...",420-405 a.,"{'or': None, 'not_before': -420, 'not_after': ...",-420.0,-405.0,,[range],,
599,/text/600?location=1701&patt=&bookid=4&offset=...,IG I³,564,Regions\nAttica (IG I-III),IG I³\n564,Att. — Athens: Akropolis — c. 500-475? a. — IG...,,1,1,Ἐπιγέν[ες — — —].,...,[[ἐπιγένες]],[ἐπιγένες],c. 500-475? a.,"{'or': None, 'certainty': '?', 'not_before': -...",-502.0,-473.0,,"[range, phase, ca]",?,
799,/text/800?location=1701&patt=&bookid=4&offset=...,IG I³,703,Regions\nAttica (IG I-III),IG I³\n703,Att. — Athens: Akropolis — c. 500-480? a. — IG...,,2,1\n,[ἄρ]γ̣ματα Θότιμ— — — ἀνέθ[εκε — — —] /\n—2-3—...,...,"[[ἄργμα, θότιμ, ἀνέθεκε, ενο, τεσε]]","[ἄργμα, θότιμ, ἀνέθεκε, ενο, τεσε]",c. 500-480? a.,"{'or': None, 'certainty': '?', 'not_before': -...",-502.0,-478.0,,"[range, phase, ca]",?,
999,/text/1000?location=1701&patt=&bookid=4&offset...,IG I³,885,Regions\nAttica (IG I-III),IG I³\n885,Att. — Athens: Akropolis — c. 440? a. — IG I² ...,,3,1\n\n,[τόνδε Πυρε͂]ς ἀνέθεκε Πολυμνέστ̣ο φίλο[ς ℎυιὸ...,...,"[[ὅδε, πυρες, ἀνέθεκε, πολυμνέστο, φίλος, ℎυιὸ...","[ὅδε, πυρες, ἀνέθεκε, πολυμνέστο, φίλος, ℎυιὸς...",c. 440? a.,"{'or': None, 'certainty': '?', 'not_before': -...",-445.0,-435.0,,"[exact, phase, ca]",?,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165100,/text/299000?location=1673&patt=&bookid=736&of...,"IDR III,2",309,Regions\nThrace and the Lower Danube (IG X)\nD...,"IDR III,2\n309",Dacia Sup. — Ulpia Traiana-Sarmizegetusa — 2nd...,,7,1\n\n\n\n5\n\n,Deae Regi(nae)\nAel(ia) Primi-\ntiva ex vot(o)...,...,"[[deae, reginae, aelia, primitiva, ex, voto, p...","[deae, reginae, aelia, primitiva, ex, voto, pr...",2nd/3rd c. AD,"{'or': None, 'not_before': 101, 'not_after': 3...",101.0,300.0,,"[range, cent, morece]",,
165300,/text/299200?location=1673&patt=&bookid=736&of...,"IDR III,2",506,Regions\nThrace and the Lower Danube (IG X)\nD...,"IDR III,2\n506",Dacia Sup. — Ulpia Traiana-Sarmizegetusa — 2nd...,,12,frg. a\n\n\n\nfrg. b\n\n\n\n\nfrg. c\n\n,[— — — — —]\n[— —]MI[— —]\n[— — — — —]\n\n[— —...,...,"[[l, m]]","[l, m]",2nd/3rd c. AD,"{'or': None, 'not_before': 101, 'not_after': 3...",101.0,300.0,,"[range, cent, morece]",,
165500,/text/299400?location=1673&patt=&bookid=737&of...,"IDR III,3",83,Regions\nThrace and the Lower Danube (IG X)\nD...,"IDR III,3\n83",Dacia Sup. — Micia (Vețel) — 2nd/3rd c. AD — C...,,6,1\n\n\n\n5\n,I(ovi) O(ptimo) M(aximo)\nvet(erani) et c(ives...,...,"[[iovi, optimo, maximo, veterani, et, cives, r...","[iovi, optimo, maximo, veterani, et, cives, ro...",2nd/3rd c. AD,"{'or': None, 'not_before': 101, 'not_after': 3...",101.0,300.0,,"[range, cent, morece]",,
165700,/text/299600?location=1673&patt=&bookid=737&of...,"IDR III,3",232,Regions\nThrace and the Lower Danube (IG X)\nD...,"IDR III,3\n232",Dacia Sup. — Germisara: Geoagiu — 161 AD,,7,1\n\n\n\n5\n\n,Aesculapio\net Hygiae\nsacrum\nP(ublius) Furiu...,...,"[[et, hygiae, sacrum, publius, furius, saturni...","[et, hygiae, sacrum, publius, furius, saturnin...",161 AD,"{'or': None, 'not_before': 161, 'not_after': 1...",161.0,161.0,,[exact],,


In [89]:
#set_with_dataframe(PHI_overview.add_worksheet("PHI_by_200_v1", 1,1), PHI_by_200)

# Application on the whole dataset

In [90]:
%%time
def apply_function(row):
  try: 
    return date_extractor(row["raw_date"])
  except:
    return {"not_before": None, "not_after": None, "or": None, "date_tags": ["unknown"], "link" : None, "certainty" : None}

PHI["dating_dict"] = PHI.apply(lambda row: apply_function(row), axis=1)
# extract data from dating_dict to individual columns:
for key in ["not_before", "not_after", "or", "date_tags", "certainty", "link"]:
  PHI[key] = PHI.apply(lambda row: row["dating_dict"][key], axis=1)

CPU times: user 1min 1s, sys: 542 ms, total: 1min 2s
Wall time: 1min 2s


In [91]:
PHI.head(20)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,...,lem_sents,lemmata,raw_date,dating_dict,not_before,not_after,or,date_tags,certainty,link
0,/text/1?location=1701&patt=&bookid=4&offset=0&...,IG I³,1,Regions\nAttica (IG I-III),IG I³\n1,Att. — Ath.: Akr. — stoich. 35 — c. 510-500 a....,,12,1\n\n\n\n5\n\n\n\n\n10\n\n,ἔδοχσεν το͂ι δέμοι· τ̣[ὸς ἐ Σ]αλαμ̣[ῖνι κλερόχ...,...,"[[ἔδοχσεν, δέμοι, Σαλαμίς, κλερόχος, οἰκεν, Σα...","[ἔδοχσεν, δέμοι, Σαλαμίς, κλερόχος, οἰκεν, Σαλ...",c. 510-500 a.,"{'or': None, 'not_before': -511, 'not_after': ...",-511.0,-499.0,,"[range, phase, ca]",,
1,/text/2?location=1701&patt=&bookid=4&offset=0&...,IG I³,2,Regions\nAttica (IG I-III),IG I³\n2,Att. — non-stoich. — c. 500 a.,,14,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n,[․․8-9․․․]ν̣ βολ — — — — — — — — — —\n[․6-7․․]...,...,"[[βολ, ἑκών, σίον, γνοσθει, ἄτεχνος, μεδὲ, κελ...","[βολ, ἑκών, σίον, γνοσθει, ἄτεχνος, μεδὲ, κελε...",c. 500 a.,"{'or': None, 'not_before': -505, 'not_after': ...",-505.0,-495.0,,"[exact, phase, ca]",,
2,/text/3?location=1701&patt=&bookid=4&offset=0&...,IG I³,3,Regions\nAttica (IG I-III),IG I³\n3,Att. — stoich. 21 — 490-480 a.,,13,1\n\n\n\n5\n\n\n\n\n10\n\n\n,[․]αρ[․․․․]ι ℎερακλειο[․․5․․]\n[․]αρ̣ο#⁷[․] τι...,...,"[[ἑρακλειο, ἀρόω, τίθημι, ἀθλοθέτης, ἀνήρ, ἄγο...","[ἑρακλειο, ἀρόω, τίθημι, ἀθλοθέτης, ἀνήρ, ἄγον...",490-480 a.,"{'or': None, 'not_before': -490, 'not_after': ...",-490.0,-480.0,,[range],,
3,/text/4?location=1701&patt=&bookid=4&offset=0&...,IG I³,4,Regions\nAttica (IG I-III),IG I³\n4,Att. — stoich. 38 — 485/4 a.,,56,face A.1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n\n\...,[․․․․․․․․․․․․․․․․․․38․․․․․․․․․․․․․․․․․․]\n[․․․...,...,"[[τὶς, φρορὰν, πεντέκοντα, δραχμή, τ, πρᾶχσιν,...","[τὶς, φρορὰν, πεντέκοντα, δραχμή, τ, πρᾶχσιν, ...",485/4 a.,"{'or': None, 'not_before': -485, 'not_after': ...",-485.0,-484.0,,[range],,
4,/text/5?location=1701&patt=&bookid=4&offset=0&...,IG I³,5,Regions\nAttica (IG I-III),IG I³\n5,Att. — c. 500 a.,,6,1\n\n\n\n5\n,[ἔδοχσε]ν [⋮ τε͂ι βολε͂ι] ⋮ καὶ [τ]ο͂ι δέμοι ⋮...,...,"[[ἔδοχσεν, τει, βολει, δέμοι, παραιβάτες, γραμ...","[ἔδοχσεν, τει, βολει, δέμοι, παραιβάτες, γραμμ...",c. 500 a.,"{'or': None, 'not_before': -505, 'not_after': ...",-505.0,-495.0,,"[exact, phase, ca]",,
5,/text/6?location=1701&patt=&bookid=4&offset=0&...,IG I³,6,Regions\nAttica (IG I-III),IG I³\n6,Att. — stoich. 23/11 — ante 460 a.,,160,face A.BM 309.1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n...,— — — — — — — — — — — — —\n[․․․․․․15․․․․․․․] δ...,...,"[[δραχμεισι, τες, μένος, δεμο, πόλεον, δοκέω, ...","[δραχμεισι, τες, μένος, δεμο, πόλεον, δοκέω, ἀ...",ante 460 a.,"{'or': None, 'not_before': None, 'not_after': ...",,-461.0,,[ante],,
6,/text/7?location=1701&patt=&bookid=4&offset=0&...,IG I³,7,Regions\nAttica (IG I-III),IG I³\n7,Att. — stoich. 40 — 460-450,,28,frg. a.1\n\n\n\n5\n\n\n\n\n10\n\n\n\n13\n\n\nf...,[ἔδοχσεν τε͂]ι βο[λ]ε͂[ι καὶ το͂ι δέμοι· ․․6․․...,...,"[[ἔδοχσεν, τει, βολει, δέμοι, πρυτανεύω, γραμμ...","[ἔδοχσεν, τει, βολει, δέμοι, πρυτανεύω, γραμμα...",460-450,"{'or': None, 'not_before': -460, 'not_after': ...",-460.0,-450.0,,[range],,
7,/text/8?location=1701&patt=&bookid=4&offset=0&...,IG I³,8,Regions\nAttica (IG I-III),IG I³\n8,Att. — stoich. 32 — 460-450,,26,frg. a.1\n\n\n\n5\n\n\n\n\n10\n\n\nfrg. b.12\n...,[․․5․․]#⁷ον ℎὰ ο[․․․․․․․․․21․․․․․․․․․․]\nα περ...,...,"[[οὗτος, δέμω, ἀντίβιος, λέγω, ἄλλος, Καλλίμαχ...","[οὗτος, δέμω, ἀντίβιος, λέγω, ἄλλος, Καλλίμαχο...",460-450,"{'or': None, 'not_before': -460, 'not_after': ...",-460.0,-450.0,,[range],,
8,/text/9?location=1701&patt=&bookid=4&offset=0&...,IG I³,9,Regions\nAttica (IG I-III),IG I³\n9,Att. — stoich. 24 — c. 458 a.,,17,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n\n,[ἔδοχσεν τε͂ι βο]λε̣͂ι καὶ το͂[ι δέμ]-\n[οι· ․...,...,"[[ἔδοχσεν, τει, βολει, δέμοι, ντὶς, πρυτανεύω,...","[ἔδοχσεν, τει, βολει, δέμοι, ντὶς, πρυτανεύω, ...",c. 458 a.,"{'or': None, 'not_before': -463, 'not_after': ...",-463.0,-453.0,,"[exact, phase, ca]",,
9,/text/10?location=1701&patt=&bookid=4&offset=0...,IG I³,10,Regions\nAttica (IG I-III),IG I³\n10,Att. — stoich. 22 — 469-450,,28,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n\n\n\n\n20...,[ἔδο]ξεν τῆι βολῆι καὶ τῶι δ[ή]-\n[μωι· Ἀ]καμα...,...,"[[δοκέω, βολῆι, δῆμος, ἀκαμαντὶς, πρυτανεύω, ν...","[δοκέω, βολῆι, δῆμος, ἀκαμαντὶς, πρυτανεύω, νά...",469-450,"{'or': None, 'not_before': -469, 'not_after': ...",-469.0,-450.0,,[range],,


In [92]:
sddk.write_file("SDAM_data/PHI/PHI_20201217.json", PHI, conf)

Your <class 'pandas.core.frame.DataFrame'> object has been succefully written as "https://sciencedata.dk/files/SDAM_root/SDAM_data/PHI/PHI_20201217.json"
