<a href="https://colab.research.google.com/github/sdam-au/PHI_ETL/blob/master/scripts/1_3_EXTRACTING_DATES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This script contains a series of functions for extracting numerical dates (e.g. as an interval 101-200) from a historical dataset experessing this information in a textual form (e.g. "2nd c. AD"). The functions have been developed especially for extracting dates from the PHI dataset, but might be reused for other applications.

In the core of the script is the function `date_extractor()`. This function takes as an input a textual date (e.g. "s. III/II p." [= "2nd c. BC"]) and returns a dictionary of dating values, e.g:
```python
{"start" : -300, "stop" : -101, "type": "range+cent+morece"}
```
where `"type"` contains tags specifying what kind of dating it is: `range` means that it is an interval; `cent` means that the interval is based on information about centuries; and `morece` implies that there is more than one century.

`date_extractor()` relies upon a number of other functions designed to extract individual types of dates:
* `extract_ante_and_post(datation, dating)` looks for words like "post" and "ante", "not before" etc. and modifies the numerical dating (in `dating` dictionary) accordingly. E.g. "ante 305BC" means all years before the stop year `-305`; "not before the reign of Trajan" means all years after `97`
*  `extract_period(datation)` checks whether the textual datation contains a period (like "reign of Tiberius") which could be translated into an interval (14-37).
* `parse_centuries()` extract intervals for individual centuries. It deals with cases in which more centuries are present (e.g. "s. III/II a.") and even where one is BC and another AD (e.g. "1st c. BC-1st c. AD")
* `modify_by_phase(datation, dating)` modify the ranges in `dating` by evaluating presence of words like "beginning", "early", "late" and "end". We use these parameters:
  * "beginning": first 10% of the range (defined by "start" and "stop" in the `dating` dictionary)
  * "early": first 25% of the range
  * "late": last 25% of the range
  * "end": last 10% of the range
  * "ca.": extends the range by adding 10% on the left and 10% on the right


# Requirements

In [1]:
import numpy as np
import math
import pandas as pd
import re

import sys
import requests
from bs4 import BeautifulSoup
import json

import datetime as dt
# for simple paralel computing:
from concurrent.futures import ThreadPoolExecutor

import gspread
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [2]:
# our own package for reading the data
!pip install sddk
import sddk

Collecting sddk
  Downloading https://files.pythonhosted.org/packages/bf/96/3ae43f2d8ac06fc16ba111916970e5a1f3b96a3e41732fa3f099e2e5cd1c/sddk-2.6-py3-none-any.whl
Installing collected packages: sddk
Successfully installed sddk-2.6


# Authentification

In [66]:
# login to sciencedata 
conf = sddk.configure("SDAM_root", "648597@au.dk")

sciencedata.dk username (format '123456@au.dk'): 648597@au.dk
sciencedata.dk password: ··········
connection with shared folder established with you as its owner
endpoint variable has been configured to: https://sciencedata.dk/files/SDAM_root/


In [487]:
### authorize google sheets
auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())
# establish connection with particular sheet by its url:
PHI_overview = gc.open_by_url("https://docs.google.com/spreadsheets/d/1zfTw0Hf304maBmrYvaMxRLnv1zfAVFixrtGTTsLCcT4/edit?usp=sharing")

# Read data

In [7]:
# read the PHI dataset from sciencedata.dk
PHI = sddk.read_file("SDAM_data/PHI/PHI_enriched_raw.json", "df", conf)

In [8]:
# print first 5 rows of the data
PHI.head(5)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,filename,PHI_ID
0,[/text/237443?location=1365&patt=&bookid=411&b...,[CSCA],"[5 (1972) 169,3]",[Regions\nAttica (IG I-III)\nAttica],"[CSCA\n5 (1972) 169,3]",[Att. — Athens: Akropolis — stoich. 28 — 303/2...,{},[4],[1\n\n\n],[ἐπὶ Λε[ωστράτου ἄρχοντος ἐπὶ τῆς Κ]-\nεκρο[πί...,[CSCA.csv],[237443]
1,[/text/237444?location=1365&patt=&bookid=411&b...,[CSCA],"[5 (1972) 169,4]",[Regions\nAttica (IG I-III)\nAttica],"[CSCA\n5 (1972) 169,4]",[Att. — Athens: EM — stoich. 35 — 306-302 BC],{},[15],[1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15],[[․]ομε̣[․․․․․․․․․․․․․28․․․․․․․․․․․․․ τῆς]\nπρ...,[CSCA.csv],[237444]
2,[/text/237445?location=1365&patt=&bookid=411&b...,[CSCA],"[5 (1972) 173,5]",[Regions\nAttica (IG I-III)\nAttica],"[CSCA\n5 (1972) 173,5]",[Att. — Athens — non-stoich.],{},[16],[1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n],[— — — — —υτω․․․․11․․․․․\n— — — — —ωντωι̣․․․․9...,[CSCA.csv],[237445]
3,[/text/237446?location=1365&patt=&bookid=411&b...,[CSCA],"[9 (1976) 9,4]",[Regions\nAttica (IG I-III)\nAttica],"[CSCA\n9 (1976) 9,4]",[Att. — Athens: Eth.Mus. — 4th c. BC — SEG 26....,{},[2],[A.1\nB.1],[ψῆφος ⋮ δημοσία.\nΔ ․],[CSCA.csv],[237446]
4,[/text/237447?location=1365&patt=&bookid=411&b...,[CSCA],"[9 (1976) 11,16]",[Regions\nAttica (IG I-III)\nAttica],"[CSCA\n9 (1976) 11,16]",[Att. — Athens: Eth.Mus. — 4th c. BC — SEG 26....,{},[2],[A.1\nB.1],[ψῆφος ⋮ δημοσία.\nΓ Ε],[CSCA.csv],[237447]


In [9]:
# unfortunately, transferring the dataset between Python and R caused that cells in most columns of the dataframe contain a LIST OF VALUES (of length 1) and not the VALUE itself.
# in such case, we have to do one simple transformation
# perhaps it will not be needed in the future

def lists_to_values(list_or_value):
  if isinstance(list_or_value, list):
    value = list_or_value[0]
  else: 
    value = list_or_value
  return value
for column in PHI.columns:
  PHI[column] = PHI.apply(lambda row: lists_to_values(row[column]), axis=1)

In [10]:
PHI.head(5)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,filename,PHI_ID
0,/text/237443?location=1365&patt=&bookid=411&bo...,CSCA,"5 (1972) 169,3",Regions\nAttica (IG I-III)\nAttica,"CSCA\n5 (1972) 169,3",Att. — Athens: Akropolis — stoich. 28 — 303/2 BC,{},4,1\n\n\n,ἐπὶ Λε[ωστράτου ἄρχοντος ἐπὶ τῆς Κ]-\nεκρο[πίδ...,CSCA.csv,237443
1,/text/237444?location=1365&patt=&bookid=411&bo...,CSCA,"5 (1972) 169,4",Regions\nAttica (IG I-III)\nAttica,"CSCA\n5 (1972) 169,4",Att. — Athens: EM — stoich. 35 — 306-302 BC,{},15,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15,[․]ομε̣[․․․․․․․․․․․․․28․․․․․․․․․․․․․ τῆς]\nπρυ...,CSCA.csv,237444
2,/text/237445?location=1365&patt=&bookid=411&bo...,CSCA,"5 (1972) 173,5",Regions\nAttica (IG I-III)\nAttica,"CSCA\n5 (1972) 173,5",Att. — Athens — non-stoich.,{},16,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n,— — — — —υτω․․․․11․․․․․\n— — — — —ωντωι̣․․․․9․...,CSCA.csv,237445
3,/text/237446?location=1365&patt=&bookid=411&bo...,CSCA,"9 (1976) 9,4",Regions\nAttica (IG I-III)\nAttica,"CSCA\n9 (1976) 9,4",Att. — Athens: Eth.Mus. — 4th c. BC — SEG 26.1...,{},2,A.1\nB.1,ψῆφος ⋮ δημοσία.\nΔ ․,CSCA.csv,237446
4,/text/237447?location=1365&patt=&bookid=411&bo...,CSCA,"9 (1976) 11,16",Regions\nAttica (IG I-III)\nAttica,"CSCA\n9 (1976) 11,16",Att. — Athens: Eth.Mus. — 4th c. BC — SEG 26.1...,{},2,A.1\nB.1,ψῆφος ⋮ δημοσία.\nΓ Ε,CSCA.csv,237447


# Raw date column

In the PHI dataset, the datation information is usually contained in the "tildeinfo" column. "tildeinfo" has a form of a list, with individual elements separated by " — ". Unfortunately, this list does not have a fully consistent structure. Typically, the datation information is the last element within the list (e.g. "Dacia Sup. — Tibiscum (Jupa) — 2nd/3rd c. AD" - PH298501), but not always (e.g. "N. Black Sea — Pantikapaion (Kerch) — 1st c. BC — IosPE IV 253" - PH183001). Thus, our first task is to extract the element which most probably contains the datation information.


In [12]:
def get_date_from_tildeinfo(tildeinfo):
  try:
    tildeinfo_list = tildeinfo.split("— ")
    datation = tildeinfo_list[-1]
    for el in tildeinfo_list:
      if any(time_indicator in el for time_indicator in [" a.", " p.", "BC", "AD", "period", "reign"]):
        datation = el
        break
  except: 
    datation = ""
  return datation 

In [15]:
# test 1
get_date_from_tildeinfo("N. Black Sea — Pantikapaion (Kerch) — 1st c. BC — IosPE IV 253")

'1st c. BC '

In [16]:
# test 2
get_date_from_tildeinfo("Att. — Athens: Agora — stoich. 29 — 301/0-295/4 a. — *Hesp. 13.1944.242,7 — *SEG 24.119; 29.93")

'301/0-295/4 a. '

In [17]:
# application on the whole dataset
PHI["raw_date"] = PHI.apply(lambda row: get_date_from_tildeinfo(row["tildeinfo"]), axis=1)

# Generating a sample
For developemnt purposes, the functions below have been firstly tested by using a representative sample from the dataset, containing every 500th inscription, i.e. inscriptions PH2501, PH3001, ..., PH218501 etc.	

In [18]:
# generate sample for testing purposes:
PHI_by_500 = PHI[PHI["PHI_ID"].isin(range(1, 300000, 500))]
PHI_by_500.head(5)

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,filename,PHI_ID,raw_date
625,/text/262001?location=1040&patt=&bookid=517&of...,Panamara,316,Regions\nAsia Minor\nCaria,Panamara\n316,IStr 427,{},2,1\n,"Διῒ Πανημέρῳ, κόμαι Εὐτύχεως· ἐπὶ ἱερέω[ς] Κλα...",Panamara.csv,262001,IStr 427
912,/text/266001?location=1673&patt=&bookid=587&of...,St.Pont. III,110a,Regions\nAsia Minor\nPontus and Paphlagonia,St.Pont. III\n110a,Pont. — Amasia — Rom. Imp. period,{},6,1\n\n\n\n5\n,Αὐρ(ηλίῳ) Φιλο-\nμούσῳ ❦\nἀρχιάτρῳ\nῬω[μᾶνα(?)...,St.Pont.-III.csv,266001,Rom. Imp. period
1260,/text/231001?location=1365&patt=&bookid=394&of...,Agora XV,130,Regions\nAttica (IG I-III)\nAttica,Agora XV\n130,Att. — Athens: Agora — 220/19,{},149,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n\n\n\n\n20...,ἐπὶ Μενεκράτου ἄρχοντος ἐπὶ τῆς Οἰνεῖδος ἕκτη-...,Agora-XV.csv,231001,220/19
1944,/text/79501?location=126&patt=&bookid=23&offse...,"IG XII,8",311,"Regions\nAegean Islands, incl. Crete (IG XI-[X...","IG XII,8\n311",Thasos,{},7,1\n\n\n\n5\n\n,Ἐπικράτης Κτησιφῶντος\nΠυθίων Περιθύμου\nΠίν...,IG-XII-8.csv,79501,Thasos
2502,/text/218501?location=1497&patt=&bookid=367&of...,Koptos à Kosseir,163,"Regions\nEgypt, Nubia and Cyrenaïca\nEgypt and...",Koptos à Kosseir\n163,Eg. — el-Boueib,{},2,1\n,Κάλχων̣\nΠανὶ Χρ(υσόδωτηι).,Koptos---Kosseir.csv,218501,el-Boueib


# Parse ante quem and post quem

In [328]:
### simple demonstration of the logic
datation = "not before 304 AD"
match = re.search("(not\s(before|bef\.)\s|non\sante\s)(\d+)",  datation, flags=re.IGNORECASE)
if match:
  dating_update = {"start" : int(match.groups()[2]), "type" : "post"}
dating_update

{'start': 304, 'type': 'post'}

In [400]:
def extract_ante_and_post(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if "unknown" in dating["type"]: 
    # if "NOT BEFORE"
    match = re.search("(not\s(before|bef\.)\s|non\sante\s)(\-?\d+)(\s|$)",  datation, flags=re.IGNORECASE)
    if match:
      if "AD" not in datation:
        start = (int(match.groups()[2]) * -1)
        dating_update = {"start" : start, "type" : "post"}
      else:
        dating_update = {"start" : int(match.groups()[2]), "type" : "post"}
    # if "BEFORE"
    else:
      match = re.search('(before\s|ante\s)(\-?\d+)(\s|$)', datation, flags=re.IGNORECASE)
      if match:
        if "AD" not in datation:
          dating_update = {"stop" : (int(match.groups()[1]) * -1) - 1, "type" : "ante"}
        else:
          dating_update = {"stop" : int(match.groups()[1]) - 1, "type" : "ante"}
      # if "NOT AFTER"
      else:
        match = re.search("(not\safter\s|non\spost\s)(\-?\d+)(\s|$)",  datation, flags=re.IGNORECASE)
        if match:
          if "AD" not in datation:
            dating_update = {"stop" : (int(match.groups()[1]) * -1), "type" : "ante"}
          else:
            dating_update = {"stop" : int(match.groups()[1]), "type" : "ante"}
        # if "AFTER"
        else:
            match = re.search('(after\s|aft.\s|post\s)(\-?\d+)(\s|$)', datation, flags=re.IGNORECASE)
            if match:
              if "AD" not in datation:
                dating_update = {"start" : (int(match.groups()[1]) * -1) + 1, "type" : "post"}
              else:
                dating_update = {"start" : int(match.groups()[1]) + 1, "type" : "post"}
            else:
              dating_update = dating
  elif "exact+or" in dating["type"]: 
    # if "NOT BEFORE"
    match = re.search("(not\s(before|bef\.)\s|non\sante\s)",  datation, flags=re.IGNORECASE)
    if match:
      dating_update = {"start" : dating["exact"], "or": {"start" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "post+or"}
    # if "BEFORE"
    else:
      match = re.search('(before\s|ante\s)', datation, flags=re.IGNORECASE)
      if match:
        dating_update = {"stop" : dating["exact"], "or": {"stop" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "ante+or"}
      # if "NOT AFTER"
      else:
        match = re.search("(not\safter\s|non\spost\s)",  datation, flags=re.IGNORECASE)
        if match:
              dating_update = {"stop" : dating["exact"], "or": {"stop" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "ante+or"}
        # if "AFTER"
        else:
            match = re.search('(after\s|aft.\s|post\s)', datation, flags=re.IGNORECASE)
            if match:
              dating_update = {"start" : dating["exact"], "or": {"start" : dating["or"]["exact"], "exact" : None}, "exact" : None, "type" : "post+or"}
            else:
              dating_update = dating
  elif "range" in dating["type"]:
    # if "NOT BEFORE"
    match = re.search("(not\s(before|bef\.)\s|non\sante\s)",  datation, flags=re.IGNORECASE)
    if match:
      dating_update = {"start" : dating["start"], "stop":None, "type" : dating["type"]+"+post"}
    # if "BEFORE"
    else:
      match = re.search('(before\s|ante\s)', datation, flags=re.IGNORECASE)
      if match:
        dating_update = {"stop" : int(dating["start"]) - 1, "start":None, "type" : dating["type"]+"+ante"}  
      # if "NOT AFTER"
      else:
        match = re.search("(not\safter\s|non\spost\s)",  datation, flags=re.IGNORECASE)
        if match:
          dating_update = {"stop" : dating["stop"], "start":None,"type" : dating["type"]+"+ante"}
        # if "AFTER"
        else:
          match = re.search('(after\s|aft.\s|post\s)', datation, flags=re.IGNORECASE)
          if match:
            dating_update = {"start" : int(dating["stop"]) + 1, "stop":None,"type" : dating["type"]+"+post"}
          else:
            dating_update = dating
  else: 
    #datation = re.sub("(not\s(after|bef\.)\s|non\spost)", "ante\s", datation)
    dating_update = dating
  if ("shortly" in datation) and ("shortly" not in dating_update["type"]):
    dating_update["type"] = dating_update["type"] + "+shortly"
  return dating_update

In [401]:
extract_ante_and_post("after 4th c. BC", {"type" : "range", "start": -400, "stop": -301})

{'start': -300, 'stop': None, 'type': 'range+post'}

In [402]:
extract_ante_and_post("shortly after 230 AD")

{'start': 231, 'type': 'post+shortly'}

In [331]:
datation, dating = "not after reign of Trajan", {"start": 98, "stop" : 117, "type" : "range+period", "era": None}
extract_ante_and_post(datation, dating)

{'start': None, 'stop': 117, 'type': 'range+period+ante'}

In [332]:
# example with "unknown"
dating = {"type" : "unknown", "era": None}
for datation in ["non post 230 AD", "shortly after 320 BC", "not bef. 114 BC","not after 317 AD", "before 200 AD", "Ante 114 BC", "post 2nd century BC"]:
  print(datation, extract_ante_and_post(datation, dating))

non post 230 AD {'stop': 230, 'type': 'ante'}
shortly after 320 BC {'start': -319, 'type': 'post+shortly'}
not bef. 114 BC {'start': -114, 'type': 'post'}
not after 317 AD {'stop': 317, 'type': 'ante'}
before 200 AD {'stop': 199, 'type': 'ante'}
Ante 114 BC {'stop': -115, 'type': 'ante'}
post 2nd century BC {'type': 'unknown', 'era': None}


# Parse periods

In [335]:
# read periods from our external coding
periods = get_as_dataframe(PHI_overview.worksheet("periods"))
periods

Unnamed: 0,period,start,stop,type,era,source,notes,link
0,Roman imp,-31,410,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p08m57hqcc5
1,Rom. Imp,-31,410,range+period,BC/AD,PeriodO,,http://n2t.net/ark:/99152/p08m57hqcc5
2,aet. imp.,-31,410,range+period,BC/AD,PeriodO,,http://n2t.net/ark:/99152/p08m57hqcc5
3,aet. Rom.,-146,324,range+period,BC/AD,,,
4,Roman period,-146,324,range+period,BC/AD,,,
5,reign of Hadrian,117,138,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p0jrrjbntfj
6,reign of Justinian,527,565,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p06c6g3r7ht
7,reign of Ant. Pius,138,161,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p06c6g3drk4
8,reign of Augustus,-27,14,range+period,BC/AD,PeriodO,,http://n2t.net/ark:/99152/p06c6g3xnmx
9,reign of Tiberius,14,37,range+period,AD,PeriodO,,http://n2t.net/ark:/99152/p0jrrjbts8w


In [336]:
periods_dict = periods.set_index("period").T.to_dict()
periods_dict["reign of Claudius"]

{'era': 'AD',
 'link': 'http://n2t.net/ark:/99152/p0jrrjb8spw',
 'notes': nan,
 'source': 'PeriodO',
 'start': 41,
 'stop': 54,
 'type': 'range+period'}

In [337]:
def extract_period(datation, dating=None):
  if (dating==None): dating = {"type" : "unknown"} 
  if dating["type"] == "unknown":
    for key in periods_dict.keys():
      if key.lower() in datation.lower(): # use lower cases to match everything
        period = periods_dict[key]   
        dating_update = period
        break
      else:
        dating_update = {"type" : "unknown"}
    return dating_update
  else:
    return dating


In [338]:
# example:
for datation in ["Roman Imperial", "reign of Augustus", "Antonine period"]:
  print({datation : extract_period(datation)})

{'Roman Imperial': {'start': -31, 'stop': 410, 'type': 'range+period', 'era': 'AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p08m57hqcc5'}}
{'reign of Augustus': {'start': -27, 'stop': 14, 'type': 'range+period', 'era': 'BC/AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p06c6g3xnmx'}}
{'Antonine period': {'start': 96, 'stop': 192, 'type': 'range+period', 'era': 'AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p06c6g34zjk'}}


# Parse "/" for individual dates

The "/" character is use in several different cases, each of which requires slightly different approach. Here we are parsing cases in which it is used for individual date numbers, e.g. "114/3 BC", what is translated as an interval (-114, -113). However, if there is a loger range between the two numbers, the "/" character is treated as "or" and the alternative date is extracted into the "or" key within the dictionary.

In [92]:
ors = PHI_by_500[PHI_by_500["raw_date"].str.contains("(\d+)(\/)(\d+)")][raw_date"].tolist()
len(ors)

  return func(self, *args, **kwargs)


9664

In [108]:
# examples of more "/" combined with "-"
[datation for datation in ors if re.search(r'(\d+)(\/)(\d+).?-.?(\d+)(\/)(\d+)', datation)][:10]

['139/8-122/1 BC ',
 '217/8-226/7 AD?',
 '331/0-330/29 a.',
 'c. 231/0-230/29',
 'c. 176/5-170/69',
 'c. 176/5-170/69',
 '167/6-165/4? a.',
 '74/3-63/2 a.',
 '132/3-137/8',
 '132/3-137/8']

In [124]:
# examples of more "/" combined with " or "
[datation for datation in ors if re.search(r'(\d+)(\/)(\d+)().?(\sor\s).?(\d+)(\/)(\d+)', datation)][:10]

['27/6 or 17/8?',
 '186/5, 162/1 or 151/0 BC?',
 '186/5, 162/1 or 151/0 BC?',
 '14/13 or 9/8 BC',
 '14/13 or 13/12 BC ',
 '129/130 or 245/246 AD [229/230 AD (Tataki, Ed. Pr.)] ',
 '268/7 or 265/4 a. ',
 '267/6 or 265/4 a. ',
 '564/5 or 464/5 AD',
 '578/9 or 494/5 AD']

In [354]:
def complete_numbers(datation, date1, date2):
  # if the second number contains less numerals, try to complete it
  len_diff = len(date1) - len(date2)
  if len_diff > 0:
    date2 = date1[:len_diff] + date2
  # transform it into integer
  date1 = int(date1)
  date2 = int(date2)
  if ("AD" not in datation) and (date1 > date2):
    date1 = date1 * -1
    date2 = date2 * -1
    #if date1 > date2:
       #  date1, date2 = date2, date1
  return date1, date2

def match_or(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if dating["type"] == "unknown":
    matches = re.findall(r'(\d+)(\/)(\d+)', str(datation), flags=re.IGNORECASE)
    if len(matches) != 0:
        date1, date2 = complete_numbers(datation, matches[0][0], matches[0][2])
        if len(matches) > 1: # if there is more than one match
          date3, date4 = complete_numbers(datation, matches[1][0], matches[1][2])         
        #if date1 > date2:
        #  date1, date2 = date2, date1
        if re.search(r'(\d+)(\/)(\d+)().?(\sor\s).?(\d+)(\/)(\d+)', datation):
          if abs(date1 - date4) < 5: # if it is something like "331/0 or 330/29 BC"
            dating.update({"start" : date1, "stop": date4, "type" : "range"})
          else: # treat the or numbers as an alternative range
            dating.update({"start" : date1, "stop": date2, "type" : "range+or", "or": {"start" : date3, "stop": date4, "type" : "range"}})
        elif re.search(r'(\d+)(\/)(\d+).?-.?(\d+)(\/)(\d+)', datation):
          dating.update({"start" : date1, "stop": date4, "type" : "range"})
        else:
          if abs(date1 - date2) < 3:
            dating.update({"start" : date1, "stop": date2, "type" : "range"})
          else:
            dating.update({"exact" : date1, "or": {"exact" : date2, "type" : "exact"}, "type" : "exact+or"})
        dating = extract_ante_and_post(datation, dating)
        return dating
        #if dating_update["type"] == "post":
        #   return {"start" : date1, "or": {"start" : date2, "type" : "post"}, "type" : "post+or"}
        #elif extract_ante_and_post(datation, dating)["type"] == "ante":
        #  return {"stop" : date1, "or": {"stop" : date2, "type" : "ante"}, "type" : "ante+or"}
        #else:
        #  return {"exact" : date1, "or": {"exact" : date2, "type" : "exact"}, "type" : "exact+or"}
    else:
      return dating
  return {"type" : "unknown"}

In [356]:
match_or("5th or 4th c. BC")

{'type': 'unknown'}

In [357]:
match_or(" 12/1 BC")

{'start': -12, 'stop': -11, 'type': 'range'}

In [358]:
match_or("229/30 or 230/1")

{'start': 229, 'stop': 231, 'type': 'range'}

In [359]:
match_or("27/6 or 17/8")

{'or': {'start': 17, 'stop': 18, 'type': 'range'},
 'start': -27,
 'stop': -26,
 'type': 'range+or'}

In [360]:
match_or("14/13 or 13/12 BC")

{'start': -14, 'stop': -12, 'type': 'range'}

In [361]:
match_or("aft. 14/13 or 13/12 BC")

{'start': -11, 'stop': None, 'type': 'range+post'}

In [362]:
match_or("139/8-122/1 BC")

{'start': -139, 'stop': -121, 'type': 'range'}

# Parse phase

In [197]:
# parametrization 
early_late = 0.25 # i.e. first or last 25% of the range
beginning_end = 0.1 # i.e. first or last 25% of the range
middle = 0.05 # i.e. 5% left of the middle, 5% right of the middle
ca = 0.1 # i.e. plus 10% of the range on the left side and plus 10% on the right side

In [198]:
# application of this function requires that you already have a dating dictionary having either start and stop or an exact date (for "ca.")
def modify_by_phase(datation, dating):
  if (not "phase" in dating["type"]) and (not "morece" in dating["type"]):
    try:
      start, stop = dating["start"], dating["stop"]
      try: 
        duration = abs(dating["stop"] - dating["start"])
      except:
        duration = 1
      datation = datation.lower()
      if "early" in datation:
        coef = early_late
        dating["stop"] = start + round(duration * coef)
        dating["type"] = dating["type"] + "+phase+early"
      if "late" in datation:
        if "late antiquity" not in datation:
          coef = early_late
          dating["start"] = stop - round(duration * coef)
          dating["type"] = dating["type"] + "+phase+late"
      if re.search("(beginning|beg\.?\s)", datation):
        coef = beginning_end
        dating["stop"] = start + round(duration * coef)
        dating["type"] = dating["type"] + "+phase+beg"
      if re.search("(end\s|fin.\s)", datation):
        coef = beginning_end
        dating["start"] = stop - round(duration * coef)
        dating["type"] = dating["type"] + "+phase+end"
      if re.search("(middle|mid\.?\s|med\.\s)", datation):
        coef = middle # that means: "middle 2nd c. AD" => 140 - 161
        dating_avr = (dating["start"] + dating["stop"]) / 2
        dating["start"] = round(dating_avr - (coef * duration))
        dating["stop"] = round(dating_avr + (coef * duration))
        dating["type"] = dating["type"] + "+phase+middle"
      if re.search("ca\.\s", datation):
        dating["type"] = dating["type"] + "+phase+ca"
        if ("exact" in dating["type"]) or duration < 10:
          dating.update({"start" : dating["exact"] - 5, "stop" : dating["exact"] + 5})
          dating["exact"] = None
        else:
          dating["start"] = start - round(duration * ca)
          dating["stop"] = stop + round(duration * ca)
      return dating
    except:
      return dating
  else: 
    return dating

In [403]:
# example 1: "ca." in case of individual date
datation = "ca. 200 BC"
dating = {"start" : None, "stop" : None, "exact" : -200, "type" : "exact", "era" : "BC"}
print(modify_by_phase(datation, dating))

{'start': -205, 'stop': -195, 'exact': None, 'type': 'exact+phase+ca', 'era': 'BC'}


In [155]:
#  example 2: "ca." in case of century
datation = "ca. s. II BC"
dating = {"start" : -200, "stop" : -101, "type" : "range+cent"}
print(modify_by_phase(datation, dating))

{'start': -210, 'stop': -91, 'type': 'range+cent+phase+ca'}


In [156]:
#  example 3: "early"
datation = "early 2nd BC"
dating = {"start" : -200, "stop" : -101, "type" : "range+cent"}
print(modify_by_phase(datation, dating))

{'start': -200, 'stop': -175, 'type': 'range+cent+phase+early'}


# Parse centuries

In [157]:
# read centuries table from gsheet
centuries_df = get_as_dataframe(PHI_overview.worksheet("centuries"))
centuries_df.set_index("arabic", inplace=True)
centuries_df

Unnamed: 0_level_0,roman,start_BC,stop_BC,start_AD,stop_AD
arabic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8th,VIII,-800,-701,701,800
7th,VII,-700,-601,601,700
6th,VI,-600,-501,501,600
4th,IV,-400,-301,301,400
5th,V,-500,401,401,500
3rd,III,-300,-201,201,300
2nd,II,-200,-101,101,200
1st,I,-100,-1,1,100


In [158]:
arabics = centuries_df.index.tolist()
arabics

['8th', '7th', '6th', '4th', '5th', '3rd', '2nd', '1st']

In [159]:
centuries_df["roman"].tolist()

['VIII', 'VII', 'VI', 'IV', 'V', 'III', 'II', 'I']

In [160]:
# navigating through the dataframe using index and ".loc[]"
centuries_df.loc["3rd"]["roman"]

'III'

In [163]:
any(re.search(arabic, "2nd c. AD") for arabic in arabics)

True

In [233]:
datation = "III bagrt"
cent = "III"
match =  re.search("^\s?" + cent + "(\s|$|/|-)", datation)
if match:
  print(match)

<_sre.SRE_Match object; span=(0, 4), match='III '>


In [396]:
def parse_centuries(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if dating["type"] == "unknown": 
    if any(cent for cent in centuries_df["roman"].tolist() if (re.search("^\s?" + cent + "($|/)", datation)) or (re.search("(s|c)\.", datation))):
      for roman, arabic in zip(centuries_df["roman"].tolist(), arabics):
        datation = re.sub(roman, arabic, datation)
    cents = [cent for cent in centuries_df.index.tolist() if re.search(cent, datation)] # we have the centuries mentioned, but not in right order :-)
    if len(cents) > 0:
      cents_list = re.split("-|/|\sor\s", datation)
      if len(cents_list) > 1:
        try:
          century1 = [re.sub(".*" + num + ".*", num, cents_list[0])  for num in arabics if num in cents_list[0]][0]
          century2 = [re.sub(".*" + num + ".*", num, cents_list[1])  for num in arabics if num in cents_list[1]][0]
          if (" AD" in datation) and (" BC" not in datation): # if explicit AD and only AD:
            start = modify_by_phase(cents_list[0], {"start" : centuries_df.loc[century1]["start_AD"], "stop" : centuries_df.loc[century1]["stop_AD"], "type" : "cent"})["start"]
            stop = modify_by_phase(cents_list[1], {"start" : centuries_df.loc[century2]["start_AD"], "stop" : centuries_df.loc[century2]["stop_AD"], "type" : "cent"})["stop"]
            era = "AD"
          elif (" BC" in cents_list[0]) and (" AD" in cents_list[1]):
            start = modify_by_phase(cents_list[0], {"start" : centuries_df.loc[century1]["start_BC"], "stop" : centuries_df.loc[century1]["stop_BC"], "type" : "cent"})["start"]
            stop = modify_by_phase(cents_list[1], {"start" : centuries_df.loc[century2]["start_AD"], "stop" : centuries_df.loc[century2]["stop_AD"], "type" : "cent"})["stop"]
            era = "BC/AD"
          else:
            start = modify_by_phase(cents_list[0], {"start" : centuries_df.loc[century1]["start_BC"], "stop" : centuries_df.loc[century1]["stop_BC"], "type" : "cent"})["start"]
            stop = modify_by_phase(cents_list[1], {"start" : centuries_df.loc[century2]["start_BC"], "stop" : centuries_df.loc[century2]["stop_BC"], "type" : "cent"})["stop"]
            era = "BC"
          dating_update = {"start" : start, "stop" : stop, "era" : era, "type" : "range+cent+morece"}
        except:
          dating_update = dating
      elif len(cents) == 1:
        century = [re.sub(".*" + num + ".*", num, datation)  for num in arabics if num in datation][0]
        if " AD" in datation: 
          start = centuries_df.loc[century]["start_AD"]
          stop = centuries_df.loc[century]["stop_AD"]
          era = "AD"
        else:
          start = centuries_df.loc[century]["start_BC"]
          stop = centuries_df.loc[century]["stop_BC"]
          era = "BC"
        dating_update = {"start" : start, "stop" : stop, "era" : era, "type" : "range+cent"}
        dating_update = modify_by_phase(datation, dating_update)
      else:
        dating_update = {"type": "unknown"}
      return dating_update
    else:
      return dating
  else:
    return dating

In [240]:
# example 1
datation = "III/II" # "p." and "a." are replaced previously by "BC" and "AD"
parse_centuries(datation)

{'era': 'BC', 'start': -300, 'stop': -101, 'type': 'range+cent+morece'}

In [241]:
datation = "fin. s. III/II" # "p." and "a." are replaced previously by "BC" and "AD"
parse_centuries(datation)

{'era': 'BC', 'start': -211.0, 'stop': -101, 'type': 'range+cent+morece'}

In [242]:
# example 1
datation = "late 1st c. AD" # - early 1st c. AD" # "p." and "a." are replaced previously by "BC" and "AD"
parse_centuries(datation, {"type": "unknown"})

{'era': 'AD', 'start': 75.0, 'stop': 100, 'type': 'range+cent+phase+late'}

In [243]:
# example 
parse_centuries("2nd/3rd c. AD", {"era": "AD", "type": "unknown"})

{'era': 'AD', 'start': 101, 'stop': 300, 'type': 'range+cent+morece'}

In [245]:
# example
parse_centuries("late 2nd c. BC/beginning 1st c. BC")

{'era': 'BC', 'start': -126.0, 'stop': -90.0, 'type': 'range+cent+morece'}

# Simple dates and ranges
This the last function we apply, since it overlooks all potential relevant information and looks for individual dates and straighforward ranges.

In [444]:
 def simple_dates_and_ranges(datation, dating=None):
  if dating==None: dating = {"type" : "unknown"}
  if dating["type"] == "unknown":
    if " AD" in datation:
      dating["era"] = "AD" 
      try:
        date_both = re.search('(\d+)(\-)(\d+)', datation, flags=re.IGNORECASE).groups()
        if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
          date1, date2 = date_both[0], date_both[2]
          len_diff = len(date1) - len(date2)
          if len_diff > 0:
            date2 = date1[:len_diff] + date2
          dating.update({"start" : int(date1), "stop" : int(date2), "type" : "range"})
      except:  
        try:
          match = re.search('\d+', datation, flags=re.IGNORECASE)
          if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
            dating.update({"exact" : int(match[0]), "type": "exact"})
        except:
          pass
    else:
      try:
        date_both = re.search('(\d+)(\-)(\d+)', datation, flags=re.IGNORECASE).groups()
        if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
          date1, date2 = date_both[0], date_both[2]
          len_diff = len(date1) - len(date2)
          if len_diff > 0:
            date2 = date1[:len_diff] + date2
          date1, date2 = int(date1) * -1, int(date2) * -1
          if date1 < date2:
            dating.update({"start" : date1, "stop" : date2, "type" : "range", "era" : "BC"})
      except:  
        match = re.match("^\s?(\d+)\s?$", datation) # if it is just one number and nothing else (single numbers are good) 
        if match:
          dating.update({"exact" : int(match[0]) * -1, "type": "exact"})
        else:
          if any(time_indicator in datation for time_indicator in [" BC", " AD", "prob", "possi"]):
            match = re.search('\d+', datation, flags=re.IGNORECASE)
            dating.update({"exact" : int(match[0]) * -1, "type": "exact"})
  return dating

In [368]:
simple_dates_and_ranges("4th or 5th c. AD", {"type": "century"})

{'type': 'century'}

In [370]:
simple_dates_and_ranges("320 BC")

here


{'exact': -320, 'type': 'exact'}

In [318]:
simple_dates_and_ranges("213 AD")

{'era': 'AD', 'exact': 213, 'type': 'exact'}

In [319]:
simple_dates_and_ranges("213-4 AD")

{'era': 'AD', 'start': 213, 'stop': 214, 'type': 'range'}

In [320]:
simple_dates_and_ranges("124-15 BC")

{'era': 'BC', 'start': -124, 'stop': -115, 'type': 'range'}

# Main function

In [464]:
def date_extractor(datation):
  dating = {"start": None, "stop": None, "exact": None, "or": None, "type": "unknown", "era": None}
  #replace " a." and " p." when at the end of datation or before "/" or "-"
  datation = re.sub("(\s)(a\.)(\-|\/|$)", r"\1BC\3", datation)
  datation = re.sub("(\s)(p\.)(\-|\/|$)", r"\1AD\3", datation)
  # UNCERTAINTY
  if "?" in datation:
    datation = datation.replace("?", "")
    dating["certainty"] = "?"
  if datation[0] == "[":
    datation = datation[1:-1]
    dating["certainty"] = "?"
  # ranges combining BC and AD:
  match = re.search("(\d+)(\s(a\.|BC))(\-)(\d+)(\s(p\.|AD))", datation) # (/sa/.)(/-)(/d+)(/sp/.)"
  if match:
    dating.update({"start" : int(match.groups()[0]) * -1, "stop" : match.groups()[4], "era" : "BC/AD", "type" : "range"})
  # simple ante quem and postquem
  dating.update(extract_ante_and_post(datation, dating))
  # PERIODS
  dating.update(extract_period(datation, dating))
  # CENTURIES
  dating.update(parse_centuries(datation, dating))
  # "year/year"
  if dating["type"] == "unknown":
    dating.update(match_or(datation, dating)) # find all "/" instances linked with individual years
  # if we still don't know:
  dating.update(simple_dates_and_ranges(datation, dating))
  # extract phases (e.g. "early", "middle", "late", "beginning" etc.)
  dating.update(modify_by_phase(datation, dating))
  # extract ante quem and post quem for ranges
  try:
    dating.update(extract_ante_and_post(datation, dating))
  except:
    pass
  if dating["type"]=="unknown":
    dating.update({"start": None, "stop": None, "exact": None, "or": None, "type": "unknown", "era" : None})
  return dating 

In [465]:
date_extractor("5th or 4th c. BC")

{'era': 'BC',
 'exact': None,
 'or': None,
 'start': -500,
 'stop': -301,
 'type': 'range+cent+morece'}

In [466]:
date_extractor("14/13 or 13/12 BC") 

{'era': None,
 'exact': None,
 'or': None,
 'start': -14,
 'stop': -12,
 'type': 'range'}

In [469]:
date_extractor("after 216/5") 

{'era': None,
 'exact': None,
 'or': None,
 'start': -214,
 'stop': None,
 'type': 'range+post'}

In [467]:
date_extractor("after 14/13 or 13/12 BC") 

{'era': None,
 'exact': None,
 'or': None,
 'start': -11,
 'stop': None,
 'type': 'range+post'}

In [447]:
date_extractor("s. II/III AD")

{'era': 'AD',
 'exact': None,
 'or': None,
 'start': 101,
 'stop': 300,
 'type': 'range+cent+morece'}

In [470]:
date_extractor("fin. s. I a./s. I p.")

{'era': 'BC/AD',
 'exact': None,
 'or': None,
 'start': -11.0,
 'stop': 100,
 'type': 'range+cent+morece'}

In [471]:
date_extractor("2nd-early 3rd c. AD")

{'era': 'AD',
 'exact': None,
 'or': None,
 'start': 101,
 'stop': 226.0,
 'type': 'range+cent+morece'}

In [472]:
date_extractor("320 BC")

{'era': None,
 'exact': -320,
 'or': None,
 'start': None,
 'stop': None,
 'type': 'exact'}

In [473]:
date_extractor("post 120 AD")

{'era': None,
 'exact': None,
 'or': None,
 'start': 121,
 'stop': None,
 'type': 'post'}

In [474]:
date_extractor("after 4th c. BC")

{'era': 'BC',
 'exact': None,
 'or': None,
 'start': -300,
 'stop': None,
 'type': 'range+cent+post'}

In [475]:
# testing/example
for datation in ["Byzantine", "non ante s. II a.", "140/39 BC", "Rom. Imp", "reign of Augustus", "ante 140 BC", "late Antonine period", "IosPE IV 348"]:
  print(datation, date_extractor(datation))

Byzantine {'start': None, 'stop': None, 'exact': None, 'or': None, 'type': 'unknown', 'era': None}
non ante s. II a. {'start': -200, 'stop': None, 'exact': None, 'or': None, 'type': 'range+cent+post', 'era': 'BC'}
140/39 BC {'start': -140, 'stop': -139, 'exact': None, 'or': None, 'type': 'range', 'era': None}
Rom. Imp {'start': -31, 'stop': 410, 'exact': None, 'or': None, 'type': 'range+period', 'era': 'BC/AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p08m57hqcc5'}
reign of Augustus {'start': -27, 'stop': 14, 'exact': None, 'or': None, 'type': 'range+period', 'era': 'BC/AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p06c6g3xnmx'}
ante 140 BC {'start': None, 'stop': -141, 'exact': None, 'or': None, 'type': 'ante', 'era': None}
late Antonine period {'start': 168, 'stop': 192, 'exact': None, 'or': None, 'type': 'range+period+phase+late', 'era': 'AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p06c6g34zjk'}

In [476]:
datation = "Nécrop.Myr. 224,5"
date_extractor(datation)

{'era': None,
 'exact': None,
 'or': None,
 'start': None,
 'stop': None,
 'type': 'unknown'}

In [477]:
 datation = " non ante med. s. II p."
 date_extractor(datation)

{'era': 'AD',
 'exact': None,
 'or': None,
 'start': 146.0,
 'stop': None,
 'type': 'range+cent+phase+middle+post'}

In [478]:
date_extractor("not bef. the Antonine period")

{'era': 'AD',
 'exact': None,
 'link': 'http://n2t.net/ark:/99152/p06c6g34zjk',
 'notes': nan,
 'or': None,
 'source': 'PeriodO',
 'start': 96,
 'stop': None,
 'type': 'range+period+post'}

In [479]:
date_extractor("1st BC/1st AD")

{'era': 'BC/AD',
 'exact': None,
 'or': None,
 'start': -100,
 'stop': 100,
 'type': 'range+cent+morece'}

In [480]:
date_extractor("Newton, Disc. II 742-43:91")

{'era': None,
 'exact': None,
 'or': None,
 'start': None,
 'stop': None,
 'type': 'unknown'}

In [481]:
datation = "27 a.-14 p."
match = re.search("(\d+)(\s(a\.|BP))(\-)(\d+)(\s(p\.|AD))", datation) # (/sa/.)(/-)(/d+)(/sp/.)"
if match:
  dating = {"start" : int(match.groups()[0]) * -1, "stop" : match.groups()[4], "era" : "BC/AD", "type" : "range"}
dating

{'era': 'BC/AD', 'start': -27, 'stop': '14', 'type': 'range'}

In [482]:
test_list = PHI_by_500["raw_date"].tolist()
for datation in test_list:
  print(datation, date_extractor(datation))

IStr 427 {'start': None, 'stop': None, 'exact': None, 'or': None, 'type': 'unknown', 'era': None}
Rom. Imp. period {'start': -31, 'stop': 410, 'exact': None, 'or': None, 'type': 'range+period', 'era': 'BC/AD', 'source': 'PeriodO', 'notes': nan, 'link': 'http://n2t.net/ark:/99152/p08m57hqcc5'}
220/19 {'start': -220, 'stop': -219, 'exact': None, 'or': None, 'type': 'range', 'era': None}
Thasos {'start': None, 'stop': None, 'exact': None, 'or': None, 'type': 'unknown', 'era': None}
el-Boueib {'start': None, 'stop': None, 'exact': None, 'or': None, 'type': 'unknown', 'era': None}
ca. 100-150 AD {'start': 95, 'stop': 155, 'exact': None, 'or': None, 'type': 'range+phase+ca', 'era': 'AD'}
267-668 AD {'start': 267, 'stop': 668, 'exact': None, 'or': None, 'type': 'range', 'era': 'AD'}
1st c. BC  {'start': -100, 'stop': -1, 'exact': None, 'or': None, 'type': 'range+cent', 'era': 'BC'}
Roman period?  {'start': -146, 'stop': 324, 'exact': None, 'or': None, 'type': 'range+period', 'era': 'BC/AD', '

In [483]:
PHI_list_of_dict = []
for inscription_date_tuple in list(zip(PHI_by_500["PHI_ID"].tolist(), PHI_by_500["tildeinfo"].tolist(),  PHI_by_500["raw_date"].tolist())):
  dating = date_extractor(inscription_date_tuple[2])
  data_dict = {"PHI_ID": inscription_date_tuple[0], "tildeinfo" : inscription_date_tuple[1], "raw_date" : inscription_date_tuple[2]}
  data_dict.update(dating)
  PHI_list_of_dict.append(data_dict)

In [484]:
PHI_by_500_dates_v8 = pd.DataFrame(PHI_list_of_dict)
PHI_by_500_dates_v8.head(5)

Unnamed: 0,PHI_ID,tildeinfo,raw_date,start,stop,exact,or,type,era,source,notes,link,certainty
0,262001,IStr 427,IStr 427,,,,,unknown,,,,,
1,266001,Pont. — Amasia — Rom. Imp. period,Rom. Imp. period,-31.0,410.0,,,range+period,BC/AD,PeriodO,,http://n2t.net/ark:/99152/p08m57hqcc5,
2,231001,Att. — Athens: Agora — 220/19,220/19,-220.0,-219.0,,,range,,,,,
3,79501,Thasos,Thasos,,,,,unknown,,,,,
4,218501,Eg. — el-Boueib,el-Boueib,,,,,unknown,,,,,


In [489]:
set_with_dataframe(PHI_overview.add_worksheet("PHI_by_500_dates_v8", 1,1), PHI_by_500_dates_v8)

# Match "/" for individual dates

In [None]:
def match_or(datation):
  match = re.search(r'(\d+)(\/)(\d+)', str(datation), flags=re.IGNORECASE)
  if match != None:
      date1 = match.groups()[0]
      date2 = match.groups()[2]
      len_diff = len(date1) - len(date2)
      if len_diff > 0:
        date2 = date1[:len_diff] + date2
      date1 = int(date1)
      date2 = int(date2)
      if "AD" not in datation:
        date1 = int(date1) * -1
        date2 = {int(date2) * -1
      return {"exact" : date1, "or": date2, "type" : "or"}
  else:
      return {"type" : "unknown"}


In [None]:
ors = PHI_by_500[PHI_by_500["raw_date"].str.contains(r'(\d+)(\/)(\d+)')]["raw_date"].tolist()
print(ors)

[' 220/19', ' 12/1 BC', ' 116/5 BC', ' 229/30 or 230/1 AD ', ' ante 336/5', ' 204/3?', ' 73/4 AD', ' ca. 163/2 BC', ' 128/127 BC ', ' shortly after 208/7 BC (or 207/6) ', ' 154/5 AD', ' 66/7 AD']


  return func(self, *args, **kwargs)


In [None]:
# example/testing
for our_or in ors:
  print({our_or: match_or(our_or)})


{' 220/19': {'exact': -220, 'or': -219, 'type': 'or'}}
{' 12/1 BC': {'exact': -12, 'or': -11, 'type': 'or'}}
{' 116/5 BC': {'exact': -116, 'or': -115, 'type': 'or'}}
{' 229/30 or 230/1 AD ': {'exact': 229, 'or': 230, 'type': 'or'}}
{' ante 336/5': {'exact': -336, 'or': -335, 'type': 'or'}}
{' 204/3?': {'exact': -204, 'or': -203, 'type': 'or'}}
{' 73/4 AD': {'exact': 73, 'or': 74, 'type': 'or'}}
{' ca. 163/2 BC': {'exact': -163, 'or': -162, 'type': 'or'}}
{' 128/127 BC ': {'exact': -128, 'or': -127, 'type': 'or'}}
{' shortly after 208/7 BC (or 207/6) ': {'exact': -208, 'or': -207, 'type': 'or'}}
{' 154/5 AD': {'exact': 154, 'or': 155, 'type': 'or'}}
{' 66/7 AD': {'exact': 66, 'or': 67, 'type': 'or'}}


In [None]:
ors = PHI_by_500[PHI_by_500["raw_date"].str.contains("(\d+)(\/)(\d+)")]["raw_date"].tolist()
print(ors)

[' 220/19', ' 12/1 BC', ' 116/5 BC', ' 229/30 or 230/1 AD ', ' ante 336/5', ' 204/3?', ' 73/4 AD', ' ca. 163/2 BC', ' 128/127 BC ', ' shortly after 208/7 BC (or 207/6) ', ' 154/5 AD', ' 66/7 AD']


  return func(self, *args, **kwargs)


# Under development

In [None]:
PHI_by_500[PHI_by_500["raw_date"].str.contains("/")]

Unnamed: 0,URL,Book,Text,hdr1,hdr2,tildeinfo,note,lines,metadata,data,filename,PHI_ID,raw_date
1260,/text/231001?location=1365&patt=&bookid=394&of...,Agora XV,130,Regions\nAttica (IG I-III)\nAttica,Agora XV\n130,Att. — Athens: Agora — 220/19,{},149,1\n\n\n\n5\n\n\n\n\n10\n\n\n\n\n15\n\n\n\n\n20...,ἐπὶ Μενεκράτου ἄρχοντος ἐπὶ τῆς Οἰνεῖδος ἕκτη-...,Agora-XV.csv,231001,220/19
7001,/text/298501?location=1237&patt=&bookid=735&of...,"IDR III,1",153,Regions\nThrace and the Lower Danube (IG X)\nD...,"IDR III,1\n153",Dacia Sup. — Tibiscum (Jupa) — 2nd/3rd c. AD,{},8,1\n\n\n\n5\n\n\n,D(is) M(anibus)\nP(ublius) Ael(ius) Claudia-\n...,IDR-III-1.csv,298501,2nd/3rd c. AD
15498,/text/243501?location=1498&patt=&bookid=474&of...,IGLSyr 4,"1271B,g",Regions\nGreater Syria and the East\nSyria and...,"IGLSyr 4\n1271B,g","Syr., Laodik. — Laodicea — 12/1 BC",{},2,1\n,"ζλʹ {²sc. ἔτους}², ἑκκα-\nιδέκατον.",IGLSyr-4.csv,243501,12/1 BC
16567,/text/299501?location=1237&patt=&bookid=737&of...,"IDR III,3",182,Regions\nThrace and the Lower Danube (IG X)\nD...,"IDR III,3\n182",Dacia Sup. — Micia (Vețel)? — Chimindia — 2nd/...,{},9,frg. a\n\n\n\n\nfrg. b\n\n\n,[— — — — — — — — — — — —]\n[— —] quondam Pompo...,IDR-III-3.csv,299501,2nd/3rd c. AD
24698,/text/64501?location=915&patt=&bookid=1&offset...,ID,2057,"Regions\nAegean Islands, incl. Crete (IG XI-[X...",ID\n2057,Delos — 116/5 BC,{},6,1\n\n\n\n5\n,"Διονύσιος\nΔιονυσίου\nΣφή<τ>τιος, ἱερεὺς\nγενό...",ID.csv,64501,116/5 BC
35543,/text/217001?location=1497&patt=&bookid=362&of...,"Bernand, Inscr. Métr.",113,"Regions\nEgypt, Nubia and Cyrenaïca\nEgypt and...","Bernand, Inscr. Métr.\n113",Eg. — Naukratis (Kōm Giéif) — 3rd/2nd c. BC?,{},4,1\n\n\n,Νειλούσσης ἀλόχου τήνδ’ εἰκόνα Παρθενοπαί[ου]\...,Bernand--Inscr.-M-tr..csv,217001,3rd/2nd c. BC?
38001,/text/228501?location=1497&patt=&bookid=389&of...,Delta I,662198,"Regions\nEgypt, Nubia and Cyrenaïca\nEgypt and...","Delta I\n662,198",Eg. — Naukratis — 6th/5th c. BC,{},1,1,Κ̣α̣ρό̣φνης με ἀνέθηκε τἀπό̣[λλοˉνι το͂ι Μ]ιλα...,Delta-I.csv,228501,6th/5th c. BC
38501,/text/229001?location=1497&patt=&bookid=389&of...,Delta I,712699,"Regions\nEgypt, Nubia and Cyrenaïca\nEgypt and...","Delta I\n712,699",Eg. — Naukratis — 6th/5th c. BC,{},1,1,[Ἀ]πόλλ[ωνός ἐˉμι].,Delta-I.csv,229001,6th/5th c. BC
58576,/text/291501?location=1&patt=&bookid=172&offse...,SEG,21:506,Regions,SEG\n21:506,Att. — 229/30 or 230/1 AD — IG II² 1064+,{},69,\n\n\nfrg. a.1\n\n\n\n5\n\n\n\n\n10\nfrg. b.10...,"IG II(2) 1064+, Oliver, Hesp. Suppl. VI (1941)...",SEG.csv,291501,229/30 or 230/1 AD
60698,/text/152001?location=1&patt=&bookid=172&offse...,SEG,26:759,Regions,SEG\n26:759,Makedonia (Mygdonia) — Thessalonike — 2nd/3rd ...,{},3,6\n\n,. . . εἰ δέ\n[τις ἕτ]ε̣ρος κα-\n[ταθῇ ἕτερον π...,SEG.csv,152001,2nd/3rd c. AD


In [None]:
re.match("")

In [None]:
set_with_dataframe(PHI_overview.add_worksheet("PHI_by_500_dates_v1", 1, 1), PHI_by_500)