# Undate: computing with uncertain and partially-unknown dates

`Undate` is an ambitious, in-progress effort to develop a pragmatic Python library for computation and analysis of temporal information in humanistic and cultural data, with a particular emphasis on uncertain, incomplete, or imprecise dates and with support for multiple calendars.

Researchers in the humanities often work with historical or cultural data, and knowing when particular materials were created or events happened is important for understanding the context, interpreting correctly, and determining relationships and sequencing. However, these kind of materials rarely have full precision dates with known year, month, and day. In some contexts, scholars may be happy if they can determine even just a century based on handwriting or mentions of historic coins.

Humanistic and cultural data also often includes dates in different calendars, or even a mix of calendars within the same project or system. It's important to preserve the original date and calendar information, but it's also valuable to convert dates to a standard calendar so they can be compared and sorted together. `Undate` objects are calendar aware and calendar explicit, with a default of the Gregorian calendar. Currently, we support parsing and calendar conversion for dates in the Hebrew _Anno Mundi_ calendar and Islamic _Hijri_ calendar.

This notebook demonstrates current use and functionality of the core `Undate` and `UndateInterval` objects, along with some examples and use-cases from specific projects.

## Basic functionality

Like Python's builtin `datetime.date` object, an `Undate` can be initialized by specifying numeric values for year, month, and day.

We can print them using the default serialization (ISO8601, or YYYY-MM-DD), and we can compare them.

In [19]:
import datetime

from undate.undate import Undate

# these are equivalent
dt_november7 = datetime.date(2000, 11, 7)
november7 = Undate(2000, 11, 7)

We can print them out. By default, both of these dates will be displayed in ISO8601 format (YYYY-MM-DD).

In [20]:
print(dt_november7)

2000-11-07


In [21]:
print(november7)

2000-11-07


We can also compare them. Is this the same date?

In [22]:
bool(november7 == dt_november7) 

True

Unlike Python's `datetime.date`, an `Undate` can be initialized without providing all values for year, month, and day.

We can create Undate instances for the month of November in 2000, for the year 2000, or even for November 7th in some unknown year.

`Undate` also has an optional `label` field, since it's sometimes useful to attach a label to date.

In [23]:
# November 2000
november = Undate(2000, 11, label="November 2000")
# Year 2000
year2k = Undate(2000, label="Y2K")
# November 7 in an unknown year
november7_some_year = Undate(month=11, day=7, label="Some November 7")
# let's reinitialize our first date with a label too
november7 = Undate(2000, 11, 7, label="November 7, 2000")

# sometimes names are important
easter1916 = Undate(1916, 4, 23, label="Easter 1916")

Each of these `Undate` objects can be displayed in a standard format, and also has information about the precision of the date and duration information.

In [24]:
for example_date in [november, year2k, november7_some_year, november7, easter1916]:
    print(f"\n{example_date.label}: {example_date}")
    print(f"Date precision: {example_date.precision}")
    print(f"Duration in days: {example_date.duration().days}")


November 2000: 2000-11
Date precision: MONTH
Duration in days: 30

Y2K: 2000
Date precision: YEAR
Duration in days: 366

Some November 7: --11-07
Date precision: DAY
Duration in days: 1

November 7, 2000: 2000-11-07
Date precision: DAY
Duration in days: 1

Easter 1916: 1916-04-23
Date precision: DAY
Duration in days: 1


We can also do some simple calculations, like checking whether one date falls within another date.

In [25]:
november in year2k

True

In [26]:
november7 in year2k

True

In [27]:
november7 in november

True

In [28]:
easter1916 in year2k

False

In [29]:
november7_some_year in year2k

False

## Partially unknown values

We can also intialize an `Undate` object with string values, when a date is only partially known. We use the character **X** to indicate an unknown digit, following the notation used in the [Extended Date Time Format (EDTF)](https://www.loc.gov/standards/datetime/).

In [69]:
someyear_1900s = Undate("19XX", label="1900s")
late2022 = Undate(2022, "1X", label="late 2022")

for example_date in [someyear_1900s, late2022]:
    print(f"\n{example_date.label}: {example_date}")
    print(f"Date precision: {example_date.precision}")


1900s: 19XX
Date precision: YEAR

late 2022: 2022-1X
Date precision: MONTH


When an `Undate` instance is initialized, internally the class calculates earliest and latest possible values for that date in the Gregorian calendar.

This means that some comparisons are possible even without precise information.

For instance, is a year sometime during the 1900s before a month in late 2022?

In [70]:
someyear_1900s < late2022

True

But uncertain dates with the same initial values aren't equal, since they are uncertain:

In [32]:
late2022 == Undate(2022, "1X")

False

The `Undate` class has properties to return `year`, `month`, and `day` if they are known. They are returned as strings to allow for partially unknown dates, and return `None` when a value is unknown.

Here are some examples from the dates we created earlier.

In [122]:
# november7 = Undate(2000, 11, 7)
assert november7.year == "2000"
assert november7.month == "11"
assert november7.day == "07"

# year2k = Undate(2000, label="Y2K")
assert year2k.year == "2000"
assert year2k.month is None
assert year2k.day is None

# someyear_1900s = Undate("19XX", label="1900s")
assert someyear_1900s.year == "19XX"

## Date Intervals

Like many other date libraries, `undate` includes support for intervals.  An `UndateInterval` is a date range between two `Undate` objects. Intervals can be open-ended, allow for optional labels, and can calculate duration if enough information is known.

In [77]:
from undate import UndateInterval

nineteenth_c = UndateInterval(Undate(1801), Undate(1900), label="19th century")
nineteenth_c

<UndateInterval '19th century' (1801/1900)>

An `UndateInterval` has an earliest and a latest value for the start and end of the date range. Since those are `Undate` instances, they also have earliest and latest values.

The duration of an interval is calculated based on the difference between the last day and first day in range.

In [78]:
print(nineteenth_c.earliest.earliest)

1801-01-01


In [82]:
print(nineteenth_c.latest.latest)

1900-12-31


In [83]:
nineteenth_c.duration()

Timedelta(36524, dtype='timedelta64[D]')

Intervals can also be open-ended.  Here are a couple of examples:

In [84]:
UndateInterval(latest=Undate(2000))  # before 2000

<UndateInterval ../2000>

In [85]:
UndateInterval(Undate(1900))  # after 1900

<UndateInterval 1900/>

## Parsing dates in supported formats

Initializing an `Undate` directly with year, month, day values is useful, but often we want to parse text dates in known formats directly and work with them as data.

The `undate` library has an extensive converter class ([`BaseDateConverter`](https://undate-python.readthedocs.io/en/latest/undate/converters.html)), which can be extended for parsing dates in specific formats and also for parsing and converting dates from other calendars.  Parsing is implemented with the Python library [Lark](https://lark-parser.readthedocs.io/en/stable/).

Currently, we support ISO8601 and some portions of the Extended Date Time Format (EDTF).  

### ISO8601

**ISO 8601** is an international standard for dates (see [Wikipedia ISO 8601 entry](https://en.wikipedia.org/wiki/ISO_8601) for more details). For Calendar dates, this format uses the familiar **YYYY-MM-DD** notation for full dates, **YYYY-MM** for year and month. Some earlier versions of the specification allowed formats like **--MM-DD** for dates when month and day are known but the year is not.

A converter can be used directly by the class, or can be parsed by the name of the converter.

Here are some examples. In this case, we set the default converter to ISO8601 so that the string format will serialize the date back out to the original format.

In [113]:
from undate.date import DatePrecision
from undate.converters.iso8601 import ISO8601DateFormat

# ensure output converter is ISO8601 (currently the default)
Undate.DEFAULT_CONVERTER = "ISO8601"

day = Undate.parse("1985-04-12", "ISO8601")
assert str(day) == "1985-04-12"
assert day.precision == DatePrecision.DAY

yearmonth = Undate.parse("1985-04", "ISO8601")
assert str(yearmonth) == "1985-04"
assert yearmonth.precision == DatePrecision.MONTH

year = Undate.parse("1985", "ISO8601")
assert str(year) == "1985"
assert year.precision == DatePrecision.YEAR

monthday = Undate.parse("--04-12", "ISO8601")
assert str(monthday) == "--04-12"
assert monthday.precision == DatePrecision.DAY

If you try to parse something that isn't supported by the format or the parser, the method raises a `ValueError` exception with the error message from the parser.

In [100]:
try:
    Undate.parse("????-04-12", "ISO8601")
except ValueError as err:
    print(err)

invalid literal for int() with base 10: '????'


### Extendend Date Time Format

Since the EDTF format includes both dates and intervals, parsing an EDTF can return either an `Undate` or an `UndateInterval`.  

Here are some examples.   

EDTF and ISO8601 use the same format for full precision day, year-month, and year dates.

In [127]:
day = Undate.parse("1985-04-12", "EDTF")
assert day.format("EDTF") == "1985-04-12"
assert day.precision == DatePrecision.DAY

yearmonth = Undate.parse("1985-04", "EDTF")
assert yearmonth.format("EDTF") == "1985-04"
assert yearmonth.precision == DatePrecision.MONTH

year = Undate.parse("1985", "EDTF")
assert year.format("EDTF") == "1985"
assert year.precision == DatePrecision.YEAR

EDTF uses **X** to indicate unspecified digits. Here's the example from above with an unknown year. 

If we specify a different formatter, we can output the date in a different format than we used for parsing.

In [130]:
monthday = Undate.parse("XXXX-04-12", "EDTF")
assert monthday.format("EDTF") == "XXXX-04-12"
assert monthday.format("ISO8601") == "--04-12"
assert monthday.precision == DatePrecision.DAY

The EDTF format includes notation for intervals; parsing an EDTF interval returns an `UndateInterval`. Here are some examples from the Library of Congress documentation on EDTF. Note that the start and end date of the interval don't have to use the same date precision.

In [131]:
# Example 1
year_range = Undate.parse("1964/2008", "EDTF")
assert isinstance(year_range, UndateInterval)
assert year_range.earliest == Undate(1964)
assert year_range.latest == Undate(2008)
# Example 2
month_range = Undate.parse("2004-06/2006-08", "EDTF")
assert isinstance(month_range, UndateInterval)
assert month_range.earliest == Undate(2004, 6)
assert month_range.latest == Undate(2006, 8)
# Example 3
day_range = Undate.parse("2004-02-01/2005-02-08", "EDTF")
assert isinstance(day_range, UndateInterval)
assert day_range.earliest == Undate(2004, 2, 1)
assert day_range.latest == Undate(2005, 2, 8)
# Example 4 
day_month_range = Undate.parse("2004-02-01/2005-02", "EDTF")
assert isinstance(day_range, UndateInterval)
assert day_month_range.earliest == Undate(2004, 2, 1)
assert day_month_range.latest == Undate(2005, 2)
assert day_month_range.earliest.precision == DatePrecision.DAY
assert day_month_range.latest.precision == DatePrecision.MONTH
# Example 5
day_year_range = Undate.parse("2004-02-01/2005", "EDTF")
assert isinstance(day_range, UndateInterval)
assert day_year_range.earliest == Undate(2004, 2, 1)
assert day_year_range.latest == Undate(2005)
assert day_year_range.earliest.precision == DatePrecision.DAY
assert day_year_range.latest.precision == DatePrecision.YEAR
# Example 6 
year_month_range = Undate.parse("2005/2006-02", "EDTF")
assert isinstance(year_month_range, UndateInterval)
assert year_month_range.earliest == Undate(2005)
assert year_month_range.latest == Undate(2006, 2)
assert year_month_range.earliest.precision == DatePrecision.YEAR
assert year_month_range.latest.precision == DatePrecision.MONTH

EDTF also supports open intervals. Here are some examples of those:

In [133]:
import datetime

interval = Undate.parse("1985-04-12/..", "EDTF")
assert isinstance(interval, UndateInterval)
assert interval.earliest == datetime.date(1985, 4, 12)
assert interval.earliest.precision == DatePrecision.DAY
assert interval.latest is None

interval = Undate.parse("1985-04/..", "EDTF")
assert isinstance(interval, UndateInterval)
assert interval.earliest == Undate(1985, 4)
assert interval.earliest.precision == DatePrecision.MONTH
assert interval.latest is None

interval = Undate.parse("1985/..", "EDTF")
assert isinstance(interval, UndateInterval)
assert interval.earliest == Undate(1985)
assert interval.earliest.precision == DatePrecision.YEAR
assert interval.latest is None

EDTF also supports negative years and years that are more than four digits; the **Y** prefix is used to indicate the number is a year.

In [137]:
neg_year = Undate.parse("-1985", "EDTF")
assert neg_year.year == "-1985"
assert Undate(-1985).format("EDTF") == "-1985"

assert Undate.parse("Y170000002", "EDTF").year == "170000002"
assert Undate(170000002).format("EDTF") == "Y170000002"

## Calendars

`undate` includes a [BaseCalendarConverter](https://undate-python.readthedocs.io/en/latest/undate/converters.html#undate.converters.base.BaseCalendarConverter), as a special case of the `BaseDateConverter` for format parsing and conversion like ISO8601 and EDTF. In addition the `parse()` method that all converters must implement, calendar converters have logic for returning minimum and maximum month and day, first and last month as integers (since some calendars don't start the year on month 1), and a `to_gregorian()` method to convert into a standard Gregorian date. We use the [convertdate](https://github.com/fitnr/convertdate) Python library for the actual numeric conversion.

### Gregorian calendar

An `Undate` instance always has a calendar defined; we use the Gregorian calendar if a calendar is not specified.

Here's an example from one of the `Undate` instances we defined earlier:

In [139]:
november7.calendar

<Calendar.GREGORIAN: 'gregorian'>

### Islamic Hijri calendar

In [159]:
from undate import Calendar

# Monday, 7 Jumādā I 1243 Hijrī (26 November, 1827 CE); Jumada I = month 5
hijri_date = Undate.parse("7 Jumādā I 1243", "Hijri") 
assert hijri_date == Undate(1243, 5, 7, calendar="Hijri")
assert hijri_date.calendar == Calendar.HIJRI
assert hijri_date.precision == DatePrecision.DAY

We preserve the numeric values of the date in the original calendar, but internally `Undate` converts to Gregorian calendar for comparison with other days. 

In [162]:
assert hijri_date.year == "1243"
assert hijri_date.month == "05"
assert hijri_date.day == "07"
print(hijri_date.earliest)  # Gregorian equivalent

1827-11-26


By default, the original text value of the parsed date and the calendar are presreved in the label of the `Undate` object:

In [184]:
print(hijri_date.label)

7 Jumādā I 1243 Hijrī


As with other formats, we support different date precisions:

In [201]:
from undate.date import Date

# month and year only
hijri_yearmonth = Undate.parse("Rajab 495", "Hijri") 
assert hijri_yearmonth == Undate(495, 7, calendar="Hijri")  # Rajab is month 7
assert hijri_yearmonth.calendar == Calendar.HIJRI
assert hijri_yearmonth.precision == DatePrecision.MONTH
 # Gregorian earliest/latest
assert hijri_yearmonth.earliest == Date(1102, 4, 28)
assert hijri_yearmonth.latest == Date(1102, 5, 27)
print(f"{hijri_yearmonth.earliest}/{hijri_yearmonth.latest}")  # Gregorian date range

1102-04-28/1102-05-27


In [203]:
# year only
hijri_year = Undate.parse("441", "Hijri") 
assert hijri_year == Undate(441, calendar="Hijri")
assert hijri_year.calendar == Calendar.HIJRI
assert hijri_year.precision == DatePrecision.YEAR
# Gregorian earliest/ latest
assert hijri_year.earliest == Date(1049, 6, 11)
assert hijri_year.latest == Date(1050, 5, 31)
print(f"{hijri_year.earliest}/{hijri_year.latest}")  # Gregorian date range

1049-06-11/1050-05-31


### Hebrew Anno Mundi calendar

Support for the Hebrew calendar is similar to the Islamic.

In [191]:
# 26 Tammuz 4816: Tammuz = month 4 (17 July, 1056 Gregorian)
hebrew_date = Undate.parse("26 Tammuz 4816", "Hebrew") 
assert hebrew_date == Undate(4816, 4, 26, calendar="Hebrew")
assert hebrew_date.calendar == Calendar.HEBREW
assert hebrew_date.precision == DatePrecision.DAY
print(hebrew_date.earliest)  # Gregorian equivalent
print(hebrew_date.label)

1056-07-17
26 Tammuz 4816 Anno Mundi


In [198]:
 # year month
hebrew_yearmonth = Undate.parse("Ṭevet 5362", "Hebrew") 
assert hebrew_yearmonth == Undate(5362, 10, calendar="Hebrew")  # Teveth = month 10
assert hebrew_yearmonth.calendar == Calendar.HEBREW
assert hebrew_yearmonth.precision == DatePrecision.MONTH
print(f"{hebrew_yearmonth.earliest}/{hebrew_yearmonth.latest}")  # Gregorian date range

1601-12-25/1602-01-22


In [199]:
# year
hebrew_year = Undate.parse("4932", "Hebrew") 
assert hebrew_year == Undate(4932, calendar="Hebrew")
assert hebrew_year.calendar == Calendar.HEBREW
assert hebrew_year.precision == DatePrecision.YEAR
print(f"{hebrew_year.earliest}/{hebrew_year.latest}")  # Gregorian date range

1171-09-09/1172-09-27


Because we preserve the numeric date values in the original calendar, this means that two `Undate` objects with the same numeric day, month, and year values represent different dates if they use different calendars.  This also means that we can preserve the precision of the date in the original calendar (such as a month or a year), even when that doesnn't neatly map to a month or year in the Gregorian calendar, since they may have a different number of days.

Since `Undate` converts to the common Gregorian calendar for comparison and determines earliest and latest possible dates, `Undate` instances with different calendars can be used together.

In [231]:
# 21 Rajab 1023 Hijrī (27 August 1614 CE) 
rajab21 = Undate.parse("21 Rajab 1023", "Hijri")
# 3 Tishrei 5370 Anno Mundi (1 October 1609 CE) 
tishrei3 = Undate.parse("3 Tishrei 5370", "Hebrew")

In [215]:
rajab21

<Undate '21 Rajab 1023 Hijrī' 1023-07-21 (Hijri)>

In [216]:
tishrei3

<Undate '3 Tishrei 5370 Anno Mundi' 5370-07-03 (Hebrew)>

In [230]:
import pandas as pd

calendars = ["Gregorian", "Hebrew", "Hijri"]

calendar_dates = {
    "text":  ["21 Rajab 1023", "Rajab 1023", "1023", "3 Tishrei 5370", "Tishrei 5370", "5370", "2 June 1663", "June 1663", "1663"],
    "calendar": ["Hijri", "Hijri", "Hijri", "Hebrew", "Hebrew", "Hebrew", "Gregorian", "Gregorian", "Gregorian"],
    # we pre-supply the numeric values int his case, since we don't yet have a text parser for Gregorian dates
    "numeric": [(1023, 7, 21), (1023, 7), (1023,), (5370, 7, 3), (5370, 7), (5370,), (1663, 6, 2), (1663, 6), (1663,)]
}

cal_dates_df = pd.DataFrame.from_dict(calendar_dates)
# initialize an undate by parsing text values with specified calendar
cal_dates_df['undate'] = cal_dates_df.apply(lambda row: Undate(*row.numeric, calendar=row.calendar), axis=1)
# string representation of how you would intiialize an undate object with numbers and calendar
cal_dates_df['undate_str'] = cal_dates_df.apply(lambda row: f"Undate({', '.join([str(n) for n in row.numeric])}, calendar='{row.calendar}')", axis=1)
cal_dates_df['precision'] = cal_dates_df.undate.apply(lambda x: str(x.precision).lower())
cal_dates_df['earliest_gregorian'] = cal_dates_df.undate.apply(lambda x: x.earliest)
cal_dates_df['latest_gregorian'] = cal_dates_df.undate.apply(lambda x: x.latest)
cal_dates_df['duration'] = cal_dates_df.undate.apply(lambda x: x.duration().days)
cal_dates_df

Unnamed: 0,text,calendar,numeric,undate,undate_str,precision,earliest_gregorian,latest_gregorian,duration
0,21 Rajab 1023,Hijri,"(1023, 7, 21)",1023-07-21,"Undate(1023, 7, 21, calendar='Hijri')",day,1614-08-27,1614-08-27,1
1,Rajab 1023,Hijri,"(1023, 7)",1023-07,"Undate(1023, 7, calendar='Hijri')",month,1614-08-07,1614-09-05,30
2,1023,Hijri,"(1023,)",1023,"Undate(1023, calendar='Hijri')",year,1614-02-11,1615-01-30,354
3,3 Tishrei 5370,Hebrew,"(5370, 7, 3)",5370-07-03,"Undate(5370, 7, 3, calendar='Hebrew')",day,1609-10-01,1609-10-01,1
4,Tishrei 5370,Hebrew,"(5370, 7)",5370-07,"Undate(5370, 7, calendar='Hebrew')",month,1609-09-29,1609-10-28,30
5,5370,Hebrew,"(5370,)",5370,"Undate(5370, calendar='Hebrew')",year,1609-09-29,1610-09-17,354
6,2 June 1663,Gregorian,"(1663, 6, 2)",1663-06-02,"Undate(1663, 6, 2, calendar='Gregorian')",day,1663-06-02,1663-06-02,1
7,June 1663,Gregorian,"(1663, 6)",1663-06,"Undate(1663, 6, calendar='Gregorian')",month,1663-06-01,1663-06-30,30
8,1663,Gregorian,"(1663,)",1663,"Undate(1663, calendar='Gregorian')",year,1663-01-01,1663-12-31,365


This table shows dates with varying precision from three different calendars, with their numeric values in the original calendar and earliest and latest Gregorian dates.

* * *

Because internally we convert to a common calendar, these dates can be used together.

In [238]:
june1663 = Undate(1663, 6)

sorted_mix = sorted([rajab21, tishrei3, june1663])
sorted_mix

[<Undate '3 Tishrei 5370 Anno Mundi' 5370-07-03 (Hebrew)>,
 <Undate '21 Rajab 1023 Hijrī' 1023-07-21 (Hijri)>,
 <Undate 1663-06 (Gregorian)>]

In [239]:
print([d.earliest for d in sorted_mix])

[Date('1609-10-01', dtype='datetime64[D]'), Date('1614-08-27', dtype='datetime64[D]'), Date('1663-06-01', dtype='datetime64[D]')]


## Implementation details 

Internally, we use `numpy.datetime64` and `numpy.timedelta64` to store converted dates; we implemented shims to make these objects look a bit more like the builtin python `datetime.date` and `datetime.timedelta` objects, since they are easier to work with and the first version of `undate` used them.

We switched to `numpy` for dates so that we could support a wider range of years.  Python `datetime.date` only supports four-digit positive years (1-9999). 

In [240]:
datetime.MINYEAR, datetime.MAXYEAR

(1, 9999)

The popular data analysis library Pandas is much more limited - in spite of using `numpy.datetime64` internally, Pandas methods for parsing dates and converting to Timestamp objects don’t support dates before 1677AD.

In contrast, `numpy.datetime64` supports a range of 2.5e16 BC, 2.5e16 AD (see [NumPy Datetime units documentation](https://numpy.org/doc/stable/reference/arrays.datetime.html#datetime-units)).

In [241]:
pd.Timestamp.min.year, pd.Timestamp.max.year

(1677, 2262)