-
Notifications
You must be signed in to change notification settings - Fork 0
06.02 Data Format for a New Future
Presently the Sutta Central data format leaves much to be desired, the Canon is naturally a tree, and it would be ideal if the data format natively supports defining trees or hierarchy. By a tree I basically just means it explicitly defines parent-child relationships, so that you don't need to know anything about the data to know that one element is a child of another. For example DN 1, the Sutta, is a child of DN, the Division.
There are two requirements for the data format:
- It should be able to represent a tree/hierarchy.
- It should be text based for version control, human usability and future-proofing.
As far as I know there are only three widely used candidates which fulfil these requirements, these are XML, JSON and YAML. There are some much lesser known ones, like Candle and SDL, which are very nearly language specific, and may be the love child of a single individual.
XML (eXtensible Markup Language) is an industry standard and has a diverse and mature ecosystem of tools and utilities.
<division>
<uid>sn</uid>
<name>Saṁyutta Nikaya</name>
<subdivision>
<uid>sn1</uid>
<name>Devatā Saṁyutta</name>
<vagga>
<uid>sn1.vagga1</uid>
<name>Nala Vagga</name>
<sutta>
<uid>sn1.1</uid>
<name>Oghataraṇa</name>
</sutta>
<sutta>
<uid>sn1.2</uid>
<name>Nimokkha</name>
</sutta>
<sutta>
<uid>sn1.3</uid>
<name>Upanīyati</name>
</sutta>
</vagga>
<vagga>
<uid>sn1.vagga2</uid>
<name>Nandana Vagga</name>
<sutta>
<uid>sn1.11</uid>
<name>Nandana</name>
</sutta>
</vagga>
</subdivision>
</division>
JSON (JavaScript Object Notation) is fat-free and very widely used for data exchange on the Internet.
{"divisions": [{
"uid": "sn",
"name": "Samyutta Nikaya",
"subdivisions": [{
"uid": "sn1",
"name": "Devatā Saṃyutta",
"vaggas": [{
"uid": "sn1.vagga1",
"name": "Nala Vagga",
"suttas": [{
"uid": "sn1.1",
"name": "Oghataraṇa"
}, {
"uid": "sn1.2",
"name": "Nimokkha"
}, {
"uid": "sn1.3",
"name": "Upanīyati"
}],
}, {
"uid": "sn1.vagga2",
"name": "Nandana Vagga"
"suttas": [{
"uid": "sn1.11"
"name": "Nandana"}
]}
]}
]}
]}
YAML is still emerging. It is a superset of JSON, but has an alternative more concise and human-friendly notation, and is extensible.
divisions:
- uid: sn
name: Samyutta Nikaya
subdivisions:
- uid: sn1
name: Devatā Saṃyutta
vaggas:
- uid: sn1.vagga1
name: Nala Vagga
suttas:
- uid: sn1.1
name: Oghataraṇa
- uid: sn1.2
name: Nimokkha
- uid: sn1.3
name: Upanīyati
- uid: sn1.vagga2
name: Nandana Vagga
suttas:
- uid: sn1.11
name: Nandana
YAML 2 - Using keys instead of lists.
sn:
type: division
name: Samyutta Nikaya
sn1:
type: subdivision
name: Devatā Saṃyutta
sn1.vagga1:
type: vagga
name: Nala Vagga
sn1.1:
type: sutta
name: Oghataraṇa
sn1.2:
type: sutta
name: Nimokkha
sn1.3:
type: sutta
name: Upanīyati
sn1.vagga2:
type: vagga
name: Nandana Vagga
sn1.11:
type: sutta
name: Nandana
YAML 3 - using custom types.
sn: !Division
name: Samyutta Nikaya
sn1: !Subdivision
name: Devatā Saṃyutta
sn1.vagga1: !Vagga
name: Nala Vagga
sn1.1: !Sutta
name: Oghataraṇa
sn1.2: !Sutta
name: Nimokkha
sn1.3: !Sutta
name: Upanīyati
sn1.vagga2: !Vagga
name: Nandana Vagga
sn1.11: !Sutta
name: Nandana
I have chosen to show some alternative tree structures using YAML, as YAML is by far the most comprehensible at a glance.
XML is very widely used and it has an unsurpassed ecosystem of tools and utilities. Everyone is familiar with XML, even if they don't like it. A legitimate criticism of XML is that XML is a Markup Language, it has been shoehorned into it's sometimes role as a data format (however the libraries which load that data are a joy to use). XML does many things, but it doesn't do any of them elegantly. XML makes my eyes bleed, I suppose it's all the angle brackets.
JSON is an Object Notation rather than a Markup Language, that means it is designed for defining objects rather than marking up text. The extreme simplicity, the extremely limited features of JSON, is why it is so widely used on the internet for communications. It's easy to use and easy to parse, JSON is a subset of Javascript code AND Python code (meaning both Javascript and Python natively understand JSON). To a large extent JSON's strengths are also it's weaknesses - it simply cannot be extended. The fact that there is no such thing as custom JSON, means everything can understand it, but it makes it limited. To me, JSON looks like something the dog barfed up, it's A LOT of work to make it look pretty and even if you do that work (as I have done for the example above) it still looks a bit hairy.
YAML, like JSON, is also an Object Notation, it is even defined in terms of not being a Markup Language. YAML is a superset of JSON, meaning that valid JSON is valid YAML. Essentially, they both parse into the same thing (an object tree). YAML has mature libraries for all popular languages (Java, Perl, Python, Ruby etc). One of the strengths of YAML is that it is easily transformed into JSON, another strength is that it has advanced features like built in references (this means you can define a thing in one place, and insert it in multiple places), YAML is Pythonic both in appearance and philosophy (i.e. DRY - Don't Repeat Yourself). YAML is also fully extensible. Much like JSON, the strengths of YAML are also it's weaknesses, the more of the awesome cool extensibility features of YAML you use, the less useful the data becomes in interchange because it's more work to make another program understand your data. Here a well-defined transform to JSON might be a good approach. YAML is elegant and superbly human usable, in fact it's so human comprehensible it seems hard to believe it actually is a data format.
Having investigated it, while the conservative choice would be XML, I feel the awesome choice is YAML