Skip to content
This repository has been archived by the owner on Jun 11, 2021. It is now read-only.

06.02 Data Format for a New Future

stuindhamma edited this page Jul 29, 2014 · 1 revision

Presently the Sutta Central data format leaves much to be desired, the Canon is naturally a tree, and it would be ideal if the data format natively supports defining trees or hierarchy. By a tree I basically just means it explicitly defines parent-child relationships, so that you don't need to know anything about the data to know that one element is a child of another. For example DN 1, the Sutta, is a child of DN, the Division.

There are two requirements for the data format:

  1. It should be able to represent a tree/hierarchy.
  2. It should be text based for version control, human usability and future-proofing.

As far as I know there are only three widely used candidates which fulfil these requirements, these are XML, JSON and YAML. There are some much lesser known ones, like Candle and SDL, which are very nearly language specific, and may be the love child of a single individual.

XML (eXtensible Markup Language) is an industry standard and has a diverse and mature ecosystem of tools and utilities.

<division>
    <uid>sn</uid>
    <name>Saṁyutta Nikaya</name>
    <subdivision>
        <uid>sn1</uid>
        <name>Devatā Saṁyutta</name>
        <vagga>
            <uid>sn1.vagga1</uid>
            <name>Nala Vagga</name>
            <sutta>
                <uid>sn1.1</uid>
                <name>Oghataraṇa</name>
            </sutta>
            <sutta>
                <uid>sn1.2</uid>
                <name>Nimokkha</name>
            </sutta>
            <sutta>
                <uid>sn1.3</uid>
                <name>Upanīyati</name>
            </sutta>
        </vagga>
        <vagga>
            <uid>sn1.vagga2</uid>
            <name>Nandana Vagga</name>
            <sutta>
                <uid>sn1.11</uid>
                <name>Nandana</name>
            </sutta>
        </vagga>
    </subdivision>
</division>

JSON (JavaScript Object Notation) is fat-free and very widely used for data exchange on the Internet.

{"divisions": [{
    "uid": "sn",
    "name": "Samyutta Nikaya",
    "subdivisions": [{
        "uid": "sn1",
        "name": "Devatā Saṃyutta",
        "vaggas": [{
            "uid": "sn1.vagga1",
            "name": "Nala Vagga",
            "suttas": [{
                "uid": "sn1.1",
                "name": "Oghataraṇa"
                }, {
                "uid": "sn1.2",
                "name": "Nimokkha"
                }, {
                "uid": "sn1.3",
                "name": "Upanīyati"
                }],
             }, {
            "uid": "sn1.vagga2",
            "name": "Nandana Vagga"
             "suttas": [{
                 "uid": "sn1.11"
                 "name": "Nandana"}
            ]}
        ]}
    ]}
]}

YAML is still emerging. It is a superset of JSON, but has an alternative more concise and human-friendly notation, and is extensible.

divisions:
-   uid: sn
    name: Samyutta Nikaya
    subdivisions:
    -   uid: sn1
        name: Devatā Saṃyutta
        vaggas:
        -   uid: sn1.vagga1
            name: Nala Vagga
            suttas:
            -   uid: sn1.1
                name: Oghataraṇa
            -   uid: sn1.2
                name: Nimokkha
            -   uid: sn1.3
                name: Upanīyati

        -   uid: sn1.vagga2
            name: Nandana Vagga
            suttas:
            -   uid: sn1.11
                name: Nandana

YAML 2 - Using keys instead of lists.

sn:
    type: division
    name: Samyutta Nikaya
    sn1:
        type: subdivision
        name: Devatā Saṃyutta
        sn1.vagga1:
            type: vagga
            name: Nala Vagga
            sn1.1:
                type: sutta
                name: Oghataraṇa
            sn1.2:
                type: sutta
                name: Nimokkha
            sn1.3:
                type: sutta
                name: Upanīyati
        sn1.vagga2:
            type: vagga
            name: Nandana Vagga
            sn1.11:
                type: sutta
                name: Nandana

YAML 3 - using custom types.

sn: !Division
    name: Samyutta Nikaya
    sn1: !Subdivision
        name: Devatā Saṃyutta
        sn1.vagga1: !Vagga
            name: Nala Vagga
            sn1.1: !Sutta
                name: Oghataraṇa
            sn1.2: !Sutta
                name: Nimokkha
            sn1.3: !Sutta
                name: Upanīyati
        sn1.vagga2: !Vagga
            name: Nandana Vagga
            sn1.11: !Sutta
                name: Nandana

I have chosen to show some alternative tree structures using YAML, as YAML is by far the most comprehensible at a glance.

Considerations

XML is very widely used and it has an unsurpassed ecosystem of tools and utilities. Everyone is familiar with XML, even if they don't like it. A legitimate criticism of XML is that XML is a Markup Language, it has been shoehorned into it's sometimes role as a data format (however the libraries which load that data are a joy to use). XML does many things, but it doesn't do any of them elegantly. XML makes my eyes bleed, I suppose it's all the angle brackets.

JSON is an Object Notation rather than a Markup Language, that means it is designed for defining objects rather than marking up text. The extreme simplicity, the extremely limited features of JSON, is why it is so widely used on the internet for communications. It's easy to use and easy to parse, JSON is a subset of Javascript code AND Python code (meaning both Javascript and Python natively understand JSON). To a large extent JSON's strengths are also it's weaknesses - it simply cannot be extended. The fact that there is no such thing as custom JSON, means everything can understand it, but it makes it limited. To me, JSON looks like something the dog barfed up, it's A LOT of work to make it look pretty and even if you do that work (as I have done for the example above) it still looks a bit hairy.

YAML, like JSON, is also an Object Notation, it is even defined in terms of not being a Markup Language. YAML is a superset of JSON, meaning that valid JSON is valid YAML. Essentially, they both parse into the same thing (an object tree). YAML has mature libraries for all popular languages (Java, Perl, Python, Ruby etc). One of the strengths of YAML is that it is easily transformed into JSON, another strength is that it has advanced features like built in references (this means you can define a thing in one place, and insert it in multiple places), YAML is Pythonic both in appearance and philosophy (i.e. DRY - Don't Repeat Yourself). YAML is also fully extensible. Much like JSON, the strengths of YAML are also it's weaknesses, the more of the awesome cool extensibility features of YAML you use, the less useful the data becomes in interchange because it's more work to make another program understand your data. Here a well-defined transform to JSON might be a good approach. YAML is elegant and superbly human usable, in fact it's so human comprehensible it seems hard to believe it actually is a data format.

Having investigated it, while the conservative choice would be XML, I feel the awesome choice is YAML

Clone this wiki locally