Skip to content

Abstract Syntax Tree as JSON

Engelbert Niehaus edited this page May 1, 2019 · 22 revisions

What is an Abstract Syntax Tree (AST)?

In wtf_wikipedia the concept of an Abstract Syntax Tree (AST) is a tree representation of the WikiMedia syntax of the source text in Wiki Markdown. The data in the Wiki article in stored in a JSON structure. The JSON is valuable for data management of extracted content elements. The AST can be represented in a JSON as well, in which each node of the tree denotes a content element (e.g. paragraph, header/title, image, mathematical expression) occurring in the source text downloaded via the MediaWiki API with wtf.fetch(...). The syntax tree is "abstract" in a sense that is not representing a special output format in detail (e.g. HTML, LaTeX, MarkDown,...). The AST nodes are encode in the Wiki Markdown and tree structure can be used to generate the different output formats by application of the appropriate tree node handler for the title, image, sentences to the AST.

Similar to programming languages the abstract syntax trees can be derived from a concrete syntax trees, traditionally generated by parsing a given string compliant with a "strange" grammar of the Wiki source text.

Current Implementation of an AST

A Wikipedia or Wikiversity article is decomposed into an AST in the following order. The current documentation is refering to wtf_wikipedia version 7.1

Document Node src/01-document

A wiki source article is parsed and the root node of the AST is of type Document.

  • The root node of the Abstract Syntax Tree (AST) Document stored in an object in /src/01-document/index.js
const main = function(wiki, options) {
  options = options || {};
  wiki = wiki || '';
  let r = {
    type: 'page',
    sections: [],
    interwiki: {},
    categories: [],
    coordinates: [],
    citations: []
  };
  ...
  • The Document object is created in src/01-document/document.js.
  • A Document object is decomposed in sections defined in src/02-section as object Section
  • A Section object is decomposed in paragraphs defined in src/03-paragraph as object Paragraph
  • A Paragraph object is decomposed in sentences defined in src/04-sentence as object Sentence
  • A Sentence object could contain other subobject like Tables in src/table or subobjects like References, Citations or mathematical expression MathExpr (currently not defined).

The root node of the AST is the r object. It is populated with the sections, citations, ... during parsing of the wiki source document. See method createNode4AST() for a generic approach.

Storage of AST

The downloaded Wiki sources of an article is stored in wtf.wiki. The wiki source text is processed into an abstract syntax tree that could be stored in wtf.tree. If developers want to export the JSON AST into a well formated text file use:

var vTreeString = JSON.stringify(wtf.tree,null,4); 
  • The string of the ASt can be used in the test cases in tests/basis_ast.test.js.
  • For debugging and designing new export formats.

Example of an AST

The following example is currently not an available feature in wtf_wikipedia. It could serve as basis for further generation of other export formats for formats that will never be implemented in wtf_wikipedia.

Wiki Page >-> wtf_wikipedia.js >-> AST >-> ast2odf.js >-> Open Document Format

{
   "language": "en",
   "domain": "wikiversity",
   "article": "Water",
   "ast": [
       {
           "type":"page",
           "value":"",
           "children":[
              {
                   "type":"sentence",
                   "value":"My first sentence.",
                  "children":[]
             },
           {
                   "type":"sentence",
                   "value":"My Second sentence.",
                  "children":[]
             },
          ]
       },
       {
           "type":"title",
           "value":"My Title",
           "children":null

       },
       {
           "type":"math",
           "value":"\\sum_{k=1}^{n} k^2",
           "children":[]

       },
    ]
}

Create a Node for the AST

The method createNode4AST() creates a very simple node for the Abstract Syntax Tree (AST) by returning a hash with just the type attribute. This AST node can be populated with additional attributes that may be relevant for generation of output formats. E.g. a Section needs a title and depth attribute as implemented in /sec/section/index.js l.40 with function splitSections().

  const createNode4AST = function(nodeid) {
    return {
               "type":nodeid
           }
  }

A tree node for Section will be created with createNode4AST("Section") and populated with more content parsed in the section body. The contentlist (is this equivalent to templates attribute in section hash??? or are these really templates).

A node-specific constructor could use switch command for adding type specific additional attributes.

  const createNode4AST = function(nodeid) {
     let ast_node = {
               "type":nodeid
           };
     switch (nodeid) {
        case "Paragraph","TextBlock","BulletList","Sentence":
           ast_node.contentlist = new ContentList()
        break;
        case "EnumList":
           ast_node.enumtype = "arabic"; //used for output: > 1. 2. 3.  
           // enumtype = "alph" > a) b) c) - not used in release 5.0
           ast_node.contentlist = new ContentList()
        break;
        case "Section":
           ast_node.title = "";
           ast_node.depth = -1;
           ast_node.templates = [];
           ast_node.contentlist = new ContentList()
        break;
        default:
    
    }
    return ast_node
  }

Branching in the AST - ContentList

The main branching element for parsing the children is the ContentList. It decomposes the part of the wiki source in a sequence content elements (branches of the tree node). The elements in the content list can processed further and decomposed into sub trees by parsing the decomposed wiki text fragments. ContentList performes two abstract tasks:

  • block level: decomposition into Paragraph, Table, EnumList, BulletList, MathBlock, ...
  • sentence level: decomposition into sentence fragments Bold, Italics, Underline, MathInline, ImageInline (like icons), ...

Javascript Parser

  • MediaWiki Parser Parsoid (see https://www.mediawiki.org/wiki/Parsoid) parses Wiki-Markdown to HTML and HTML to Wiki-Markdown. parsoid can be used to replace the parser inside wtf_wikipedia with the output format HTML with just the Wiki markdown fetch wtf_fetch and the application of parsoid on the downloaded/fetched wiki source text.
  • Use a generic parser generator for specific expressions e.g. with PEGJS. Can be used to generate a Javascript parser for specific purpose in wtf_wikipedia