data_def_language

The data_def language

data_def is a dict containing a "data" keyword which can contain the following keywords. Unknown keywords will be ignored:

"sort" Only valid for a DataTreeGrab.JSONtree called through DataTreeShell. With it you can sort any number of lists in the tree through their childdicts before importing using:
data_value(path, page, list).sort(key=lambda l: (l[childkeys[0]], l[childkeys[1]], l[childkeys[2]]))
It contains a list of sort-definition dicts, each containing two keywords like:
"sort":[{"path":["schedule"], "childkeys":["station","unixtime"]}],
- "path" being a list of keywords or indexes to follow to the target list
- "childkeys" being a list of keys in the dicts contained in the list to be sorted. You can add a maximum of 3 keys.
"init-path" containing a path_def (also called the init_def) to the start-node. If it results in more then one end-node, only the first one will be used.
"iter" containing a list of dicts with each a "key-path" and a "values" keyword. Can be omitted if only a single set is needed.
"key-path" containing a path_def (also called the key_def) to a set of key-nodes
"values" containing a list of path_defs (also each called a value_def) to the data values to be linked to every key-node. They are numberd from 1 up, with the key-node itself being number 0. If it results in multiple end-values, they will be contained within a list.
"values2" with version 1.3.3 the possibility to store node references is introduced. By using this to directly jump back to them, significant speed increase can be achieved. I have seen searches run more then 30% faster. Using this however makes your data_defs incompatible with older versions of DataTreeGrab. By leaving the old "values" tree intact and creating a new adapted "values2" tree, your data_def will run both with the newer and the older versions.
"--remark--"

Next to the "data" keyword it can contain the following property values. They are used in several data manipulation functions:

"datetimestring" Defaults to u"%Y-%m-%d %H:%M:%S"
"date-sequence" Defaults to ["y","m","d"]
"date-splitter" Defaults to "-"
"month-names" No default
"relative-weekdays" No default
"str-list-splitter" Defaults to "\|"
"time-splitter" Defaults to ":"
"time-type" Defaults to [24]
"timezone" Defaults to "UTC"
"value-filters" No default
"weekdays" No default

If you use the DataTreeShell wrapper class, some more keywords are recognized on the root-level of the data_def. They will be further explained there.

"accept-header" A string
"autoclose-tags" A list
"data-format" A string
"date-range-splitter" Defaults to "~"
"default-item-count" Defaults to0
"empty-values" Defaults to [None, ""]
"enclose-with-html-tag" Defaults to False
"encoding" A string
"item-range-splitter" Defaults to "-"
"text_replace" A list of regex pairs
"unquote_html" A list of regexes
"url" A string or a list
"url-data" A dict
"url-date-type" Defaults to 0
"url-date-format" A format string. Defaults to None
"url-date-multiplier" Defaults to 1
"url-header" A dict
"url-type" An integer, not defined, but used by tv_grab_py_API.
"url-weekdays" A list
"values" A dict containg a set of link_defs.
This is a different one from the "values" keyword contained within the "data" keyword. See the data_def language Link extension for a description.

Escaping special characters

Special characters sometimes need to be escaped. Be aware that if you read in your data_def through the JSON parser, you might have to double escape. for instance: "split":[["/",-1],["\\.",0]] (to get a name without the path or the extension). For the splitting in Python you need to escape the dot, but the escape is removed by the JSON parser. If you create your data_def as a Python dict/list structure a single escape suffices.

path_def

A path_def is a list of node_defs and exists in several variants:

init_def
key_def
value_def

When walking through this list every node_def is evaluated to a list of zero or more nodes relative to the current node. On every node in this list the next node_def is applied, until either no node is found or the last node_def is reached. In the end this results in a list of zero or more nodes. In special situations this can be the NULLnode(). In a key_def or a value_def the final node_def can contain data manipulation functions and a pointer to the kind of data to be fetched. If missing this will be for HTML the text property and for JSON the value property. This final node_def can be either a separate one without node selection parameters or can be contained in the last node_def.

Node selection statements

A node_def is a dict defining zero or more nodes relative to the current node. To that effect you can use several keywords, some generic, some specific to HTMLtree or JSONtree. As of version 1.1.3 syntax and behaviour has been made more consistent. Any <value> can also be a link dict as described below.
For an HTMLtree you need at least a "path", "tag", "tags" or "index" keyword, where "tag" and "tags" can coexist with "index". They are checked in that order, so a "tags" keyword is ignored if a "path" or "tag" is present.
For a JSONtree you need at least a "path" , "key", "keys" or "index" keyword. An "index" can coexist with a "keys" list. They are checked in that order.
If neither of those keywords are present, the next node_def is applied to the same node, but first it is checked for any "link" or "name" statements. You best combine those with a regular node_def, but sometimes you cannot, mainly if you need a link from your starting node or if you want to store multiple links on one node.

About values:

During node selection you compare values in your DataTree to values given by you in your data_def. These can be values like "a text value" or 257, or can also be links to earlier encountered values you have stored with the "link" keyword. Instead of a value you supply a link dict with:

At least the "link" keyword containing an integer pointing to an earlier saved value.
Optional in case of a numerical value the "calc" keyword containing a list of first "plus" or "min" followed by an integer. This later value being added or subtracted from the stored value.
Optional in case of the index keyword, the "previous" or "next" keyword, meaning to select all results with a lower or a higher index then the stored value. The content after the keyword is not used and ignored.

Of cause you can also use data manipulation statements next to the data extraction statements from the final node_def syntax to adapt a value before storing it.
In case of HTML only text values are valid (except for indexes which are integers). It won't give an error, but other kind of values simply won't match. For JSON you can use any valid JSON datatype: (string, integer, float, boolean, null). Depending on where you use it of cause. A key for instance can only match an integer or string. Everywhere <value> or <list of values> is indicated you can use a link dict.
Stored links in the key_def are passed on to its value_defs and from one value_defs to the next.

Generic keywords:

"path":`<keyword>`

This can have the following values:

"all", selecting all child-nodes.
The result differs on whether you use it in an init_def, key_def or one of the value_defs. The first one can only result in one end-node, so after the first successful hit the rest is ignored. In a key_def every resulting end-node is added as an independent key-node. A value_def acts somewhere in between. If it results in more then one end-value, they are gathered into one list for that key-node/value combination.
"parent", selecting the parent-node.
Be aware of the difference between an HTMLtree and a JSONtree. As a JSON value can not contain children, the default start-point there is the containing dict/list, so you do not first go to the parent to access the siblings. In an HTMLtree you do first go to the parent to access the siblings. This is only relevant for the first node_def in a value_def
"root", jumping to the root-node.
This can be relevant where in JSON a main-list contains pointers to other data-lists. See Example 1
Any integer value used with the "node" keyword to store a link to a previously accessed node

"index":`<value>`

This can either contain an absolute child_index value or a "link" statement.

Be aware that the index value for a json dict has at present no real meaning as the json parser decodes it to a python dict, which has no predictable ordering. In the future an adapted json parser will be added that bypasses the python dict.

See Example 1 and Example 2

"node":`<integer>`

Store a reference to every qualifying node under this number. If there is no node selection keyword present in the node_def, the current node is referenced. Use the "path" keyword with the integer value to jump back to this node instead of slowly crawling to it.
This keyword is introduced with version 1.3.3. Using it makes your data_def incompatible with older versions, but can increase speed significantly.

"link":`<integer>`

Store the value for every qualifying node under this number to be referenced later. If there is no node selection keyword present in the node_def, the value is extracted from the current node. Use data extraction and data manipulation statements from the final node_def syntax to define the value to be stored. See Example 1 and Example 2. You can use this stored value later on to select an index, tag, key, attribute or childkey with this value.

"name":`<dict>`

This is a special keyword. The containing dict follows the syntax for the final node_def and the resulting value, extracted from the current node is stored. When at the end of the path_def the final value is retrieved it will be put in a dict with this stored name as a keyword. This is especially handy where in HTML the name and value are stored in 'independent' tags. You can define successive names in one value_def to create nested dicts, but normally that is not advisable. You should never add a name keyword to an init_def or a key_def. It will be applied and any successive value search will fail as the result will not be a DataNode but a dict. See Example 3

"--remark--":`<text>`

While any unknown keyword will be ignored and could be used to add clarification remarks, use this one to avoid it clashing with future keywords. In general any keyword in form of "--*****--" will suffice.

HTML keywords:

"tag":`<value>`

Select the tag(s) with this name. tag-names are always lower-case. The value kan be a "link".

"tags":`<list of values>`

The same as above, but with a list of multiple values/"links".

"attrs":`<dict>|<list of dicts>`

A dict of attribute names and contents that must be present. Use None/null to not select on the content of the attribute. Attribute-names are always lower-case. If you want to give multiple possible values, contain them in a list. An empty list is the opposite of None/Null so in essence does invalidate the statement as no match is possible. Use None/Null to match attributes without value instead. It is also possible to give a negative selection. Add instead of the content value a dict with the "not" keyword followed by a list of values to not select. You can use a "link" statement as a value in the "attrs" dict. All attribute/value pairs must satisfy. If you want to give multiply alternative sets combine them in a list within the "attrs" dict (so or in stead of and).

"notattrs":`<dict>|<list of dicts>`

A dict of attribute names and contents that may not be present. Use None/null to deselect on the presence of the attribute irrespective of the content. Attribute-names are always lower-case. If you want to give multiple possible values, contain them in a list. An empty list is the opposite of None/Null so in essence does invalidate the statement as no match is possible. Use None/Null to match attributes without value instead. You can use a "link" statement as a value in the "notattrs" dict. The difference with using the "not" keyword together with "attrs" is that then the attribute must be present, but not with those values, whereas now the attribute may or may not be present only not with the given values. All attribute/value pairs must satisfy. If you want to give multiply alternative sets combine them in a list within the "notattrs" dict (so or in stead of and).

"text":`<text>`

Select on tags containing this text. Case-sensitive. The full text must match but can also be a "link" statement.

"tail":`<text>`

Select on tags containing this tailing text. Case-sensitive. The full text must match but can also be a "link" statement.

JSON keywords:

In general "key", "keys", "childkeys" and "notchildkeys" follow the same syntax as respectively "tag", "tags", "attrs" and "notattrs". A "keys" keyword containing a dict is for backward compatibility seen as a "childkeys" statement.

"key":`<value>`

Select the child with this key value. Be aware that here JSON-dict and JSON-lists are treated the same, list-value having the list index as a key. The only difference being that a dict has an alpha-numerical key and a list an integer key. So for a JSON list this is the same as "index". The value kan be a "link".

"keys":`<list of values>`

If you want to select multiple children on key value. Use this.

"childkeys":`<dict>|<list of dicts>`

Select the child(s) containing these key/value pairs in the dict. Be aware of the difference with the previous two. These are the keys contained in the child dict/list, while the previous two look in the nodes own key-list. You can use it to select among a set of similar nodes that contain a combination of properties (like the attribute in HTML) and a desired value. Once you have selected this containing node on the basis of its properties, you select the one with the value. If you want to give multiple possible values, contain them in a list. An empty list is the opposite of None/Null so in essence does invalidate the statement as no match is possible. It is also possible to give a negative selection. Add instead of the content value a dict with the "not" keyword followed by a list of values to not select. You can use a "link" statement as a value. If you want to give multiply alternative sets combine them in a list within the "childkeys" dict (so or in stead of and).

"notchildkeys":`<dict>|<list of dicts>`

A dict of keys and contents that may not be present in the child. Use None/null to deselect on the presence of the key irrespective of the content. If you want to give multiple possible values, contain them in a list. An empty list is the opposite of None/Null so in essence does invalidate the statement as no match is possible. You can use a "link" statement as a value in the "notchildkeys" dict. The difference with using the "not" keyword together with "childkeys" is that then the childkey must be present, but not with those values, whereas now the childkey may or may not be present only not with the given values. All key/value pairs must satisfy. If you want to give multiply alternative sets combine them in a list within the "notchildkeys" dict (so or in stead of and).

The final node_def

or the data definition in the last node_def in a key_def or value_def. This can either be joined with the last actual node definition or in a separate last node_def. The same language is used to define a link-value and a name-value.

Data extraction statements

First there is the selection of what data is returned. Only one selection can be made. If you need multiple values from this node use a separate value_def for each value. If omitted this defaults to "select":"text" for HTML and "select":"value" for JSON. This both if no selection statement is given or an invalid value is given to "select". The first in the order listed below will be used and any later one ignored. The resulting value can be string, numeric, boolean or None/null, but through the Data manipulation statements can also be converted to datetime, date, time, timedelta values or a list.

"value":`<text>`

Return this text, but only if the node was found.

"attr":`<name>`

Return the content of the named attribute.

"select":`<keyword>`

Depending on whether it is a HTMLtree or a JSONtree it can select one of the following properties:

"index"
"tag"
"text"
"tail"
"key"
"value"
"presence"
This is a special selection. It will return True if exactly one node is found. Else False is returned.
"inclusive text"
In HTML as tags kan be simple layout statements like <i>or <b>, text can be spread over multiple levels of child nodes. This statement will concatenate the text and tail values from the child nodes to the nodes text value. By default it only goes down one level, but you can add a "depth" keyword to indicate going deeper down. You can also add an "include" or "exclude" keyword with a list of tags to include or exclude. If "include" is present, "exclude" is ignored. For any tag being excluded (direct or by not being present in "include") only the tail value is added. The text value and text/tail values form any chids are excluded. Concatenated text parts are separated by a space.

Data manipulation statements

After selecting the value, several transformations can be applied to the value. Add any number of them, but any keyword can only exist once. With some ("sub","split") you can add multiple implementations. They are given in the order the keyword presence is checked. Be aware that any special character must be escaped.

"lower":

Convert the value to lower-case. The content after the keyword is not used and ignored. Only applicable on string values!

"upper":

Convert the value to upper-case. The content after the keyword is not used and ignored. Only applicable on string values!

"capitalize":

Convert the first character to upper-case and the rest to lower-case. The content after the keyword is not used and ignored. Only applicable on string values!

"ascii-replace":`<list>`

The containing list can consist of up to 3 values.

The character or character-string to put in place of non-ascii characters
The character or character-string to put in place of
the list of characters contained in here. This should be in the form of a string with the characters enclosed in []. This replacement will be done before the general replacement.
Only applicable on string values!

"lstrip":`<string>`

Strip if present the contained string of the left side of the value. Only applicable on string values!

"rstrip":`<string>`

Strip if present the contained string of the right side of the value. Only applicable on string values!

"sub":`<list>`

Replace sub-string 0 with sub-string 1, You can add multiple replacements after each other in the list. Only applicable on string values!

"split":`<list>`

Split the value on the first item in the list. Next return the numbered items supplied next, concatenating them with the split string. You can number up from 0 or down from -1. If the split is made on any of '\\s', '\\t', '\\n', '\\r', '\\f', '\\v', ' ' the concatenating is done with just a space (use '\\s*' to split on any number off whitespace characters). If you want to do multiple splits, join them after each other in a containing list. If a split fails the original string is maintained. Only applicable on string values!

"multiplier":`<integer>`

Multiply with the containing integer, converting first to integer. On failure the original value is maintained.

"divider":`<integer>`

Divide by the containing integer, converting first to integer. On failure the original value is maintained.

"replace":`<dict>`

If the value converted to lower-case exists as a key-value in the dict, replace it with its value. If it does not replace with None.

"default":`<value>`

If the value is None return this value. Use especially in conjunction with the previous keyword.

"first":

Only return the first found value. The content after the keyword is not used and ignored.

"last":

Only return the last found value. The content after the keyword is not used and ignored.

"type":

Force the value to be of the named type. On failure the original value is maintained.

"timestamp"
Apply:
value = datetime.datetime.fromtimestamp(float(value), self.utc)
"datetimestring"
Apply:
date = self.timezone.localize(datetime.datetime.strptime(value, dts))
value = self.utc.normalize(date.astimezone(self.utc))
Where dts is either defined within the node_def through the "datetimestring" keyword or through self.datetimestring as defined in DATAtree. Also self.timezone is the value defined in DATAtree defaulting to UTC.
"time"
On basis of the self.time_type value is determined wether this is a 12 or a 24 hour time. This can be overruled in the node_def through the "time-type" keyword. If a 12 hour type is defined, the end of the string is checked for an "am" or "pm" value (or the coresponding values defined in self.time_type/"time-type"). Next the value is split on self.time_splitter or an in the node_def through the "time-splitter" keyword provided ad-hoc character. The resulting values are used as hour, minutes and seconds to construct a datetime.time object, adding 12 houres if a pm value was indicated.
"timedelta"
Apply:
value = datetime.timedelta(seconds = int(value))
"date"
Similar to the time function, but using "date-splitter"/self.date_splitter to split the value and "date-sequence"/self.date_sequence to define the order and returning a datetime.date object. In case the months are not numerical the self.month_names list is used to replace text with a number. A missing or invalid day/month/year value is replaced with the current one.
"datestamp"
Apply:
value = datetime.date.fromtimestamp(float(value))
"relative-weekday"
Like the "replace" function, but look in self.relative_weekdays.keys() for the value and replace if found
"string"
Force the value to be unicode.
"int"
Try to force the value to be an integer. Return 0 on failure.
"float"
Try to force the value to be a float. Return 0 on failure.
"boolean"
If value is a boolean return value.
If value is an integer or float return True if not equal to 0.
If value is a string or unicode return True if not empty and not equal to "0".
Else return False
"lower-ascii"
A simpler variant to the "ascii-replace" function:
- replace spaces and slashes with an underscore
- remove !(),
- remove any accents by replacing for instance á with a
- replace @ with a
- apply value = value.encode('ascii','replace') replacing any non-ascii character with a ?
"list"
Force the value to be a list. Use this in combination with a "path":"all" in a value_def to force an empty or single value to also return as an empty list or a list with a single item.
"str-list" (version 1.1)
Split a string value and return a list. Define the character to split on with the "str-list-splitter" keyword or use the default value in self.str_list_splitter as defined in DATAtree. If a keyword "omit-empty-list-items" with value True is encountered, all None/null values and empty string values are omitted from the list.

"member-off":`<value_filters key-value>`

DATAtree contains a property self.value_filters. This is a dict where each key contains a list of values. If the retrieved value is not found in this list the value is set to NULLnode and thereby the containing key-node is flagged invalid and dumped. Further value searching around this key-node is stoped. See Example 4.
This can be relevant if this node can not be used to select the key-nodes, because of its location in the tree or because it is shared by multiple key-nodes.

The data_def language URL extension

To extract URL data with DataTreeGrab.get_url() you can use URL_def statements. They can be used in the "url" , "url-header" and "url-data" keywords. The first is a list of strings and functions that are concatenated to an url string. The other two are a dict of header and post-data items to send with the url. We suggest using the Requests package to fetch your url.
An example URL definition.
URL functions are numbered, where numbers up to 99 are reserved for system provided functions and higher numbers can be used for self defined functions. If you want to create your own functions, follow the scheme as in the example below. You can use your own keywords in data_def and the url_data dict. We however suggest to prefix them with user-. This to prevent clashing with future extensions.

from DataTreeGrab import DataTreeShell, is_data_value, data_value

class DataTree(DataTreeShell):
    def __init__(self, data_def, warnaction = "default"):
        <My own initialisation code>
        DataTreeShell.__init__(self, data_def, warnaction = warnaction, warngoal = <My own logger>)
        <My own code to overrule any default initialization>
        
    def add_on_url_functions(self, urlid, data = None):
        if urlid == 101:
            <My own url_function 101>
            return <My return value>

        elif urlid == 102:
            <My own url_function 102>
            return <My return value>

An URL_def consists of a list with the first item being the function number (send to the functions urlid parameter) followed by data-items used by that function (send to the functions data parameter). A function not needing data can also be called with just the number. Any string values will be taken as literal values! If the first list item is not an integer, function 0 is assumed.
A function can use several sources of data to determine what output to return:

A value supplied by your data_def
A value supplied through the url_data dict supplied on calling DataTreeGrab.get_url().
A value supplied in the URL_def in your data_def. This often is a pointer to data in the url_data dict.

Available URL functions

0: Needs one string value, being a data-key in the url_data dict, whose value is returned. For example [0, "datavalue"] will return the value in url_data["datavalue"]. If the key does not exists in url_data a warning is issued. As said before, you can leave out the number 0 and just call ["datavalue"]. If you just call 0 or [0] a key value of "url-var" is assumed, so if it exists url_data["url-var"] is returned.
4: Returns a range. Uses no extra values in data. Uses url_data["count"] and url_data["cnt-offset"] and from your data_def "default-item-count" and "item-range-splitter". "default-item-count" is the default for url_data["count"] and itself like url_data["cnt-offset"] defaults to 0. "item-range-splitter" defaults to "-".
It returns a range from url_data["cnt-offset"] * url_data["count"] + 1 to url_data["cnt-offset"] * url_data["count"] + url_data["count"], the two values being separated by "item-range-splitter". So with a cnt-offset of 0, a count of 100 and the default splitter, "1-100" is returned. Raising count by one will return "101-200"
If your returning page holds a value for The number of items in the page use the "page-item-count" keyword to hold a path_def pointing to it. The same for the total number and the "total-item-count" keyword. You can use these values to calculate any next page fetch (currentree here is the name of the DataTreeShell object). E.g.:

total_item_count = currentree.searchtree.find_data_value(currentree.data_value(['data',"total-item-count"],list))
current_item_count = currentree.searchtree.find_data_value(currentree.data_value(['data',"page-item-count"],list))

Another system that for instance ttvdb.com uses is to number the pages:

"first-page":[{"key":"links"},{ "key": "first"}],
"last-page":[{"key":"links"},{ "key": "last"}],
"next-page":[{"key":"links"},{ "key": "next"}],
"previous-page":[{"key":"links"},{ "key": "prev"}],

11: Returns a date value. Uses 1 key-value in data pointing to a value in the url_data dict, defaulting to "offset". This is an integer giving the days from the current_day as set with the DataTreeShell.set_current_date() function. Further from your data_def are used: "url-date-type" defaulting to 0, "url-date-format" defaulting to None, "url-date-multiplier" defaulting to 1 and "url-weekdays" defaulting to an empty list.
"url-date-type" determines in what format the date is returned:
- 0 If a valid format string is supplied through the "url-date-format" keyword that is used to format the date. Else the offset value in url_data["offset"] is returned.
- 1 A timestamp is returned multiplied by the value in the "url-date-multiplier" keyword.
- 2 The numerical weekday is determined with Monday being 0. If the "url-weekdays" returns a list of exactly 7 items the value representing that weekday is returned, else the number. These can be any string or integer value, so if you need weeknumbers 1 through 7 starting on Sunday fill "url-weekdays" with [2,3,4,5,6,7,1].
14: Returns a date range. Uses 2 key-values in data pointing to values in the url_data dict, defaulting to "start" and "end". These are two integers giving the days from the current_day as set with the DataTreeShell.set_current_date() function. Further from your data_def are used: "url-date-type" defaulting to 0, "url-date-format" defaulting to None, "url-date-multiplier" defaulting to 1 and "url-weekdays" defaulting to an empty list and "date-range-splitter" defaulting to "~".
Two dates as described in function 11 are returned separated by the string in "date-range-splitter"

The data_def language Link extension

The "values" keyword at the root of your data_def contains a dict with each keyword containing a link_def. When calling DataTreeShell.link_values(linkdata) a dict with every link_def replaced by its resulting value is returned. DataTreeShell.extract_datalist() will do that with the data-lists associated with every key-node as returned in DATAtree.result. The resulting list of dicts is placed in DataTreeShell.result.
This has two advantages:

Working with a dict in your program is easier as the meaning of every value is implied by the keyword and adding a new keyword can not disturb the other keywords.
You can combine multiple result values from the data-list into one value. For instance a date and a time into a full datetime value. Or if there are two or more possible locations in the page from where a specific value can be found you can select the first valid one from a list of descending quality. Also the functions in DATAtree are more about basic data extraction. Here you can add all kind of specialized custom functions.

A link_def dict contains at least one of four keywords. The first one found in this order will be used. Successive primary keywords (except for "default") are ignored.
Some Link_def examples.

Primary keywords

"varid": It must contain an integer value that is an index in the extracted linkdata list. If an invalid index is supplied a warning is issued. Next to "varid" the following secondary keywords are recognized and handled in order:

"max length"
"min length"
"regex"
"type"
"calc"

"funcid": It must contain an integer value pointing to a Link function, where numbers up to 99 are reserved for system provided functions and higher numbers can be used for self defined functions. A number pointing to a none-existing function will result in a warning. If you want to create your own functions, follow the scheme as in the example further down. To prevent warnings about empty return values, because for instance it handles mailing you some message, give it a fid of 200 or higher.
Next to "funcid" a "data" keyword containing a list of data to be delivered to the function can be added. Every item in the list can be either a link_def dict or a concrete data-value. There is no limit other than that of your system on the level of nesting of the function calls. Next to "funcid" and "data" the following secondary keywords are recognized and handled in order:

"regex"
"type"
"calc"

"value": Returns the containing value.
"default": When none of the three above returns a valid value this containing value is returned. By default None and "" are seen as invalid, but you can change this list through the "empty-values" keyword. It must contain a list, but that can be an empty one.

Secundary keywords

"max length" and "min length":
You can limit the maximum and minimum value or length of a variable. With int and float values this is done on their value. str, unicode, list and dict values are tested on their length. If they are to big, small, short or long None is returned and a warning is issued.
"regex": This must contain a valid regex returning at least one valid group and is applied to the result of the coexisting "varid" or "funcid". On failure None is returned and a warning is issued.
"type": Must contain a string value. Forces the returning value of the coexisting "varid" or "funcid" to be of this type. On failure None is returned and a warning is issued. An invalid type only raises a warning, but is further ignored. Valid values are:
- "string": Apply: unicode(value)
- "lower": Apply: unicode(value).lower()
- "upper": Apply: unicode(value).upper()
- "capitalize": Apply: unicode(value).capitalize()
- "int": Apply: int(value)
- "float": Apply: float(value)
- "bool": Apply: bool(value)
"calc": Contains a dict with one or more of the below keywords with an int or float value.
- "multiplier":
- "divider":

Example add_on_link_functions implementation:

from DataTreeGrab import DataTreeShell, is_data_value, data_value

class DataTree(DataTreeShell):
    def __init__(self, data_def, warnaction = "default"):
        <My own initialisation code>
        DataTreeShell.__init__(self, data_def, warnaction = warnaction, warngoal = <My own logger>)
        <My own code to overrule any default initialization>
        
    add_on_link_functions(self, fid, data = None, default = None):
        if fid == 101:
            <My own link_function 101>
            return <My return value>

        elif link == 201:
            <My own link_function 201 that does not give a return value>
            return None

The tvgrabpyAPI add_on_link_functions.

Available Link functions

0: Strip data[1] from the end of data[0] if present and make sure it's unicode
1: Strip data[1] from the start of data[0] if present and make sure it's unicode (new in version 1.3.3 Previously present in tvgrabpyAPI as function 109)
2: Concatenate string-parts in data and make sure it's unicode
3: Get 1 or more parts of a path in data[0]. data[1] is an integer or a list of integers pointing to the index(es) of the resulting list after the split. Concatenate the parts again with a "/".
4: Combine date value in data[0] with time value in data[1]. If data[0] is not a datetime.date object use self.current_date. data[2] must be an integer, but is currenrly not used. If data[3] is a time object and if when combined with data[0] lies before the above datetime.datetime object one day is added. (to correct for a possible midnight passing.)
5: Return True (or data[2]) if data[1] is present in data[0], else False (or data[3])
6: Compare the values data[0] and data[1] returning data[2] or True if equal, data[3] or False if unequal and data[4] or None if one of them is None.
7: Return string data[1] (if present) on data[0] being True. Else data[2]
8: Return the longest not empty text value in data
9: Return the first not empty value in data
10: look for data[2] in list data[0] and return the corresponding value in list data[1], If not found return data[3] or None if absent
11: look for data[1] in the keys from dict data[0] and return the corresponding value
12: Remove data[1] from the string in data[0] if present and make sure it's unicode (new in version 1.3.3)

Glossary

accept-header
autoclose-tags
caller_id
current_date
current_ordinal
child_index
data_def
data-format
DATAnode
DATAtree
date-range-splitter
date-sequence
date-splitter
datetimestring
default-item-count
empty-values
enclose-with-html-tag
encoding
init_def
item-range-splitter
key_def
key-node
link_def
link-value
month-names
name-value
node_def
NULLnode
path_def
.print_searchtree
relative-weekdays
root-node
severity
.show_result
start_node
str-list-splitter
text_replace
time-splitter
time-type
timezone
unquote_html
URL_def
url
url-data
url-date-format
url-date-multiplier
url-date-type
url-header
url-type
url-weekdays
value_def
value-filters
warngoal
weekdays

data_def_language

The data_def language

Escaping special characters

path_def

Node selection statements

About values:

Generic keywords:

"path":<keyword>

"index":<value>

"node":<integer>

"link":<integer>

"name":<dict>

"--remark--":<text>

HTML keywords:

"tag":<value>

"tags":<list of values>

"attrs":<dict>|<list of dicts>

"notattrs":<dict>|<list of dicts>

"text":<text>

"tail":<text>