Skip to content
Hika van den Hoven edited this page May 5, 2017 · 45 revisions

The syntax

The classes

DataTreeGrab.DATAtree()

Options:

This is the class HTMLtree and JSONtree are derived from. You can use it to create your own tree definition. It contains several settable attributes, used on data-extraction:

  • .print_searchtree = False
    Debug setting. If set to True the result of every step through the tree is send to .fle
  • .show_result = False
    Debug setting. If set to True the end result is send to .fle
  • .fle = output
    The output used for the two previous settings and containing the parameter given on creation. This should be a pointer to a file object opened earlier. It is set by the above output parameter.
  • .show_progress = False
    As the .extract_datalist() script can take several minutes to complete, you might choose to run it in a background thread. If you at the same time want to show progress in the Main-thread you set this Flag to True.
  • .progress_queue = Queue()
    If the above .show_progress Flag is, during execution of the .extract_datalist() script, set to True, two part tuples indicating the progress will get inserted into this Queue() object. First on extracting the key-nodes a 0 followed by the number of key-nodes found and subsequently a value from 1 to that total followed again by the total count. You can for instance use a division of the two tuple values as a progress percentage to show on screen.
  • .quit = False
    This Flag is while running the .extract_datalist() checked before extracting the dataset for each of the found key-nodes. If found True the extraction process is imediatly aborted. It is mostly intended, as this script can on large datasets take several minutes to complete, to be used in a multithreading environment to savely abort execution on a user abort or a fatal error in one of the other threads. We might later include checking this Flag in other locations.
  • .extract_from_parent = False
    True for JSON and False for HTML. This is to equalize the behaviour of the two tree-types and has to do with JSON data being only present at the end-point, without any children, while in HTML data can be present everywhere. Only potentially relevant on creating your own tree definition.
  • .result = []
    The result of a search
  • .current_date Today as a datetime.date object. Used by many date related functions as a default. Settable with the set_current_date() function.
  • .current_ordinal Today as an ordinal. Used by many date related functions as a default. Settable with the set_current_date() function.
  • .month_names = []
    A list of the month-names. This can be either short or long names, just what this page uses. As the first month is 1, the first item should be a dummy one.
    See the "date" keyword
  • .weekdays = []
    A list of weekdays starting with Monday. This can be either short or long names, just what this page uses.
  • .relative_weekdays = {}
    A dict with daynames like "today", "yesterday" as key and their relative position to today as value.
    e.g. {"yesterday":-1,"today":0,:"tomorrow":1}. This dict will be updated with the weekdays above and their relative position for today up to a week in the future.
    See the "relative-weekday" keyword
  • .datetimestring = u"%Y-%m-%d %H:%M:%S"
    See the "datetimestring" keyword
  • .time_splitter = u':'
    See the "time" keyword
  • .time_type = [24]
    See the "time" keyword
  • .date_sequence = ["y","m","d"]
    See the "date" keyword
  • .date_splitter = u'-'
    See the "date" keyword
  • .utc = pytz.utc
    See the "timestamp" and "datetimestring" keyword
  • .timezone = pytz.utc
    Any datetime will be stored in the result as an utc timezone value. However the time extracted from the page can be given in another timezone. That timezone should be stored in this property. Use the set_timezone function to change it
    See the "datetimestring" keyword
  • .value_filters = {}
    See the "member-off" keyword
  • .str_list_splitter = '|'
    See the "str-list" keyword

And some functions:

.set_timezone([timezone = None])

Set the timezone to be used. timezone can be either a tzinfo object or a valid timezone string like "Europe/Amsterdam". If no timezone is given the data_def is queried for a timezone keyword, defaulting to "UTC". If an invalid timezone is given a warning is set and "UTC" is used. This function is automatically called on initialization.

.set_current_date([cdate = None])

Set the default date to be used by the several datetime functions. This can be a datetime.datetime object, a datetime.date object or an ordinal. If no value is given self.timezone.normalize(datetime.datetime.now(pytz.utc).astimezone(self.timezone)).date() is used. This function is automatically called on initialization.

.check_data_def(data_def)

Store a data_def and extract any values to fill the above properties. Runs the above two functions.

.find_start_node([data_def = None])

By default the start-node for any search is the root. This function lets you define another start-node. It looks in data_def for an "init-path" keyword containing a init_def leading to the new start-node. If data_def is omitted an earlier stored data_def is used.

.find_data_value(path_def[, start_node = None, link_values = None])

This function lets you extract a list of zero or more data-values and is used by the next function. path_def is a list of node_defs to follow. start-node if undefined is either the root or the one set through the above function. If defined it should be one of the DATAnodes in the tree. link_values is for internal use to store and transfer values during a search as defined by the "link" keyword and should if you call this function yourself be left to value None.

.extract_datalist([data_def = None])

This is the main function. It looks in data_def for first the "key-path" keyword to find the key-nodes and next the "values" keyword to find the corresponding values. A list of multiple "key-path", "values" sets can be contained within the "iter" keyword. If data_def is omitted an earlier stored data_def is used

.is_data_value(searchpath, [dtype = None, empty_is_false = False])

Calls the DataTreeGrab.is_data_value function with the data_def as searchtree

.data_value(searchpath, [dtype = None, default = None])

Calls the DataTreeGrab.data_value function with the data_def as searchtree

DataTreeGrab.HTMLtree()

Options:

  • data
  • autoclose_tags = []
  • print_tags = False
  • output=sys.stdout
  • warnaction = "default"
  • warngoal = sys.stderr
  • caller_id = 0

This is the wrapper around DATAtree to read in the HTMLpage and to deal with the specific HTML properties. It contains the following additional attributes:

  • .print_tags = print_tags
    Debug setting. If set to True the result of parsing through the HTML page is send to .fle
  • .autoclose_tags = autoclose_tags
    The page is checked for an unbalance between opening and closing tags. Those will be added to this list. On calling you can already add tags you know having this problem
  • .root = HTMLnode(self, 'root')
    The root node from where to build the tree
  • .start_node = self.root
    The node to start any search

DataTreeGrab.JSONtree()

Options:

This is the wrapper around DATAtree to read in the JSONpage and to deal with the specific JSON properties. At present it only accepts, through the jsonparser, converted data. It contains the following additional attributes:

  • .root = JSONnode(self, data, key='ROOT')
    The root node from where to build the tree.
  • .start_node = self.root
    The node to start any search

DataTreeGrab.DATAnode()

Options:

  • dtree
  • parent = None

The basic node definition, containing the basic attributes and several search function. In part these are empty functions defined in the HTMLnode and JSONnode subClasses.

Attributes:

  • .dtree = dtree
  • .parent = parent
  • .children = []
  • .value = None
  • .child_index = 0
  • .level = 0
  • .is_root = bool(self.parent==None)
  • .get_children([path_def=None, link_values=None])
    The first node_def in the path_def is used and then subtracted on further calls by the matching nodes. This is the main function used by the tree-functions.
  • .print_tree()
    Used by the print_searchtree property

Functions detailed in the subclasses and used by get_children:

  • .match_node([node_def=None, link_values=None])
    Check if the node matches the node_def returning True or False
  • .find_name(node_def)
    Check and retrieve a value for the "name" keyword
  • .find_value([node_def)
    Retrieve the value for the final node_def
  • .print_node([print_all=False])
    Used by the print_tree function and the show_result property

DataTreeGrab.HTMLnode()

Options:

  • dtree
  • data = None
  • parent = None

Attributes:

  • .tag = u''
  • .text = u''
  • .tail = u''
  • .attributes = {}
  • .match_node([tag=None, attributes = None, node_def = None, link_values = None, last_node_def = False])
  • .get_child(self, tag = None, attributes = None)
  • .get_attribute(name)
  • .is_attribute(name[, value = None])

DataTreeGrab.JSONnode()

Options:

  • dtree
  • data = None
  • parent = None
  • key = None

Attributes:

  • .type = "value"
  • .key = key
  • .keys = []
  • .key_index = {}
  • .value = None
  • .get_child(key)
  • .match_node(node_def = None, link_values = None, last_node_def = False])

DataTreeGrab.NULLnode()

DataTreeGrab.DataTreeShell()

Options:

This is a wrapper around HTMLtree and JSONtree. It allows for extraction of an url, autodetects on whether the data is JSON or HTML and has functionality to further process the datalists coming out of the tree to a dict. Hereby you can combine multiple values, add your own functions etc. See also the DataTreeShell page.

The functions

DataTreeGrab.version()

A function returning a tuple with:

  1. The name
  2. major version number
  3. minor version number
  4. patch number
  5. patch date "yyyymmdd"
  6. boolean on whether it is beta
  7. boolean on whether it is alfa

You can use it to check on the module version on initializing you script.

DataTreeGrab.is_data_value(searchpath, searchtree, [dtype = None, empty_is_false = False])

This function searches searchtree following the keywords in searchpath. searchtree is a dict/list tree like a data_def, searchpath is a list of string/integer keys. If the requested value exists it is checked on its type against dtype. If dtype is None any type suffices except a None/null value. If empty_is_false is set to True an empty string, list or dict is also seen as invalid. It returns either True or False. Next to valid Python types and classes as accepted by the python isinstance function dtype also accepts the strings "string" and "list". As of version 1.1.3 dtype can also be a tuple of multiple types and classes.

  • If dtype = float an int type will also give True.
  • str, unicode or "string" will accept both str and unicode values
  • list, tuple or "list" will accept both list and tuple values

DataTreeGrab.data_value(searchpath, searchtree, [dtype = None, default = None])

This function first calls the above function and on True will return the value. On False it will return default if set. Else if dtype = str, unicode, dict, list or tupple an empty string, dict or list is returned. It does NOT accept a tuple of values for dtype as a single value is needed to know what empty value to return!
Be aware that a default of None means no default. So in case dtype being str, unicode, dict, list or tupple the implied default of "", {} or [] is returned!
For instance:

dversion = data_value('version', v, int, 0)

is the same as

dversion = v['version'] if (isinstance(v, dict) and 'version' in v and isinstance(v["version"], int)) else 0

or

if isinstance(v, dict) and 'version' in v and isinstance(v["version"], int):
    dversion = v['version']

else:
    dversion = 0

DataTreeGrab.extend_list(base_list, extend_list)

Basicly this function will extend base_list with extend_list returning the result, but will also check a few things.

  • If base_list is not a list it will make it a list containing base_list.
  • if extend_list is not a list it will use append in stead of extend.
Clone this wiki locally