Spin-off python module for extracting structured data from HTML and JSON pages.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
DataTreeGrab.py
LICENSE
MANIFEST
README.md
setup.py
test_data_def.py
test_json_struct.py

README.md

DataTreeGrab

Some examples
The latest stable release
Go to the WIKI
Go to tvgrabpyAPI

Spin-off python module for extracting structured data from HTML and JSON pages.
It is at the heart of the tv_grab_py_API and was initially named just DataTree,
but as this name is already taken in the Python library...

Requirements

  • Python 2.7.9 or higher (currently not python 3.x)
  • The pytz module

Installation

  • Especially under Windows, make sure Python 2.7.9 or higher is installed
  • Make sure the above mentioned Python 2 package is installed on your system
  • Download the latest stable release and unpack it into a directory
  • Run:
    • under Linux: sudo ./setup.py install from that directory
    • under Windows depending on how you installed Python:
      • setup.py install from that directory
      • Or: Python setup.py install from that directory

Main advantages

  • It gives you a highly dependable dataset from a potentially changable source.
  • You can easily update on changes in the source without touching your code.
  • You can make the data_def available on a central location while distributing
    the aplication and so giving your users easy access to (automated) updates.

Known issues

  • Adding warning rules to DataTreeShell prior to DataTree initialization will place the general rule before those added rules. Fixed in version 1.3.1.

Release notes

With version 1.4.0 10-07-2017

  • Introducing a pre-conversion of the data_def to a more machine friendly and thus faster format. During conversion the data_defs are validated and any defaults are filled-in. Because of this a lot of validation code during parsing could be removed, introducing more speed increasement.
    (Some 50% relative to 1.3.3 and 65%!! compared to 1.3.2)
  • During a complete review of the code, adapting it to the converted data_def format, several inconsistencies in data_def keyword handling were found and were corrected.
  • It should be compatible with older implementations.

With version 1.3.4 18-06-2017

  • Some minor fixes

With version 1.3.3 17-05-2017

  • with the introduction of the "node" keyword to store references to nodes, significant speed increase of some 30% can be achieved. It does need adaptations to existing data_defs. The "values2" keyword is introduced as an alternative set of value_defs. leaving the original set for backward compatibility.

With version 1.3.2 27-11-2016

  • With a fix on missing signals on the extract_datalist function on an empty result

With version 1.3.1 19-11-2016

With version 1.3.0 9-11-2016

  • first not beta release
  • added functionality to show progress while running the extract_datalist function in a multi-threading environment
  • added a flag to abort the extract_datalist function in a multi-threading environment

With version 1.2.5 15-10-2016

  • added a print_datatree function to DataTreeShell
  • some cosmetic updates on the internal print functions

With version 1.2.4 30-9-2016

  • implemented "text_replace" keyword to search and replace in the html data before importing
  • implemented "unquote_html" keyword to correct ", < and > occurence in html text by replacing them with the correct &quot;, &lt; and &gt;
  • made it possible to read a partial html page resulting from an "HTTP incomplete read" by checking and adding on a missing </html> and/or </body>tag. If more then the tail part is missing it probably will later fail on your search. (Any tag with a missing clossing tag is assumed to be auto-closing. This will, except on the enclosing <html> tag, prevent HTML errors. However if the missing is inadvertently it can cause a change in the tree hierarchy, making the search, even when the data is present, fail. So this will only work if all the other higher closing tags are in the download.)

With version 1.2.3 25-9-2016

  • implemented "url-relative-weekdays" keyword
  • some bug fixes
  • Updates on the test module

With version 1.2.2 31-8-2016

  • Updates on the test module
  • Some code sanitation

With version 1.2.1 22-8-2016

  • Updates on the test module

With version 1.2.0 20-8-2016

  • Implemented a data_def test module

With version 1.1.4 23-7-2016

  • Implemented a stripped and extended Warnings framework
  • Added optional sorting before extraction of part of a JSON tree
  • Some fixes

With version 1.1.3 9-7-2016

  • More unified HTML and JSON parsing with added keywords "notchildkeys" and "tags",
    renamed keyword "childkeys" and extended functionality for some of the others.
    Also allowing to use a linked value in most cases.
  • Added selection keyword "inclusive text" for HTML to include text in sub tags like
    "i", "b" etc.
  • Added support for a tupple with multiple dtype values in the is_data_value function.

With version 1.1.2 5-7-2016

  • A new warnings category for invalid data imports into a tree
  • A new search keyword "notattrs"

With version 1.1 28-6-2016 we have next to some patches added several new features:

  • Added support for 12 hour time values
  • Added the str-list type
  • Added a warnings framework
  • Added a DataTreeShell class with pre and post processing functionality.

It reads the page into a Node based tree, from which you, on the bases of a json
data-file, can extract your data into a list of items. For this a special Data_def language
has been developed. It can first extract a list of keyNodes and extract for each
of them the same data-list. During the extraction several data manipulation
functions are available.

Check the WIKI for the syntax. Here a short incomplete list of possible keywords:

path-dict keywords:

  • "path": "all", "root", "parent"
  • "key":
  • "keys":{"":{"link":1},"":""} (selection on child presence)
  • "tag":""
  • "attrs":{"":{"link":1},"":{"not":[]},"":"","":null}
  • "index":{"link":1}

selection-keywords:

  • "select": "key", "text", "tag", "index", "value"
  • "attr":""
  • "link":1 (create a link)
  • "link-index":1 (create a link)

link examples

[{"key":"abstract_key", "link":1},
        "root",{"key":"library"},"all",{"key":"abstracts"},
        {"keys":{"abstract_key":{"link":1}}},
        {"key":"name","default":""}],

        [...,{ "attr":"value", "ascii-replace":["ss","s", "[-!?(), ]"], "link":1}],

        [...,{"tag":"img", "attrs":{"class": {"link":1}},"attr":"src"}],

selection-format keywords:

  • "lower","upper","capitalize"
  • "ascii-replace":["ss","s", "[-!?(), ]"]
  • "lstrip", "rstrip":"')"
  • "sub":["",""]
  • "split":[["/",-1],["\.",0,1]]
  • "multiplier", "divider":1000 (for timestamp)
  • "replace":{"tv":2, "radio":12}
  • "default":
  • "type":
    • "datetimestring","timestamp","time","timedelta","date","datestamp", "relative-weekday","string", "lower-ascii","int", "float","boolean","list",
  • "member-off"