Skip to content

Latest commit

 

History

History
112 lines (90 loc) · 4.34 KB

development.rst.txt

File metadata and controls

112 lines (90 loc) · 4.34 KB

Development

Diffport involves the following units:

  1. Command line interface. The code for this is in file cli.py.
  2. Core module which reads config and delegates tasks to watchers. This is in the file core.py.
  3. Store is an abstraction over the area where diffport is going to save snapshots. Its defined in store.py. Adding new store here means adding another class inheriting from Store abstract class. As an example, see the class StoreDirectory which keeps snapshots in a directory.
  4. DB connection. A few functions related to database connection are in connection.py.
  5. Watchers. Actual watchers are defined in watchers.py along with their report templates in templates.py. We will dissect watchers in more details later.

Watchers

A watcher is defined by a bunch of functions grouped together (as static methods) in a class. We don't maintain any state in a watcher and the class structure is only to modularize the functionality. These methods are enforced by the abstract class Watcher to have the following structure

diffport.watchers.Watcher

After reading the main config.yaml file, the core module of diffport invokes each involved watcher to take snapshot by providing a db object (which is a dataset instance) and that watcher's config as read from the yaml file.

Any new watcher needs to implement a new class with internally consistent methods (meaning that its diff method accepts the output from its own take_snapshot method). In what follows, we describe the general structure of these methods using the example of a simple watcher SchemaTables with the following config passed in:

# Input config to SchemaTables
config = ["raw_tables", "processed_tables"]

take_snapshot

Snapshot output from a watcher is expected to be a serializable dictionary object. Although not required, it is nice to pass in the config used for taking the snapshot so that the diffing function may run quick checks or use some metadata from it. As an example of snapshot returned from a watcher, here is the output from our SchemaTables example:

# Output snapshot
{
  "config": ["raw_tables", "processed_tables"],
  "data": [("raw_tables", ["table_one_raw", "table_two_raw"]),
           ("processed_tables", ["the_only_processed_table"])]
}

Diffport core will now will save this snapshot in its store along with other snapshots collected from other watchers.

diff

The diff method of a watcher takes in two snapshots generated by its own take_snapshot method and returns an object representing the diff in those snapshots. As an example, consider that our SchemaTables watcher saved the following two snapshots at some points in time:

# Snapshot old
old = {
  "config": ["raw_tables", "processed_tables"],
  "data": [("raw_tables", ["table_one_raw", "table_two_raw"]),
           ("processed_tables", ["the_only_processed_table"])]
}

# Snapshot new
new = {
  "config": ["raw_tables", "processed_tables"],
  "data": [("raw_tables", ["table_one_raw", "table_two_raw", "table_three_raw"]),
           ("processed_tables", [])]
}

After finding the difference, the diff method might return a diff object like so (the current implementation actually does return this):

# Diff output
{
  "config": ["raw_tables", "processed_tables"],
  "data": [["raw_tables", { "removed": [], "added": ["table_three_raw"] }],
           ["processed_tables", { "removed": ["the_only_processed_table"], "added": [] }]]
}

Notice that we also pass along the config. This is not required for this watcher, but some watchers (like NumberOfRows) use some information from config to generate the final report.

report

After a diff is calculated, the report function generates a string report for the diff. The reports from all the enabled watchers are concatenated and returned as the final report by diffport. For generating their own chunk of diff reports, watcher rely on jinja2 templates present in templates.py. The expected format of template is markdown since its easy to maintain and can be converted to other formats pretty easily using tools like pandoc.