Diffport involves the following units:
- Command line interface. The code for this is in file
cli.py
. - Core module which reads config and delegates tasks to watchers. This is in the file
core.py
. - Store is an abstraction over the area where diffport is going to save snapshots. Its defined in
store.py
. Adding new store here means adding another class inheriting fromStore
abstract class. As an example, see the classStoreDirectory
which keeps snapshots in a directory. - DB connection. A few functions related to database connection are in
connection.py
. - Watchers. Actual watchers are defined in
watchers.py
along with their report templates intemplates.py
. We will dissect watchers in more details later.
A watcher is defined by a bunch of functions grouped together (as static methods) in a class. We don't maintain any state in a watcher and the class structure is only to modularize the functionality. These methods are enforced by the abstract class Watcher
to have the following structure
diffport.watchers.Watcher
After reading the main config.yaml file, the core
module of diffport invokes each involved watcher to take snapshot by providing a db
object (which is a dataset instance) and that watcher's config as read from the yaml file.
Any new watcher needs to implement a new class with internally consistent methods (meaning that its diff
method accepts the output from its own take_snapshot
method). In what follows, we describe the general structure of these methods using the example of a simple watcher SchemaTables
with the following config passed in:
# Input config to SchemaTables
config = ["raw_tables", "processed_tables"]
Snapshot output from a watcher is expected to be a serializable dictionary object. Although not required, it is nice to pass in the config used for taking the snapshot so that the diffing function may run quick checks or use some metadata from it. As an example of snapshot returned from a watcher, here is the output from our SchemaTables
example:
# Output snapshot
{
"config": ["raw_tables", "processed_tables"],
"data": [("raw_tables", ["table_one_raw", "table_two_raw"]),
("processed_tables", ["the_only_processed_table"])]
}
Diffport core will now will save this snapshot in its store along with other snapshots collected from other watchers.
The diff
method of a watcher takes in two snapshots generated by its own take_snapshot
method and returns an object representing the diff in those snapshots. As an example, consider that our SchemaTables
watcher saved the following two snapshots at some points in time:
# Snapshot old
old = {
"config": ["raw_tables", "processed_tables"],
"data": [("raw_tables", ["table_one_raw", "table_two_raw"]),
("processed_tables", ["the_only_processed_table"])]
}
# Snapshot new
new = {
"config": ["raw_tables", "processed_tables"],
"data": [("raw_tables", ["table_one_raw", "table_two_raw", "table_three_raw"]),
("processed_tables", [])]
}
After finding the difference, the diff
method might return a diff object like so (the current implementation actually does return this):
# Diff output
{
"config": ["raw_tables", "processed_tables"],
"data": [["raw_tables", { "removed": [], "added": ["table_three_raw"] }],
["processed_tables", { "removed": ["the_only_processed_table"], "added": [] }]]
}
Notice that we also pass along the config. This is not required for this watcher, but some watchers (like NumberOfRows
) use some information from config to generate the final report.
After a diff is calculated, the report
function generates a string report for the diff. The reports from all the enabled watchers are concatenated and returned as the final report by diffport. For generating their own chunk of diff reports, watcher rely on jinja2 templates present in templates.py
. The expected format of template is markdown since its easy to maintain and can be converted to other formats pretty easily using tools like pandoc.