# Synthorus Workflows #

There are three main workflows when using Synthorus. These are
1. make a model
2. make and run a simulator
3. make reports.

These are shown schematically in the following figure.

```{figure} figures/synthorus_high_level_workflows.svg
---
scale: 50%
align: center
name: synthorus_high_level_workflows
---
A schematic diagram showing a high-level view of Synthorus workflows.
```


## Making a Model ##

Making a model involves providing a model specification. A model specification describes:
- model meta-data
- privacy protection and other parameter values
- data sources for reference data
- what are the random variables and their states
- what are the cross-tables
- what are the entities, their fields, and relationships
- simulation parameters and values.

The steps to make a model are as follows.
1. Datasets specified in the model specification will be loaded based on the specified data sources.
2. Clean cross-tables are computed.
3. Privacy protection is applied to create noisy cross-tables.
4. The noisy cross-tables are used to create a probabilistic graphical model (PGM) for each entity.
5. A simulator specification is formed which describes the model entities and how to generate records for entities.

The result of making a model is a collection of files which are placed into a model definition folder. The files are:
1. a JSON definition of the model ('model_spec.json')
2. a JSON definition of an index of model components ('model_index.json')
2. a JSON definition of the synthetic data simulator ('simulator_spec.json')
4. Compiled Knowledge PGMs for each entity ('pgms/{_entity_}.py')
5. noisy cross-tables, if requested for saving ('noisy_cross_tables/{_cross_table_}.pk')
6. clean cross-tables, if requested for saving ('clean_cross_tables/{_cross_table_}.pk')

In the given filenames, {_entity_} and {_cross_table_} are the names of an entity and cross-table respectively, as referenced by the JSON files.


## Making and Running a Simulator ##

A simulator is a Python object of class `Simulator`. A Simulator object can be constructed from files in a model definition folder. Essentially, making the Simulator object involves:
1. loading a PGM for each entity
2. loading the simulator specification file
3. constructing and configuring a Simulator object.

When a simulator is run, it needs to know what to do with the generated records. This is described by a SimRecorder object. `SimRecorder` is an abstract base class with several provided concreate implementations. A simple SimRecorder can be created by constructing a `DebugRecorder`. A DebugRecorder write records to stdout (or file) as the records are generated.

A simulator can be run by calling method `run` on a Simulator object. Method `run` is given a SimRecorder object and optionally specifying the number of iterations to run the simulation.


## Making Reports ##

Reports produced by Synthorus are:
1. detailed cross-table report ('crosstabs.csv')
2. system specification report ('report_on_model_spec.txt')
3. privacy report ('report_on_privacy.txt')
4. utility report ('report_on_utility.txt')
5. detailed utility results ('utility_results.csv')

The report filename is shown in brackets. These files placed in a `reports` directory as a subdirectory of the model definition folder.

Most of these reports are computationally cheap and are created as a by-product of making a model (specifically, 1, 2, and 3). The other two reports (4 and 5) are potentially computationally expensive, so are designed to run using a separate script.



## Spec File ##

Generating synthetic data with Synthorus involves creating a `Simulator` object and a probabilistic graphical model (PGM) for each data entity of the simulator. These can be constructed directly in code (which will be demonstrated in the next notebook). However, normally these objects will be defined and serialised to a "model definition folder", as show in the workflows schematic.

There are multiple ways to construct the files in a model definition folder. Internally, Synthorus uses a `ModelSpec` object. The `ModelSpec` class is a Pydantic `BaseModel` and so can be easily serialized and deserialize as JSON formated text.

It is possible to manually construct a `ModelSpec` object or its equivalent JSON. However, it can be tedious for all but trivial Synthorus systems. A more human-friendly option to specify a Synthorus system is to use a "spec file".

A spec file is Python code that defines a spec file dictionary. This is a very flexible format as any valid Python code is permitted. This means that loading a spec file has full access to your Python environment. Thus sharing or using a spec file should be treated in the same way as sharing Python code.

A safe way to share a spec file is to convert it to a `ModelSpec` object and share the serialized JSON.



Normally you would load a spec file from the file system using function `load_spec_file`. This returns a `ModelSpec` object. Internally, `load_spec_file` read the spec file dictionary, then calls `interpret_spec_file` to get the `ModelSpec` object.

The following example reads the spec file `spec_tiny.py` in `synthorus_demos.demo_files.spec_files``and uses `load_spec_file` to create a `ModelSpec`.

In [1]:
from synthorus_demos.utils.file_helper import cat
from synthorus_demos.demo_files import SPEC_FILES

spec_file = SPEC_FILES / 'spec_tiny.py'

cat(spec_file)

"""
This is an example minimal Synthorus spec file.
"""
from synthorus.spec_file.keys import *

spec = {
    # default for all random variables
    states: infer_distinct,

    datasources: {
        'xyz': {
            data_format: csv,
            sensitivity: 0,
            inline: """
                X,Y,Z
                y,y,y
                y,y,n
                y,n,y
                y,n,n
                n,y,y
                n,y,n
                n,n,y
                n,n,n
                """
        }
    }
}


The following code loads the spec file, converts it o a `ModelSpec` then prints the JSON.


In [2]:
from synthorus.spec_file.interpret_spec_file import load_spec_file
from synthorus.model.model_spec import ModelSpec

model_spec: ModelSpec = load_spec_file(spec_file)

# Render the JSON to stdout.
print(model_spec.model_dump_json(indent=2))

{
  "name": "spec_tiny",
  "author": "_unknown_",
  "comment": "This is an example minimal Synthorus spec file.",
  "roots": [],
  "rng_n": 4,
  "datasources": {
    "xyz": {
      "sensitivity": 0.0,
      "rvs": [
        "X",
        "Y",
        "Z"
      ],
      "dataset_spec": {
        "type": "csv",
        "weight": null,
        "rv_map": null,
        "rv_define": {},
        "input": {
          "type": "inline",
          "inline": "X,Y,Z\ny,y,y\ny,y,n\ny,n,y\ny,n,n\nn,y,y\nn,y,n\nn,n,y\nn,n,n\n"
        },
        "sep": ",",
        "header": true,
        "skip_blank_lines": true,
        "skip_initial_space": true
      },
      "non_distribution_rvs": []
    }
  },
  "rvs": {
    "Y": {
      "states": "infer_distinct",
      "ensure_none": false
    },
    "X": {
      "states": "infer_distinct",
      "ensure_none": false
    },
    "Z": {
      "states": "infer_distinct",
      "ensure_none": false
    }
  },
  "crosstabs": {
    "_Y": {
      "rvs": [
        "Y"

This spec file defines one datasource, "xyz" that includes three random variables, "X", "Y" and "Z".

System random variable can be explicitly defined in a spec file using an "rvs" section. The demo `spec_tiny.py` does not include this section so one is internally created with a random variable defined for all random variables seen in all data sources. Each random variable definition needs to define the possible states of the random variable. Including `states: infer_distinct` at the top level of spec file dictionary means that `states: infer_distinct` will be inherited for every random variable definition. The value `infer_distinct` means that the possible values of a random variable will be defined as "all distinct values seen for that random variable in the datasources."



## Demo output directory ##

The following examples need to write files to a model definition directory. We use a managed demo output directory for this. The following code snippet gives access to a managed demo output directory.
```
from synthorus_demos.utils.output_directory import output_directory

with output_directory('my_directory') as my_dir:
    ...
```

Code within the `with` context will have access to a directory named "my_directory", available as a `Path` object `my_dir`. The directory "my_directory" will be created empty as a subdirectory of the _demo_out_ directory. The _demo_out_ directory is configured using local configuration variable "DEMO_OUT", or if "DEMO_OUT" is not defined, a temporary directory is created using `tempfile`.

IMPORTANT, `output_directory` is only needed to run these demonstrations. It is not needed for normal usage of Synthorus.

In [3]:
from synthorus_demos.utils.output_directory import output_directory

with output_directory('my_directory') as my_dir:
    print(my_dir)

C:\Users\barry\Development\synthorus\out\my_directory


Synthorus provides a function `make_model_definition_files` for taking a model spec and creating model definition files.


In [4]:
from synthorus.utils.print_function import NO_LOG
from synthorus_demos.utils.file_helper import print_file_tree
from synthorus.workflows.make_model_definition_files import make_model_definition_files

with output_directory('my_model_definition_files') as model_definition_dir:

    make_model_definition_files(model_spec, model_definition_dir, overwrite=True, log=NO_LOG)

    # Show what files got created
    print('-------------------------------------------')
    print_file_tree(model_definition_dir)
    print('-------------------------------------------')


-------------------------------------------
my_model_definition_files/
  clean_cross_tables/
    _X.pkl
    _Y.pkl
    _Z.pkl
  model_index.json
  model_spec.json
  noisy_cross_tables/
    _X.pkl
    _Y.pkl
    _Z.pkl
  pgms/
    _default_entity_.py
  reports/
    crosstabs.csv
    report_on_model_spec.txt
    report_on_privacy.txt
  simulator_spec.json
-------------------------------------------
