# Spec File #

A spec file is a Python file with a single dictionary defined in it. As a spec file is a Python file, comments start with a hash, #.

The module `synthorus.spec_file.keys` defines all the reserved words as strings. This can be imported in a spec file to make
it easier to read, e.g.,
```
from synthorus.spec_file.keys import *
```
You can find many example spec files in the package, `synthorus_demos.demo_files.spec_files`.

When Synthorus interprets a nested dictionary in a spec file, a missing key-value pair will be inherited from outer dictionaries, if available.


A spec file dictionary should comply with the format below. The format is described using a modified BNF. Specifically,

* upper case is used to denote a grammar component; lower case denotes a string literal unless otherwise indicated

* the pipe symbol, `|`,  is used to indicate an optional format.

* `XXXX := {` ... `}` is used to indicate the value is a dictionary.

* `XXXX := [` ... `]` is used to indicate the value is a list (or tuple or set).


```
SPEC := {
    name:    STRING,       # model name, default is the loaded module file stem.
    comment: STRING,       # model comment, default is the loaded module doc string.
    author:  STRING,       # model author, default is the loaded module __author__ value.

    roots:                        # optional where to find files, default is the current working directory
        STRING |                  # a directory path
        [ STRING, ...]            # a list of directory paths

    rng_n: POSITIVE_INTEGER,      # random number generator security level (for Differential Privacy)
                                  #     4 is equivalent to AES128,
                                  #     5 is equivalent to AES192,
                                  #     6 is equivalent to AES256.

    datasources: {
        DATASOURCE_ID: DATASOURCE_SPEC,
        ...
    },

    rvs: {                       # if omitted, then all rvs mentioned in all data sources with empty spec
        RV_ID: RV_SPEC,
        ...                      # optional additional entries
    },

    crosstabs: {                 # list, tuple, set or dict, if omitted, then empty
        CROSSTAB_SPEC,
        ...                      # optional additional entries
    },

    parameters: {                # optional simulation parameters
        FIELD_ID: STATE,
        ...                      # optional additional entries
    },

    entities: {                  # optional simulation entities, default is a single entity with all rvs.
        ENTITY_ID: ENTITY_SPEC,
        ...                      # optional additional entries
    },
}

DATASOURCE_SPEC :=
    TEXT_DATASOURCE |                   # Text based datasource, like CSV
    BINARY_DATASOURCE |                 # A binary based datasource, like Parquet
    DBMS_DATASOURCE |                   # A database with ODBC psycopg driver
    FUNCTION_DATASOURCE |               # Mathematically defined dataset

TEXT_DATASOURCE := {                     # A text datasource can be inline data or a text file
    sensitivity: NON_NEG_NUMBER,         # Differential Privacy parameter
    weight: None    |                    # no weight column provided (i.e., every row has weight 1)
            INTEGER |                    # index of weight column (just like Python array index)
            STRING,                      # name of weight column.
    rvs: None |                          # use the rv names as per the data file header line
         RV_MAP |                        # map rv ids to columns
         RV_LIST,                        # rvs ids in column order (rv id of None or '' means remove column)
    define: None | DEFINE_COLUMNS_SPEC,  # mathematically define additional columns.
    condition: None | RV_LIST,           # rvs that should not be considered as providing a distribution

    location: STRING | None,             # A string file path to data, or None for 'inline' data
    inline: STRING | None,               # inline data, or None for file data at a 'location'
    data_format: csv |                   # comma separated text file
                 tsv |                   # tab separated text file
                 table_builder |         # ABS TableBuilder CSV format
                 None                    # infer data_format from 'location' file extension
    sep: STRING,                         # explicit separator override
    header: BOOLEAN,                     # is the first line a header line (default is True)
    skip_blank_lines: BOOLEAN,           # skip blank lines (default is True)
}

BINARY_DATASOURCE := {                   # A binary datasource cannot be inline data.
    sensitivity: NON_NEG_NUMBER,         # Differential Privacy parameter
    weight: None    |                    # no weight column provided (i.e., every row has weight 1)
            INTEGER |                    # index of weight column (just like Python array index)
            STRING,                      # name of weight column.
    rvs: None   |                        # use the rv names as per the data file header line
         RV_MAP |                        # map rv ids to columns
         RV_LIST,                        # rvs ids in column order (rv id of None or '' means remove column)
    define: None | DEFINE_COLUMNS_SPEC,  # mathematically define additional columns
    condition: None | RV_LIST,           # rvs that should not be considered as providing a distribution

    data_format: pickle  |               # pickled Pandas dataframe
                 parquet |               # Parquet file
                 feather,                # Feather file

    location: STRING | None              # A string file path, default is the datasource name with appropriate extension
}

FUNCTION_DATASOURCE := {                 # (implies: sensitivity = 0 and condition = input, unless specified directly)
    data_format: function | None,        # optional as the 'function' key gives it away.
    function: STRING,                    # a Python expression using input rvs
    input: {
        RV_ID: STATES                    # list of states or number of states
        ...
    }
    output: | None  # optional, default is the datasource name.
}

DBMS_DATASOURCE := {
    data_format: odbc | postgres,
    sensitivity: NON_NEG_NUMBER,         # Differential Privacy parameter
    condition: None | RV_LIST,           # rvs that should not be considered as providing a distribution

    table: STRING,                       # name of table in the database
    schema: STRING | None,               # optional schema where to find the table, default taken from config.DB_SCHEMA
    rvs: RV_LIST | None,                 # optional restriction of the columns to query, default is all table columns

    connection: None | {                 # optional connection dictionary
        # these are database connection parameters
        # a value of None means look up the parameter in config using DB_{PARAMETER}
        CONN_PARAM: STRING | INTEGER | None,
        ...
    }
}

DEFINE_COLUMNS_SPEC := {
    RV_ID:                                 # a new column added with this name.
           COLUMN_FUNCTION_SPEC |          # values are a function of other columns
           COLUMN_GROUP_SPEC,              # values are a grouping of another column
    ...
}

COLUMN_FUNCTION_SPEC := {
    function: STRING,                    # a Python expression using input columns
    input: RV_LIST,                      # column names to use for input, after any remapping
    delete_input: BOOLEAN                # delete the input columns (default is False)
}

COLUMN_GROUP_SPEC := {
    grouping:
        group_cut |                      # create groups from a single column using Pandas 'cut'.
        group_qcut |                     # create groups from a single column using Pandas 'qcut'.
        group_normalise,                 # group values just as categories (multiple input columns permitted)
    input: RV_LIST,                      # the source columns to group
    size: POSITIVE_INTEGER               # how many groups
    delete_input: BOOLEAN                # delete the input column (default is False)
}

RV_MAP := {
    RV_ID: STRING | INTEGER,             # map rv id to column (name or index)
    ...
}

RV_SPEC := {
    states: STATES_SPEC,
    ensure_none: BOOLEAN | None,         # optional, default is False, ensure states include None
    dataset: DATASOURCE_ID | None,       # optional, distribution datasource for this rv
}

STATES_SPEC :=
    STATES             |                 # defined list of states
    infer_distinct     |                 # infer from data sources
    infer_range        |                 # infer from data sources
    infer_max                            # infer from data sources

STATES :=
    STATE_LIST |                         # list of named states
    POSITIVE_INTEGER |                   # number of states, equivalent to Python range(n)
    STATE_RANGE                          # equivalent to Python range(start, stop, step)

STATE_RANGE := {
    start: INTEGER | None,
    stop: INTEGER,
    step: INTEGER | None,
}

CROSSTABS_LIST := [        # list, tuple, set
        CROSSTAB_SPEC,
        ...                # optional additional entries
]

CROSSTABS_DICT := {
        CROSSTAB_ID: CROSSTAB_SPEC,
        ...                # optional additional entries
}

CROSSTAB_SPEC :=
    RV_LIST |              # only define as rv list if the datasource is obvious
    CROSSTAB_DICT

CROSSTAB_DICT := {
    rvs: RV_LIST,
    epsilon: POSITIVE_NUMBER,       # Differential Privacy number for noise injection
    min_cell_size: NON_NEG_NUMBER,  # crosstab rows with weight below this value are removed
    need_sensitivity: BOOLEAN       # If True, then no noise applied if sensitivity == 0 (even if min_cell_size > 0)
    max_add_rows: POSITIVE_NUMBER   # Differential Privacy limit on adding rows that had zero weight
    datasource: None | DATASOURCE_ID
}

ENTITY_SPEC := {
    rvs: RV_LIST | None,                     # fields to populate by sampling random variables
    fields: FIELDS_DICT | None,              # fields that are computed, not sampled
    id_field: STRING | None,                 # name of the entity 'id' field (default is _id_)
    count_field: STRING | None,              # name of the entity 'count' field (default is _count_)
    foreign_field: STRING | None             # name of child entity foreign key field (default is {name}_{id_field})
    parent: ENTITY_ID | None,                # parent entity (default is None)
    cardinality: CARDINALITY_SPEC | None     # define number of records per parent entity (default is 1)
}

FIELDS_DICT := {
    FIELD_ID:
        FIELD_CONST_SPEC |
        FIELD_SUM_SPEC |
        FIELD_FUNCTION_SPEC |
        FIELD_SAMPLE_SPEC,
    ...                                      # optional additional entries
}

FIELD_CONST_SPEC := {
    value: STATE,                            # field state
}

FIELD_SAMPLE_SPEC := {
    sample: RV_ID
}

FIELD_SUM_SPEC := {
    value: STATE | None,                                        # initial field state (default is 0)
    sum: [FIELD_ID | NUMBER, ...] FIELD_ID | NUMBER | None,     # field update method
}

FIELD_FUNCTION_SPEC := {
    value: STATE | None,                 # initial field state (default is None)
    function: STRING,                    # a Python expression using input RVs
    input: RV_LIST,                      # RV names to use for input
}

CARDINALITY_SPEC :=
    [ CARDINALITY_SPEC, ... ] |                   # list/set of cardinality specs (stop if any indicates stop)
    NON_NEG_NUMBER |                              # stop if entity 'count' field >= this value
    FIELD_ID |                                    # stop if entity 'count' field >= the value of this rv
    {field: FIELD_ID, limit: NON_NEG_NUMBER} |    # stop after rv value is >= the given limit value
    {field: FIELD_ID, limit: FIELD_ID} |          # stop after rv value is >= the given limit variable
    {field: FIELD_ID, state: STATE | STATE_LIST}  # stop after rv is in one of the specified states


RV_LIST := RV_ID | [RV_ID, ...]              # list or tuple with at least one entry (or just a single RV_ID)
ENTITY_LIST := ENTITY_ID | [ENTITY_ID, ...]  # list or tuple with at least one entry (or just a single ENTITY_ID)
STATE_LIST := [STATE, ...]                   # list, tuple or set with at least one entry

STATE := STRING | NUMBER | None

DATASOURCE_ID := ID
RV_ID := ID
CROSSTAB_ID := ID
ENTITY_ID := ID
FIELD_ID := ID
CONN_PARAM := ID

BOOLEAN := True | False | yes | no | 1 | 0
```

`STRING :=` a normal Python string constant

`ID :=` a normal Python string constant containing only the characters: a-zA-Z0-9_-.,

`NUMBER :=` a normal Python numerical constant (float or int)

`NON_NEG_NUMBER :=` a normal Python numerical constant (float or int), >= 0

`POSITIVE_NUMBER :=` a normal Python numerical constant (float or int), > 0

`INTEGER :=` a normal Python integer constant

`NON_NEG_INTEGER :=` a normal Python integer constant, >= 0

`POSITIVE_INTEGER :=` a normal Python integer constant, > 0

`None :=` can be either the Python None object or just omit the 'key:value' pair in the dictionary.
