# Background #

Here we briefly cover background concepts needed for using the Synthorus library. The topics covered are:
- synthetic data
- random variables
- probabilistic graphical models
- datasets and datasources
- cross-tables
- protecting privacy
- entities and relationships
- entity fields
- synthetic data utility
- synthetic data accuracy.

Synthorus is built on the Compiled Knowledge library for probabilistic graphical models. Compiled Knowledge documentation can be found at [compiled-knowledge.readthedocs.io](https://compiled-knowledge.readthedocs.io/).

## Synthetic Data ##

Synthetic data is artificial data that is generated by algorithms to represent some real-world or hypothetical system.


### use cases ###


Synthetic data is used instead of data collected from a real system when:
1. data from the real system contains private or sensitive information that should not be disclosed
2. there is no real system, only a hypothetical system
3. it is difficult to collect enough data from a real system.

The primary target for Synthorus is (1) where private or sensitive information needs to be protected.

Synthetic data may be used for things like:

- education and training
- testing workflows, algorithms and data processing pipelines
- training machine learning models.


### suitability ###

When using synthetic data for some use case, it is important to consider how suitable the synthetic data is for the use case.
Suitability of synthetic data to a use case is considered using two broad categories of criteria, (1) privacy protection, and (2) accuracy.
There is typically a tradeoff made across privacy protection and accuracy - the more privacy protection the less accuracy, and
vice versa. There may also be computational considerations for synthetic data suitability, such as time to create the synthetic
data or the volume that can be created.


### generation ###

There are many approaches to generating synthetic data. Approaches include:

1. take existing data and modify it
2. construct a simulator based on knowledge of the represented system, then run the simulator to generate data
3. train a generative model on reference data (from the represented system), then draw samples from the generative model.

Approach (1) may be suitable when privacy protection is not important. Examples include upsampling algorithms such as
Synthetic Minority Oversampling Technique (SMOTE).

Approach (2) may be useful when there are no data available from the represented system, and where knowledge about the represented system is theoretical.
Examples include Discrete Event Simulation for systems with distinct events, Continuous Simulation for systems described by continuous variables,
and Agent-Based Modeling, which simulates the actions of individual agents. An advantage of this approach is that privacy can never be compromised
as private or sensitive data is not used. Another advantage is the limitless supply of synthetic data. A disadvantage is that the quality of synthetic data
is only as good as the quality of knowledge used to build the simulation.

Approach (3) is common when there is reference data available from the represented system. Examples include
Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Probabilistic Graphical Models (PGMs).
An advantage of these techniques is that once the generative model is defined the supply of synthetic data is limitless.
If the reference data contains private or sensitive information, there are two ways that privacy may be protected.
(1) The reference data may be altered prior to use, to remove or reduce private information. (2) A model may be
trained on the reference data, then the model altered to remove or reduce private information.

At its core Synthorus uses approach (3). Specifically, Probabilistic Graphical Models (PGMs) are used. However, Synthorus
allows multiple PGMs and hard-coded knowledge to be combined into a simulator for generating complex synthetic data.

## Random variables ##

A probabilistic model defines a joint probability distribution over a set of random variables.

All probabilistic models start with a set of random variables. These are variables that represent something in the world (real or hypothetical) that we want to represent in the model.

Each random variable has a set of possible values (sometimes call states of the random variable).

To be a proper model (and not just maths) every random variable of the model needs to be given a meaning (i.e. what does it relate to in the world), and each possible value of each random variable needs to be specified and defined.

In our case we consider only _discrete_ random variables where the possible values is a finite set. Often we deal with _Boolean_ random variables that only have two possible values (true and false). Sometimes the number of possible values of a random variable may be large (e.g. 10s, 100s, or even 1000s of possible values).

It is possible for a random variable to only have one possible state. However, in that case the value of that random variable is always definitely known (i.e., the only possible value). It doesn't really make sense for a PGM to contain a random variable with zero possible values.

The set of random variables and their possible values do not change for a given model. If you add or remove a random variable or possible value, then you have defined a new model.


## PGMs ##

A Probabilistic Graphical Model (PGM) is a specific kind of probabilistic model, and thus defines a joint probability distribution over a set of random variables.

A PGM may be sampled to provide an unlimited series of samples from a joint distribution over one or more random variables. Samples from a PGM form an empirical distribution.
As the number of samples increases, the empirical distribution more closely matches the joint probability distribution defined by the PGM.

Internally a PGM represents its joint probability distribution using _factors_ (which are sets of random variables) and a _potential function_ that maps states
of a factor's random variables to a real number. This is managed by the Compiled Knowledge library, documented at [compiled-knowledge.readthedocs.io](https://compiled-knowledge.readthedocs.io/).


## Datasets and Datasources ##

Within Synthorus a _dataset_ represents reference data from a system (the system to represent with synthetic data).
Although in-principal synthetic data can be used as a dataset, we generally reserve _dataset_ to mean reference data for training PGMs, and reserve _synthetic data_
to mean the output of a Synthorus simulation.

Logically a dataset is a table of data where columns represent random variables and each row represents an _instance_ (also known as a _sample_, _record_, _row_, _datapoint_, or _joint states_).

The _length_ of a dataset is the number of instances in the dataset.

Each instance in a dataset has an instance weight. An instance weight represents a _weight of evidence_.
The value of an instance weight is notionally 1, but can be other values to represent multiple or fractional evidence.

Within Synthorus dataset values and weights are immutable and cannot be updated.

A _dataset spec_ is a specification for how to access a dataset. Synthorus provides dataset spec classes for different kinds of datasets, including:
- CSV file or inline text (with generalised separators and line spacing)
- Table Builder file or inline text (as created by the [Australian Bureau of Statistics](https://www.abs.gov.au/statistics/microdata-tablebuilder/tablebuilder))
- Pickled Pandas DataFrame object, as a file
- Parquet file
- Feather file
- A database SQL query (using ODBC or Postgres)
- A mathematical function (with a defined domain).

A _datasource spec_ (or just _datasource_) is composed of a dataset spec and other parameters. Those parameters support privacy protection and the statistical integrity of data.


## Cross-tables ##

A cross-table records the total weight for possible combinations of states for some random variables. Its primary purpose is to represent an empirical distribution over joint states of the random variables.

Practically, a cross-table is a mapping from states of the cross-table random variables to a weight. Instances with weight zero are not explicitly represented in a cross-table.

A cross-table is constructed from a dataset by projecting the dataset onto a subset of the dataset's random variables, then summing the weights of equivalent joint states. In Synthorus,

Cross-tables are a crucial concept in Synthorus as cross-tables are used for constructing PGMs. Importantly, if a set of random variables appears together in some cross-table,
then the joint probability distribution of those random variable can be represented in generated synthetic data. If a set of random variables does not appear together in some cross-table,
then joint probability distribution of those random variables may not be accurate.

If the joint probability distribution a set of random variables is accurate, then _all_ statistics involving those random variables will be accurate. This includes statistics like
conditional probabilities, correlation, joint entropy, etc.

For example, consider a dataset with random variables A, B, C, D and E. And consider two cross-tables from that dataset, one with random variables A, B and C, and the other
with C, D and E. Then correlation between A and C will be accurately modelled in synthetic data as will correlation between C and E. But correlation between A and E may not
be accurate.

Within Synthorus, cross-tables are the mechanism used to protect privacy. Namely, before the cross-tables are used to construct PGMs, they are adjusted to preserve privacy (using Differential Privacy algorithms,
explained next).


## Protecting Privacy ##

Privacy leakage happens when information can be used to identify an individual (or infer sensitive information).
An individual is "identified" when a link can be established between private information and the individual.
Protecting privacy required more than removing or replacing all personal identifiers. It is important to
remove or replace all other information that may, alone or in combination with other information, allow an
individual to be identifiable.

Synthorus implements privacy protection measures to ensure that the potential for privacy leakage is managed.
Within Synthorus, dataset are used to create _clean cross-tables_. The clean cross tables may contain
data that potentially compromises privacy. The clean cross-tables are modified to create _noisy cross-tables_
that have a lower risk of compromising privacy. The noisy cross-tables are used to create PGMs, which are combined
with theoretical system knowledge to define a synthetic data simulation. The simulation is run to generate synthetic data.

Privacy is protected using two steps to create a noisy cross-table from it's corresponding clean cross-table.
1. Differential Privacy is applied by adding Laplacian noise.
2. Min-cell-size Suppression is performed to removed rows with a weight below a threshold.

Efficient algorithms for these steps are described and evaluated in the publication: Suresh, S., Zhang, G., Liu, B., Drake, B. (2025).
Cost-Efficient and Privacy-Preserving Synthesis of Complex Sensitive Data. ACS 2025: Proceedings of the 2025 Australasian Computer Science Week. https://dl.acm.org/doi/10.1145/3727166.3727170.


### Differential Privacy ###

In the first step, Synthorus adds random Laplacian noise values to every cross-table count. The amount of noise
is controlled by two parameters _sensitivity_ and _epsilon_.

Parameter _sensitivity_ captures the inherent privacy risk of a dataset. Namely, it is then maximum possible change a single record can have on a function's output, and it quantifies how much a query's result can change if one person's data is added or removed. If a dataset contains no private information its sensitivity is zero. If person can have
at most one record in the dataset, its sensitivity is one.

Parameter _epsilon_ captures the privacy budget used by a cross-table and quantifies the level of privacy loss of using
the cross-table. Lower _epsilon_ values indicates stronger privacy but lower accuracy. Higher _epsilon_ values provides less privacy protection but more accurate results. The total privacy budget for a synthetic data simulator is the sum of the _epsilon_ values for all cross-tables used to construct the simulator. Commonly a total budget in the range 2 to 10 may be considered offering low privacy protection, whereas a budget below 1 may be considered providing strong privacy protection.

Synthorus provides the opportunity to distribute the privacy budget arbitrarily across cross-tables to allow a synthetic data engineer to focus accuracy to critical combinations of random variables.

Differential Privacy defines a noise scale, $\sigma$ = _sensitivity_ / _epsilon_. Internally, a Laplacian random variate
with zero mean and $\sigma ^ 2$ variance is added to each weight of a clean cross-table. If this step causes a weight
to go negative, the cross-table row is removed (i.e., weight is set to zero).

Note that if the sensitivity of a dataset is zero, then any cross-table constructed from that dataset has no risk to privacy so its epsilon value is effectively zero (even if a synthetic data engineer sets epsilon to some other value).
For such cross-tables the clean and noisy version are identical (unless Min-cell-size Suppression is enforced). These
cross-tables will not contribute to the privacy budget of the synthetic data.


### min-cell-size suppression ###

In the second step, Synthorus enforces minimum cell size using a weight _threshold_. That is, if any cross-table row has a weight less than the threshold then the cross-table row is removed (i.e., weight is set to zero).

Minimum cell size suppression comes after Differential Privacy and is thus a post-processing, which is inherently protected
having no bearing on the Differential Privacy property.


### random number generation ###

Implementations of Differential Privacy are vulnerable to statistical attacks caused by the approximation of real values to floating point numbers. Synthorus produces random variates for Differential Privacy using a mathematically sound technique to make such an attack infeasible.

For details of random variate production, see:
Holohan, N., & Braghin, S. (2021, October). Secure random sampling in differential privacy.
In European Symposium on Research in Computer Security (pp. 523-542). Springer, Cham. https://doi.org/10.1007/978-3-030-88428-4_26.

The security of random variates is controlled by a parameter _n_. Higher values of _n_ provide better security, but linearly increases the time taken to produce each random variate. Each value of _n_ can be related to an equivalent encryption level. For example:
- _n_ = 4 is equivalent to AES128
- _n_ = 5 is equivalent to AES192
- _n_ = 6 is equivalent to AES256.


## Entities and Relationships ##


### entities ###

In data modeling, an entity represents a class of objects or a concept from a modelled system. An entity is distinctly definable, such as a person, place, event, diagnosis, etc.

An instance of an entity represents a specific example of the entity. For example "Abraham Lincoln" may be an instance of the "American President" entity.

Each entity is associated with a set of attributes. Every instance of an entity with have value for each attribute of the entity. (Here we consider _missing_ or _null_ to be a value.)

In databases, an entity typically is implemented as a table. Table rows are instances, table columns are attributes, and each cell in a table has a value.


### relationships ###

In data modeling, a relationship is an association or connection between two or more entities. For example an entity "Person" may have a relationship, called "Residence", to another entity "Address". The Residence relationship allows a data model to capter that some particular person resides at some particular address.


### relationship cardinality ###

In data modeling, relationship cardinality describes how many instances of one entity can be related to instances of another entity. In general, each side of the relationship cardinality is a range. E.g., the cardinality of entities X and Y might be _a_-_b_:_c_-_d_, meaning that for any instance of Y the minimum number of X instances is _a_ and the maximum is _b_. Similarly, for any instance of X the minimum number of Y instances is _c_ and the maximum is _d_. If the minimum and maximum of a range is the same value, then just that value is written.

If a range can be any value, then an asterisk (\*) can be used. If the minimum of a range is greater than zero, then only the upper value is marked with an asterisk.

Here are some common examples.

__1:1__ means a single instance of one entity is related to a single instance of the other another entity.

__1:0-1__ means an instance of the first entity is optionally related to an instance of the second entity, but every instance of the second entity is related to exactly one instance of the first entity.

__1:\*__ means each single instance of the first entity is related to any number (including zero) instances of the second entity. But every instance of the second entity is related to exactly one instance of the first entity. Also called a "one-to-many" relationship.

__1:1-\*__ means each single instance of the first entity is related to one or more instances of the second entity. Every instance of the second entity is related to exactly one instance of the first entity.

__\*:\*__ means each single instance of the first entity is related to any number (including zero) instances of the second entity, and vice versa. Also called a "many-to-many" relationship.


Note that in theory any many-to-many relationship can be converted to a pair of one-to-many relationships by introducing an auxiliary entity. For example if entities X and Y have relationship cardinality \*:\* this can be replaced by introducing entity R with relationship X and R being 1:\* and relationship R and Y being \*:1.


### Synthorus entity cardinality ###

In the current version of Synthorus, an entity can have unlimited number of 1:\* relationships, but no more than one \*:1 relationship. This means that the graph of relationships forms a tree (or a forest in general). If entities X and Y have a 1:\* relationship we refer to X as the _parent_ entity and Y as a _child_ entity.


### record cardinality ###

When a synthetic data simulation is running the simulator must produce multiple records of a child entity for a given record of a parent entity. We cast this into a general framework.

Synthorus identifies all the root entities (those with no parent) then for each root entity calls `run_node`. Method `run_node` generates records for the entity it is given. After generating a record for entity X, Synthorus will then call `run_node` for each child entity of X.

Note that this implies that when generating a record for entity X, there will be a _parent_ record which is the last record generated for X's parent entity (or the special _parameters_ record if X has no parent). Similarly, there are the _ancestor_ records which is the parent record, the parent's parent record, and so on up until the special parameters record.

The method `run_node` needs to decide how many records to generate for its current entity, which will be related to the current parent record. The number of child records for a given parent record is referred to as the parent's _record cardinality_.

Theoretically, a particular record cardinality is a random variable, conditioned on the _current context_. The current context is the collection ancestor records plus any previous record for the current parent record. We consider the current context as a collection of _fields_ each with current a _value_.

Synthorus implements record cardinality as _stopping conditions_. Specifically, before generating a record for entity X, the stopping conditions are tested on the current context. If the stopping condition is "true" then no more records are generated for the current parent record, otherwise a new record is generated.

This is a very flexible approach to record cardinality. Synthorus provide several methods for defining record cardinality by attaching stopping conditions to an entity. Possible stopping conditions include the following.
1. Fixed Limit (_n_): the generating process for a parent record should stop once _n_ child records are generated.
2. Variable Limit (_f_): the generating process for a parent record should stop once _n_ child records are generated, where _n_ is the value of context field _f_.
3. Fixed Comparison (_f_, _t_): the generating process for a parent record should stop if the value of a context field _f_ reaches some threshold _t_. Typically, _f_ is a field of the current parent record.
4. Variable Comparison (_f~1~_, _f~2~_): the generating process for a parent record should stop if the value of a context field _f~1~_ reaches the value of a context field _f~2~_.
5. Field State (_f_, _x_): the generating process for a parent record should stop if the value of context field _f_ is _x_. Typically, _f_ is a field of the current entity, and thus referres to the previous record of the entity.


### primary and foreign keys ###

In practice, relationships between entities are implemented using primary a foreign keys. This is also the case for Synthorus.

Every instance of an entity has a primary key, which is a special attribute that uniquely identifies each instance of the entity.

A foreign key is a special attribute of one entity that references the primary key of another entity. The value of the foreign key for any instance refers to exactly one instance in the referenced entity.

Synthorus implements relationships by including a foreign key in each child record that refers to the primary key of its parent record. Primary keys are generated automatically as consecutive serial numbers.



## Entity Fields ##

Recall that in data modelling, each entity is associated with a set of attributes.

Within Synthorus, the attributes of an entity are called _fields_. In addition to the attributes of the entity, a Synthorus will maintain a collection of _special fields_ that are used to manage the generation of synthetic data. The special fields are:
1. ID Field: this field is the primary key for the entity. The default name for an ID field is `_id_`.
2. Foreign ID Field: this field records the relationship of a child entity to a parent entity. The default name for a Foreign ID Field is `_{parent_entity}_{id_field}` where `{parent_entity}` is the name of the parent entity and `{id_field}` is the name of the parent entity's ID field.
3. Count Field: this field records the value of a counter, starting from zero, as child records are generated for a given parent record.

All other fields need to be configured by the synthetic data engineer. Those fields can be either:
1. A Sampled Field: which refers to a random variable, that will be sampled by the simulator; each sampled field is associated with exactly one random variable.
2. A Function Field: which is a function where the value for a record is an arbitrary function of the current context.

For convenience, Synthorus provides predefined functions for function fields in addition to arbitrary Python functions. These include counting functions, copy functions, and summing functions.


## Synthetic Data Utility ##


Whether some specific synthetic data is suitable for a given use case can depend on many factors that depend on the use case. Such factors characterise the _utility_ of the synthetic data and may relate to:
1. how accurately the synthetic data represents the system being modeled (real-world or hypothetical system)
2. the risk of private or sensitive information being disclosed
3. the volume of synthetic data available
4. how expensive it is to create synthetic data (time and resources).

Questions of accuracy are complex with many subtleties and difficulties. They are discussed in detail below.

Synthorus quantifies risks to privacy using Differential Privacy. Privacy risk is defined using _sensitivity_ for datasets and _epsilon_ for cross-tables. Privacy protection is implemented as additive Laplacian noise.

Synthorus places privacy guarantees on the synthetic data generator (i.e., on the simulator). This is done by placing privacy guarantees on the noisy cross-tables used to create the simulator's PGMs. This means there are no limits to the amount of synthetic data produced. That is, changing the number of synthetic data records does not affect privacy risk.

Synthorus aims to be computationally efficient by combining strategies. The process of creating synthetic data is split into two parts: creating a simulator and running the simulator. As much of the expensive computation as possible is pushed into the first part - creating the simulator. Running a simulator is designed to be computationally lean. In particular, building the simulator uses Knowledge Compilation so that sampling and evaluating are as efficient as possible. This functionality is proved by the [Compiled Knowledge](https://compiled-knowledge.readthedocs.io) library.


## Synthetic Data Accuracy ##

When considering the accuracy of some synthetic data the following criteria may be used:
- field validity
- conditioned field validity
- referential integrity
- statistical accuracy.


### field validity ###

Field validity is respected when each value of a field within the synthetic data is a valid value.

In Synthorus, the synthetic data engineer has total control over what values are considered to be valid for a given field. A common choice for defining valid values for a sampled field is to use the set of values seen in reference data for the field's associated random variable.


### conditioned field validity ###

If principle, the value of a field may only be valid in relation to the value of another filed. For example, a diagnosis of "ovarian cancer" for a patient may be considered invalid in the "sex" field of the patient has value "male".

Conditioned field validity is respected when each value of a field within the synthetic data is a valid value, knowing the values of other related fields.

In Synthorus, conditioned field validity can be assured by including a cross-table for related random variables. If the cross-table has no entry for a particular combination of random variable values, then the probability of generating that combination is zero.

`* * * * * * * * * TODO * * * * * * * * *`

Note that Differential Privacy may add entries to a cross-table and thus cause synthetic data to violate conditioned field validity. This can be managed by using fine-grained controls for how Differential Privacy is applied to a cross-table.


### referential integrity ###

Referential integrity is the property of data that the value of every foreign key is an actual value of the associated primary key.

Synthorus guarantees referential integrity by construction.

### statistical accuracy ###

The statistical accuracy of a synthetic dataset quantifies the quality of the empirical joint probability distribution of the synthetic data. Before even considering the practical questions of how to do this, there are theoretical questions, "What is the reference 'true' joint probability distribution" and "What are the measures of quality?"

Theoretically sound measures of distribution quality might include [Kullbackâ€“Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)
(KL Divergence, also called relative entropy and I-divergence) or the more intuitive [Histogram Intersection](https://doi.org/10.1007/BF00130487).

The good thing about these measures is that if the synthetic data distribution is 100% accurate, then all possible statistics and probabilities will be accurate. However, these measures may be expensive to compute over large state spaces. Furthermore, these measures may require many synthetic data records to get a meaningful empirical joint distributions, and may also evaluate far more than is needed.

For example, the utility of a dataset may depend only on the statistical relationships between small subsets of random variables. Or it may only be concerned with marginal distributions of random variables, or correlations between pairs of random variables.

Synthorus avoids the problem of needing large numbers of synthetic data records by using the joint probability distribution from the probabilistic models of the simulator. The probabilistic models guarantee that as the number of synthetic data records increases, the empirical probability distribution approaches the models distribution. This has three advantages.
1. It removes effects of sampling errors in the evaluation of utility.
2. Evaluation can be performed without any synthetic data being generated.
3. Synthorus probabilistic models can be directly and efficiently queried for most standard probabilities and statistics.

We note that when using Synthorus, a synthetic data engineer specifies cross-tables to capture their understanding of what statistical relationships are important. Specifically, if some random variables do not appear together in a cross-table, then the statistical relationships between those random variables are not _directly_ important. For this reason, Synthorus uses specified cross-tables as the core factorisation for measuring statistical accuracy.

Synthorus uses the specified reference datasets to represent the reference 'true' joint probability distribution.