# Randomness #

This section explains how Synthorus uses and manages randomness.

Synthorus uses randomness in two places within supported workflows.
One is in the production of noisy cross-tables from clean cross-tables, and the other
is when sampling probabilistic graphical models. These two uses of randomness each have different requirements.

## SafeRandom for privacy protection ##

Synthorus is designed to make privacy guarantees on synthetic data. It does this by making privacy guarantees
on a synthetic data simulator. In particular, these guarantees derive from the Differential Privacy techniques
applied to cross-tables to make noisy cross-tables. It is the noise added to the cross-tables that provides
privacy assurances.

Researchers have demonstrated that  implementations of differential privacy are vulnerable to statistical attacks.
See:
Holohan, N., & Braghin, S. (2021, October). Secure random sampling in differential privacy.
In European Symposium on Research in Computer Security (pp. 523-542). Springer, Cham. https://doi.org/10.1007/978-3-030-88428-4_26.

Holohan & Braghin provide an algorithm to protect againsts such attacks. They describe a method that wraps a potentially
vulnerable random number generator to create safe random variates that are robust to these forms of attacks. In Synthorus
this is implemented as `SafeRandom`.

As per Holohan & Braghin, `SafeRandom` takes a parameter, `n`, which controls the level of protection. Higher values of `n` provide more protection, but the time to generate a random variate is linear in `n`. They show that values of `n` are equivalent to [AES encryption](https://en.wikipedia.org/wiki/Advanced_Encryption_Standard). In particular:
- n = 4 is equivalent to AES128,
- n = 5 is equivalent to AES192,
- n = 6 is equivalent to AES256.

Synthorus `SafeRandom` specifically wraps Python's `random.SystemRandom` class, which uses sources
 of randomness provided by the operating system (such as /dev/urandom on Unix or CryptGenRandom on Windows).

`SafeRandom` combined with `random.SystemRandom` provides cryptographic security for random numbers used in privacy protection.

The `SafeRandom` class provides random variates from selected distributions. Here are some examples.

In [1]:
from synthorus.noise.safe_random import SafeRandom

# Create a random number generator with AES128 equivalent protection.
rng = SafeRandom(n=4)

 Here are random variates, uniform in the interval [0, 1)...

In [2]:
[rng.random() for _ in range(5)]

[0.5953962027330629,
 0.11791977728408987,
 0.23749613518316548,
 0.21885498659993996,
 0.20676661704734167]

 Here are random integers, uniform in the interval [0, x)...


In [3]:
x = 4
[rng.uniform(x) for _ in range(10)]


[3, 0, 2, 2, 0, 2, 0, 3, 0, 2]

 Here are Gaussian random variates, with mean `mu` and standard deviation `sigma`...

In [4]:
[rng.gauss(mu=0, sigma=1) for _ in range(5)]


[-0.7828611460624882,
 -1.2768140490217574,
 -0.6025894264524065,
 -0.5729192789642424,
 1.095563239121768]

Here are Laplace random variates, with mean `mu` and scale `b`, i.e., variance `2 b^2`...


In [5]:
[rng.laplace(mu=0, b=1) for _ in range(5)]

[2.3601250268683835,
 -0.23585466729922358,
 1.109011848082121,
 -2.741604893934928,
 -0.11221176967783801]

Here are Binomial random variates, given `n` draws with success probability `p`, i.e., mean `n p` and variance `n p (1 - p)`...

In [6]:
[rng.binomial(n=10, p=0.5) for _ in range(5)]

[4, 6, 6, 6, 3]

## Sampling for data generation ##

Synthorus also uses randomness when sampling a probabilistic graphical model (PGM) for generating synthetic data records. It is important that the records have the statistical properties defined by the PGM. They should avoid statistical biases and spurious correlations.

The functionality of sampling a PGM is delegated to Compiled Knowledge, in particular a so-called [WMC Direct Sampler](https://compiled-knowledge.readthedocs.io/en/latest/2_08_sampling.html#wmc-direct-sampler).

This sampler is based on [inverse transform sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling) which is a type of independent sampling. Nearby records in a sequence of sampled records are not correlated.

The algorithm used by the sampler is described and evaluated in the publication:
Suresh, S., Drake, B. (2025). Sampling of Large Probabilistic Graphical Models Using Arithmetic Circuits. AI 2024: Advances in Artificial Intelligence. AI 2024. Lecture Notes in Computer Science, vol 15443. https://doi.org/10.1007/978-981-96-0351-0_13.

Ultimately, these samplers call Python's `random.random` and `random.randrange` functions to generate
floating-point and integer random variates.

Here is an example of the WMC Direct Sampler, using the example `student` PGM from Compiled Knowledge.

In [7]:
from ck import example
from ck.pgm import PGM

student_pgm: PGM = example.Student()

print('Random Variables:', *student_pgm.rvs)


Random Variables: difficult intelligent grade sat letter


Here we compile and create the sampler...

In [8]:
from ck.sampling.sampler import Sampler
from ck.pgm_compiler import DEFAULT_PGM_COMPILER
from ck.pgm_circuit.wmc_program import WMCProgram

wmc = WMCProgram(DEFAULT_PGM_COMPILER(student_pgm))
sampler: Sampler = wmc.sample_direct()

Here we generate random records using the sampler, showing them in CVS format.

In [9]:
print(*student_pgm.rvs, sep=', ')
for sample in sampler.take(5):
    print(student_pgm.state_str(sample))

difficult, intelligent, grade, sat, letter
No, No, 2, Low, No
Yes, Yes, 3, Low, Yes
Yes, Yes, 1, High, No
Yes, No, 2, Low, Yes
Yes, Yes, 3, High, Yes
