# Meta-config generation for synthetic data

We will generate synthetic data configurations that can in turn generate synthetic data. The entire process is coded in a reproducible manner. This is useful for testing the pipeline. In the working directory, there are directories that are named by the number of covariates.
In each of the directories, we will generate a set of configurations that generate causal graphs with that number of covariates. The functional form of the covariate is one of the following:

1. Linear non-Gaussian datasets
2. Non-linear Gaussian datasets with parametric assumptions
    1. The invertible function is a polynomial of degree 3
    2. The invertible function is $x + sin(x)$
3. Non-linear Gaussian datasets with no parametric assumptions

We will explain in detail how we generate each of these datasets. The causal graph generators are separate entities and we will sample from three different causal graph types:

1. Chains (only one correct ordering and really sparse)
2. V-Structures (Many correct orderings and sparse) - one weired covariate and n-1 normal covariates
3. Forks (Many correct orderings and sparse) - one normal covariate and n-1 weired covariates
4. Erdos-Renyi (Many correct orderings and dense) with p set to 0.65
5. Full (one one correct ordering and dense)


In [1]:
import numpy as np

def generate_graph_generator_args(n: int):
    # graph generator
    graph_generator_args = {'n': n, 'seed': np.random.randint(1000)}
    type_ind = np.random.randint(5)
    if type_ind == 0 or n <= 2:
        graph_generator_args['graph_type'] = 'chain'
    elif type_ind == 1:
        graph_generator_args['graph_type'] = 'fork' if np.random.randint(2) else 'v_structure'
    elif type_ind == 2:
        graph_generator_args['graph_type'] = 'full'
    elif type_ind == 3:
        graph_generator_args['graph_type'] = 'erdos_renyi'
        graph_generator_args['p'] = 0.65
    else:
        graph_generator_args['graph_type'] = 'fork'
    
    return graph_generator_args

In [2]:
def get_conf(scm_generator, scm_generator_args, observation_size):
    dataset_args = {
        'seed': np.random.randint(1000),
        'scm_generator': scm_generator,
        'scm_generator_args': scm_generator_args,
        'observation_size': observation_size
    }

    conf = {
        'class_path': 'lightning_toolbox.DataModule',
        'init_args': {
            'dataset': 'ocd.data.SyntheticOCDDataset',
            'dataset_args': dataset_args,
            'val_size': 0.1,
            'batch_size': 128,
        },
    }
    
    return conf

Set the following for reproducibility and some pre-requisite imports.

In [3]:
import typing as th
from ruamel import yaml
np.random.seed(100)

## Functional form Generator

Here we will cover three types of synthetic SCMs we will consider. We will go into some details of the data-generating process for each and provide a code that by running it will generate the configurations for the synthetic data.

### Linear non-Gaussian datasets

In these datasets, each covariate equals a linear combination of its parent covariates plus some noise. The noise is sampled from a non-Gaussian distribution, either being a uniform distribution or Laplace. That said, by running the following cell we will define a function that generates configurations. Take a look at the function documentation to get an idea of what it does.

In [4]:
def generate_linear_non_gaussan_configurations(n_cov: th.List[int],
                                               n_configs: th.List[int],
                                               observation_sizes: th.List[int]):
    """
    This function generates a set of yaml files that contain the meta-data required to generate the synthetic data.
    The files are formatted as follows:
    
    >> linear_non_gauussian_{n_cov}_{observation_size}_{graph_type}_{noise_type}.yaml
    
    For more information, check out ocd/data/synthetic/linear_non_gaussian_scm_generator.py
    
    Args:
        n_cov: A list where n_cov[i] indicates the number of covariates in the ith set of configurations
        n_configs: A list where n_configs[i] indicates the number of configurations to generate for the ith set of configurations
        observation_sizes: A list where 
        observation_sizes[i] indicates the number of observations to generate for the ith set of configurations
    """
    
    for n, observation_size, n_config in zip(n_cov, observation_sizes, n_configs):
        for _ in range(n_config):
            
            scm_generator_args = {}
            scm_generator_args['graph_generator'] = 'ocd.data.scm.GraphGenerator'
            scm_generator_args['graph_generator_args'] = generate_graph_generator_args(n)    
            scm_generator_args['weight'] = [-1.0, 1.0]
            
            scm_generator_args['seed'] = np.random.randint(1000)
            scm_generator_args['noise_type'] = "laplace" if np.random.randint(2) == 0 else "uniform"

            scm_generator = 'ocd.data.synthetic.LinearNonGaussianSCMGenerator'
            
            conf = get_conf(scm_generator, scm_generator_args, observation_size)
            
            conf_name = f"linear_non_gaussian_{n}_{observation_size}_{scm_generator_args['graph_generator_args']['graph_type']}_{scm_generator_args['noise_type']}.yaml"
            
            # write conf to conf_name in yaml format
            with open(conf_name, 'w') as f:
                # yaml.dump(conf, f)
                yaml.safe_dump(conf, f, indent=4)
                

            

Run the following piece of code to generate the actual configurations. After running it will generate all the yaml files in the working directory. You can change the seed to generate different configurations; but we have already generated a set of configurations for you which can be reproduced by setting the seed to 100.

In [5]:
np.random.seed(100)

ns = [2, 3, 4, 5, 10, 25, 50, 100]
n_configs = [30, 30, 30, 30, 15, 10, 3, 3]
observation_size = [500, 500, 1000, 1000, 10000, 10000, 10000, 10000]

generate_linear_non_gaussan_configurations(ns, n_configs, observation_size)

## Parametric non-linear Gaussian datasets

In this set, the linking functions are random linear combinations of their parents with a non-linear transformation applied to them. The non-linear transformation is either a polynomial of degree 3 or $x + sin(x)$. The noise is sampled from a Gaussian distribution. The following cell defines the function that generates the configurations.

In [8]:

s_func = """def func(x):
    return numpy.log(1 + numpy.exp(x))"""

def get_t_func_1(t):
    return f"""def func(x):
    x_mean = numpy.mean(x)
    x_std = numpy.std(x)
    if x_std == 0:
        x_std = 1
    x = (x - x_mean) / x_std
    return x**3 + {t}"""

def get_t_func_2():
    return f"""def func(x):
    return numpy.sin(x) + x"""
    
def non_linear_gaussan_configurations(n_cov: th.List[int],
                                      n_configs: th.List[int],
                                      observation_sizes: th.List[int]):
    """
    This function generates a set of yaml files that contain the meta-data required to generate the synthetic data.
    The files are formatted as follows:
    
    >> parametric_non_linear_gaussian_{n_cov}_{observation_size}_{graph_type}_{type}.yaml
    
    type: sin_plus_x or x_cubed_plus_t
    
    For more information check out ocd/data/synthetic/non_linear_invertible_gaussian_scm_generator.py
    Args:
        n_cov: A list where n_cov[i] indicates the number of covariates in the ith set of configurations
        n_configs: A list where n_configs[i] indicates the number of configurations to generate for the ith set of configurations
        observation_sizes: A list where 
        observation_sizes[i] indicates the number of observations to generate for the ith set of configurations
    """
      
    for n, observation_size, n_config in zip(n_cov, observation_sizes, n_configs):
        for _ in range(n_config):
            
            scm_generator_args = {}
            scm_generator_args['graph_generator'] = 'ocd.data.scm.GraphGenerator'
            scm_generator_args['graph_generator_args'] = generate_graph_generator_args(n)    
            
            type = np.random.randint(2)
            
            scm_generator_args['seed'] = np.random.randint(1000)
            scm_generator_args['std'] = 1.0
            scm_generator_args['mean'] = 0.0
            scm_generator_args['weight_s'] = [0.5, 1.5]
            scm_generator_args['weight_t'] = [0.5, 1.5]
            scm_generator_args['s_function'] = {
                'function_descriptor': s_func,
                'function_of_interest': 'func'
            }
            'lambda x: numpy.log(1 + numpy.exp(x))'
            scm_generator_args['s_function_signature'] = 'softplus'
            
            if type == 0:
                # Cube and dislocate function
                type_naming = 'cube_dislocate'
                scm_generator_args['t_function'] = {
                    'function_descriptor': get_t_func_1(np.random.randint(10)),
                    'function_of_interest': 'func'
                }
                scm_generator_args['t_function_signature'] = 'cube_and_dislocate'
            else:
                # Sine function
                type_naming = 'sin_plus_x'
                scm_generator_args['t_function'] = {
                    'function_descriptor': get_t_func_2(),
                    'function_of_interest': 'func'
                }
                scm_generator_args['t_function_signature'] = 'sin_plus_x'

            scm_generator = 'ocd.data.synthetic.InvertibleModulatedGaussianSCMGenerator'
            
            conf = get_conf(observation_size, scm_generator, scm_generator_args)
            
            conf_name = f"parametric_non_linear_gaussian_{n}_{observation_size}_{scm_generator_args['graph_generator_args']['graph_type']}_{type_naming}.yaml"
            
            # write conf to conf_name in yaml format
            with open(conf_name, 'w') as f:
                # yaml.dump(conf, f)
                yaml.safe_dump(conf, f, indent=4)
                

            

To generate the meta-data for the configurations run the following block. These data have larger `observation_size`.

In [9]:
np.random.seed(100)
ns = [2, 3, 4, 5, 10, 25, 50, 100]
n_configs = [10, 10, 10, 10, 10, 10, 3, 3]
observation_size = [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000]

non_linear_gaussan_configurations(ns, n_configs, observation_size)

## Non-parametric generators with Gaussian Processes



In the following, we can generate the linking functions by sampling from a Gaussian Process. The function that is sampled from the Gaussian process is most likely invertible and non-linear.

In [12]:

    
def non_parametric_gaussian_process_generator(
                                    n_cov: th.List[int],
                                    n_configs: th.List[int],
                                    observation_sizes: th.List[int]):
    """
    This function generates a set of yaml files that contain the meta-data required to generate the synthetic data.
    The files are formatted as follows:
    
    >> non_parametric_non_linear_gaussian_{n_cov}_{observation_size}_{graph_type}.yaml
    
    For more information check out ocd/data/synthetic/gaussian_process.py
    
    Args:
        n_cov: A list where n_cov[i] indicates the number of covariates in the ith set of configurations
        n_configs: A list where n_configs[i] indicates the number of configurations to generate for the ith set of configurations
        observation_sizes: A list where 
        observation_sizes[i] indicates the number of observations to generate for the ith set of configurations
    """
      
    for n, observation_size, n_config in zip(n_cov, observation_sizes, n_configs):
        for _ in range(n_config):
            
            scm_generator_args = {}
            scm_generator_args['graph_generator'] = 'ocd.data.scm.GraphGenerator'
            scm_generator_args['graph_generator_args'] = generate_graph_generator_args(n)    
            

            scm_generator_args['seed'] = np.random.randint(1000)
            scm_generator_args['noise_std'] = 1.0
            scm_generator_args['noise_mean'] = 0.0
            scm_generator_args['s_gamma_rbf_kernel'] = 1.0
            scm_generator_args['s_variance_rbf_kernel'] = 1.0
            scm_generator_args['s_mean_function_weights'] = [0.0, 0.0]
            scm_generator_args['t_gamma_rbf_kernel'] = 1.0
            scm_generator_args['t_variance_rbf_kernel'] = 1.0
            scm_generator_args['t_mean_function_weights'] = [0.0, 0.0]
            
            scm_generator = 'ocd.data.synthetic.GaussianProcessBasedSCMGeberator'
            
            conf = get_conf(observation_size, scm_generator, scm_generator_args)
            
            conf_name = f"non_parametric_non_linear_gaussian_{n}_{observation_size}_{scm_generator_args['graph_generator_args']['graph_type']}.yaml"
            
            # write conf to conf_name in yaml format
            with open(conf_name, 'w') as f:
                # yaml.dump(conf, f)
                yaml.safe_dump(conf, f, indent=4)
                

            

To generate the meta-data for the configurations run the following block. These data have smaller `observation_size` because it is computationally expensive to generate them.

In [11]:
np.random.seed(100)

ns = [2, 3, 4, 5, 10, 25, 50, 100]
n_configs = [10, 10, 10, 10, 10, 10, 3, 3]
observation_size = [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]

non_parametric_gaussian_process_generator(ns, n_configs, observation_size)