Problem Description
With the auto-detection of constraints, we are now able to easily discover and understand which constraints to apply to complex, multi-table datasets. However, there's no good way to list or store the constraints that should be used for a given dataset. This makes it difficult to perform demos or benchmarks of datasets with constraints.
We should add a way to store the constraints that should be applied to a dataset, and load them to apply them to a synthesizer.
Expected behavior
Add the set_constraints method to all synthesizers. Given a JSON file of constraints, the method should instantiate and apply all the constraints to the synthesizer.
The get_constraints method should also be modified to now have a new parameter, output_filepath. By default, this parameter should be None. If provided, then the method should write the constraints currently applied to the synthesizer to the given JSON file.
Constraints JSON File Spec
The constraints for a dataset can be specified as a list of dictionaries, where each dictionary defines a particular constraint. Each dictionary has the following keys:
class_name: The name of the constraint class. These can either be classes in the cag module, or the cag.sandbox module. Ad-hoc programmable constraints are not supported here! They must be added to the sandbox module to be used.
parameters: A dictionary that contains all the input parameters for that particular constraint class.
The constraints should be listed in the order in which they need to be applied.
<synthesizer>.set_constraints
Given the file of constraints, this method should create the constraint objects and add them to the synthesizer.
- If there are already any constraints that have been added to the synthesizer, then this method should
delete them and warn the user that existing constraints are being delete error that we cannot set constraints if constraints have already been added to the synthesizer.
- This method should add constraints from the file one at a time. The constraint classes should be found in the main sdv.cag module or in sdv.cag.sandbox (we should check both places, preferencing sdv.cag module).
- If a constraint cannot be added for some reason (eg. the class is not found or a table/column it's referencing cannot be found), then it should produce a warning to the user (saying that the constraint cannot be added) and then skip over to the next constraint.
Parameters:
filepath (str, required): A filepath of the constraints JSON file, which should be in the format specified in the previous section.
Output: None
NOTE: After setting constraints from a file, a user should still be able to add additional constraints through the add_constraints method.
<synthesizer>.get_constraints
This function already exists. Currently it returns a list of all the constraint objects that it contains.
We should modify this function to include a parameter called output_filepath. If provided, the function should write a JSON file with all the constraints information to the file.
Parameters:
output_filepath (str, optional): An optional string containing the path to a JSON file to write the constraints. The JSON file should not already exist in the filesystem. Defaults to None.
Output: The function should always return a list of constraints. If the output filepath is provided, then it should additionally write a file with the constraints JSON.
Additional context
Moved to Community from datacebo/sdv-enterprise#2060
Problem Description
With the auto-detection of constraints, we are now able to easily discover and understand which constraints to apply to complex, multi-table datasets. However, there's no good way to list or store the constraints that should be used for a given dataset. This makes it difficult to perform demos or benchmarks of datasets with constraints.
We should add a way to store the constraints that should be applied to a dataset, and load them to apply them to a synthesizer.
Expected behavior
Add the
set_constraintsmethod to all synthesizers. Given a JSON file of constraints, the method should instantiate and apply all the constraints to the synthesizer.The
get_constraintsmethod should also be modified to now have a new parameter,output_filepath. By default, this parameter should beNone. If provided, then the method should write the constraints currently applied to the synthesizer to the given JSON file.Constraints JSON File Spec
The constraints for a dataset can be specified as a list of dictionaries, where each dictionary defines a particular constraint. Each dictionary has the following keys:
class_name: The name of the constraint class. These can either be classes in thecagmodule, or thecag.sandboxmodule. Ad-hoc programmable constraints are not supported here! They must be added to the sandbox module to be used.parameters: A dictionary that contains all the input parameters for that particular constraint class.The constraints should be listed in the order in which they need to be applied.
<synthesizer>.set_constraintsGiven the file of constraints, this method should create the constraint objects and add them to the synthesizer.
delete them and warn the user that existing constraints are being deleteerror that we cannot set constraints if constraints have already been added to the synthesizer.Parameters:
filepath (str, required): A filepath of the constraints JSON file, which should be in the format specified in the previous section.Output: None
NOTE: After setting constraints from a file, a user should still be able to add additional constraints through the
add_constraintsmethod.<synthesizer>.get_constraintsThis function already exists. Currently it returns a list of all the constraint objects that it contains.
We should modify this function to include a parameter called
output_filepath. If provided, the function should write a JSON file with all the constraints information to the file.Parameters:
output_filepath (str, optional):An optional string containing the path to a JSON file to write the constraints. The JSON file should not already exist in the filesystem. Defaults to None.Output: The function should always return a list of constraints. If the output filepath is provided, then it should additionally write a file with the constraints JSON.
Additional context
Moved to Community from datacebo/sdv-enterprise#2060