Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError in CTGANSynthesizer when applying FixedCombinations constraint #1717

Closed
pvk-developer opened this issue Dec 12, 2023 · 1 comment · Fixed by #1718
Closed

KeyError in CTGANSynthesizer when applying FixedCombinations constraint #1717

pvk-developer opened this issue Dec 12, 2023 · 1 comment · Fixed by #1718
Assignees
Labels
bug Something isn't working
Milestone

Comments

@pvk-developer
Copy link
Member

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.8.0
  • Python version: Any
  • Operating System: Any

Error Description

When CTGANSynthesizer attempts to estimate the number of columns, it tries to directly access the dictionary of transformers which causes the following error when applying a constraint that produces an output of categorical or boolean value:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[15], line 19
      8 my_constraint = {
      9     'constraint_class': 'FixedCombinations',
     10     'constraint_parameters': {
     11         'column_names': ['high_spec', 'degree_type']
     12     }
     13 }
     15 my_synthesizer.add_constraints(constraints=[
     16     my_constraint
     17 ])
---> 19 my_synthesizer.fit(data)

File ~/Projects/sdv-dev/SDV/sdv/single_table/base.py:436, in BaseSynthesizer.fit(self, data)
    434 self._data_processor.reset_sampling()
    435 self._random_state_set = False
--> 436 processed_data = self._preprocess(data)
    437 self.fit_processed_data(processed_data)

File ~/Projects/sdv-dev/SDV/sdv/single_table/ctgan.py:192, in CTGANSynthesizer._preprocess(self, data)
    190 self.validate(data)
    191 self._data_processor.fit(data)
--> 192 self._print_warning(data)
    194 return self._data_processor.transform(data)

File ~/Projects/sdv-dev/SDV/sdv/single_table/ctgan.py:167, in CTGANSynthesizer._print_warning(self, data)
    165 def _print_warning(self, data):
    166     """Print a warning if the number of columns generated is over 1000."""
--> 167     dict_generated_columns = self._estimate_num_columns(data)
    168     if sum(dict_generated_columns.values()) > 1000:
    169         header = {'Original Column Name  ': 'Est # of Columns (CTGAN)'}

File ~/Projects/sdv-dev/SDV/sdv/single_table/ctgan.py:157, in CTGANSynthesizer._estimate_num_columns(self, data)
    154     num_generated_columns[column] = 11
    156 elif sdtypes[column] in {'categorical', 'boolean'}:
--> 157     if transformers[column] is None:
    158         num_categories = data[column].fillna(np.nan).nunique(dropna=False)
    159         num_generated_columns[column] = num_categories

KeyError: 'high_spec'

Steps to reproduce

from sdv.single_table import CTGANSynthesizer
from sdv.constraints import FixedCombinations

data, metadata = download_demo('single_table', 'student_placements')

my_synthesizer = CTGANSynthesizer(metadata)

my_constraint = {
    'constraint_class': 'FixedCombinations',
    'constraint_parameters': {
        'column_names': ['high_spec', 'degree_type']
    }
}

my_synthesizer.add_constraints(constraints=[
    my_constraint
])

my_synthesizer.fit(data)
@pvk-developer pvk-developer added bug Something isn't working new Automatic label applied to new issues and removed new Automatic label applied to new issues labels Dec 12, 2023
@npatki npatki changed the title KeyError in CTGANSynthesizer during _estimate_num_columns when applying constraints that return a categorical or boolean KeyError in CTGANSynthesizer when applying constraints that return a categorical or boolean Dec 12, 2023
@npatki npatki changed the title KeyError in CTGANSynthesizer when applying constraints that return a categorical or boolean KeyError in CTGANSynthesizer when applying FixedCombinations constraint Dec 12, 2023
@npatki
Copy link
Contributor

npatki commented Dec 12, 2023

Updated title to something that is more user-facing. Right now, FixedCombinations is the only constraint that produces categorical or boolean columns. But we should also validate that this new column estimation logic won't mess up with any custom constraints.

Note that the estimation logic is meant to be just that -- an estimation. So to fix this issue, it's not necessary to compute the constraint and get the exact # of columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants