You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please indicate the following details about the environment in which you found the bug:
CTGAN version: 0.8.0
Python version: 3.9.6
Pandas version: 2.0.3
Operating System: Ubuntu 22
Error Description
Hello, when trying to use CTGAN to sample (18000) synthetic data for the compas dataset (https://www.kaggle.com/datasets/danofer/compass), I came across the following error:
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)
Although it is an error thrown by Pandas, it originally comes from the function "_inverse_transform_continuous" in the file "data_transformer.py", and more specifically the line data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))
Note that I am not sure if it is because of the version of Pandas or not.
Steps to reproduce
First, I trained a model on the compas data (file: cox-violent-parsed_filt.csv) and the following columns were removed: to_remove_compas = ["id", "name", "first", "last", "dob", "c_jail_in", "c_jail_out", "c_charge_desc", "r_offense_date", "r_charge_desc", "r_jail_in", "violent_recid", "vr_offense_date", "screening_date"]
I tried to sample synthetic data using ctgan.sample(18000) and obtained directly this error:
Traceback (most recent call last):
File "data_generation.py", line 177, in <module>
fake = ctgan.sample(18000)
File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/base.py", line 50, in wrapper
return function(self, *args, **kwargs)
File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/ctgan.py", line 498, in sample
return self._transformer.inverse_transform(data)
File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 218, in inverse_transform
recovered_column_data = self._inverse_transform_continuous(
File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 192, in _inverse_transform_continuous
data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))
File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 758, in __init__
mgr = ndarray_to_mgr(
File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 337, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 408, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)
The text was updated successfully, but these errors were encountered:
Hi @bronval thanks for opening this issue, I was able to reproduce this using the same dataset when using CTGAN but not when using CTGANSynthesizer. I suspect that the pre-processing is slightly better with the latter and in general we recommend people use CTGANSynthesizer from the sdv library as it has a significantly better user experience. (Under the hood, CTGANSynthesizer calls out to the original CTGAN library).
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
Hello, when trying to use CTGAN to sample (18000) synthetic data for the compas dataset (https://www.kaggle.com/datasets/danofer/compass), I came across the following error:
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)
Although it is an error thrown by Pandas, it originally comes from the function "_inverse_transform_continuous" in the file "data_transformer.py", and more specifically the line
data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))
Note that I am not sure if it is because of the version of Pandas or not.
Steps to reproduce
First, I trained a model on the compas data (file: cox-violent-parsed_filt.csv) and the following columns were removed:
to_remove_compas = ["id", "name", "first", "last", "dob", "c_jail_in", "c_jail_out", "c_charge_desc", "r_offense_date", "r_charge_desc", "r_jail_in", "violent_recid", "vr_offense_date", "screening_date"]
I tried to sample synthetic data using
ctgan.sample(18000)
and obtained directly this error:The text was updated successfully, but these errors were encountered: