ValueError: mismatch of shapes when sampling data for compas dataset #329

bronval · 2024-01-23T13:13:30Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

CTGAN version: 0.8.0
Python version: 3.9.6
Pandas version: 2.0.3
Operating System: Ubuntu 22

Error Description

Hello, when trying to use CTGAN to sample (18000) synthetic data for the compas dataset (https://www.kaggle.com/datasets/danofer/compass), I came across the following error:
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)

Although it is an error thrown by Pandas, it originally comes from the function "_inverse_transform_continuous" in the file "data_transformer.py", and more specifically the line
data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))

Note that I am not sure if it is because of the version of Pandas or not.

Steps to reproduce

First, I trained a model on the compas data (file: cox-violent-parsed_filt.csv) and the following columns were removed:
to_remove_compas = ["id", "name", "first", "last", "dob", "c_jail_in", "c_jail_out", "c_charge_desc", "r_offense_date", "r_charge_desc", "r_jail_in", "violent_recid", "vr_offense_date", "screening_date"]

I tried to sample synthetic data using ctgan.sample(18000) and obtained directly this error:

Traceback (most recent call last):
  File "data_generation.py", line 177, in <module>
    fake = ctgan.sample(18000)
  File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/base.py", line 50, in wrapper
    return function(self, *args, **kwargs)
  File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/ctgan.py", line 498, in sample
    return self._transformer.inverse_transform(data)
  File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 218, in inverse_transform
    recovered_column_data = self._inverse_transform_continuous(
  File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 192, in _inverse_transform_continuous
    data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))
  File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 758, in __init__
    mgr = ndarray_to_mgr(
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 337, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 408, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)

The text was updated successfully, but these errors were encountered:

srinify · 2024-02-28T22:43:53Z

Hi @bronval thanks for opening this issue, I was able to reproduce this using the same dataset when using CTGAN but not when using CTGANSynthesizer. I suspect that the pre-processing is slightly better with the latter and in general we recommend people use CTGANSynthesizer from the sdv library as it has a significantly better user experience. (Under the hood, CTGANSynthesizer calls out to the original CTGAN library).

Here's a Colab notebook I created where you can see CTGANSynthesizer working perfectly with the same dataset: https://colab.research.google.com/drive/1NJROe2bq28G_IL1Kr6MsP8i5hFuUeM_A?usp=sharing

(Note: You'll need to upload the cox-violent-parsed_filt.csv dataset to Colab in the File view when you run / fork this notebook)

srinify · 2024-04-17T14:41:34Z

Hi there @bronval I'm closing this issue for now please feel free to re-open if this issue is still relevant to you :)

bronval added bug Something isn't working new Label applied to new issues labels Jan 23, 2024

bronval mentioned this issue Jan 23, 2024

fix shape mismatch in function inverse_transform_continuous #330

Open

srinify closed this as completed Apr 17, 2024

srinify added resolution:out of scope CTGAN is not designed to solve this problem and removed new Label applied to new issues labels Apr 17, 2024

srinify mentioned this issue Oct 30, 2024

Surface error to user during fit if training data contains null values #414

Closed

srinify added resolution:WAI The software is working as intended and removed resolution:out of scope CTGAN is not designed to solve this problem labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: mismatch of shapes when sampling data for compas dataset #329

ValueError: mismatch of shapes when sampling data for compas dataset #329

bronval commented Jan 23, 2024

srinify commented Feb 28, 2024 •

edited

Loading

srinify commented Apr 17, 2024

ValueError: mismatch of shapes when sampling data for compas dataset #329

ValueError: mismatch of shapes when sampling data for compas dataset #329

Comments

bronval commented Jan 23, 2024

Environment Details

Error Description

Steps to reproduce

srinify commented Feb 28, 2024 • edited Loading

srinify commented Apr 17, 2024

srinify commented Feb 28, 2024 •

edited

Loading