Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: mismatch of shapes when sampling data for compas dataset #329

Closed
bronval opened this issue Jan 23, 2024 · 2 comments
Closed
Labels
bug Something isn't working resolution:WAI The software is working as intended

Comments

@bronval
Copy link

bronval commented Jan 23, 2024

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • CTGAN version: 0.8.0
  • Python version: 3.9.6
  • Pandas version: 2.0.3
  • Operating System: Ubuntu 22

Error Description

Hello, when trying to use CTGAN to sample (18000) synthetic data for the compas dataset (https://www.kaggle.com/datasets/danofer/compass), I came across the following error:
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)

Although it is an error thrown by Pandas, it originally comes from the function "_inverse_transform_continuous" in the file "data_transformer.py", and more specifically the line
data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))

Note that I am not sure if it is because of the version of Pandas or not.

Steps to reproduce

First, I trained a model on the compas data (file: cox-violent-parsed_filt.csv) and the following columns were removed:
to_remove_compas = ["id", "name", "first", "last", "dob", "c_jail_in", "c_jail_out", "c_charge_desc", "r_offense_date", "r_charge_desc", "r_jail_in", "violent_recid", "vr_offense_date", "screening_date"]

I tried to sample synthetic data using ctgan.sample(18000) and obtained directly this error:

Traceback (most recent call last):
  File "data_generation.py", line 177, in <module>
    fake = ctgan.sample(18000)
  File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/base.py", line 50, in wrapper
    return function(self, *args, **kwargs)
  File "/venv/lib/python3.9/site-packages/ctgan/synthesizers/ctgan.py", line 498, in sample
    return self._transformer.inverse_transform(data)
  File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 218, in inverse_transform
    recovered_column_data = self._inverse_transform_continuous(
  File "/venv/lib/python3.9/site-packages/ctgan/data_transformer.py", line 192, in _inverse_transform_continuous
    data = pd.DataFrame(column_data[:, :2], columns=list(gm.get_output_sdtypes()))
  File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 758, in __init__
    mgr = ndarray_to_mgr(
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 337, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 408, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (18000, 2), indices imply (18000, 3)
@bronval bronval added bug Something isn't working new Label applied to new issues labels Jan 23, 2024
@srinify
Copy link

srinify commented Feb 28, 2024

Hi @bronval thanks for opening this issue, I was able to reproduce this using the same dataset when using CTGAN but not when using CTGANSynthesizer. I suspect that the pre-processing is slightly better with the latter and in general we recommend people use CTGANSynthesizer from the sdv library as it has a significantly better user experience. (Under the hood, CTGANSynthesizer calls out to the original CTGAN library).

Screenshot 2024-02-28 at 5 41 13 PM

Here's a Colab notebook I created where you can see CTGANSynthesizer working perfectly with the same dataset: https://colab.research.google.com/drive/1NJROe2bq28G_IL1Kr6MsP8i5hFuUeM_A?usp=sharing

(Note: You'll need to upload the cox-violent-parsed_filt.csv dataset to Colab in the File view when you run / fork this notebook)

@srinify
Copy link

srinify commented Apr 17, 2024

Hi there @bronval I'm closing this issue for now please feel free to re-open if this issue is still relevant to you :)

@srinify srinify closed this as completed Apr 17, 2024
@srinify srinify added resolution:out of scope CTGAN is not designed to solve this problem and removed new Label applied to new issues labels Apr 17, 2024
@srinify srinify added resolution:WAI The software is working as intended and removed resolution:out of scope CTGAN is not designed to solve this problem labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants