Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubts on the usage of conditional sampling #322

Closed
tonydp03 opened this issue Oct 31, 2023 · 4 comments
Closed

Doubts on the usage of conditional sampling #322

tonydp03 opened this issue Oct 31, 2023 · 4 comments
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@tonydp03
Copy link

Environment details

If you are already running CTGAN, please indicate the following details about the environment in
which you are running it:

  • CTGAN version: 0.7.5
  • Python version: 3.10.11
  • Operating System: MacOS 14.0 Sonoma

Problem description

I'm trying to generate samples from the example dataset adult.csv, conditioned on the column "sex" with value "Female". however it doesn't seem to work.

What I already tried

I tried to put the value between ".." or '...', tried with other categories/values, but the result doesn't change. The one-hot-vector that is generated only contains zeros.

The command is the following:

python ctgan/__main__.py examples/csv/adult.csv examples/csv/synthetic_adult_cond.csv --save test_model_cond.p -d workclass,education,marital-status,occupation,relationship,race,sex,native-country,income --verbose --epochs 10 --sample_condition_column sex --sample_condition_column_value Female

(note that the number of epochs is very low just for testing the command and reproducing the error). The traceback is the following:

Traceback (most recent call last):
  File "[...]/CTGAN/ctgan/__main__.py", line 103, in <module>
    main()
  File "[...]/CTGAN/ctgan/__main__.py", line 91, in main
    sampled = model.sample(
  File "[...]/ctgan/lib/python3.10/site-packages/ctgan/synthesizers/base.py", line 50, in wrapper
    return function(self, *args, **kwargs)
  File "[...]/ctgan/lib/python3.10/site-packages/ctgan/synthesizers/ctgan.py", line 465, in sample
    condition_info = self._transformer.convert_column_name_value_to_id(
  File "[...]/ctgan/lib/python3.10/site-packages/ctgan/data_transformer.py", line 260, in convert_column_name_value_to_id
    raise ValueError(f"The value `{value}` doesn't exist in the column `{column_name}`.")
ValueError: The value `Female` doesn't exist in the column `sex`.

Any hint? Am I using it wrong?

@tonydp03 tonydp03 added new Label applied to new issues question General question about the software labels Oct 31, 2023
@npatki
Copy link
Contributor

npatki commented Oct 31, 2023

Hi @tonydp03,

Nice to meet you. From looking at the raw CSV of input data, it seems that there is a leading space before every value. So in this case, the value you're conditioning on should be " Female" (with the leading space) instead of Female (with no space).

BTW if your project allows for it, I would recommend accessing the CTGAN model through the SDV library. The SDV is a publicly available Python SDK that allows you to generate synthetic data using a variety of synthesizers such as CTGAN. It also providers convenient wrappers for data pre- and post-processing, should you want to modify that. And you can use conditional sampling with it too.

Some resources:

@npatki npatki added under discussion Issue is currently being discussed and removed new Label applied to new issues labels Oct 31, 2023
@tonydp03
Copy link
Author

tonydp03 commented Nov 2, 2023

Hi @tonydp03,

Nice to meet you. From looking at the raw CSV of input data, it seems that there is a leading space before every value. So in this case, the value you're conditioning on should be " Female" (with the leading space) instead of Female (with no space).

BTW if your project allows for it, I would recommend accessing the CTGAN model through the SDV library. The SDV is a publicly available Python SDK that allows you to generate synthetic data using a variety of synthesizers such as CTGAN. It also providers convenient wrappers for data pre- and post-processing, should you want to modify that. And you can use conditional sampling with it too.

Some resources:

Hi @npatki,

thanks for your answer. I simply assumed the test dataset could be used "out-of-the-box" and didn't notice the leading space at the beginning of the column value. I will give it another try, for sure.

Thanks for the resources too. For the moment, we were just testing the usage of CTGAN to generate synthetic data, as we were positively impressed by the results shown in the paper. In parallel, we're also testing the usage of the SDV library, as it seems an interesting tool.

@tonydp03
Copy link
Author

tonydp03 commented Nov 2, 2023

One more thing: is it correct that, in the main.py, the function fit is called even when the model is loaded? I was expecting for it to be called only when the model has not been trained yet and I'm creating a new one.

@npatki
Copy link
Contributor

npatki commented Apr 17, 2024

Hi @tonydp03 my apologies for getting this reply so late.

The current recommended approach is to use CTGAN via the SDV library as described above. I can answer your usage questions and help you troubleshoot any issues with your project.

Unfortunately I'm unable to go through any detailed lines of code with you. Please also note that some code in the repo may be deprecated or unsupported so I would always recommend the docs for the latest supported usage.

Thanks and please feel free to file a new issue with additional questions or feature requests.

@npatki npatki closed this as completed Apr 17, 2024
@npatki npatki added resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General question about the software resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

2 participants