Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SettingWithCopyWarning (may be leading to a numerical calculation bug) #215

Closed
davebulaval opened this issue May 5, 2022 · 2 comments · Fixed by #263
Closed

Fix SettingWithCopyWarning (may be leading to a numerical calculation bug) #215

davebulaval opened this issue May 5, 2022 · 2 comments · Fixed by #263
Assignees
Labels
bug Something isn't working
Milestone

Comments

@davebulaval
Copy link

When doing the TVAE or CTGAN tutorial, one gets a lot of SettingWithCopyWarning due to the following line.

I found a bug by deeply looking at the code and trying to figure out some things.

I was not quite sure of the objective of doing such a flatten operation on single-line column data, and since it was the line raising the warning, I've started debugging. It has been proven that using the following data[column_name] = value could lead to potential bug.

Using the TVAE tutorial for the student_id.value column, one can obtain the following output table after the _transform_continuous method.

image

By reading the comment in the _transform_continuous method, it is clear that the column 0 must be the exact same value as the transformed vector generated by another package since it is the normalized value.

#  Converts the transformed data to the appropriate output format.
#  The first column (ending in '.normalized') stays the same,
#  but the lable encoded column (ending in '.component') is one hot encoded.

By looking at the idx=16 we can see that the value of column 0 is -0.60451, and according to the transformedoutput shown here, it should be equal to the following: -0.25646.

image

In total, in this example, there is 70 mismatch between the column transformed["student_id.value.normalized"] and the first column of the output (np.equal(transformed["student_id.value.normalized"].to_numpy(), output[:, 0])).

To fix:

Will push a PR but if we switch data[column_name] = data[column_name].to_numpy().flatten() to flatten_data = pd.DataFrame({column_name: data[column_name].to_numpy().flatten()}), it fixes all the mismatch. However, the one-hot encoded columns change compare to the initial code so I am not sure if this is ok.

@davebulaval davebulaval added bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started labels May 5, 2022
@npatki
Copy link
Contributor

npatki commented Jul 14, 2022

Hello, thanks for filing this issue with all the details and screenshots. I just want to acknowledge that we've read through it.

However, the one-hot encoded columns change compare to the initial code so I am not sure if this is ok.

CTGAN has a specific cross-entropy algorithm that requires the categorical columns to be properly one hot encoded and avoid mode collapse.

Let's keep this bug open for tracking. We'd need to handle this with caution and make sure there are no unintended consequences of this change. We can reach out when we have the bandwidth to fully investigate and review such a change.

@npatki npatki removed the pending review This issue needs to be further reviewed, so work cannot be started label Jul 14, 2022
@npatki npatki changed the title Annoying SettingWithCopyWarning Lead to Bug Report Fix SettingWithCopyWarning (may be leading to a numerical calculation bug) Jul 14, 2022
@davebulaval
Copy link
Author

From my experimentation, it seems like it would impact the mode computing.

@amontanez24 amontanez24 added this to the 0.7.0 milestone Jan 20, 2023
@amontanez24 amontanez24 self-assigned this Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants