Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer categoricals being sampled as strings instead of integer values #194

Closed
LihuaXiong2020 opened this issue Sep 17, 2020 · 4 comments · Fixed by #252
Closed

Integer categoricals being sampled as strings instead of integer values #194

LihuaXiong2020 opened this issue Sep 17, 2020 · 4 comments · Fixed by #252
Assignees
Labels
bug Something isn't working
Milestone

Comments

@LihuaXiong2020
Copy link

  • SDV version: 0.4.0
  • Python version: 3.6.8
  • Operating System: Windows

Description & What I did

  • Specified some columns of dataframe to be Categorical in metadata;
  • after SDV is trained and the sampled data is generated, run evaluate(metadata, orig_data, sample_data)
  • Type Error is raised complaining "ufunc 'isnan' not supported for input types..."
  • Debugged and found that the sampled categorical columns are stored as "object", which needs to be transformed into a numerical type manually before calling evaluate()
@csala
Copy link
Contributor

csala commented Sep 18, 2020

Hi @LihuaXiong2020 ! Thanks for reporting this.

If I understand this correctly, the problem is not really about sdmetrics but rather about "Integer categoricals being sampled as strings instead of integer values". Would you mind updating the title accordingly?

I can actually reproduce it like this:

import pandas as pd

from sdv import Metadata, SDV

df = pd.DataFrame({
    'int': [1, 2, 1, 3],
})
orig_data = {'a': df}
fields_metadata = {
    'int': {
        'type': 'categorical'
    },
}
metadata = Metadata()
metadata.add_table('a', df, fields_metadata=fields_metadata)

sdv = SDV()
sdv.fit(metadata, orig_data)
print(sdv.sample('a', sample_children=False).dtypes)

Which outputs:

int    object
dtype: object

As a technical note, the changes to fix this will need to be made in two places:

  • RDT: Make the CategoricalEncoder learn the column dtype and restore it back when reverse transforming.
  • SDV: Remove the dropna().astype(dtype) (the entire loop) inside the Metadata.reverse_transform which is actually unnecessary and incorrectly transforms all the categorical columns to object.

@csala csala added the bug Something isn't working label Sep 18, 2020
@csala csala added this to the 0.4.2 milestone Sep 18, 2020
@LihuaXiong2020 LihuaXiong2020 changed the title Categorical data incompatible with SDMetrics Integer categoricals being sampled as strings instead of integer values Sep 18, 2020
@LihuaXiong2020
Copy link
Author

Hi @csala, thanks for looking into it. Yes, I agree that the problem should be solved by your updated code to learn to original dtype.

@csala csala modified the milestones: 0.4.2, 0.4.3 Sep 19, 2020
@csala csala modified the milestones: 0.4.3, 0.4.4 Sep 28, 2020
@csala csala modified the milestones: 0.4.4, 0.4.5 Oct 6, 2020
@csala csala modified the milestones: 0.4.5, 0.4.6 Oct 16, 2020
@JagdishKolhe
Copy link

JagdishKolhe commented Nov 4, 2020

I am too getting error " Type Error is raised complaining "ufunc 'isnan' not supported for input types...""

Not sure if it is related to this issue. I am using SDV 0.4.5

@csala
Copy link
Contributor

csala commented Nov 4, 2020

I am too getting error " Type Error is raised complaining "ufunc 'isnan' not supported for input types...""

Not sure if it is related to this issue. I am using SDV 0.4.5

Hi @JagdishKolhe this sounds like it could be a separated topic.

Would you mind opening a new issue with all the details about what you executed and the output that you obtained?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants